Python: Programs¶

Exercises¶

Write a Python program that:

takes as input a file name filename containing the multiple alignment of a protein sequence

prints, for each sequence position, the corresponding alignment profile, i.e how many times each amino acid (considering also gaps as an option) was found in that position of the alignment.

Input example (copy-and-paste into a file):

012 C ---------CCAHSLVVTEASAD-------------------------------------------------
013 G ---------GGGYFAVFATAFSS-------------------------------------------------
024 V ---------VILINLLLYVIIIE-------VI----------------------------------------
014 P ---------PPPCKRGKG-SEDS-------RV----------------------------------------
013 A ---------AKACLSSVY-VIN--------FT------L---------------------------------
014 I ---------IIPPEY-QLLVFI--------HL------P---------------------------------
019 Q ---------QEKENEEGTQGGE--------LG------L---------------------------------
020 P ---------PPTPPSSQKP-DI--------SV------P------A--------------------------
077 V --N------VKAP-PPREESTPVGAVKGTCR-------P--L---SR-E----L-DC---AE-W-S---MQ-
087 L --A------LLLP-VISESPKEDHSPPPSSP-------L--E---FD-IG---WDRG---EA-T-T---QI-
106 S --K------SPPL-GSEERIPVTSSSGETGF-------N--KA--ADDQT-F-RLLW--KHP-G-P---TQ-
103 G --T------GGQY-V-PIQKPDNIRFFQTSF-------DG-LT--DSRGA-Y-TDSD--AVD-R-Q---VA-
202 L --F------LAA--RQLQSDSSGPATEMSHG-QQQSSSRD-SGQSKTRNANV-HQMPV-LRGLH-Q--NEGS
274 S VVI-E-PPTSNE--EFYSSNVLLPKGGGPSA-GGGFLSEGPGEENPNSDSGR-ASASGPKSFAP-P-DENQT
399 R GGR-R-RRRRRRR-QKRRRRKER-RFRRTRR-RRRATSRKKRERGDMRRERS-RRLRRDELETR-SAALKDW
610 I IVI-I-IIIIIIM-IIVIVIIIVIIVIVRII-IIIILIIIIIIIIVIIIIII-IIIIVHYVITI-IYLVIVL
463 V YAV-V-VIVVVVAVLLIVIVVVVIVIIVIIV-LMAHQRVVVVIVYAVRVVVV-IIIVAINMVVYPISHLRII
506 N GGG-G-GGGNGGYEGGNGNSGNNEGNNGNCG-GGGGGNGNGGNGNNGEGGGG-GGGNDANGGSNLGSRGHGR
794 G GGG-G-GGGGGGGAGGGGGGGGGGGGGGGGG-GGGGDGGGGGGGGGGGGGGG-GGGGGGSGGTGGGSGGGGS
288 E HHQ-S-HHKESITSHHKSRSSQQTDTHSGQS-EEETRVSTSEKVVGMQ-RIH-TYTVS-YGQVIK-SITPVP
323 E DED-Y-PVEEDEASNNPSDDPENEEEENQNV-DDDKKRNTEVKEEDKL-RDA-EDDED-LTKIKE-PPELSV
493 A IIA-I-SSTAASANAAASAAAVAAAAAARAAAVAAIITAAATAVAAVARAAS-AVVTA-PVASTCSAVVTAQ
320 V SPN-P-DSTVVTRLDPDSPKRVNNAEEAVKSEADVDERSGEPKPKADKPRVLMNDEER-EPQFTPNRQPSNH
351 P IIIII-VIIPPIPVIVLVPLLPRLPPKVNKEVQALIIIPPRKKIFEIDSPAPTIIDIR-NVFRVPAIQVIDR
372 G ESQEE-WKSGHNNHKAFGGGHHGHHHGNENGGGTGKTYGGNFNHDNEDGHGGRSEGNGTFGGPEHTEQGVIT
388 S QDDQY-HEESSQEQLSSQSQQSQEESESTSQSEAQDDLDEEQSLFSQEGADAHNDKEQTNARRSSSDNKKIE
347 W AAYAV-QEVWYWYNHANFFFFIFFFAAAVAFQYTLAAAAFMYFTWFVFYWAAWFAAFFPVKHYLRGYYYYAF
739 P PPPPP-PKPPPSPPPPPPKPPPPPPPPPPPPPPTPPPPVPPPQPTPPTSPPPPPPPPPPPTRPGPPPGPPRP
521 W FFYWW-HYYWWSWWWWWYYWWYYYYYFYFYHHWFYWWYYFYWYWFYYGFFWYWWYYMYYYYYYGWWSYFFSF

Execution example:

python program.py
Data file: alignment
el - A C D E F G H I K L M N P Q R S T V W X Y
C 58 3 2 1 1 0 0 1 0 0 1 0 0 0 0 0 2 1 2 0 0 0
G 58 3 0 0 0 3 3 0 0 0 0 0 0 0 0 0 2 1 1 0 0 1
V 56 0 0 0 1 0 0 0 6 0 4 0 1 0 0 0 0 0 3 0 0 1
P 57 0 1 1 1 0 2 0 0 2 0 0 0 3 0 2 2 0 1 0 0 0
A 57 2 1 0 0 1 0 0 1 1 2 0 1 0 0 0 2 1 2 0 0 1
I 57 0 0 0 1 1 0 1 3 0 3 0 0 3 1 0 0 0 1 0 0 1
Q 56 0 0 0 5 0 4 0 0 1 2 0 1 0 2 0 0 1 0 0 0 0
P 56 1 0 1 0 0 0 0 1 1 0 0 0 6 1 0 3 1 1 0 0 0
V 36 3 2 1 4 0 2 0 0 2 2 1 1 5 1 3 3 2 3 1 0 0
....

Warning

Note that not all the rows contain all the possible aminoacids and gaps (see all the 0 values present in the table). In order to account for this, it’s necessary to extract an alphabet (you can see it in the first row of the output) with all the possible alternatives, to be used as a reference.

You can use five separate functions:

A function reading the file filename and returning a list of couples residue-alignment

A function taking as input the list read from the file, and returning an alphabetically ordered list of all the possible characters appearing in the alignment, to be used as alphabet

A function taking as input a string with an alignment (in a position) and returning a profile (dictionary from character to the corresponding number of occurrences in the alignment)

A function taking as input the list read from the file and the alphabet, printing a header (see execution example) and, for each position in the list, the corresponding sequence element and the profile (obtained from function 3) ordered as alphabet (if the letter is not present in the profile, the number of occurrences will be 0)

A function that runs the program using the functions defined above.

Write a Python program that:
- takes as input a file name filename containing the annotation of a protein sequence in UNIPROT format
- prints, for each secondary structure element (HELIX, STRAND, TURN), the average length of the element, in amino acids, and the average number of consecutive times the element occurs in the sequence
Input example (protein Q1NZL3 on Uniprot):
```
ID ZN224_HUMAN Reviewed; 707 AA.
AC Q9NZL3; A6NFW9; P17033; Q86V10; Q8IZC8; Q9UID9; Q9Y2P6;
DT 08-DEC-2000, integrated into UniProtKB/Swiss-Prot.
DT 30-NOV-2010, sequence version 3.
....
....
....
FT {ECO:0000305}.
FT CONFLICT 562 562 H -> R (in Ref. 1; AAF04106).
FT {ECO:0000305}.
FT STRAND 174 177 {ECO:0000244|PDB:2EN8}.
FT TURN 179 181 {ECO:0000244|PDB:2EN8}.
FT STRAND 184 187 {ECO:0000244|PDB:2EN8}.
FT HELIX 188 195 {ECO:0000244|PDB:2EN8}.
FT TURN 196 198 {ECO:0000244|PDB:2EN8}.
FT STRAND 207 209 {ECO:0000244|PDB:2EM6}.
FT HELIX 216 223 {ECO:0000244|PDB:2EM6}.
FT TURN 224 226 {ECO:0000244|PDB:2EM6}.
FT STRAND 227 229 {ECO:0000244|PDB:2EM6}.
FT STRAND 235 237 {ECO:0000244|PDB:2ELY}.
FT STRAND 241 243 {ECO:0000244|PDB:2ELY}.
```
Execution example:
```
python program.py
Data file: Q9NZL3.txt
average lengths
{’TURN’: 3.125, ’HELIX’: 7.285714285714286,
’STRAND’: 3.323529411764706}
average number of consecutive occurrences
{’TURN’: 1.0, ’HELIX’: 1.4, ’STRAND’: 1.736842105263158}
```
You can use five separate functions:
1. A function reading the file filename and returning a list of pairs: secondary structure element (ss) - length.
2. A function taking as input the list read from the file, and returning two dictionaries, both having ss as keys, the first having as value the corresponding sum of lengths, the second the number of times the ss element occurs.
3. A function taking as input two dictionaries and returning a new dictionary obtained by normalizing the values of the first with the values of the second.
4. A function taking as input the list read from the file, and returning two dictionaries, both having ss as keys: the first dictionary maps from ss the number of occurrences; the second maps from ss to the number of non-consecutive occurrences. The function will return the normalized dictionary of occurrences.
5. A function that implements the program using the above functions.
Write a Python program that:
1. Takes a path to a file describing a list of donor (D) splicing sites in FASTA format. Each site contains 15 base pairs: 7 before the site, 8 after the site.
2. An integer threshold in between 2 and 8.
The program should:
1. Print the number of occurrences of the sub-sequences (1) before and (2) after the donor site, for all sub-sequences of length in-between 2 and 7 (before the site) or 8 (after the site). Sub-sequences occurring less than threshold times should not be printed.
  
  The number of occurrences should be computed over all the sequences contained in the FASTA file.
The full input file is here:

https://drive.google.com/open?id=0B0wILN942aEVWGM4UU9mOG52ODQ

Example input
```
>HUMGLUT4B_2257 <-- header
CTCCGAAGTAGGATT <-- site sequence (7 before + 8 after)
>HUMGLUT4B_3651
TCAGAAGGTGAGGGC
>HUMGLUT4B_3935
TTGGAAGGTTCGCAG
>HUMGLUT4B_4152
TACTCAGGTACTCAC
>HUMGLUT4B_4379
CGCCCAGGTGACCGG
>HUMGLUT4B_4669
AGAAAGAGTAAGCTC
...
```
For instance, given the sequence:
```
CTCCGAAGTAGGATT
```
the sequence after the site is GTAGGATT, and its sub-sequences are:
```
GT, GTA, GTAG, ..., GTAGGATT
```
while the sequence before the site is CTCCGAA, and its sub-sequences are:
```
AA, GAA, CGAA, ...,  CTCCGAA
```
The expected result with threshold = 100 is:
```
> python splice_patterns.py
Insert path: splice_donor.fasta
Insert threshold: 100
Most frequent sub-sequences before the site:
{'AAG': 208, 'AG': 583, 'GG': 107, 'GAG': 108,
'TG': 135, 'CAG': 247}
Most frequent sub-sequences after the site:
{'GTA': 568, 'GT': 1116, 'GTAAGT': 126, 'GTGAG': 388,
'GTG': 491, 'GTGA': 399, 'GTGAGT': 204, 'GTAA': 379,
'GTAAG': 300, 'GTGAGTG': 116}
```
meaning that "AAG" appears 208 times before the site, "AG" appears 583 times before the site, ..., "GTA" appears 568 times after the site, "GT" appears 1116 times after the site, etc.
Note

We suggest implementing five functions:
1. A function that reads the input file and returns two collections: a collection of all sequences before the site, and a collection of all sequences after the site.
2. A function that takes a collection of sequences after the site, and returns a dictionary mapping from sub-sequences to number of occurrences.
3. A function that takes a collection of sequences before the site, and returns a dictionary mapping from sub-sequences to number of occurrences. (note that sub-sequences start at the end of the sequence of the file, and grow backwards.)
4. A function that takes a dictionary of counts and a threshold, and returns a dictionary of counts with only the key-value pairs whose value is larger than or equal to the threshold.
5. A function that implements the program using the above.

Write a Python program that:

Takes a path to a file describing the association between proteins and the PFAM domains that appear in the protein.
Prints the list of all the PFAM domains (including their ID and name), sorted by number of occurrences, as well as the number of proteins that each PFAM domain appears into.

The full input file is output_pfam.

Example input:

78184355|ref|YP_376790.1|     28   145 PF01475.11      1   124 ls    62.1   2.1e-15               FUR
78185596|ref|YP_378030.1|     31   111 PF00575.15      1    79 fs    33.4   2.8e-08                S1
78185596|ref|YP_378030.1|    113   381 PF10150.1       1   273 ls   436.7  3.5e-128         RNase_E_G
78185773|ref|YP_378207.1|     70   222 PF04087.6       1   147 ls   221.5   2.1e-63            DUF389
78183836|ref|YP_376270.1|     13   375 PF01266.16      1   399 ls   182.5   1.2e-51               DAO
78184892|ref|YP_377327.1|      5   191 PF00702.18      1   184 fs    99.1   5.5e-28         Hydrolase
78183858|ref|YP_376292.1|     78   482 PF00909.13      1   455 ls   519.6  4.1e-153   Ammonium_transp
...

^^^^^^^^^^^^^^^^^^^^^^^^^              ^^^^^^^^^^                                     ^^^^^^^^^^^^^^^
      protein                          domain ID                                       domain name

The expected result is:

> python pfam_statistics.py
Type path: output_pfam
PFAM ID     PFAM Name         NumOcc  NumDiffProt
PF00005.19  ABC_tran          35      32
PF00132.16  Hexapep           29      5
PF00534.12  Glycos_transf_1   21      20
PF00515.20  TPR_1             17      8
PF01370.13  Epimerase         16      16
PF03130.8   HEAT_PBS          15      6
PF04055.13  Radical_SAM       14      14
PF00502.11  Phycobilisome     13      12
PF00427.13  PBS_linker_poly   13      10
PF00271.23  Helicase_C        13      13
PF00528.14  BPD_transp_1      12      11
PF00004.21  AAA               12      12
PF00805.14  Pentapeptide      11      5
PF00535.18  Glycos_transf_2   11      11
PF00353.11  HemolysinCabind   11      3
PF07862.3   Nif11             10      10
PF01926.15  MMR_HSR1          10      9

The output shows that, for instance, the domain with ID PF00005.19 and name ABC_tran occurs 35 different times in the complete file, and in particular it occurs in 32 distinct proteins.

Warning

The output text columns are not required to be aligned.

Note

We suggest implementing five functions:

A function that reads the input file and returns a dictionary mapping from domain PFAM ID to doman PFAM name, and for each protein the list of its domains.
A function that takes the information read from the file, and returns a dictionary from domains to their number of occurrences.
A function that takes the information read from the file, and returns a dictionary from domains to their number of distinct proteins they appear into.
A function that print the output as shown in the expected results. The domains should be sorted by their total number of occurrences.
A function that implements the program using the above.

Write a Python program that:
1. Takes a path to a file containing protein sequences in FASTA format. The header of the FASTA contains information about the cellular localization and the organism of the protein.
2. Prints, for each cellular localization, the number of proteins, divided by organism, and sorted from the most frequent to the less frequent organism.
The full input file is here:

https://drive.google.com/file/d/0B7pmXOEcMgmZUG5WTHk1dXh2dFk/view?usp=sharing

Example input:
```
>7B2_HUMAN:Secretory
MVSRMVSTMLSGLLFWLASGWTPAFAYSPRTPDRVSEADIQRLLHGVME ...
>A1AG1_MUSCR:Secretory
MALHMILVMLSLLPLLEAQNPEHVNITIGEPITNETLSWLSDKWFFIGA ...
>A1BG_HUMAN:Secretory
MSMLVVFLLLWGVTWGPVTEAAIFYETQPSLWAESESLLKPLANVTLTC ...
...
>ZNF22_RAT:Nucleus
MRLGKPQKGGISRSATQGKTYESRRKTARQRQKWGVAIRFDSGLSRRRR ...
>ZNF42_HUMAN:Nucleus
MRPAVLGSPDRAPPEDEGPVMVKLEDSEEEGEAALWDPGPEAARLRFRC ...
```
The expected result is:
```
> python program.py
Name of file: sequences.fasta
Mitochondrion
107:HUMAN 27:BOVIN 18:DROME 16:MOUSE 7:RAT 3:ASCSU ...
Secretory
191:HUMAN 79:MOUSE 59:DROME 35:BOVIN 30:RAT 19:CHICK ...
Nucleus
636:HUMAN 188:DROME 145:MOUSE 65:CAEEL 27:RAT ...
Cytoplasm
203:HUMAN 72:MOUSE 47:DROME 30:RAT 20:CAEEL 13:BOVIN ...
```
Note

We suggest implementing four functions:
1. A function that reads the input file and returns a dictionary mapping from each cellular localization to the list of protein with that localization
2. A function that takes the dictionary generated by the previous function and returns a new dictionary, mapping from cellular localizations to dictionaries of counts (mapping from organism to number of proteins belonging to the organism)
3. A function that takes the dictionary generated by the previous function and, for each cellular localization, prints the list of counts, ordered by decreasing number of occurrences.
4. A function that implements the program using the above functions.
Write a function findPerfectMatches(filename,pattern) that:
1. Takes the name of a file filename containing a list of RNA sequences, and an RNA string (for example, a microRNA sequence).
2. Prints, for each sequence, the name and the list of positions corresponding to perfect matches (by complementarity) with the string (only for sequences with at least one match).
The expected result is:
```
>>> import utility
>>> utility.findPerfectMatches(’utr.txt’,’acgaatt’)
ENSG00000050344 [1186]
ENSG00000109929 [204, 373, 3336]
ENSG00000155850 [2162, 5387]
ENSG00000073756 [1152]
ENSG00000175445 [693]
ENSG00000159167 [781]
ENSG00000138061 [1229]
ENSG00000152268 [1362]
ENSG00000197121 [3024]
ENSG00000115159 [1111]
ENSG00000169908 [1695]
ENSG00000150938 [1751]
ENSG00000179314 [2782]
ENSG00000075223 [1743]
```
Note

We suggest implementing five functions:
1. A function that reads the input file and returns a dictionary mapping from name to sequence.
2. A function that takes as input a string and returns the complementary sequence (note that sequences and strings have T, and not U, even if they represent RNA sequences).
3. A function that takes two sequences and return the list of occurrences of the second string on the first one (look carefully at the help of the function find).
4. A function taking the dictionary of sequences and the converted string, that calculates for each sequence the match positions, using the previous function, and if there is at least one match, prints the name of the sequence and the list of matches.
5. A function that implements the program using the above.
Midterm exam 2016/11/02:
Given:
1. The hierarchy.txt file.
2. Optionally, a pair of protein identifiers.
write a Python program that:
1. If the two protein identifiers are given, prints the list of domains shared by the two proteins; further, for each shared domain, it prints the residue histogram of their sequences.
2. If no protein identifiers are given, it prints the above information for all pairs of proteins. The output text columns are not required to be aligned.
Warning

Same domains appear in more than one protein; different occurrences of a domain however may have different sequences!

Example input:
```
# Protein Domain Position Residue
YAL003W PF00736 00120 W
YAL003W PF00736 00121 I
YAL003W PF00736 00154 Q
YAL038W PF00224 00077 E
YAL038W PF00224 00236 K
YAL038W PF00224 00327 H
YAL038W PF00224 00362 I
YAL038W PF02887 00391 L
YAL038W PF02887 00415 S
YAL038W PF02887 00432 G
YAL038W PF02887 00441 F
YAL038W PF02887 00450 M
...
```
Expected output:
```
$ python commdomains.py
insert the path to the hierarchy file: hierarchy.txt
all pairs? (Y or N) N
insert 1st protein ID: YIL109C
insert 2nd protein ID: YPR181C
YIL109C YPR181C shared domains: ['PF00626', 'PF04810', 'PF04811']
PF00626 histogram = {'I': 1, 'R': 1, 'T': 1, 'W': 3, 'V': 1}
PF04810 histogram = {'S': 1, 'R': 1}
PF04811 histogram = {'A': 1, 'E': 1, 'D': 3, 'G': 4, 'F': 3}
PF04815 histogram = {'C': 1, 'D': 2, 'I': 1, 'Q': 2, 'P': 2}
PF08033 histogram = {'C': 1, 'E': 2, 'D': 1, 'I': 3, 'L': 2}
```
Note

We suggest implementing five functions:
1. A function that takes a path to the hierarchy file, and returns a map from each protein to a dictionary; this dictionary maps from each domain in the protein to its sequence. Picture: sequence = hierarchy[protein][domain]
2. A function that takes two lists and returns their intersection.
3. A function that takes a sequence and returns the histogram of its residues.
4. A function that takes the hierarchy read from the file and two protein identifiers, and prints the shared domains. Then, for every shared domain, it extracts the domain sequences from the two proteins, and prints their histogram.
5. A function that implements the program.