Python: Programs¶
Exercises¶
Write a Python program that:
- takes as input a file name
filename
containing the multiple alignment of a protein sequence - prints, for each sequence position, the corresponding alignment profile, i.e how many times each amino acid (considering also gaps as an option) was found in that position of the alignment.
Input example (copy-and-paste into a file):
0.012 C ---------CCAHSLVVTEASAD------------------------------------------------- 0.013 G ---------GGGYFAVFATAFSS------------------------------------------------- 0.024 V ---------VILINLLLYVIIIE-------VI---------------------------------------- 0.014 P ---------PPPCKRGKG-SEDS-------RV---------------------------------------- 0.013 A ---------AKACLSSVY-VIN--------FT------L--------------------------------- 0.014 I ---------IIPPEY-QLLVFI--------HL------P--------------------------------- 0.019 Q ---------QEKENEEGTQGGE--------LG------L--------------------------------- 0.020 P ---------PPTPPSSQKP-DI--------SV------P------A-------------------------- 0.077 V --N------VKAP-PPREESTPVGAVKGTCR-------P--L---SR-E----L-DC---AE-W-S---MQ- 0.087 L --A------LLLP-VISESPKEDHSPPPSSP-------L--E---FD-IG---WDRG---EA-T-T---QI- 0.106 S --K------SPPL-GSEERIPVTSSSGETGF-------N--KA--ADDQT-F-RLLW--KHP-G-P---TQ- 0.103 G --T------GGQY-V-PIQKPDNIRFFQTSF-------DG-LT--DSRGA-Y-TDSD--AVD-R-Q---VA- 0.202 L --F------LAA--RQLQSDSSGPATEMSHG-QQQSSSRD-SGQSKTRNANV-HQMPV-LRGLH-Q--NEGS 0.274 S VVI-E-PPTSNE--EFYSSNVLLPKGGGPSA-GGGFLSEGPGEENPNSDSGR-ASASGPKSFAP-P-DENQT 0.399 R GGR-R-RRRRRRR-QKRRRRKER-RFRRTRR-RRRATSRKKRERGDMRRERS-RRLRRDELETR-SAALKDW 0.610 I IVI-I-IIIIIIM-IIVIVIIIVIIVIVRII-IIIILIIIIIIIIVIIIIII-IIIIVHYVITI-IYLVIVL 0.463 V YAV-V-VIVVVVAVLLIVIVVVVIVIIVIIV-LMAHQRVVVVIVYAVRVVVV-IIIVAINMVVYPISHLRII 0.506 N GGG-G-GGGNGGYEGGNGNSGNNEGNNGNCG-GGGGGNGNGGNGNNGEGGGG-GGGNDANGGSNLGSRGHGR 0.794 G GGG-G-GGGGGGGAGGGGGGGGGGGGGGGGG-GGGGDGGGGGGGGGGGGGGG-GGGGGGSGGTGGGSGGGGS 0.288 E HHQ-S-HHKESITSHHKSRSSQQTDTHSGQS-EEETRVSTSEKVVGMQ-RIH-TYTVS-YGQVIK-SITPVP 0.323 E DED-Y-PVEEDEASNNPSDDPENEEEENQNV-DDDKKRNTEVKEEDKL-RDA-EDDED-LTKIKE-PPELSV 0.493 A IIA-I-SSTAASANAAASAAAVAAAAAARAAAVAAIITAAATAVAAVARAAS-AVVTA-PVASTCSAVVTAQ 0.320 V SPN-P-DSTVVTRLDPDSPKRVNNAEEAVKSEADVDERSGEPKPKADKPRVLMNDEER-EPQFTPNRQPSNH 0.351 P IIIII-VIIPPIPVIVLVPLLPRLPPKVNKEVQALIIIPPRKKIFEIDSPAPTIIDIR-NVFRVPAIQVIDR 0.372 G ESQEE-WKSGHNNHKAFGGGHHGHHHGNENGGGTGKTYGGNFNHDNEDGHGGRSEGNGTFGGPEHTEQGVIT 0.388 S QDDQY-HEESSQEQLSSQSQQSQEESESTSQSEAQDDLDEEQSLFSQEGADAHNDKEQTNARRSSSDNKKIE 0.347 W AAYAV-QEVWYWYNHANFFFFIFFFAAAVAFQYTLAAAAFMYFTWFVFYWAAWFAAFFPVKHYLRGYYYYAF 0.739 P PPPPP-PKPPPSPPPPPPKPPPPPPPPPPPPPPTPPPPVPPPQPTPPTSPPPPPPPPPPPTRPGPPPGPPRP 0.521 W FFYWW-HYYWWSWWWWWYYWWYYYYYFYFYHHWFYWWYYFYWYWFYYGFFWYWWYYMYYYYYYGWWSYFFSF
Execution example:
python program.py Data file: alignment el - A C D E F G H I K L M N P Q R S T V W X Y C 58 3 2 1 1 0 0 1 0 0 1 0 0 0 0 0 2 1 2 0 0 0 G 58 3 0 0 0 3 3 0 0 0 0 0 0 0 0 0 2 1 1 0 0 1 V 56 0 0 0 1 0 0 0 6 0 4 0 1 0 0 0 0 0 3 0 0 1 P 57 0 1 1 1 0 2 0 0 2 0 0 0 3 0 2 2 0 1 0 0 0 A 57 2 1 0 0 1 0 0 1 1 2 0 1 0 0 0 2 1 2 0 0 1 I 57 0 0 0 1 1 0 1 3 0 3 0 0 3 1 0 0 0 1 0 0 1 Q 56 0 0 0 5 0 4 0 0 1 2 0 1 0 2 0 0 1 0 0 0 0 P 56 1 0 1 0 0 0 0 1 1 0 0 0 6 1 0 3 1 1 0 0 0 V 36 3 2 1 4 0 2 0 0 2 2 1 1 5 1 3 3 2 3 1 0 0 ....
Warning
Note that not all the rows contain all the possible aminoacids and gaps (see all the 0 values present in the table). In order to account for this, it’s necessary to extract an alphabet (you can see it in the first row of the output) with all the possible alternatives, to be used as a reference.
You can use five separate functions:
- A function reading the file
filename
and returning a list of couples residue-alignment - A function taking as input the list read from the file, and returning an alphabetically ordered list of all the possible characters appearing in the alignment, to be used as alphabet
- A function taking as input a string with an alignment (in a position) and returning a profile (dictionary from character to the corresponding number of occurrences in the alignment)
- A function taking as input the list read from the file and the alphabet, printing a header (see execution example) and, for each position in the list, the corresponding sequence element and the profile (obtained from function 3) ordered as alphabet (if the letter is not present in the profile, the number of occurrences will be 0)
- A function that runs the program using the functions defined above.
- takes as input a file name
Write a Python program that:
- takes as input a file name
filename
containing the annotation of a protein sequence in UNIPROT format - prints, for each secondary structure element (HELIX, STRAND, TURN), the average length of the element, in amino acids, and the average number of consecutive times the element occurs in the sequence
Input example (protein Q1NZL3 on Uniprot):
ID ZN224_HUMAN Reviewed; 707 AA. AC Q9NZL3; A6NFW9; P17033; Q86V10; Q8IZC8; Q9UID9; Q9Y2P6; DT 08-DEC-2000, integrated into UniProtKB/Swiss-Prot. DT 30-NOV-2010, sequence version 3. .... .... .... FT {ECO:0000305}. FT CONFLICT 562 562 H -> R (in Ref. 1; AAF04106). FT {ECO:0000305}. FT STRAND 174 177 {ECO:0000244|PDB:2EN8}. FT TURN 179 181 {ECO:0000244|PDB:2EN8}. FT STRAND 184 187 {ECO:0000244|PDB:2EN8}. FT HELIX 188 195 {ECO:0000244|PDB:2EN8}. FT TURN 196 198 {ECO:0000244|PDB:2EN8}. FT STRAND 207 209 {ECO:0000244|PDB:2EM6}. FT HELIX 216 223 {ECO:0000244|PDB:2EM6}. FT TURN 224 226 {ECO:0000244|PDB:2EM6}. FT STRAND 227 229 {ECO:0000244|PDB:2EM6}. FT STRAND 235 237 {ECO:0000244|PDB:2ELY}. FT STRAND 241 243 {ECO:0000244|PDB:2ELY}.
Execution example:
python program.py Data file: Q9NZL3.txt average lengths {’TURN’: 3.125, ’HELIX’: 7.285714285714286, ’STRAND’: 3.323529411764706} average number of consecutive occurrences {’TURN’: 1.0, ’HELIX’: 1.4, ’STRAND’: 1.736842105263158}
You can use five separate functions:
- A function reading the file
filename
and returning a list of pairs: secondary structure element (ss) - length. - A function taking as input the list read from the file, and returning two dictionaries, both having ss as keys, the first having as value the corresponding sum of lengths, the second the number of times the ss element occurs.
- A function taking as input two dictionaries and returning a new dictionary obtained by normalizing the values of the first with the values of the second.
- A function taking as input the list read from the file, and returning two dictionaries, both having ss as keys: the first dictionary maps from ss the number of occurrences; the second maps from ss to the number of non-consecutive occurrences. The function will return the normalized dictionary of occurrences.
- A function that implements the program using the above functions.
- takes as input a file name
Write a Python program that:
- Takes a
path
to a file describing a list of donor (D) splicing sites in FASTA format. Each site contains 15 base pairs: 7 before the site, 8 after the site. - An integer
threshold
in between 2 and 8.
The program should:
Print the number of occurrences of the sub-sequences (1) before and (2) after the donor site, for all sub-sequences of length in-between 2 and 7 (before the site) or 8 (after the site). Sub-sequences occurring less than
threshold
times should not be printed.The number of occurrences should be computed over all the sequences contained in the FASTA file.
The full input file is here:
Example input
>HUMGLUT4B_2257 <-- header CTCCGAAGTAGGATT <-- site sequence (7 before + 8 after) >HUMGLUT4B_3651 TCAGAAGGTGAGGGC >HUMGLUT4B_3935 TTGGAAGGTTCGCAG >HUMGLUT4B_4152 TACTCAGGTACTCAC >HUMGLUT4B_4379 CGCCCAGGTGACCGG >HUMGLUT4B_4669 AGAAAGAGTAAGCTC ...
For instance, given the sequence:
CTCCGAAGTAGGATT
the sequence after the site is
GTAGGATT
, and its sub-sequences are:GT, GTA, GTAG, ..., GTAGGATT
while the sequence before the site is
CTCCGAA
, and its sub-sequences are:AA, GAA, CGAA, ..., CTCCGAA
The expected result with
threshold = 100
is:> python splice_patterns.py Insert path: splice_donor.fasta Insert threshold: 100 Most frequent sub-sequences before the site: {'AAG': 208, 'AG': 583, 'GG': 107, 'GAG': 108, 'TG': 135, 'CAG': 247} Most frequent sub-sequences after the site: {'GTA': 568, 'GT': 1116, 'GTAAGT': 126, 'GTGAG': 388, 'GTG': 491, 'GTGA': 399, 'GTGAGT': 204, 'GTAA': 379, 'GTAAG': 300, 'GTGAGTG': 116}
meaning that
"AAG"
appears208
times before the site,"AG"
appears583
times before the site, ...,"GTA"
appears568
times after the site,"GT"
appears1116
times after the site, etc.Note
We suggest implementing five functions:
- A function that reads the input file and returns two collections: a collection of all sequences before the site, and a collection of all sequences after the site.
- A function that takes a collection of sequences after the site, and returns a dictionary mapping from sub-sequences to number of occurrences.
- A function that takes a collection of sequences before the site, and returns a dictionary mapping from sub-sequences to number of occurrences. (note that sub-sequences start at the end of the sequence of the file, and grow backwards.)
- A function that takes a dictionary of counts and a threshold, and returns a dictionary of counts with only the key-value pairs whose value is larger than or equal to the threshold.
- A function that implements the program using the above.
- Takes a
Write a Python program that:
- Takes a
path
to a file describing the association between proteins and the PFAM domains that appear in the protein. - Prints the list of all the PFAM domains (including their ID and name), sorted by number of occurrences, as well as the number of proteins that each PFAM domain appears into.
The full input file is output_pfam.
Example input:
78184355|ref|YP_376790.1| 28 145 PF01475.11 1 124 ls 62.1 2.1e-15 FUR 78185596|ref|YP_378030.1| 31 111 PF00575.15 1 79 fs 33.4 2.8e-08 S1 78185596|ref|YP_378030.1| 113 381 PF10150.1 1 273 ls 436.7 3.5e-128 RNase_E_G 78185773|ref|YP_378207.1| 70 222 PF04087.6 1 147 ls 221.5 2.1e-63 DUF389 78183836|ref|YP_376270.1| 13 375 PF01266.16 1 399 ls 182.5 1.2e-51 DAO 78184892|ref|YP_377327.1| 5 191 PF00702.18 1 184 fs 99.1 5.5e-28 Hydrolase 78183858|ref|YP_376292.1| 78 482 PF00909.13 1 455 ls 519.6 4.1e-153 Ammonium_transp ... ^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^ ^^^^^^^^^^^^^^^ protein domain ID domain name
The expected result is:
> python pfam_statistics.py Type path: output_pfam PFAM ID PFAM Name NumOcc NumDiffProt PF00005.19 ABC_tran 35 32 PF00132.16 Hexapep 29 5 PF00534.12 Glycos_transf_1 21 20 PF00515.20 TPR_1 17 8 PF01370.13 Epimerase 16 16 PF03130.8 HEAT_PBS 15 6 PF04055.13 Radical_SAM 14 14 PF00502.11 Phycobilisome 13 12 PF00427.13 PBS_linker_poly 13 10 PF00271.23 Helicase_C 13 13 PF00528.14 BPD_transp_1 12 11 PF00004.21 AAA 12 12 PF00805.14 Pentapeptide 11 5 PF00535.18 Glycos_transf_2 11 11 PF00353.11 HemolysinCabind 11 3 PF07862.3 Nif11 10 10 PF01926.15 MMR_HSR1 10 9
The output shows that, for instance, the domain with ID
PF00005.19
and nameABC_tran
occurs35
different times in the complete file, and in particular it occurs in32
distinct proteins.Warning
The output text columns are not required to be aligned.
Note
We suggest implementing five functions:
- A function that reads the input file and returns a dictionary mapping from domain PFAM ID to doman PFAM name, and for each protein the list of its domains.
- A function that takes the information read from the file, and returns a dictionary from domains to their number of occurrences.
- A function that takes the information read from the file, and returns a dictionary from domains to their number of distinct proteins they appear into.
- A function that print the output as shown in the expected results. The domains should be sorted by their total number of occurrences.
- A function that implements the program using the above.
- Takes a
Write a Python program that:
- Takes a
path
to a file containing protein sequences in FASTA format. The header of the FASTA contains information about the cellular localization and the organism of the protein. - Prints, for each cellular localization, the number of proteins, divided by organism, and sorted from the most frequent to the less frequent organism.
The full input file is here:
Example input:
>7B2_HUMAN:Secretory MVSRMVSTMLSGLLFWLASGWTPAFAYSPRTPDRVSEADIQRLLHGVME ... >A1AG1_MUSCR:Secretory MALHMILVMLSLLPLLEAQNPEHVNITIGEPITNETLSWLSDKWFFIGA ... >A1BG_HUMAN:Secretory MSMLVVFLLLWGVTWGPVTEAAIFYETQPSLWAESESLLKPLANVTLTC ... ... >ZNF22_RAT:Nucleus MRLGKPQKGGISRSATQGKTYESRRKTARQRQKWGVAIRFDSGLSRRRR ... >ZNF42_HUMAN:Nucleus MRPAVLGSPDRAPPEDEGPVMVKLEDSEEEGEAALWDPGPEAARLRFRC ...
The expected result is:
> python program.py Name of file: sequences.fasta Mitochondrion 107:HUMAN 27:BOVIN 18:DROME 16:MOUSE 7:RAT 3:ASCSU ... Secretory 191:HUMAN 79:MOUSE 59:DROME 35:BOVIN 30:RAT 19:CHICK ... Nucleus 636:HUMAN 188:DROME 145:MOUSE 65:CAEEL 27:RAT ... Cytoplasm 203:HUMAN 72:MOUSE 47:DROME 30:RAT 20:CAEEL 13:BOVIN ...
Note
We suggest implementing four functions:
- A function that reads the input file and returns a dictionary mapping from each cellular localization to the list of protein with that localization
- A function that takes the dictionary generated by the previous function and returns a new dictionary, mapping from cellular localizations to dictionaries of counts (mapping from organism to number of proteins belonging to the organism)
- A function that takes the dictionary generated by the previous function and, for each cellular localization, prints the list of counts, ordered by decreasing number of occurrences.
- A function that implements the program using the above functions.
- Takes a
Write a function
findPerfectMatches(filename,pattern)
that:- Takes the name of a file
filename
containing a list of RNA sequences, and an RNA string (for example, a microRNA sequence). - Prints, for each sequence, the name and the list of positions corresponding to perfect matches (by complementarity) with the string (only for sequences with at least one match).
The expected result is:
>>> import utility >>> utility.findPerfectMatches(’utr.txt’,’acgaatt’) ENSG00000050344 [1186] ENSG00000109929 [204, 373, 3336] ENSG00000155850 [2162, 5387] ENSG00000073756 [1152] ENSG00000175445 [693] ENSG00000159167 [781] ENSG00000138061 [1229] ENSG00000152268 [1362] ENSG00000197121 [3024] ENSG00000115159 [1111] ENSG00000169908 [1695] ENSG00000150938 [1751] ENSG00000179314 [2782] ENSG00000075223 [1743]
Note
We suggest implementing five functions:
- A function that reads the input file and returns a dictionary mapping from name to sequence.
- A function that takes as input a string and returns the complementary sequence (note that sequences and strings have T, and not U, even if they represent RNA sequences).
- A function that takes two sequences and return the list of occurrences of the second string on the first one (look carefully at the help of the function
find
). - A function taking the dictionary of sequences and the converted string, that calculates for each sequence the match positions, using the previous function, and if there is at least one match, prints the name of the sequence and the list of matches.
- A function that implements the program using the above.
- Takes the name of a file
Midterm exam 2016/11/02:
- Given:
- The
hierarchy.txt
file. - Optionally, a pair of protein identifiers.
- The
- write a Python program that:
- If the two protein identifiers are given, prints the list of domains shared by the two proteins; further, for each shared domain, it prints the residue histogram of their sequences.
- If no protein identifiers are given, it prints the above information for all pairs of proteins. The output text columns are not required to be aligned.
Warning
Same domains appear in more than one protein; different occurrences of a domain however may have different sequences!
Example input:
# Protein Domain Position Residue YAL003W PF00736 00120 W YAL003W PF00736 00121 I YAL003W PF00736 00154 Q YAL038W PF00224 00077 E YAL038W PF00224 00236 K YAL038W PF00224 00327 H YAL038W PF00224 00362 I YAL038W PF02887 00391 L YAL038W PF02887 00415 S YAL038W PF02887 00432 G YAL038W PF02887 00441 F YAL038W PF02887 00450 M ...
Expected output:
$ python commdomains.py insert the path to the hierarchy file: hierarchy.txt all pairs? (Y or N) N insert 1st protein ID: YIL109C insert 2nd protein ID: YPR181C YIL109C YPR181C shared domains: ['PF00626', 'PF04810', 'PF04811'] PF00626 histogram = {'I': 1, 'R': 1, 'T': 1, 'W': 3, 'V': 1} PF04810 histogram = {'S': 1, 'R': 1} PF04811 histogram = {'A': 1, 'E': 1, 'D': 3, 'G': 4, 'F': 3} PF04815 histogram = {'C': 1, 'D': 2, 'I': 1, 'Q': 2, 'P': 2} PF08033 histogram = {'C': 1, 'E': 2, 'D': 1, 'I': 3, 'L': 2}
Note
We suggest implementing five functions:
- A function that takes a path to the hierarchy file, and returns
a map from each protein to a dictionary; this dictionary maps
from each domain in the protein to its sequence. Picture:
sequence = hierarchy[protein][domain]
- A function that takes two lists and returns their intersection.
- A function that takes a sequence and returns the histogram of its residues.
- A function that takes the hierarchy read from the file and two protein identifiers, and prints the shared domains. Then, for every shared domain, it extracts the domain sequences from the two proteins, and prints their histogram.
- A function that implements the program.