================ Python: Programs ================ Exercises --------- #. Write a Python program that: - takes as input a file name ``filename`` containing the multiple alignment of a protein sequence - prints, for each sequence position, the corresponding *alignment profile*, i.e how many times each amino acid (considering also gaps as an option) was found in that position of the alignment. Input example (copy-and-paste into a file):: 0.012 C ---------CCAHSLVVTEASAD------------------------------------------------- 0.013 G ---------GGGYFAVFATAFSS------------------------------------------------- 0.024 V ---------VILINLLLYVIIIE-------VI---------------------------------------- 0.014 P ---------PPPCKRGKG-SEDS-------RV---------------------------------------- 0.013 A ---------AKACLSSVY-VIN--------FT------L--------------------------------- 0.014 I ---------IIPPEY-QLLVFI--------HL------P--------------------------------- 0.019 Q ---------QEKENEEGTQGGE--------LG------L--------------------------------- 0.020 P ---------PPTPPSSQKP-DI--------SV------P------A-------------------------- 0.077 V --N------VKAP-PPREESTPVGAVKGTCR-------P--L---SR-E----L-DC---AE-W-S---MQ- 0.087 L --A------LLLP-VISESPKEDHSPPPSSP-------L--E---FD-IG---WDRG---EA-T-T---QI- 0.106 S --K------SPPL-GSEERIPVTSSSGETGF-------N--KA--ADDQT-F-RLLW--KHP-G-P---TQ- 0.103 G --T------GGQY-V-PIQKPDNIRFFQTSF-------DG-LT--DSRGA-Y-TDSD--AVD-R-Q---VA- 0.202 L --F------LAA--RQLQSDSSGPATEMSHG-QQQSSSRD-SGQSKTRNANV-HQMPV-LRGLH-Q--NEGS 0.274 S VVI-E-PPTSNE--EFYSSNVLLPKGGGPSA-GGGFLSEGPGEENPNSDSGR-ASASGPKSFAP-P-DENQT 0.399 R GGR-R-RRRRRRR-QKRRRRKER-RFRRTRR-RRRATSRKKRERGDMRRERS-RRLRRDELETR-SAALKDW 0.610 I IVI-I-IIIIIIM-IIVIVIIIVIIVIVRII-IIIILIIIIIIIIVIIIIII-IIIIVHYVITI-IYLVIVL 0.463 V YAV-V-VIVVVVAVLLIVIVVVVIVIIVIIV-LMAHQRVVVVIVYAVRVVVV-IIIVAINMVVYPISHLRII 0.506 N GGG-G-GGGNGGYEGGNGNSGNNEGNNGNCG-GGGGGNGNGGNGNNGEGGGG-GGGNDANGGSNLGSRGHGR 0.794 G GGG-G-GGGGGGGAGGGGGGGGGGGGGGGGG-GGGGDGGGGGGGGGGGGGGG-GGGGGGSGGTGGGSGGGGS 0.288 E HHQ-S-HHKESITSHHKSRSSQQTDTHSGQS-EEETRVSTSEKVVGMQ-RIH-TYTVS-YGQVIK-SITPVP 0.323 E DED-Y-PVEEDEASNNPSDDPENEEEENQNV-DDDKKRNTEVKEEDKL-RDA-EDDED-LTKIKE-PPELSV 0.493 A IIA-I-SSTAASANAAASAAAVAAAAAARAAAVAAIITAAATAVAAVARAAS-AVVTA-PVASTCSAVVTAQ 0.320 V SPN-P-DSTVVTRLDPDSPKRVNNAEEAVKSEADVDERSGEPKPKADKPRVLMNDEER-EPQFTPNRQPSNH 0.351 P IIIII-VIIPPIPVIVLVPLLPRLPPKVNKEVQALIIIPPRKKIFEIDSPAPTIIDIR-NVFRVPAIQVIDR 0.372 G ESQEE-WKSGHNNHKAFGGGHHGHHHGNENGGGTGKTYGGNFNHDNEDGHGGRSEGNGTFGGPEHTEQGVIT 0.388 S QDDQY-HEESSQEQLSSQSQQSQEESESTSQSEAQDDLDEEQSLFSQEGADAHNDKEQTNARRSSSDNKKIE 0.347 W AAYAV-QEVWYWYNHANFFFFIFFFAAAVAFQYTLAAAAFMYFTWFVFYWAAWFAAFFPVKHYLRGYYYYAF 0.739 P PPPPP-PKPPPSPPPPPPKPPPPPPPPPPPPPPTPPPPVPPPQPTPPTSPPPPPPPPPPPTRPGPPPGPPRP 0.521 W FFYWW-HYYWWSWWWWWYYWWYYYYYFYFYHHWFYWWYYFYWYWFYYGFFWYWWYYMYYYYYYGWWSYFFSF Execution example:: python program.py Data file: alignment el - A C D E F G H I K L M N P Q R S T V W X Y C 58 3 2 1 1 0 0 1 0 0 1 0 0 0 0 0 2 1 2 0 0 0 G 58 3 0 0 0 3 3 0 0 0 0 0 0 0 0 0 2 1 1 0 0 1 V 56 0 0 0 1 0 0 0 6 0 4 0 1 0 0 0 0 0 3 0 0 1 P 57 0 1 1 1 0 2 0 0 2 0 0 0 3 0 2 2 0 1 0 0 0 A 57 2 1 0 0 1 0 0 1 1 2 0 1 0 0 0 2 1 2 0 0 1 I 57 0 0 0 1 1 0 1 3 0 3 0 0 3 1 0 0 0 1 0 0 1 Q 56 0 0 0 5 0 4 0 0 1 2 0 1 0 2 0 0 1 0 0 0 0 P 56 1 0 1 0 0 0 0 1 1 0 0 0 6 1 0 3 1 1 0 0 0 V 36 3 2 1 4 0 2 0 0 2 2 1 1 5 1 3 3 2 3 1 0 0 .... .. warning:: Note that not all the rows contain all the possible aminoacids and gaps (see all the 0 values present in the table). In order to account for this, it's necessary to extract an alphabet (you can see it in the first row of the output) with all the possible alternatives, to be used as a reference. You can use five separate functions: #. A function reading the file ``filename`` and returning a list of couples residue-alignment #. A function taking as input the list read from the file, and returning an alphabetically ordered list of all the possible characters appearing in the alignment, to be used as *alphabet* #. A function taking as input a string with an alignment (in a position) and returning a profile (dictionary from character to the corresponding number of occurrences in the alignment) #. A function taking as input the list read from the file and the *alphabet*, printing a header (see execution example) and, for each position in the list, the corresponding sequence element and the profile (obtained from function 3) ordered as *alphabet* (if the letter is not present in the profile, the number of occurrences will be 0) #. A function that runs the program using the functions defined above. | #. Write a Python program that: - takes as input a file name ``filename`` containing the annotation of a protein sequence in UNIPROT format - prints, for each secondary structure element (HELIX, STRAND, TURN), the average length of the element, in amino acids, and the average number of consecutive times the element occurs in the sequence Input example (protein `Q1NZL3 `_ on Uniprot):: ID ZN224_HUMAN Reviewed; 707 AA. AC Q9NZL3; A6NFW9; P17033; Q86V10; Q8IZC8; Q9UID9; Q9Y2P6; DT 08-DEC-2000, integrated into UniProtKB/Swiss-Prot. DT 30-NOV-2010, sequence version 3. .... .... .... FT {ECO:0000305}. FT CONFLICT 562 562 H -> R (in Ref. 1; AAF04106). FT {ECO:0000305}. FT STRAND 174 177 {ECO:0000244|PDB:2EN8}. FT TURN 179 181 {ECO:0000244|PDB:2EN8}. FT STRAND 184 187 {ECO:0000244|PDB:2EN8}. FT HELIX 188 195 {ECO:0000244|PDB:2EN8}. FT TURN 196 198 {ECO:0000244|PDB:2EN8}. FT STRAND 207 209 {ECO:0000244|PDB:2EM6}. FT HELIX 216 223 {ECO:0000244|PDB:2EM6}. FT TURN 224 226 {ECO:0000244|PDB:2EM6}. FT STRAND 227 229 {ECO:0000244|PDB:2EM6}. FT STRAND 235 237 {ECO:0000244|PDB:2ELY}. FT STRAND 241 243 {ECO:0000244|PDB:2ELY}. Execution example:: python program.py Data file: Q9NZL3.txt average lengths {’TURN’: 3.125, ’HELIX’: 7.285714285714286, ’STRAND’: 3.323529411764706} average number of consecutive occurrences {’TURN’: 1.0, ’HELIX’: 1.4, ’STRAND’: 1.736842105263158} You can use five separate functions: #. A function reading the file ``filename`` and returning a list of pairs: secondary structure element (*ss*) - length. #. A function taking as input the list read from the file, and returning two dictionaries, both having *ss* as keys, the first having as value the corresponding sum of lengths, the second the number of times the *ss* element occurs. #. A function taking as input two dictionaries and returning a new dictionary obtained by normalizing the values of the first with the values of the second. #. A function taking as input the list read from the file, and returning two dictionaries, both having *ss* as keys: the first dictionary maps from *ss* the number of occurrences; the second maps from *ss* to the number of non-consecutive occurrences. The function will return the normalized dictionary of occurrences. #. A function that implements the program using the above functions. | #. Write a Python program that: #. Takes a ``path`` to a file describing a list of donor (D) splicing sites in FASTA format. Each site contains 15 base pairs: 7 before the site, 8 after the site. #. An integer ``threshold`` in between 2 and 8. The program should: #. Print the number of occurrences of the sub-sequences (1) before and (2) after the donor site, for all sub-sequences of length in-between 2 and 7 (before the site) or 8 (after the site). Sub-sequences occurring less than ``threshold`` times should not be printed. The number of occurrences should be computed over **all** the sequences contained in the FASTA file. The full input file is here: https://drive.google.com/open?id=0B0wILN942aEVWGM4UU9mOG52ODQ Example input :: >HUMGLUT4B_2257 <-- header CTCCGAAGTAGGATT <-- site sequence (7 before + 8 after) >HUMGLUT4B_3651 TCAGAAGGTGAGGGC >HUMGLUT4B_3935 TTGGAAGGTTCGCAG >HUMGLUT4B_4152 TACTCAGGTACTCAC >HUMGLUT4B_4379 CGCCCAGGTGACCGG >HUMGLUT4B_4669 AGAAAGAGTAAGCTC ... For instance, given the sequence:: CTCCGAAGTAGGATT the sequence after the site is ``GTAGGATT``, and its sub-sequences are:: GT, GTA, GTAG, ..., GTAGGATT while the sequence before the site is ``CTCCGAA``, and its sub-sequences are:: AA, GAA, CGAA, ..., CTCCGAA The expected result with ``threshold = 100`` is:: > python splice_patterns.py Insert path: splice_donor.fasta Insert threshold: 100 Most frequent sub-sequences before the site: {'AAG': 208, 'AG': 583, 'GG': 107, 'GAG': 108, 'TG': 135, 'CAG': 247} Most frequent sub-sequences after the site: {'GTA': 568, 'GT': 1116, 'GTAAGT': 126, 'GTGAG': 388, 'GTG': 491, 'GTGA': 399, 'GTGAGT': 204, 'GTAA': 379, 'GTAAG': 300, 'GTGAGTG': 116} meaning that ``"AAG"`` appears ``208`` times before the site, ``"AG"`` appears ``583`` times before the site, ..., ``"GTA"`` appears ``568`` times after the site, ``"GT"`` appears ``1116`` times after the site, etc. .. note:: We suggest implementing five functions: #. A function that reads the input file and returns two collections: a collection of all sequences before the site, and a collection of all sequences after the site. #. A function that takes a collection of sequences after the site, and returns a dictionary mapping from sub-sequences to number of occurrences. #. A function that takes a collection of sequences before the site, and returns a dictionary mapping from sub-sequences to number of occurrences. (note that sub-sequences start at the end of the sequence of the file, and grow backwards.) #. A function that takes a dictionary of counts and a threshold, and returns a dictionary of counts with only the key-value pairs whose value is larger than or equal to the threshold. #. A function that implements the program using the above. #. Write a Python program that: #. Takes a ``path`` to a file describing the association between proteins and the PFAM domains that appear in the protein. #. Prints the list of all the PFAM domains (including their ID and name), sorted by number of occurrences, as well as the number of proteins that each PFAM domain appears into. The full input file is `output_pfam `_. Example input:: 78184355|ref|YP_376790.1| 28 145 PF01475.11 1 124 ls 62.1 2.1e-15 FUR 78185596|ref|YP_378030.1| 31 111 PF00575.15 1 79 fs 33.4 2.8e-08 S1 78185596|ref|YP_378030.1| 113 381 PF10150.1 1 273 ls 436.7 3.5e-128 RNase_E_G 78185773|ref|YP_378207.1| 70 222 PF04087.6 1 147 ls 221.5 2.1e-63 DUF389 78183836|ref|YP_376270.1| 13 375 PF01266.16 1 399 ls 182.5 1.2e-51 DAO 78184892|ref|YP_377327.1| 5 191 PF00702.18 1 184 fs 99.1 5.5e-28 Hydrolase 78183858|ref|YP_376292.1| 78 482 PF00909.13 1 455 ls 519.6 4.1e-153 Ammonium_transp ... ^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^ ^^^^^^^^^^^^^^^ protein domain ID domain name The expected result is:: > python pfam_statistics.py Type path: output_pfam PFAM ID PFAM Name NumOcc NumDiffProt PF00005.19 ABC_tran 35 32 PF00132.16 Hexapep 29 5 PF00534.12 Glycos_transf_1 21 20 PF00515.20 TPR_1 17 8 PF01370.13 Epimerase 16 16 PF03130.8 HEAT_PBS 15 6 PF04055.13 Radical_SAM 14 14 PF00502.11 Phycobilisome 13 12 PF00427.13 PBS_linker_poly 13 10 PF00271.23 Helicase_C 13 13 PF00528.14 BPD_transp_1 12 11 PF00004.21 AAA 12 12 PF00805.14 Pentapeptide 11 5 PF00535.18 Glycos_transf_2 11 11 PF00353.11 HemolysinCabind 11 3 PF07862.3 Nif11 10 10 PF01926.15 MMR_HSR1 10 9 The output shows that, for instance, the domain with ID ``PF00005.19`` and name ``ABC_tran`` occurs ``35`` different times in the complete file, and in particular it occurs in ``32`` distinct proteins. .. warning:: The output text columns are not required to be aligned. .. note:: We suggest implementing five functions: #. A function that reads the input file and returns a dictionary mapping from domain PFAM ID to doman PFAM name, and for each protein the list of its domains. #. A function that takes the information read from the file, and returns a dictionary from domains to their number of occurrences. #. A function that takes the information read from the file, and returns a dictionary from domains to their number of distinct proteins they appear into. #. A function that print the output as shown in the expected results. The domains should be sorted by their total number of occurrences. #. A function that implements the program using the above. #. Write a Python program that: #. Takes a ``path`` to a file containing protein sequences in FASTA format. The header of the FASTA contains information about the cellular localization and the organism of the protein. #. Prints, for each cellular localization, the number of proteins, divided by organism, and sorted from the most frequent to the less frequent organism. The full input file is here: https://drive.google.com/file/d/0B7pmXOEcMgmZUG5WTHk1dXh2dFk/view?usp=sharing Example input:: >7B2_HUMAN:Secretory MVSRMVSTMLSGLLFWLASGWTPAFAYSPRTPDRVSEADIQRLLHGVME ... >A1AG1_MUSCR:Secretory MALHMILVMLSLLPLLEAQNPEHVNITIGEPITNETLSWLSDKWFFIGA ... >A1BG_HUMAN:Secretory MSMLVVFLLLWGVTWGPVTEAAIFYETQPSLWAESESLLKPLANVTLTC ... ... >ZNF22_RAT:Nucleus MRLGKPQKGGISRSATQGKTYESRRKTARQRQKWGVAIRFDSGLSRRRR ... >ZNF42_HUMAN:Nucleus MRPAVLGSPDRAPPEDEGPVMVKLEDSEEEGEAALWDPGPEAARLRFRC ... The expected result is:: > python program.py Name of file: sequences.fasta Mitochondrion 107:HUMAN 27:BOVIN 18:DROME 16:MOUSE 7:RAT 3:ASCSU ... Secretory 191:HUMAN 79:MOUSE 59:DROME 35:BOVIN 30:RAT 19:CHICK ... Nucleus 636:HUMAN 188:DROME 145:MOUSE 65:CAEEL 27:RAT ... Cytoplasm 203:HUMAN 72:MOUSE 47:DROME 30:RAT 20:CAEEL 13:BOVIN ... .. note:: We suggest implementing four functions: #. A function that reads the input file and returns a dictionary mapping from each cellular localization to the list of protein with that localization #. A function that takes the dictionary generated by the previous function and returns a new dictionary, mapping from cellular localizations to dictionaries of counts (mapping from organism to number of proteins belonging to the organism) #. A function that takes the dictionary generated by the previous function and, for each cellular localization, prints the list of counts, ordered by decreasing number of occurrences. #. A function that implements the program using the above functions. #. Write a function ``findPerfectMatches(filename,pattern)`` that: #. Takes the name of a file ``filename`` containing a list of RNA sequences, and an RNA string (for example, a microRNA sequence). #. Prints, for each sequence, the name and the list of positions corresponding to perfect matches (by complementarity) with the string (only for sequences with at least one match). The expected result is:: >>> import utility >>> utility.findPerfectMatches(’utr.txt’,’acgaatt’) ENSG00000050344 [1186] ENSG00000109929 [204, 373, 3336] ENSG00000155850 [2162, 5387] ENSG00000073756 [1152] ENSG00000175445 [693] ENSG00000159167 [781] ENSG00000138061 [1229] ENSG00000152268 [1362] ENSG00000197121 [3024] ENSG00000115159 [1111] ENSG00000169908 [1695] ENSG00000150938 [1751] ENSG00000179314 [2782] ENSG00000075223 [1743] .. note:: We suggest implementing five functions: #. A function that reads the input file and returns a dictionary mapping from name to sequence. #. A function that takes as input a string and returns the complementary sequence (note that sequences and strings have T, and not U, even if they represent RNA sequences). #. A function that takes two sequences and return the list of occurrences of the second string on the first one (look carefully at the help of the function ``find``). #. A function taking the dictionary of sequences and the *converted* string, that calculates for each sequence the match positions, using the previous function, and if there is at least one match, prints the name of the sequence and the list of matches. #. A function that implements the program using the above. #. Midterm exam 2016/11/02: Given: #. The ``hierarchy.txt`` file. #. Optionally, a pair of protein identifiers. write a Python program that: #. If the two protein identifiers are given, prints the list of domains shared by the two proteins; further, for each shared domain, it prints the residue histogram of their sequences. #. If no protein identifiers are given, it prints the above information for all pairs of proteins. The output text columns are not required to be aligned. .. warning:: Same domains appear in more than one protein; different occurrences of a domain however may have different sequences! Example input:: # Protein Domain Position Residue YAL003W PF00736 00120 W YAL003W PF00736 00121 I YAL003W PF00736 00154 Q YAL038W PF00224 00077 E YAL038W PF00224 00236 K YAL038W PF00224 00327 H YAL038W PF00224 00362 I YAL038W PF02887 00391 L YAL038W PF02887 00415 S YAL038W PF02887 00432 G YAL038W PF02887 00441 F YAL038W PF02887 00450 M ... Expected output:: $ python commdomains.py insert the path to the hierarchy file: hierarchy.txt all pairs? (Y or N) N insert 1st protein ID: YIL109C insert 2nd protein ID: YPR181C YIL109C YPR181C shared domains: ['PF00626', 'PF04810', 'PF04811'] PF00626 histogram = {'I': 1, 'R': 1, 'T': 1, 'W': 3, 'V': 1} PF04810 histogram = {'S': 1, 'R': 1} PF04811 histogram = {'A': 1, 'E': 1, 'D': 3, 'G': 4, 'F': 3} PF04815 histogram = {'C': 1, 'D': 2, 'I': 1, 'Q': 2, 'P': 2} PF08033 histogram = {'C': 1, 'E': 2, 'D': 1, 'I': 3, 'L': 2} .. note:: We suggest implementing five functions: #. A function that takes a path to the hierarchy file, and returns a map from each protein to a dictionary; this dictionary maps from each domain in the protein to its sequence. Picture: ``sequence = hierarchy[protein][domain]`` #. A function that takes two lists and returns their intersection. #. A function that takes a sequence and returns the histogram of its residues. #. A function that takes the hierarchy read from the file and two protein identifiers, and prints the shared domains. Then, for every shared domain, it extracts the domain sequences from the two proteins, and prints their histogram. #. A function that implements the program.