================
Python: Programs
================

Exercises
---------

#. Write a Python program that:

      - takes as input a file name ``filename`` containing the multiple
        alignment of a protein sequence

      - prints, for each sequence position, the corresponding *alignment
        profile*, i.e how many times each amino acid (considering also gaps as
        an option) was found in that position of the alignment.

   Input example (copy-and-paste into a file)::

      0.012 C ---------CCAHSLVVTEASAD-------------------------------------------------
      0.013 G ---------GGGYFAVFATAFSS-------------------------------------------------
      0.024 V ---------VILINLLLYVIIIE-------VI----------------------------------------
      0.014 P ---------PPPCKRGKG-SEDS-------RV----------------------------------------
      0.013 A ---------AKACLSSVY-VIN--------FT------L---------------------------------
      0.014 I ---------IIPPEY-QLLVFI--------HL------P---------------------------------
      0.019 Q ---------QEKENEEGTQGGE--------LG------L---------------------------------
      0.020 P ---------PPTPPSSQKP-DI--------SV------P------A--------------------------
      0.077 V --N------VKAP-PPREESTPVGAVKGTCR-------P--L---SR-E----L-DC---AE-W-S---MQ-
      0.087 L --A------LLLP-VISESPKEDHSPPPSSP-------L--E---FD-IG---WDRG---EA-T-T---QI-
      0.106 S --K------SPPL-GSEERIPVTSSSGETGF-------N--KA--ADDQT-F-RLLW--KHP-G-P---TQ-
      0.103 G --T------GGQY-V-PIQKPDNIRFFQTSF-------DG-LT--DSRGA-Y-TDSD--AVD-R-Q---VA-
      0.202 L --F------LAA--RQLQSDSSGPATEMSHG-QQQSSSRD-SGQSKTRNANV-HQMPV-LRGLH-Q--NEGS
      0.274 S VVI-E-PPTSNE--EFYSSNVLLPKGGGPSA-GGGFLSEGPGEENPNSDSGR-ASASGPKSFAP-P-DENQT
      0.399 R GGR-R-RRRRRRR-QKRRRRKER-RFRRTRR-RRRATSRKKRERGDMRRERS-RRLRRDELETR-SAALKDW
      0.610 I IVI-I-IIIIIIM-IIVIVIIIVIIVIVRII-IIIILIIIIIIIIVIIIIII-IIIIVHYVITI-IYLVIVL
      0.463 V YAV-V-VIVVVVAVLLIVIVVVVIVIIVIIV-LMAHQRVVVVIVYAVRVVVV-IIIVAINMVVYPISHLRII
      0.506 N GGG-G-GGGNGGYEGGNGNSGNNEGNNGNCG-GGGGGNGNGGNGNNGEGGGG-GGGNDANGGSNLGSRGHGR
      0.794 G GGG-G-GGGGGGGAGGGGGGGGGGGGGGGGG-GGGGDGGGGGGGGGGGGGGG-GGGGGGSGGTGGGSGGGGS
      0.288 E HHQ-S-HHKESITSHHKSRSSQQTDTHSGQS-EEETRVSTSEKVVGMQ-RIH-TYTVS-YGQVIK-SITPVP
      0.323 E DED-Y-PVEEDEASNNPSDDPENEEEENQNV-DDDKKRNTEVKEEDKL-RDA-EDDED-LTKIKE-PPELSV
      0.493 A IIA-I-SSTAASANAAASAAAVAAAAAARAAAVAAIITAAATAVAAVARAAS-AVVTA-PVASTCSAVVTAQ
      0.320 V SPN-P-DSTVVTRLDPDSPKRVNNAEEAVKSEADVDERSGEPKPKADKPRVLMNDEER-EPQFTPNRQPSNH
      0.351 P IIIII-VIIPPIPVIVLVPLLPRLPPKVNKEVQALIIIPPRKKIFEIDSPAPTIIDIR-NVFRVPAIQVIDR
      0.372 G ESQEE-WKSGHNNHKAFGGGHHGHHHGNENGGGTGKTYGGNFNHDNEDGHGGRSEGNGTFGGPEHTEQGVIT
      0.388 S QDDQY-HEESSQEQLSSQSQQSQEESESTSQSEAQDDLDEEQSLFSQEGADAHNDKEQTNARRSSSDNKKIE
      0.347 W AAYAV-QEVWYWYNHANFFFFIFFFAAAVAFQYTLAAAAFMYFTWFVFYWAAWFAAFFPVKHYLRGYYYYAF
      0.739 P PPPPP-PKPPPSPPPPPPKPPPPPPPPPPPPPPTPPPPVPPPQPTPPTSPPPPPPPPPPPTRPGPPPGPPRP
      0.521 W FFYWW-HYYWWSWWWWWYYWWYYYYYFYFYHHWFYWWYYFYWYWFYYGFFWYWWYYMYYYYYYGWWSYFFSF

   Execution example::

      python program.py
      Data file: alignment
      el - A C D E F G H I K L M N P Q R S T V W X Y
      C 58 3 2 1 1 0 0 1 0 0 1 0 0 0 0 0 2 1 2 0 0 0
      G 58 3 0 0 0 3 3 0 0 0 0 0 0 0 0 0 2 1 1 0 0 1
      V 56 0 0 0 1 0 0 0 6 0 4 0 1 0 0 0 0 0 3 0 0 1
      P 57 0 1 1 1 0 2 0 0 2 0 0 0 3 0 2 2 0 1 0 0 0
      A 57 2 1 0 0 1 0 0 1 1 2 0 1 0 0 0 2 1 2 0 0 1
      I 57 0 0 0 1 1 0 1 3 0 3 0 0 3 1 0 0 0 1 0 0 1
      Q 56 0 0 0 5 0 4 0 0 1 2 0 1 0 2 0 0 1 0 0 0 0
      P 56 1 0 1 0 0 0 0 1 1 0 0 0 6 1 0 3 1 1 0 0 0
      V 36 3 2 1 4 0 2 0 0 2 2 1 1 5 1 3 3 2 3 1 0 0
      ....

   .. warning::
      Note that not all the rows contain all the possible aminoacids and gaps
      (see all the 0 values present in the table). In order to account for
      this, it's necessary to extract an alphabet (you can see it in the first
      row of the output) with all the possible alternatives, to be used as a
      reference.

   You can use five separate functions:

      #. A function reading the file ``filename`` and returning a list of
         couples residue-alignment
      #. A function taking as input the list read from the file, and returning
         an alphabetically ordered list of all the possible characters
         appearing in the alignment, to be used as *alphabet*
      #. A function taking as input a string with an alignment (in a position)
         and returning a profile (dictionary from character to the
         corresponding number of occurrences in the alignment)
      #. A function taking as input the list read from the file and the
         *alphabet*, printing a header (see execution example) and, for each
         position in the list, the corresponding sequence element and the
         profile (obtained from function 3) ordered as *alphabet* (if the
         letter is not present in the profile, the number of occurrences will
         be 0)
      #. A function that runs the program using the functions defined above.

   |

#. Write a Python program that:

      - takes as input a file name ``filename`` containing the annotation of a
        protein sequence in UNIPROT format
      - prints, for each secondary structure element (HELIX, STRAND, TURN), the
        average length of the element, in amino acids, and the average number
        of consecutive times the element occurs in the sequence

   Input example (protein `Q1NZL3 <https://drive.google.com/open?id=0B0wILN942aEVdTJjVHJ0dFNqVHM>`_ on Uniprot)::

      ID ZN224_HUMAN Reviewed; 707 AA.
      AC Q9NZL3; A6NFW9; P17033; Q86V10; Q8IZC8; Q9UID9; Q9Y2P6;
      DT 08-DEC-2000, integrated into UniProtKB/Swiss-Prot.
      DT 30-NOV-2010, sequence version 3.
      ....
      ....
      ....
      FT {ECO:0000305}.
      FT CONFLICT 562 562 H -> R (in Ref. 1; AAF04106).
      FT {ECO:0000305}.
      FT STRAND 174 177 {ECO:0000244|PDB:2EN8}.
      FT TURN 179 181 {ECO:0000244|PDB:2EN8}.
      FT STRAND 184 187 {ECO:0000244|PDB:2EN8}.
      FT HELIX 188 195 {ECO:0000244|PDB:2EN8}.
      FT TURN 196 198 {ECO:0000244|PDB:2EN8}.
      FT STRAND 207 209 {ECO:0000244|PDB:2EM6}.
      FT HELIX 216 223 {ECO:0000244|PDB:2EM6}.
      FT TURN 224 226 {ECO:0000244|PDB:2EM6}.
      FT STRAND 227 229 {ECO:0000244|PDB:2EM6}.
      FT STRAND 235 237 {ECO:0000244|PDB:2ELY}.
      FT STRAND 241 243 {ECO:0000244|PDB:2ELY}.

   Execution example::

      python program.py
      Data file: Q9NZL3.txt
      average lengths
      {’TURN’: 3.125, ’HELIX’: 7.285714285714286,
      ’STRAND’: 3.323529411764706}
      average number of consecutive occurrences
      {’TURN’: 1.0, ’HELIX’: 1.4, ’STRAND’: 1.736842105263158}

   You can use five separate functions:

      #. A function reading the file ``filename`` and returning a list of
         pairs: secondary structure element (*ss*) - length.
      #. A function taking as input the list read from the file, and returning
         two dictionaries, both having *ss* as keys, the first having as value
         the corresponding sum of lengths, the second the number of times the
         *ss* element occurs.
      #. A function taking as input two dictionaries and returning a new
         dictionary obtained by normalizing the values of the first with the
         values of the second.
      #. A function taking as input the list read from the file, and returning
         two dictionaries, both having *ss* as keys: the first dictionary maps
         from *ss* the number of occurrences; the second maps from *ss* to the
         number of non-consecutive occurrences. The function will return the
         normalized dictionary of occurrences.
      #. A function that implements the program using the above functions.

   |

#. Write a Python program that:

   #. Takes a ``path`` to a file describing a list of donor (D) splicing sites
      in FASTA format. Each site contains 15 base pairs: 7 before the site,
      8 after the site.
   #. An integer ``threshold`` in between 2 and 8.

   The program should:

   #. Print the number of occurrences of the sub-sequences (1) before and (2)
      after the donor site, for all sub-sequences of length in-between 2 and 7
      (before the site) or 8 (after the site). Sub-sequences occurring less
      than ``threshold`` times should not be printed.

      The number of occurrences should be computed over **all** the sequences
      contained in the FASTA file.

   The full input file is here:

        https://drive.google.com/open?id=0B0wILN942aEVWGM4UU9mOG52ODQ

   Example input ::

      >HUMGLUT4B_2257 <-- header
      CTCCGAAGTAGGATT <-- site sequence (7 before + 8 after)
      >HUMGLUT4B_3651
      TCAGAAGGTGAGGGC
      >HUMGLUT4B_3935
      TTGGAAGGTTCGCAG
      >HUMGLUT4B_4152
      TACTCAGGTACTCAC
      >HUMGLUT4B_4379
      CGCCCAGGTGACCGG
      >HUMGLUT4B_4669
      AGAAAGAGTAAGCTC
      ...

   For instance, given the sequence::

      CTCCGAAGTAGGATT

   the sequence after the site is ``GTAGGATT``, and its sub-sequences are::

      GT, GTA, GTAG, ..., GTAGGATT

   while the sequence before the site is ``CTCCGAA``, and its sub-sequences are::

      AA, GAA, CGAA, ...,  CTCCGAA

   The expected result with ``threshold = 100`` is::

      > python splice_patterns.py
      Insert path: splice_donor.fasta
      Insert threshold: 100
      Most frequent sub-sequences before the site:
      {'AAG': 208, 'AG': 583, 'GG': 107, 'GAG': 108,
      'TG': 135, 'CAG': 247}
      Most frequent sub-sequences after the site:
      {'GTA': 568, 'GT': 1116, 'GTAAGT': 126, 'GTGAG': 388,
      'GTG': 491, 'GTGA': 399, 'GTGAGT': 204, 'GTAA': 379,
      'GTAAG': 300, 'GTGAGTG': 116}

   meaning that ``"AAG"`` appears ``208`` times before the site, ``"AG"``
   appears ``583`` times before the site, ..., ``"GTA"`` appears ``568``
   times after the site, ``"GT"`` appears ``1116`` times after the site, etc.

   .. note::
      We suggest implementing five functions:

      #. A function that reads the input file and returns two collections:
         a collection of all sequences before the site, and a collection
         of all sequences after the site.
      #. A function that takes a collection of sequences after the site, and
         returns a dictionary mapping from sub-sequences to number of
         occurrences.
      #. A function that takes a collection of sequences before the site, and
         returns a dictionary mapping from sub-sequences to number of
         occurrences. (note that sub-sequences start at the end of the
         sequence of the file, and grow backwards.)
      #. A function that takes a dictionary of counts and a threshold, and
         returns a dictionary of counts with only the key-value pairs whose
         value is larger than or equal to the threshold.
      #. A function that implements the program using the above.

#. Write a Python program that:

   #. Takes a ``path`` to a file describing the association between proteins
      and the PFAM domains that appear in the protein.
   #. Prints the list of all the PFAM domains (including their ID and name),
      sorted by number of occurrences, as well as the number of proteins that
      each PFAM domain appears into.

   The full input file is `output_pfam <https://drive.google.com/open?id=0B0wILN942aEVelgzUGdhNG9CUE0>`_.

   Example input::

      78184355|ref|YP_376790.1|     28   145 PF01475.11      1   124 ls    62.1   2.1e-15               FUR
      78185596|ref|YP_378030.1|     31   111 PF00575.15      1    79 fs    33.4   2.8e-08                S1
      78185596|ref|YP_378030.1|    113   381 PF10150.1       1   273 ls   436.7  3.5e-128         RNase_E_G
      78185773|ref|YP_378207.1|     70   222 PF04087.6       1   147 ls   221.5   2.1e-63            DUF389
      78183836|ref|YP_376270.1|     13   375 PF01266.16      1   399 ls   182.5   1.2e-51               DAO
      78184892|ref|YP_377327.1|      5   191 PF00702.18      1   184 fs    99.1   5.5e-28         Hydrolase
      78183858|ref|YP_376292.1|     78   482 PF00909.13      1   455 ls   519.6  4.1e-153   Ammonium_transp
      ...

      ^^^^^^^^^^^^^^^^^^^^^^^^^              ^^^^^^^^^^                                     ^^^^^^^^^^^^^^^
            protein                          domain ID                                       domain name

   The expected result is::

      > python pfam_statistics.py
      Type path: output_pfam
      PFAM ID     PFAM Name         NumOcc  NumDiffProt
      PF00005.19  ABC_tran          35      32
      PF00132.16  Hexapep           29      5
      PF00534.12  Glycos_transf_1   21      20
      PF00515.20  TPR_1             17      8
      PF01370.13  Epimerase         16      16
      PF03130.8   HEAT_PBS          15      6
      PF04055.13  Radical_SAM       14      14
      PF00502.11  Phycobilisome     13      12
      PF00427.13  PBS_linker_poly   13      10
      PF00271.23  Helicase_C        13      13
      PF00528.14  BPD_transp_1      12      11
      PF00004.21  AAA               12      12
      PF00805.14  Pentapeptide      11      5
      PF00535.18  Glycos_transf_2   11      11
      PF00353.11  HemolysinCabind   11      3
      PF07862.3   Nif11             10      10
      PF01926.15  MMR_HSR1          10      9

   The output shows that, for instance, the domain with ID ``PF00005.19`` and
   name ``ABC_tran`` occurs ``35`` different times in the complete file, and
   in particular it occurs in ``32`` distinct proteins.

   .. warning::
      The output text columns are not required to be aligned.

   .. note::
      We suggest implementing five functions:

      #. A function that reads the input file and returns a dictionary mapping
         from domain PFAM ID to doman PFAM name, and for each protein the
         list of its domains.
      #. A function that takes the information read from the file, and returns
         a dictionary from domains to their number of occurrences.
      #. A function that takes the information read from the file, and returns
         a dictionary from domains to their number of distinct proteins they
         appear into.
      #. A function that print the output as shown in the expected results. The
         domains should be sorted by their total number of occurrences.
      #. A function that implements the program using the above.

#. Write a Python program that:

   #. Takes a ``path`` to a file containing protein sequences in FASTA format. The header of the FASTA contains information about the cellular localization and the organism of the protein.
   #. Prints, for each cellular localization, the number of proteins, divided by organism, and sorted from the most frequent to the less frequent organism.
   
   The full input file is  here:

        https://drive.google.com/file/d/0B7pmXOEcMgmZUG5WTHk1dXh2dFk/view?usp=sharing
   
   Example input::
   
       >7B2_HUMAN:Secretory
       MVSRMVSTMLSGLLFWLASGWTPAFAYSPRTPDRVSEADIQRLLHGVME ...
       >A1AG1_MUSCR:Secretory
       MALHMILVMLSLLPLLEAQNPEHVNITIGEPITNETLSWLSDKWFFIGA ...
       >A1BG_HUMAN:Secretory
       MSMLVVFLLLWGVTWGPVTEAAIFYETQPSLWAESESLLKPLANVTLTC ...
       ...
       >ZNF22_RAT:Nucleus
       MRLGKPQKGGISRSATQGKTYESRRKTARQRQKWGVAIRFDSGLSRRRR ...
       >ZNF42_HUMAN:Nucleus
       MRPAVLGSPDRAPPEDEGPVMVKLEDSEEEGEAALWDPGPEAARLRFRC ...
	   
   The expected result is::
   
       > python program.py
       Name of file: sequences.fasta
       Mitochondrion
       107:HUMAN 27:BOVIN 18:DROME 16:MOUSE 7:RAT 3:ASCSU ...
       Secretory
       191:HUMAN 79:MOUSE 59:DROME 35:BOVIN 30:RAT 19:CHICK ...
       Nucleus
       636:HUMAN 188:DROME 145:MOUSE 65:CAEEL 27:RAT ...
       Cytoplasm
       203:HUMAN 72:MOUSE 47:DROME 30:RAT 20:CAEEL 13:BOVIN ...

   .. note::
      We suggest implementing four functions:

      #. A function that reads the input file and returns a dictionary mapping
         from each cellular localization to the list of protein with that localization
      #. A function that takes the dictionary generated by the previous function and returns a new dictionary, mapping from cellular localizations to dictionaries of counts (mapping from organism to number of proteins belonging to the organism)
      #. A function that takes the dictionary generated by the previous function and, for each cellular localization, prints the list of counts, ordered by decreasing number of occurrences.
      #. A function that implements the program using the above functions.	   
 
#. Write a function ``findPerfectMatches(filename,pattern)`` that: 

   #. Takes the name of a file ``filename`` containing a list of RNA sequences, and an RNA string (for example, a microRNA sequence).
   #. Prints, for each sequence, the name and the list of positions corresponding to perfect matches (by complementarity) with the string (only for sequences with at least one match).

   The expected result is::
   
       >>> import utility
       >>> utility.findPerfectMatches(’utr.txt’,’acgaatt’)
       ENSG00000050344 [1186]
       ENSG00000109929 [204, 373, 3336]
       ENSG00000155850 [2162, 5387]
       ENSG00000073756 [1152]
       ENSG00000175445 [693]
       ENSG00000159167 [781]
       ENSG00000138061 [1229]
       ENSG00000152268 [1362]
       ENSG00000197121 [3024]
       ENSG00000115159 [1111]
       ENSG00000169908 [1695]
       ENSG00000150938 [1751]
       ENSG00000179314 [2782]
       ENSG00000075223 [1743]
	   
   .. note::
      We suggest implementing five functions:

      #. A function that reads the input file and returns a dictionary mapping
         from name to sequence.
      #. A function that takes as input a string and returns the complementary sequence (note that sequences and strings have T, and not U, even if they represent RNA sequences).
      #. A function that takes two sequences and return the list of occurrences of the second string on the first one (look carefully at the help of the function ``find``). 
      #. A function taking the dictionary of sequences and the *converted* string, that calculates for each sequence the match positions, using the previous function, and if there is at least one match, prints the name of the sequence and the list of matches.
      #. A function that implements the program using the above.

#. Midterm exam 2016/11/02: 

   Given:
      #. The ``hierarchy.txt`` file.
      #. Optionally, a pair of protein identifiers.
   
   write a Python program that:
      #. If the two protein identifiers are given, prints the list of
         domains shared by the two proteins; further, for each shared
         domain, it prints the residue histogram of their sequences.
      #. If no protein identifiers are given, it prints the above
         information for all pairs of proteins. The output text columns 
         are not required to be aligned.

   .. warning::  
      Same domains appear in more than one protein; different
      occurrences of a domain however may have different sequences!

   Example input::
   
      # Protein Domain Position Residue
      YAL003W PF00736 00120 W
      YAL003W PF00736 00121 I
      YAL003W PF00736 00154 Q
      YAL038W PF00224 00077 E
      YAL038W PF00224 00236 K
      YAL038W PF00224 00327 H
      YAL038W PF00224 00362 I
      YAL038W PF02887 00391 L
      YAL038W PF02887 00415 S
      YAL038W PF02887 00432 G
      YAL038W PF02887 00441 F
      YAL038W PF02887 00450 M
      ...
	  
   Expected output::
   
      $ python commdomains.py
      insert the path to the hierarchy file: hierarchy.txt
      all pairs? (Y or N) N
      insert 1st protein ID: YIL109C
      insert 2nd protein ID: YPR181C
      YIL109C YPR181C shared domains: ['PF00626', 'PF04810', 'PF04811'] 
      PF00626 histogram = {'I': 1, 'R': 1, 'T': 1, 'W': 3, 'V': 1}
      PF04810 histogram = {'S': 1, 'R': 1}
      PF04811 histogram = {'A': 1, 'E': 1, 'D': 3, 'G': 4, 'F': 3} 
      PF04815 histogram = {'C': 1, 'D': 2, 'I': 1, 'Q': 2, 'P': 2} 
      PF08033 histogram = {'C': 1, 'E': 2, 'D': 1, 'I': 3, 'L': 2}
	  
   .. note::
   
      We suggest implementing five functions:
	  
      #. A function that takes a path to the hierarchy file, and returns
         a map from each protein to a dictionary; this dictionary maps
         from each domain in the protein to its sequence. Picture: ``sequence = hierarchy[protein][domain]``
      #. A function that takes two lists and returns their intersection.
      #. A function that takes a sequence and returns the histogram of its residues.
      #. A function that takes the hierarchy read from the file and two
         protein identifiers, and prints the shared domains. Then, for
         every shared domain, it extracts the domain sequences from
         the two proteins, and prints their histogram.
      #. A function that implements the program.