=================
Pandas: Exercises
=================

.. hint::
    You can hone your skill with the (simple) exercises at:

        https://github.com/guipsamora/pandas_exercises


#. Given the *iris dataset*:

   #. How many rows does it contain? How many columns?

   #. Compute the average petal length

   #. Compute the average of all numerical columns

   #. Extract the petal length outliers (i.e. those rows whose petal length is
      50% longer than the average petal length)

   #. Compute the standard deviation of all columns, for each iris species

   #. Extract the petal length outliers (as above) for each iris species

   #. Extract the group-wise petal length outliers, i.e. find the outliers
      (as above) for each iris species using ``groupby()``, ``aggregate()``,
      and ``merge()``.

      .. hint::
        You can pass ``as_index=False`` to ``groupby()`` to keep the
        ``on`` column as an actual column (rather than turn it into the index
        of the aggregated dataframe). Example::

            grouped = iris.groupby("Name", as_index=False)

.. note::
    All the necessary files for the following exercises are here:

        https://drive.google.com/file/d/0B7pmXOEcMgmZWGZtRGpQZE9DYVU/view?usp=sharing

    You will also need to ``count_values()`` method of the class ``Sequence``::

        >>> df = pd.DataFrame({
        ...     "col1": ["a", "a", "d", "c"],
        ...     "col2": ["x", "w", "d", "w"]
        ... })
        ...
        >>> df
        col1 col2
        0    a    x
        1    a    w
        2    d    d
        3    c    w
        >>> df.col1.value_counts()
        a    2
        c    1
        d    1
        Name: col1, dtype: int64
        >>> df.col2.value_counts()
        w    2
        d    1
        x    1
        Name: col2, dtype: int64


|

.. note::
    Please refer to  the following figure to refresh the biological concepts
    necessary to understand the following exercises.

    .. image:: figures/genestructure.jpg
        :width: 100%

#. The file ``gene_table.txt`` contains summary annotation on all human genes,
   based on the Ensembl annotation:

       http://www.ensembl.org/index.html

   For each gene, this file contains:

   #. *gene_name* based on the HGNC nomenclature:

      http://www.genenames.org/

   #. *gene_biotype* for example protein_coding, pseudogene, lincRNA, miRNA
      etc. See here for a more detailed description of the biotypes:

      http://vega.sanger.ac.uk/info/about/gene_and_transcript_types.html

   #. *chromosome* on which the gene is located

   #. *strand* on which the gene is located

   #. *transcript_count* the number of known isoforms of the gene

   The incipit of the file ::

      gene_name,gene_biotype,chromosome,strand,transcript_count
      TSPAN6,protein_coding,chrX,-,5
      TNMD,protein_coding,chrX,+,2
      DPM1,protein_coding,chr20,-,6
      SCYL3,protein_coding,chr1,-,5
      C1orf112,protein_coding,chr1,+,9
      ...

   Based on this file, write a program that:

   #. computes the number of genes annotated for the human genome

   #. computes the minimum, maximum, average and median number of known
      isoforms per gene (consider the *transcript_count* column as a series).

   #. plots a histogram of the number of known isoforms per gene

   #. computes the number of different biotypes

   #. computes, for each gene_biotype, the number of associated genes (a
      histogram), and prints the gene_biotype with the number of associated
      genes in decreasing order

   #. computes the number of different chromosomes

   #. computes, for each chromosome, the number of genes it contains (again, a
      histogram), and prints the chromosome with the corresponding number of
      genes in increasing order.

   #. computes, for each chromosome, the percentage of genes located on the
      ``+`` strand

   #. computes, for each biotype, the average number of transcripts associated
      to genes belonging to the biotype

#. The file ``transcript_table.txt`` contains summary annotation on all human
   transcripts, based on Ensembl annotation:

       http://www.ensembl.org/index.html

   For each transcript, this file contains:

   #. *transcript_name*, composed of the gene name plus a numeric identifier

   #. *transcript_biotype* for example protein_coding, retained_intron,
      nonsense_mediated_decay, etc., see here for a more detailed description of biotypes:

         http://vega.sanger.ac.uk/info/about/gene_and_transcript_types.html

   #. *transcript_length* the length of the transcript (without considering
      introns an poly A tail)

   #. *utr5_length* the length of the 5' UTR region (without considering
      introns)

   #. *cds_length* the length of the CDS region (without considering introns)

   #. *utr3_length* the length of the 3' UTR region (without considering
      introns)

   #. *exon_count* the number of exons of the transcript

   #. *canonical_flag* a boolean indicating if the isoform is canonical (i.e.
      the *reference* isoform of the gene) or not. Each gene has only one
      canonical isoform.

   The incipit of the file ::

      transcript_name,transcript_biotype,transcript_length,utr5_length,cds_length,utr3_length,exon_count,canonical_flag
      ARF5-001,protein_coding,1103,154,543,406,6,T
      M6PR-001,protein_coding,2756,469,834,1453,7,T
      ESRRA-002,protein_coding,2215,171,1272,772,7,F
      FKBP4-001,protein_coding,3732,187,1380,2165,10,T
      CYP26B1-001,protein_coding,4732,204,1539,2989,6,T
      ...

   Based on this file, write a program that:

   #. computes the number of transcripts annotated for the human genome

   #. computes the minimum, maximum, average and median length of human
      transcripts

   #. computes the minimum, maximum, average and median length of the CDS of
      human transcripts (excluding values equal to 0)

   #. computes the percentage of human transcripts with a CDS length that is a
      multiple of 3 (excluding values equal to 0)

   #. computes the minimum, maximum, average and median length of the UTRs of
      human transcripts (excluding values equal to 0)

   #. computes, for each transcript_biotype, the number of associated
      transcripts (a histogram), and prints the transcript_biotype with the
      number of associated transcripts in decreasing order

   #. computes, for each transcript_biotype, the average transcript length, and
      prints the results in increasing order

   #. computes, for protein_coding transcripts, the average length of the
      5'UTR, CDS and 3' UTR

   #. computes, for each transcript_biotype and considering only canonical
      transcripts, the average number of exons

#. The file ``exon_table.txt`` contains summary annotation on human exons
   associated with canonical transcripts, based on the Ensembl annotation:

      http://www.ensembl.org/index.html

   For each exon, this file contains:

   #. *transcript_name*, the name of the transcript

   #. *exon_id*, the id of the exon

   #. *exon_rank*, the rank of the exon inside the transcript (``1`` for the
      first exon, ``2`` for the second exon, etc.)

   #. *exon_chrom_start*, the genomic coordinate corresponding to the start of
      the exon

   #. *exon_chrom_end*, the genomic coordinate corresponding to the end of the
      exon

   The incipit of the file ::

      transcript_name,exon_id,exon_rank,exon_chrom_start,exon_chrom_end
      ARF5-001,ENSE00001872691,1,127588345,127588565
      ARF5-001,ENSE00003494180,2,127589083,127589163
      ARF5-001,ENSE00003504066,3,127589485,127589594
      ARF5-001,ENSE00003678978,4,127590066,127590137
      ARF5-001,ENSE00003676786,5,127590963,127591088
      ...

   Based on this file, write a program that:

   #. computes the minimum, maximum, average and median length of
      human exons

   #. given the name of a transcript, returns an ordered list of its exons,
      plus the corresponding length


#. Load the gene-expression timeseries from:

       https://drive.google.com/open?id=0B0wILN942aEVY2JuaDc4VkkyUkU

   Then:

   #. How many genes are there? How many time steps?

   #. Are there any measurements not assigned to a gene? If there are, fill
      in the missing gene names by using the genes appearing above the
      missing entries. (Take a look at the missing data handling section above.)

   #. Plot the expression of gene ``"ZFX"`` using a line plot.

   #. Plot the mean and variance of the expression of all genes using a line plot.

   #. Find out which genes are more expressed at time 0, 30m, 3h, 6h, and 12h.

   #. Find out, for each gene, which other gene is the other most correlated.

   #. Draw the matrix of pairwise gene correlations using ``plt.imshow()``.

#. Write a Python function that mimics the semantics of ``merge()``: it takes
   two dataframes, matches the row labels, and concatenates the matching rows.

   The result should be a list of lists (or, optionally, a ``DataFrame``).

   It should follow ``merge()``'s ``how="inner"`` semantics, i.e. drop all
   rows that can not be matched.

   .. hint::
        Use the ``iterrows()`` method to iterate over the rows of a ``DataFrame``.

        Example::

            for row in df.iterrows():
                # do something with the row (it is a series!)

#. Same as above, but for the ``how="outer"`` semantics.