======
Pandas
======

What is Pandas?
---------------

Pandas is a freely available library for loading, manipulating, and visualizing
**sequential and tabular data**, for instance time series or microarrays.

Its main features are:

- **Loading** and **saving** with "standard" tabular files, like *CSV*
  (Comma-separated Values), *TSV* (Tab-separated Values), *Excel* files, and
  *database* formats

- Flexible **indexing** and **aggregation** of series and tables

- Efficient **numerical** and **statistical** operations (e.g. broadcasting)

- Pretty, straightforward **visualization**

This list is far from complete!

|

Documentation & Source
----------------------

A few websites worth visiting are the official Pandas website:

    http://pandas.pydata.org/

as well as the official documentation:

    http://pandas.pydata.org/pandas-docs/stable/dsintro.html

The Pandas source code, available here:

    https://github.com/pandas-dev/pandas/

is distributed under the *BSD license*:

    https://github.com/pandas-dev/pandas/blob/master/LICENSE

|

Importing Pandas
----------------

In order to import Pandas, the customary idiom is::

    >>> import pandas as pd
    >>> help(pd)

but feel free to use this instead::

    >>> from pandas import *

To load the (optional, but super-useful) visualization library, write::

    >>> import matplotlib.pyplot as plt

For now, the only method you need is the ``plt.show()`` method.

|

Pandas: A short demonstration
-----------------------------

Let's start with a copy of Fisher's own **iris dataset**:

    https://drive.google.com/open?id=0B0wILN942aEVYTVBekRHLTNON3c

The dataset records petal length/width and sepal length/width of three
different kinds of iris species: *Iris setosa*, *Iris versicolor*, and *Iris
virginica*.

.. note::
    The iris dataset has quite some history -- there is even a Wikipedia
    page about it!:

        https://en.wikipedia.org/wiki/Iris_flower_data_set

This is what the iris dataset in CSV format looks like::

    SepalLength,SepalWidth,PetalLength,PetalWidth,Name
    5.1,3.5,1.4,0.2,Iris-setosa
    4.9,3.0,1.4,0.2,Iris-setosa
    ...
    5.0,3.3,1.4,0.2,Iris-setosa
    7.0,3.2,4.7,1.4,Iris-versicolor
    6.4,3.2,4.5,1.5,Iris-versicolor
    ...
    5.7,2.8,4.1,1.3,Iris-versicolor
    6.3,3.3,6.0,2.5,Iris-virginica
    5.8,2.7,5.1,1.9,Iris-virginica
    ...
    5.9,3.0,5.1,1.8,Iris-virginica

In an effort to understand the dataset, we would like to visualize the relation
between the four properties for the case of *Iris virginica*.

In pure Python, we would have to:

#. Load the dataset by parsing all the rows in the file, then

#. Keep only the rows pertaining to *Iris virginica*, then

#. Compute statistics on the values of the rows, making sure to convert from
   strings to ``float``'s as required, and then

#. Actually draw the plots by using a specialized plotting library.

With Pandas, this all becomes very easy. Check this out::

    >>> import pandas as pd
    >>> from pandas.tools.plotting import scatter_matrix
    >>> import matplotlib.pyplt as plt

    >>> df = pd.read_csv("iris.csv")
    >>> scatter_matrix(df[df.Name == "Iris-virginica"])

    >>> plt.show()

and that's it! The result is:

.. image:: figures/iris_scattermatrix.png
    :width: 100%

.. attention::
    It only looks bad because I used the default colors. Plots can be
    customized to your liking!

Of course, there is a lot more to Pandas than solving simple tasks like this.

|

Pandas datatypes
----------------

Pandas provides a couple of very useful datatypes, ``Series`` and ``DataFrame``:

- ``Series`` represents 1D data, like time series, calendars, the output of
  one-variable functions, etc.

- ``DataFrame`` represents 2D data, like a column-separated-values (CSV) file,
  a microarray, a database table, a matrix, etc.

A ``DataFrame`` is composed of multiple ``Series``. More specifically, each
column of a ``DataFrame`` is a ``Series``. That's why we will see how the
``Series`` datatype works first.

.. attention::
    Most of what we will say about ``Series`` also applies to ``DataFrame``'s.

    The similarities include how indexing is done, how broadcasting applies
    to arithmetical operators and Boolean masks, how to compute statistics,
    how to deal with missing values (``nan``'s) and how plotting is done.

|

Pandas: ``Series``
------------------

A ``Series`` is a **one-dimensional** array with a labeled axis. It can hold
arbitrary objects. It works a little like a ``list`` and a little like a
``dict``.

The axis is called the ``index``, and can be used to access the elements; it is
very flexible, and not necessarily numerical.

To access the online documentation, type::

    >>> s = pd.Series()
    >>> help(s)

|

Creating a ``Series``
---------------------

Use the ``Series()`` constructor to create a new ``Series`` object, as follows.

When creating a ``Series``, you can either specify *both* the date and the
index, like this::

    >>> s = pd.Series(["a", "b", "c"], index=[2, 5, 8])
    >>> s
    2    a
    5    b
    8    c
    dtype: object
    >>> s.index
    Int64Index([2, 5, 8], dtype='int64')

or pass a single ``dict`` in, in which case the index is built from the
dictionary keys::

    >>> s = pd.Series({"a": "A", "b": "B", "c": "C"})
    >>> s
    a    A
    b    B
    c    C
    dtype: object
    >>> s.index
    Index([u'a', u'b', u'c'], dtype='object')

or skip the index altogether: in this case, the default index (i.e. a 
numerical index starting from ``0``) is automatically assigned::

    >>> s = pd.Series(["a", "b", "c"])
    >>> s
    0    a
    1    b
    2    c
    dtype: object
    >>> s.index
    RangeIndex(start=0, stop=3, step=1)

Finally, if given a single scalar (e.g. an ``int``), the ``Series`` constructor
will replicate it for all indices::

    >>> s = pd.Series(3, index=range(10))
    >>> s
    0    3
    1    3
    2    3
    3    3
    4    3
    5    3
    6    3
    7    3
    8    3
    9    3
    dtype: int64

.. note::
    The *type* of the index changes in the previous exercises.

    The library uses different types of index for performance reasons. The
    ``Series`` class however works (more or less) the same in all cases, so
    you can ignore these differences in practical applications.

|

Accessing a ``Series``
----------------------

Let's create a ``Series`` representing the hours of sleep we had the chance to
get each day of the past week::

    >>> sleephours = [6, 2, 8, 5, 9]
    >>> days = ["mon", "tue", "wed", "thu", "fri"]
    >>> s = pd.Series(sleephours, index=days)
    >>> s
    mon    6
    tue    2
    wed    8
    thu    5
    fri    9
    dtype: int64

Now we can access the various elements of the ``Series`` both by their label
(i.e. the index) like we would with a ``dict``::

    >>> s["mon"]
    6

as well as by their position, like we would with a ``list``::

    >>> s[0]
    6

.. warning::
    If the label or position is wrong, you get either a ``KeyError``::

        >>> s["sat"]
        KeyError: 'sat'

    or an ``IndexError``::

        >>> s[9]
        IndexError: index out of bounds

We can also slice the positions, like we would with a ``list``::

    >>> s[-3:]
    wed    8
    thu    5
    fri    9
    dtype: int64

Note that *both* the data and the index are extracted correctly. It also works
with labels::

    >>> s["tue":"thu"]
    tue    2
    wed    8
    thu    5
    dtype: int64

The first and last ``n`` elements can be extracted also using ``head()`` and
``tail()``::

    >>> s.head(2)
    mon    6
    tue    2
    dtype: int64
    >>> s.tail(3)
    wed    8
    thu    5
    fri    9
    dtype: int64

Most importantly, can also explicitly pass in a **list of positions**, like
this::

    >>> s[[0, 1, 2]]
    mon    6
    tue    2
    wed    8
    dtype: int64
    >>> s[["mon", "wed", "fri"]]
    mon    6
    wed    8
    fri    9
    dtype: int64

    >>> s.head()

.. warning::
    Passing in a **tuple** of positions does not work!

|

Operator Broadcasting
---------------------

The ``Series`` class automatically broadcasts arithmetical operations by a
scalar to all of the elements. Consider adding ``1`` to our ``Series``::

    >>> s
    mon    6
    tue    2
    wed    8
    thu    5
    fri    9
    dtype: int64
    >>> s + 1
    mon     7
    tue     3
    wed     9
    thu     6
    fri    10
    dtype: int64

As you can see, the addition was applied to **all** the entries! This also
applies to multiplication and other arithmetical operations::

    >>> s * 2
    mon    12
    tue     4
    wed    16
    thu    10
    fri    18
    dtype: int64

This is extremely useful: it frees the analyst from having to write loops.
The (worse) alternative the above would be::

    for x in s.index:
        s[x] += 1

which is definitely more verbose and error-prone, and also takes more time
to execute!

.. note::
    The concept of *operator broadcasting* was taken from the ``numpy``
    library, and is one of the key features for writing efficient,
    clean numerical code in Python.

    In a way, it is a "generalized" version of scalar products (from linear
    algebra).

    The rules govering how broadcasting is applied can be pretty complex
    (and confusing). Here we will cover constant broadcasting only.

|

Masks and Filtering
-------------------

Most importantly, broadcasting also applies to **conditions**::

    >>> s > 6
    mon     True
    tue    False
    wed     True
    thu    False
    fri     True
    dtype: bool

Here the result is a ``Series`` with the same index as the original (that is,
the very same labels), and each label ``label`` is associated to the result of
the comparison ``s[label] > 6``. This kind of series is called a **mask**.

Is it useful in any way?

Yes, of course! Masks can be used to filter the elements of a ``Series``
according to a given condition::

    >>> s[s > 6]
    mon     7
    wed     9
    fri    10
    dtype: int64

Here only those entries corresponding to ``True``'s in the mask have been
kept. For extracting the elements that satisfy the opposite of the condition,
write::

    >>> s[s <= 6]
    tue    3
    thu    6
    dtype: int64

or::

    >>> s[~(s > 6)]
    tue    3
    thu    6
    dtype: int64

The latter version is useful when the condition you are checking for is complex
and can not be inverted easily.

|

Automatic Label Alignment
-------------------------

Operations between multiple time series are automatically aligned by label,
meaning that elements with the same label are matched prior to carrying out the
operation.

Consider this example::

    >>> s[1:] + s[:-1]
    fri     NaN         # <-- not common
    mon     NaN         # <-- not common
    thu    10.0         # <-- common element
    tue     4.0         # <-- common element
    wed    16.0         # <-- common element
    dtype: float64

As you can see, the elements are summed in the right order! But what's up with
those ``nan``'s?

The index of the resulting ``Series`` is the **union** of the indices of
the operands. What happens depend on whether a given label appears in both
ipnut ``Series`` or not:

- For common labels (in our case ``"tue"``, ``"wed"``, ``"thu"``), the output
  ``Series`` contains the sum of the aligned elements.

- For labels appearing in only one of the operands (``"mon"`` and ``"fri"``),
  the result is a ``nan``, i.e. not-a-number.

.. warning::
    ``nan`` means *not a number*.

    It is just a symbolic constant that specifies that the object is a
    number-like entity with an invalid or undefined value. Think for
    instance of the result of division by 0.

    In the example above, the result is a ``nan`` because it represents the
    sum of an element, like ``s[1:]["fri"]``, and a non-existing element,
    like ``s[:-1]["fro"]``.

    It is used to indicate undefined results (like above) or missing
    measurements (as we will see).

    The actual value of the ``nan`` symbol is taken from ``numpy``, see
    ``numpy.nan``. You can actually use it to insert ``nan``'s manually::

        >>> s["fri"] = np.nan
        >>> s
        mon    6.0
        tue    2.0
        wed    8.0
        thu    5.0
        fri    NaN
        dtype: float64

|

Dealing with Missing Data
-------------------------

By default, missing entries and invalid results in Pandas are defined as
``nan``. Consider the ``Series`` from the previous example::

    >>> t = s[1:] + s[:-1]
    >>> t
    fri     NaN
    mon     NaN
    thu    12.0
    tue     6.0
    wed    18.0
    dtype: float64

There are different strategies for dealing with ``nan``'s. There is no "best"
strategy: you have to pick one depending on the problem you are trying to
solve.

- Drop all entries that have a ``nan``, with ``dropna()`` (read it as drop N/A,
  i.e. not available)::

    >>> nonan_t = t.dropna()
    >>> nonan_t
    thu    12.0
    tue     6.0
    wed    18.0
    dtype: float64

- Filling in the ``nan`` entries with a more palatable value, for instance
  a default value that does not disrupt your statistics::

    >>> t.fillna(0.0)
    fri     0.0
    mon     0.0
    thu    12.0
    tue     6.0
    wed    18.0
    dtype: float64

  or::

    >>> t.fillna("unknown")
    fri    unknown
    mon    unknown
    thu         12
    tue          6
    wed         18
    dtype: object

- Filing in the ``nan`` by imputing the missing value from the surrouding
  ones.

    >>> s
    mon    NaN
    tue    2.0
    wed    8.0
    thu    NaN
    fri    NaN
    dtype: float64
    >>> s.ffill()
    mon    NaN
    tue    2.0
    wed    8.0
    thu    8.0
    fri    8.0
    dtype: float64

.. warning::
    Most of Pandas function gracefully deal with missing values.

    For instance the ``mean()`` method has a ``skipna`` argument can be either
    ``True`` or ``False``. If it is ``True``, the mean will not consider the
    missing values; if it is ``False``, the result of a series containing
    a ``nan`` will itself be a ``nan`` (as should happen in purely mathematical
    terms).

|

Computing Statistics
--------------------

``Series`` objects support a variety of mathematical and statistical
operations:

- The sum and product of all values in the ``Series`` ::

    >>> s.sum()
    30
    >>> s.prod()
    11340

- The cumulative sum::

    >>> s.cumsum()
    mon     6
    tue     8
    wed    16
    thu    21
    fri    30
    dtype: int64

- The minimum and maximum values and their indices::

    >>> s.max()
    9
    >>> s.argmax()
    'fri'
    >>> s[s.argmax()]
    9

  and the same for ``s.min()`` and ``s.argmin()``.

- The mean, variance, std. deviation, median, quantiles, etc.::

    >>> s.mean()
    6.0
    >>> s.var()
    7.5
    >>> s.std()
    2.7386127875258306
    >>> s.median()
    6.0
    >>> s.quantile([0.25, 0.5, 0.75])
    0.25    6.0
    0.50    7.0
    0.75    9.0
    dtype: float64

  These can be paired with **masks** for awesome extraction effects::

    >>> s[s > s.quantile(0.66)]
    wed     9
    fri    10
    dtype: int64

- The Pearson, Kendall, Spearman correlations between series::

    >>> s.corr(s) # Pearson by default
    1.0
    >>> s.corr(s, method="spearman")
    0.99999999999999989

- The autocorrelation with arbitrary time lag::

    >>> s.autocorr(lag=0)
    1.0
    >>> s.autocorr(lag=1)
    -0.54812812776251907
    >>> s.autocorr(lag=2)
    0.99587059488582252

- ... and many, many others.

.. note::
    A quick way to examine a ``Series`` is to use the ``describe()`` method::

        >>> s.describe()
        count     5.000000
        mean      7.000000
        std       2.738613
        min       3.000000
        25%       6.000000
        50%       7.000000
        75%       9.000000
        max      10.000000
        dtype: float64

.. warning::
    You are not required to *remember* all these methods!

    Just be aware that if you need basic statistical tool, probably ``pandas``
    has them.

|

Plotting
--------

We want to visualize our simple sleeping hours data::

    >>> sleephours = [6, 2, 8, 5, 9]
    >>> days = ["mon", "tue", "wed", "thu", "fri"]
    >>> s = pd.Series(sleephours, index=days)

Now that we have a ``Series``, let's import the additional ``matplotlib``
module::

    >>> import matplotlib.pyplot as plt

This is the customary idiom to import the library. Let's plot the data!

- With a simple **line plot**::

    >>> s.plot()
    >>> plt.show()

- With a more readable **bar plot**::

    >>> s.plot(kind="bar")
    >>> plt.show()

- With a **histogram**::

    >>> s.plot(kind="hist")
    >>> plt.show()

.. note::
    There are very many ways to improve the looks of your plots, like changing
    the shape, size, margins, colors, labels, etc.

    Browse the documentation for inspiration:

        http://pandas.pydata.org/pandas-docs/stable/visualization.html

.. note::
    To save the plot to a file instead of merely showing it on screen, use the
    ``savefig()`` method of the ``matplotlib`` library::

        >>> plt.savefig("output.png")

|

Pandas: ``DataFrame``
---------------------

Pandas ``DataFrame`` is the 2D analogue of a ``Series``: it is essentially a
**table** of heterogeneous objects.

Recall that a ``Series`` holds both the *values* and the *labels* of all its
elements. Analogously, a ``DataFrame`` two major attributes:

- the ``index``, which holds the labels of the **rows**

- the ``columns``, which holds the labels of the **columns**

- the ``shape``, which describes the **dimension** of the table

Each column in the ``DataFrame`` is a ``Series``: when you extract a column
from a ``DataFrame`` you get a proper ``Series``, and you can operate on it
using all the tools presented in the previous section.

Further, most of the operations that you can do on a ``Series``, you can also
do on an entire ``DataFrame``. The sample principles apply: broadcasting, label
alignment, and plotting. They all work very similarly.

.. warning::
    Not all operations work *exactly* the same on the two datatypes.

    This is of course because some operations that make sense on a 2D table,
    do not really apply to a 1D sequence, and vice versa.

    However the analogy holds well enough in most cases.

|

Creating a ``DataFrame``
------------------------

Just like for ``Series``, there are various options. We can:

- Create a ``DataFrame`` from a dictionary of ``Series``::

    >>> d = {
    ...     "name":
    ...         pd.Series(["bobby", "ronald", "ronald", "ronald"]),
    ...     "surname":
    ...         pd.Series(["fisher", "fisher", "reagan", "mcdonald"]),
    ... }
    >>> df = pd.DataFrame(d)
    >>> df
    name   surname
    0   bobby    fisher
    1  ronald    fisher
    2  ronald    reagan
    3  ronald  mcdonald
    >>> df.columns
    Index([u'name', u'surname'], dtype='object')
    >>> df.index
    RangeIndex(start=0, stop=4, step=1)

  As you can see:

  -  the *keys* of the dictionary became the ``columns`` of ``df``

  - the ``index`` of the various ``Series`` became the ``index`` of ``df``.

  .. warning::
    If the ``index`` of the input ``Series`` do not match, since label
    alignment applies, the missing values are treated as ``nan``'s.

    This is exactly what happens in this example::

        >>> d = {
        >>>     "name":
        >>>         pd.Series([0, 0], index=["a", "b"]),
        >>>     "surname":
        >>>         pd.Series([0, 0], index=["c", "d"]),
        >>> }
        >>> df = pd.DataFrame(d)
        >>> df
        name  surname
        a   0.0      NaN
        b   0.0      NaN
        c   NaN      0.0
        d   NaN      0.0
        >>> df.columns
        Index(['name', 'surname'], dtype='object')
        >>> df.index
        Index(['a', 'b', 'c', 'd'], dtype='object')

- Create a ``DataFrame`` from a single dictionary::

    >>> d = {
    ...     "column1": [1., 2., 6., -1.],
    ...     "column2": [0., 1., -2., 4.],
    ... }
    >>> df = pd.DataFrame(d)
    >>> df
    column1  column2
    0      1.0      0.0
    1      2.0      1.0
    2      6.0     -2.0
    3     -1.0      4.0

  Again, the ``columns`` are taken from the *keys*, while the ``index`` is
  set to the default one (i.e. ``range(0, num_rows)``).

  A custom ``index`` can be specified the usual way::

    >>> df = pd.DataFrame(d, index=["a", "b", "c", "d"])
    >>> df
    column1  column2
    a      1.0      0.0
    b      2.0      1.0
    c      6.0     -2.0
    d     -1.0      4.0

- Create a ``DataFrame`` from a list of dictionaries::

    >>> data = [
    ...     {"a": 1, "b": 2},
    ...     {"a": 2, "c": 3},
    ... ]
    >>> df = pd.DataFrame(data)
    >>> df
    a    b    c
    0  1  2.0  NaN
    1  2  NaN  3.0

  Here the ``columns`` are taken (again) from the *keys* of the dictionaries,
  while the ``index`` is the default one. Since not all common *keys* appear
  in all input dictionaries, missing values (i.e. ``nan``'s) are automatically
  added to ``df``.

.. note::
    Once a ``DataFrame`` has been created, you can change interactively its
    ``index`` and ``columns``::

        >>> d = {
        ...     "column1": [1., 2., 6., -1.],
        ...     "column2": [0., 1., -2., 4.],
        ... }
        >>> df = pd.DataFrame(d)
        >>> df
        column1  column2
        0      1.0      0.0
        1      2.0      1.0
        2      6.0     -2.0
        3     -1.0      4.0
        >>> df.columns = ["bar", "bar"]
        >>> df.columns
        Index([u'foo', u'bar'], dtype='object')
        >>> df.index = range(df.shape[1])
        >>> df.index
        Int64Index([0, 1, 2, 3], dtype='int64', length=4)

|

Loading a CSV file
------------------

Of course, the advantage of Pandas is that it allows to load data from many
different file formats. Here we will only consider **CSV** (comma separated
values) files and similar.

To load a CSV into a ``DataFrame``, for instance the *iris dataset*, type::

    >>> df = pd.read_csv("iris.csv")
    >>> print type(df)
    <class 'pandas.core.frame.DataFrame'>

Now that it's loaded, let's print some information about it::

    >>> df.columns
    Index([u'SepalLength', u'SepalWidth', u'PetalLength', u'PetalWidth', u'Name'], dtype='object')
    >>> df.index
    RangeIndex(start=0, stop=150, step=1)
    >>> df.shape
    (150, 5)
    >>> print df
    ...

.. note::
    The CSV file format is not really well-defined.

    This means that the CSV files that you'll encounter in the wild can (and
    probably will!) be very different from one another.

    Given the diversity of this (under-specified) file format, the
    ``read_csv()`` provides a *large* number of options to deal with many
    of the (otherwise code-breaking) differences between CSV-like files.

    Here is an excerpt from the ``read_csv()`` manual::

        >>> help(pd.read_csv)

        Help on function read_csv in module pandas.io.parsers:

        read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer',
            names=None, index_col=None, usecols=None, squeeze=False,
            prefix=None, mangle_dupe_cols=True, dtype=None, engine=None,
            converters=None, true_values=None, false_values=None,
            skipinitialspace=False, skiprows=None, nrows=None, na_values=None,
            keep_default_na=True, na_filter=True, verbose=False,
            skip_blank_lines=True, parse_dates=False,
            infer_datetime_format=False, keep_date_col=False, date_parser=None,
            dayfirst=False, iterator=False, chunksize=None,
            compression='infer', thousands=None, decimal='.',
            lineterminator=None, quotechar='"', quoting=0, escapechar=None,
            comment=None, encoding=None, dialect=None, tupleize_cols=False,
            error_bad_lines=True, warn_bad_lines=True, skipfooter=0,
            skip_footer=0, doublequote=True, delim_whitespace=False,
            as_recarray=False, compact_ints=False, use_unsigned=False,
            low_memory=True, buffer_lines=None, memory_map=False,
            float_precision=None)

        Read CSV (comma-separated) file into DataFrame

        Also supports optionally iterating or breaking of the file
        into chunks.

        Additional help can be found in the `online docs for IO Tools
        <http://pandas.pydata.org/pandas-docs/stable/io.html>`_.

    As you can see, you can tweak the ``read_csv()`` method in very many
    ways, in order to make it load your data the proper way.

.. note::
    A similar function esists for reading Excel files, ``read_excel()``.

|

**Example: A malformed TSV file.** Let's load a TAB-separated file instead
(i.e. a CSV file that uses TABs instead of commas). The file can be found here:

    https://drive.google.com/open?id=0B0wILN942aEVeWdScy1wQnA3LTA

It describes a mapping from UniProt protein IDs (i.e. sequences) to hits in the
Protein Data Bank (i.e. structures). The TSV file looks like this::

    # 2014/07/08 - 14:59
    SP_PRIMARY	PDB
    A0A011	3vk5;3vka;3vkb;3vkc;3vkd
    A0A2Y1	2jrd
    A0A585	4mnq
    A0A5B4	2ij0
    A0A5B9	2bnq;2eyr;2eys;2eyt;3kxf;3o6f;3o8x;3o9w;3qux;3t0e;3ta3;3tvm
    A0A5E3	1hq4;1ob1
    A0AEF5	4iu2;4iu3
    A0AEF6	4iu2;4iu3
    A0AQQ7	2ib5

Note that the two columns are separated by a TAB. Note also that the first
row is actually a comment.

We can use the ``sep`` (separator) argument of the ``read_csv()`` function
to take care of the TABs. Let's write::

    >>> df = pd.read_csv("uniprot_pdb.tsv", sep="\t")
    >>> df.shape
    (33637, 1)
    >>> df.columns
    Index([u'# 2014/07/08 - 14:59'], dtype='object')
    >>> df.head()
    # 2014/07/08 - 14:59
    SP_PRIMARY                       PDB
    A0A011      3vk5;3vka;3vkb;3vkc;3vkd
    A0A2Y1                          2jrd
    A0A585                          4mnq
    A0A5B4                          2ij0
    >>> df.head().shape
    (5, 1)

Unfortunately, the shape of the ``DataFrame`` is wrong: there is only **one**
column!

The problem is that the very first line (the comment) contains no tabs: this
makes Panda think that there is only one column.

We have to tell ``read_csv()`` to either ignore the first line::

    >>> df = pd.read_csv("uniprot_pdb.tsv", sep="\t", skiprows=1)
    >>> df.shape
    (33636, 2)
    >>> df.head()
    SP_PRIMARY                                                PDB
    0     A0A011                           3vk5;3vka;3vkb;3vkc;3vkd
    1     A0A2Y1                                               2jrd
    2     A0A585                                               4mnq
    3     A0A5B4                                               2ij0
    4     A0A5B9  2bnq;2eyr;2eys;2eyt;3kxf;3o6f;3o8x;3o9w;3qux;3...
    >>> df.shape
    (33636, 2)

or, as an alternative, to ignore all comment lines, i.e. all lines that start
with a sharp ``"#"`` character::

    >>> df = pd.read_csv("uniprot_pdb.tsv", sep="\t", comment="#")
    >>> df.shape
    (33636, 2)
    >>> df.head()
    SP_PRIMARY                                                PDB
    0     A0A011                           3vk5;3vka;3vkb;3vkc;3vkd
    1     A0A2Y1                                               2jrd
    2     A0A585                                               4mnq
    3     A0A5B4                                               2ij0
    4     A0A5B9  2bnq;2eyr;2eys;2eyt;3kxf;3o6f;3o8x;3o9w;3qux;3...
    >>> df.head().shape
    (5, 2)

Now the number of columns is correct!

|

**Example: microarray data.** Let's also try to load the **microarray** file in
Stanford PCL format (namely, a disguised TAB-separated-values file) available
here:

    https://drive.google.com/open?id=0B0wILN942aEVMENTUnRpa05uY0k

Again, it is pretty easy::

    >>> df = pd.read_csv("2010.DNAdamage.pcl", sep="\t")
    >>> df.shape
    (6128, 55)
    >>> df.columns
    Index([u'YORF', u'NAME', u'GWEIGHT', u'wildtype + 0.02% MMS (5 min)',
    u'wildtype + 0.02% MMS (15 min)', u'wildtype + 0.02% MMS (30 min)',
    u'wildtype + 0.2% MMS (45 min)', u'wildtype + 0.02% MMS (60 min)',
    u'wildtype + 0.02% MMS (90 min)', u'wildtype + 0.02% MMS (120 min)',
    u'mec1 mutant + 0.02% MMS (5 min)', u'mec1 mutant + 0.02% MMS (15 min)',
    u'mec1 mutant + 0.02% MMS (30 min)',
    u'mec1 mutant + 0.02% MMS (45 min)',
    u'mec1 mutant + 0.02% MMS (60 min)',
    u'mec1 mutant + 0.02% MMS (90 min)',
    u'mec1 mutant + 0.02% MMS (120 min)',
    u'dun1 mutant + 0.02% MMS (30 min)',
    u'dun1 mutant + 0.02% MMS (90 min)',
    u'dun1 mutant + 0.02% MMS (120 min)',
    u'wildtype + gamma irradiation (5 min)',
    u'wildtype + gamma irradiation (10 min)',
    ...
    u'crt1 mutant vs wild type (expt 1)',
    u'crt1 mutant vs. wild type (expt 2)',
    u'GAL-inducible ROX1 on galactose'],
    dtype='object')

.. warning::
    ... at least at a quick glance.

    Additional care must be taken for this data: some columns are not
    separated correctly (just look at the file with the ``less`` command
    to see what I mean).

    Feel free to try and improve the example above to properly load the
    microarray file!

|

Extracting Rows and Columns
---------------------------

Row and column extraction from a ``DataFrame`` can be very easy to very
complex, depending on what you want to do.

Here we will only deal with elementary extraction methods.

.. note::
    If you wish to know more, especially about the ever useful ``ix`` attribute,
    take a look at the documentation:

        http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe

        http://pandas.pydata.org/pandas-docs/stable/indexing.html

        http://pandas.pydata.org/pandas-docs/stable/advanced.html

The (non-exhaustive) list of alternatives is:

=============================== =================== =========
Operation                       Syntax              Result
=============================== =================== =========
Select column                   ``df[col]``         Series
Select row by label             ``df.loc[label]``   Series
Select row by integer location  ``df.iloc[loc]``    Series
Slice rows                      ``df[5:10]``        DataFrame
Select rows by boolean vector   ``df[bool_vec]``    DataFrame
=============================== =================== =========

.. note::
    For simplicity, in the following examples we will use a random sample
    taken from the *iris dataset*, computed like this::

        >>> import numpy as np
        >>> import pandas as pd
        >>> np.random.seed(0)
        >>> df = pd.read_csv("iris.csv")
        >>> small = df.iloc[np.random.permutation(df.shape[0])].head()
        >>> small.shape
        (5, 5)
        >>> small
        SepalLength  SepalWidth  PetalLength  PetalWidth             Name
        114          5.8         2.8          5.1         2.4   Iris-virginica
        62           6.0         2.2          4.0         1.0  Iris-versicolor
        33           5.5         4.2          1.4         0.2      Iris-setosa
        107          7.3         2.9          6.3         1.8   Iris-virginica
        7            5.0         3.4          1.5         0.2      Iris-setosa

    Brief explanation: here we use ``numpy.random.permutation()`` to generate
    a random permutation of the indices from ``0`` to ``df.shape[0]``, i.e.
    the number of rows in the *iris dataset*; then we use these random
    numbers as row indices to permute all the rows in ``df``; finally, we
    take the first ``5`` rows of the permuted ``df`` using the ``head()``
    method.

- A ``DataFrame`` can now be accessed like a regular dictionary to extract
  individual **columns** -- indeed, you can also print its *keys*!::

    >>> small.columns
    Index([u'SepalLength', u'SepalWidth', u'PetalLength', u'PetalWidth', u'Name'], dtype='object')
    >>> small.keys()
    Index([u'SepalLength', u'SepalWidth', u'PetalLength', u'PetalWidth', u'Name'], dtype='object')

  Now we access ``small`` by **column name**; the result is a ``Series``::

    >>> type(small["Name"])
    <class 'pandas.core.series.Series'>
    >>> small["Name"]
    114     Iris-virginica
    62     Iris-versicolor
    33         Iris-setosa
    107     Iris-virginica
    7          Iris-setosa
    Name: Name, dtype: object

  If the name of the column is compatible with the Python conventions for
  variable names, you can also treat columns as if they were actual attributes
  of the ``df`` object::

    >>> small.Name
    114     Iris-virginica
    62     Iris-versicolor
    33         Iris-setosa
    107     Iris-virginica
    7          Iris-setosa
    Name: Name, dtype: object

  .. warning::
    Here ``Name`` is a variable created on-the-fly by Pandas when loading
    the *iris setosa* dataset!

  It is possible to extract **multiple columns** in one go; the result will be
  a ``DataFrame``::

    >>> small[["SepalLength", "PetalLength"]]
    SepalLength  PetalLength
    114          5.8          5.1
    62           6.0          4.0
    33           5.5          1.4
    107          7.3          6.3
    7            5.0          1.5

  .. warning::
      Note that the order of the column labels matters!

      Compare this with the previous code fragment::

        >>> small[["PetalLength", "SepalLength"]]
        PetalLength  SepalLength
        114          5.1          5.8
        62           4.0          6.0
        33           1.4          5.5
        107          6.3          7.3
        7            1.5          5.0

- To extract the **rows**, we can use the ``loc`` and ``iloc`` attributes:

  - ``loc`` allows to retrieve a row **by label**::

        >>> small.loc[114]
        SepalLength               5.8
        SepalWidth                2.8
        PetalLength               5.1
        PetalWidth                2.4
        Name           Iris-virginica
        Name: 114, dtype: object
        >>> type(small.loc[114])
        <class 'pandas.core.series.Series'>

    The result is a ``Series``.

  - ``iloc`` allows to retrieve a row **by position**::

        >>> small.iloc[0]
        SepalLength               5.8
        SepalWidth                2.8
        PetalLength               5.1
        PetalWidth                2.4
        Name           Iris-virginica
        Name: 114, dtype: object

    The result is, again, a ``Series``.

  - The ``loc`` and ``iloc`` attribute also allow to retrieve **multiple rows**::

        >>> small.loc[[114,7,62]]
        SepalLength  SepalWidth  PetalLength  PetalWidth             Name
        114          5.8         2.8          5.1         2.4   Iris-virginica
        7            5.0         3.4          1.5         0.2      Iris-setosa
        62           6.0         2.2          4.0         1.0  Iris-versicolor

        >>> small.iloc[[0,1,2]]
        SepalLength  SepalWidth  PetalLength  PetalWidth             Name
        114          5.8         2.8          5.1         2.4   Iris-virginica
        62           6.0         2.2          4.0         1.0  Iris-versicolor
        33           5.5         4.2          1.4         0.2      Iris-setosa

    This time, the result is a ``DataFrame``.

    .. warning::
        Again, note that the order of the row labels matters!

|

Broadcasting, Masking, Label Alignment
--------------------------------------

All of these work just like with ``Series``:

- **Broacasting** is applied automatically to all rows::

    >>> small["SepalLength"] + small["SepalWidth"]
    114     8.6
    62      8.2
    33      9.7
    107    10.2
    7       8.4
    dtype: float64

  or to the whole table::

    >>> small + small
         SepalLength  SepalWidth  PetalLength  PetalWidth  \
    114         11.6         5.6         10.2         4.8
    62          12.0         4.4          8.0         2.0
    33          11.0         8.4          2.8         0.4
    107         14.6         5.8         12.6         3.6
    7           10.0         6.8          3.0         0.4

                                   Name
    114    Iris-virginicaIris-virginica
    62   Iris-versicolorIris-versicolor
    33           Iris-setosaIris-setosa
    107    Iris-virginicaIris-virginica
    7            Iris-setosaIris-setosa

  The usual **label alignment** (and corresponding
  mismatching-labels-equals-``nan``) rules apply::

    >>> small["SepalLength"][1:] + small["SepalWidth"][:-1]
    7       NaN
    33      9.7
    62      8.2
    107    10.2
    114     NaN
    dtype: float64

  .. warning::
    Both row labels and column labels are automatically aligned!

- **Masking** works as well, for instance::

    >>> small.PetalLength[small.PetalLength > 5]
    114    5.1
    107    6.3
    Name: PetalLength, dtype: float64

    >>> small["PetalLength"][small["PetalLength"] > 5]
    114    5.1
    107    6.3
    Name: PetalLength, dtype: float64

    >>> small["Name"][small["PetalLength"] > 5]
    114    Iris-virginica
    107    Iris-virginica
    Name: Name, dtype: object

    >>> small[["Name", "PetalLength", "SepalLength"]][small.Name == "Iris-virginica"]
                   Name  PetalLength  SepalLength
    114  Iris-virginica          5.1          5.8
    107  Iris-virginica          6.3          7.3

- **Statistics** can be computed on rows::

    >>> df.loc[114][:-1].mean()
    4.0249999999999995

  (Here the ``[:-1]`` skips over the ``Name`` column), on columns::

    >>> df.PetalLength.mean()
    3.7586666666666662

    >>> df["PetalLength"].mean()
    3.7586666666666662

  or on the whole table::

    >>> df.mean()
    SepalLength    5.843333
    SepalWidth     3.054000
    PetalLength    3.758667
    PetalWidth     1.198667
    dtype: float64

  or between rows::

    >>> small.PetalLength.cov(small.SepalLength)
    1.6084999999999996

  You can combine them with *masking* as well::

    >>> small[["Name", "PetalLength", "SepalLength"]][small.PetalLength > small.PetalLength.mean()]
                    Name  PetalLength  SepalLength
    114   Iris-virginica          5.1          5.8
    62   Iris-versicolor          4.0          6.0
    107   Iris-virginica          6.3          7.3

|

Merging Tables
--------------

Merging different dataframes is performed using the ``merge()`` function.

Merging means that, given two tables with a common column name, first the
rows with the same column value are matched; then a new table is created
by concatenating the matching rows.

Here is an explanatory image:

.. image:: figures/merge.png

(Taken from the `Merge, join, and concatenate <http://pandas.pydata.org/pandas-docs/stable/merging.html>`_
section of the official Pandas documentation.)

.. note::
    Only the column **name** must be the same in both tables: *the actual
    values in the column may differ*.

    What happens when the column **values** are different, depends on the
    ``how`` argument of the ``merge()`` function. See below.

A simple example of merging two tables describing protein properties
(appropriately simplified for ease of exposition)::

    sequences = pd.DataFrame({
        "id":  ["Q99697", "O18400", "P78337", "Q9W5Z2"],
        "seq": ["METNCR", "MDRSSA", "MDAFKG", "MTSMKD"],
    })

    names = pd.DataFrame({
        "id":   ["Q99697", "O18400", "P78337", "P59583"],
        "name": ["PITX2_HUMAN", "PITX_DROME",
                 "PITX1_HUMAN", "WRK32_ARATH"],
    })

    print "input dataframes:"
    print sequences
    print names

    print "full inner:"
    print pd.merge(sequences, names, on="id", how="inner")

    print "full left:"
    print pd.merge(sequences, names, on="id", how="left")

    print "full right:"
    print pd.merge(sequences, names, on="id", how="right")

    print "full outer:"
    print pd.merge(sequences, names, on="id", how="outer")

Note that the ``"id"`` column appears in both dataframes. However, the values
in the column differ: ``"Q9W5Z2"`` only appears in the ``sequences`` table,
while ``"P59583"`` only appears in ``names``.

Let's see what happens using the four different joining semantics::

    input dataframes:
    id     seq
     0  Q99697  METNCR
     1  O18400  MDRSSA
     2  P78337  MDAFKG
     3  Q9W5Z2  MTSMKD

    id         name
     0  Q99697  PITX2_HUMAN
     1  O18400   PITX_DROME
     2  P78337  PITX1_HUMAN
     3  P59583  WRK32_ARATH

    # mismatched ids are dropped
    full inner:
    id     seq         name
     0  Q99697  METNCR  PITX2_HUMAN
     1  O18400  MDRSSA   PITX_DROME
     2  P78337  MDAFKG  PITX1_HUMAN

    # ids are taken from the left table
    full left:
    id     seq         name
     0  Q99697  METNCR  PITX2_HUMAN
     1  O18400  MDRSSA   PITX_DROME
     2  P78337  MDAFKG  PITX1_HUMAN
     3  Q9W5Z2  MTSMKD          NaN

    # ids are taken from the right table
    full right:
    id     seq         name
     0  Q99697  METNCR  PITX2_HUMAN
     1  O18400  MDRSSA   PITX_DROME
     2  P78337  MDAFKG  PITX1_HUMAN
     3  P59583     NaN  WRK32_ARATH

    # all ids are retained
    full outer:
    id     seq         name
     0  Q99697  METNCR  PITX2_HUMAN
     1  O18400  MDRSSA   PITX_DROME
     2  P78337  MDAFKG  PITX1_HUMAN
     3  Q9W5Z2  MTSMKD          NaN
     4  P59583     NaN  WRK32_ARATH

To summarize, the ``how`` argument establishes the semantics for the mismatched
labels, and where the ``nan``'s should be inserted.

|

**Example**. Dataframe merging automatically takes care of replicating rows
appropriately if the same key appears twice.

For instance, consider this code::

    sequences = pd.DataFrame({
        "id":  ["Q99697", "Q99697"],
        "seq": ["METNCR", "METNCR"],
    })

    names = pd.DataFrame({
        "id":   ["Q99697", "O18400", "P78337", "P59583"],
        "name": ["PITX2_HUMAN", "PITX_DROME",
                 "PITX1_HUMAN", "WRK32_ARATH"],
    })

    print "input dataframes:"
    print sequences
    print names

    print "full:"
    print pd.merge(sequences, names, on="id", how="inner")

The result is::

    input dataframes:
    id     seq
     0  Q99697  METNCR
     1  Q99697  METNCR

    id         name
     0  Q99697  PITX2_HUMAN
     1  O18400   PITX_DROME
     2  P78337  PITX1_HUMAN
     3  P59583  WRK32_ARATH

    full:
    id     seq         name
     0  Q99697  METNCR  PITX2_HUMAN
     1  Q99697  METNCR  PITX2_HUMAN

As you can see, rows are *replicated appropriately*!

.. hint::
    This is *extremely* useful.

    Consider for instance the problem of having to compute statistics on all
    the *three* gene-exon-CDS tables in the exercises section, jointly.

    Using ``merge()`` it is easy to merge the three tables in a single, big
    table, where the gene information is appropriately replicated for all
    gene exons, and the exon information is replicated for all coding
    sequences!

.. note::
    The ``merge()`` method also allows to join tables by more than one column,
    using the ``key`` attribute. See the official documentation here:

        http://pandas.pydata.org/pandas-docs/stable/merging.html

|

Grouping Tables
---------------

The ``groupby`` method is essential for efficiently performing operations on
groups of rows -- especially split-apply-aggregate operations.

A picture is worth one thousand words, so here it is:

.. image:: figures/groupby.png

(Picture source: `Hadley Wickham's Data Science in R slides <http://courses.had.co.nz/12-oscon/>`_)

The ``groupby()`` method allows to extract groups of rows with the same value
(the "split" part), perform some computation on them (the "apply" part) and
then combine the results into a single table over the groups (the "aggregate"
part).

**Example**. Let's take the *iris dataset*. We want to compute the average of
the four data tables for the three different iris species.

The result should be a 3 species (rows) by 4 columns (petal/sepal length/width)
dataframe.

First, let's load the CSV file and print some information::

    >>> iris = pd.read_csv("iris.csv")
    >>> iris.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 150 entries, 0 to 149
    Data columns (total 5 columns):
    SepalLength    150 non-null float64
    SepalWidth     150 non-null float64
    PetalLength    150 non-null float64
    PetalWidth     150 non-null float64
    Name           150 non-null object
    dtypes: float64(4), object(1)
    memory usage: 5.9+ KB

Here is how to do that with ``groupby()``. First, let's group all rows of the
dataframe by the ``Name`` column::

    >>> grouped = iris.groupby(iris.Name)
    >>> type(grouped)
    <class 'pandas.core.groupby.DataFrameGroupBy'>
    >>> for group in grouped:
    ...     print group[0], group[1].shape
    ... 
    Iris-setosa (50, 5)
    Iris-versicolor (50, 5)
    Iris-virginica (50, 5)

The ``groupby`` method returns a ``pandas.DataFrameGroupBy`` object (the
details are not important).

To select a single group, write::
    >>> group_df = grouped.get_group("Iris-versicolor")
    >>> print type(group)
    <class 'pandas.core.frame.DataFrame'>
    >>> group.info()
    Int64Index: 50 entries, 50 to 99
    Data columns (total 5 columns):
    SepalLength    50 non-null float64
    SepalWidth     50 non-null float64
    PetalLength    50 non-null float64
    PetalWidth     50 non-null float64
    Name           50 non-null object
    dtypes: float64(4), object(1)
    memory usage: 2.3+ KB

.. note::
    As shown by the above code snippet, iterating over the ``grouped`` variable
    returns ``(value-of-Name, DataFrame)`` tuples:

    - The first element is the value of the column ``Name`` shared by the group

    - The second element is a ``DataFrame`` including only the rows in that group.

|

The whole idea is that you can *apply* some transformation (e.g. ``mean()``) to
the individual groups **automatically**, using the ``aggregate()`` method
directly on the ``grouped`` variable.

For instance, let's compute the mean of all columns for the three groups::

    >>> iris_mean_by_name = grouped.aggregate(pd.DataFrame.mean)
    >>> iris_mean_by_name.info()
    <class 'pandas.core.frame.DataFrame'>
    Index: 3 entries, Iris-setosa to Iris-virginica
    Data columns (total 4 columns):
    SepalLength    3 non-null float64
    SepalWidth     3 non-null float64
    PetalLength    3 non-null float64
    PetalWidth     3 non-null float64
    dtypes: float64(4)
    memory usage: 120.0+ bytes

So the result of ``aggregate()`` is a dataframe. Here is what it looks like::

    >>> iris_mean_by_name
    SepalLength  SepalWidth  PetalLength  PetalWidth
    Name
    Iris-setosa            5.006       3.418        1.464       0.244
    Iris-versicolor        5.936       2.770        4.260       1.326
    Iris-virginica         6.588       2.974        5.552       2.026

.. warning::
    Note that the ``Name`` colum of the ``iris`` dataframe becomes the
    index of the aggregated ``iris_mean_by_name`` dataframe: it is not a
    proper column anymore!

    The only actual columns are::

        >>> iris_mean_by_name.columns
        Index([u'SepalLength', u'SepalWidth', u'PetalLength', u'PetalWidth'], dtype='object')

|

.. note::
    The ``groupby()`` method also allows to group by more than one column,
    using the ``key`` attribute. See the official documentation here:

        http://pandas.pydata.org/pandas-docs/stable/groupby.html