====== Pandas ====== What is Pandas? --------------- Pandas is a freely available library for loading, manipulating, and visualizing **sequential and tabular data**, for instance time series or microarrays. Its main features are: - **Loading** and **saving** with "standard" tabular files, like *CSV* (Comma-separated Values), *TSV* (Tab-separated Values), *Excel* files, and *database* formats - Flexible **indexing** and **aggregation** of series and tables - Efficient **numerical** and **statistical** operations (e.g. broadcasting) - Pretty, straightforward **visualization** This list is far from complete! | Documentation & Source ---------------------- A few websites worth visiting are the official Pandas website: http://pandas.pydata.org/ as well as the official documentation: http://pandas.pydata.org/pandas-docs/stable/dsintro.html The Pandas source code, available here: https://github.com/pandas-dev/pandas/ is distributed under the *BSD license*: https://github.com/pandas-dev/pandas/blob/master/LICENSE | Importing Pandas ---------------- In order to import Pandas, the customary idiom is:: >>> import pandas as pd >>> help(pd) but feel free to use this instead:: >>> from pandas import * To load the (optional, but super-useful) visualization library, write:: >>> import matplotlib.pyplot as plt For now, the only method you need is the ``plt.show()`` method. | Pandas: A short demonstration ----------------------------- Let's start with a copy of Fisher's own **iris dataset**: https://drive.google.com/open?id=0B0wILN942aEVYTVBekRHLTNON3c The dataset records petal length/width and sepal length/width of three different kinds of iris species: *Iris setosa*, *Iris versicolor*, and *Iris virginica*. .. note:: The iris dataset has quite some history -- there is even a Wikipedia page about it!: https://en.wikipedia.org/wiki/Iris_flower_data_set This is what the iris dataset in CSV format looks like:: SepalLength,SepalWidth,PetalLength,PetalWidth,Name 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa ... 5.0,3.3,1.4,0.2,Iris-setosa 7.0,3.2,4.7,1.4,Iris-versicolor 6.4,3.2,4.5,1.5,Iris-versicolor ... 5.7,2.8,4.1,1.3,Iris-versicolor 6.3,3.3,6.0,2.5,Iris-virginica 5.8,2.7,5.1,1.9,Iris-virginica ... 5.9,3.0,5.1,1.8,Iris-virginica In an effort to understand the dataset, we would like to visualize the relation between the four properties for the case of *Iris virginica*. In pure Python, we would have to: #. Load the dataset by parsing all the rows in the file, then #. Keep only the rows pertaining to *Iris virginica*, then #. Compute statistics on the values of the rows, making sure to convert from strings to ``float``'s as required, and then #. Actually draw the plots by using a specialized plotting library. With Pandas, this all becomes very easy. Check this out:: >>> import pandas as pd >>> from pandas.tools.plotting import scatter_matrix >>> import matplotlib.pyplt as plt >>> df = pd.read_csv("iris.csv") >>> scatter_matrix(df[df.Name == "Iris-virginica"]) >>> plt.show() and that's it! The result is: .. image:: figures/iris_scattermatrix.png :width: 100% .. attention:: It only looks bad because I used the default colors. Plots can be customized to your liking! Of course, there is a lot more to Pandas than solving simple tasks like this. | Pandas datatypes ---------------- Pandas provides a couple of very useful datatypes, ``Series`` and ``DataFrame``: - ``Series`` represents 1D data, like time series, calendars, the output of one-variable functions, etc. - ``DataFrame`` represents 2D data, like a column-separated-values (CSV) file, a microarray, a database table, a matrix, etc. A ``DataFrame`` is composed of multiple ``Series``. More specifically, each column of a ``DataFrame`` is a ``Series``. That's why we will see how the ``Series`` datatype works first. .. attention:: Most of what we will say about ``Series`` also applies to ``DataFrame``'s. The similarities include how indexing is done, how broadcasting applies to arithmetical operators and Boolean masks, how to compute statistics, how to deal with missing values (``nan``'s) and how plotting is done. | Pandas: ``Series`` ------------------ A ``Series`` is a **one-dimensional** array with a labeled axis. It can hold arbitrary objects. It works a little like a ``list`` and a little like a ``dict``. The axis is called the ``index``, and can be used to access the elements; it is very flexible, and not necessarily numerical. To access the online documentation, type:: >>> s = pd.Series() >>> help(s) | Creating a ``Series`` --------------------- Use the ``Series()`` constructor to create a new ``Series`` object, as follows. When creating a ``Series``, you can either specify *both* the date and the index, like this:: >>> s = pd.Series(["a", "b", "c"], index=[2, 5, 8]) >>> s 2 a 5 b 8 c dtype: object >>> s.index Int64Index([2, 5, 8], dtype='int64') or pass a single ``dict`` in, in which case the index is built from the dictionary keys:: >>> s = pd.Series({"a": "A", "b": "B", "c": "C"}) >>> s a A b B c C dtype: object >>> s.index Index([u'a', u'b', u'c'], dtype='object') or skip the index altogether: in this case, the default index (i.e. a numerical index starting from ``0``) is automatically assigned:: >>> s = pd.Series(["a", "b", "c"]) >>> s 0 a 1 b 2 c dtype: object >>> s.index RangeIndex(start=0, stop=3, step=1) Finally, if given a single scalar (e.g. an ``int``), the ``Series`` constructor will replicate it for all indices:: >>> s = pd.Series(3, index=range(10)) >>> s 0 3 1 3 2 3 3 3 4 3 5 3 6 3 7 3 8 3 9 3 dtype: int64 .. note:: The *type* of the index changes in the previous exercises. The library uses different types of index for performance reasons. The ``Series`` class however works (more or less) the same in all cases, so you can ignore these differences in practical applications. | Accessing a ``Series`` ---------------------- Let's create a ``Series`` representing the hours of sleep we had the chance to get each day of the past week:: >>> sleephours = [6, 2, 8, 5, 9] >>> days = ["mon", "tue", "wed", "thu", "fri"] >>> s = pd.Series(sleephours, index=days) >>> s mon 6 tue 2 wed 8 thu 5 fri 9 dtype: int64 Now we can access the various elements of the ``Series`` both by their label (i.e. the index) like we would with a ``dict``:: >>> s["mon"] 6 as well as by their position, like we would with a ``list``:: >>> s[0] 6 .. warning:: If the label or position is wrong, you get either a ``KeyError``:: >>> s["sat"] KeyError: 'sat' or an ``IndexError``:: >>> s[9] IndexError: index out of bounds We can also slice the positions, like we would with a ``list``:: >>> s[-3:] wed 8 thu 5 fri 9 dtype: int64 Note that *both* the data and the index are extracted correctly. It also works with labels:: >>> s["tue":"thu"] tue 2 wed 8 thu 5 dtype: int64 The first and last ``n`` elements can be extracted also using ``head()`` and ``tail()``:: >>> s.head(2) mon 6 tue 2 dtype: int64 >>> s.tail(3) wed 8 thu 5 fri 9 dtype: int64 Most importantly, can also explicitly pass in a **list of positions**, like this:: >>> s[[0, 1, 2]] mon 6 tue 2 wed 8 dtype: int64 >>> s[["mon", "wed", "fri"]] mon 6 wed 8 fri 9 dtype: int64 >>> s.head() .. warning:: Passing in a **tuple** of positions does not work! | Operator Broadcasting --------------------- The ``Series`` class automatically broadcasts arithmetical operations by a scalar to all of the elements. Consider adding ``1`` to our ``Series``:: >>> s mon 6 tue 2 wed 8 thu 5 fri 9 dtype: int64 >>> s + 1 mon 7 tue 3 wed 9 thu 6 fri 10 dtype: int64 As you can see, the addition was applied to **all** the entries! This also applies to multiplication and other arithmetical operations:: >>> s * 2 mon 12 tue 4 wed 16 thu 10 fri 18 dtype: int64 This is extremely useful: it frees the analyst from having to write loops. The (worse) alternative the above would be:: for x in s.index: s[x] += 1 which is definitely more verbose and error-prone, and also takes more time to execute! .. note:: The concept of *operator broadcasting* was taken from the ``numpy`` library, and is one of the key features for writing efficient, clean numerical code in Python. In a way, it is a "generalized" version of scalar products (from linear algebra). The rules govering how broadcasting is applied can be pretty complex (and confusing). Here we will cover constant broadcasting only. | Masks and Filtering ------------------- Most importantly, broadcasting also applies to **conditions**:: >>> s > 6 mon True tue False wed True thu False fri True dtype: bool Here the result is a ``Series`` with the same index as the original (that is, the very same labels), and each label ``label`` is associated to the result of the comparison ``s[label] > 6``. This kind of series is called a **mask**. Is it useful in any way? Yes, of course! Masks can be used to filter the elements of a ``Series`` according to a given condition:: >>> s[s > 6] mon 7 wed 9 fri 10 dtype: int64 Here only those entries corresponding to ``True``'s in the mask have been kept. For extracting the elements that satisfy the opposite of the condition, write:: >>> s[s <= 6] tue 3 thu 6 dtype: int64 or:: >>> s[~(s > 6)] tue 3 thu 6 dtype: int64 The latter version is useful when the condition you are checking for is complex and can not be inverted easily. | Automatic Label Alignment ------------------------- Operations between multiple time series are automatically aligned by label, meaning that elements with the same label are matched prior to carrying out the operation. Consider this example:: >>> s[1:] + s[:-1] fri NaN # <-- not common mon NaN # <-- not common thu 10.0 # <-- common element tue 4.0 # <-- common element wed 16.0 # <-- common element dtype: float64 As you can see, the elements are summed in the right order! But what's up with those ``nan``'s? The index of the resulting ``Series`` is the **union** of the indices of the operands. What happens depend on whether a given label appears in both ipnut ``Series`` or not: - For common labels (in our case ``"tue"``, ``"wed"``, ``"thu"``), the output ``Series`` contains the sum of the aligned elements. - For labels appearing in only one of the operands (``"mon"`` and ``"fri"``), the result is a ``nan``, i.e. not-a-number. .. warning:: ``nan`` means *not a number*. It is just a symbolic constant that specifies that the object is a number-like entity with an invalid or undefined value. Think for instance of the result of division by 0. In the example above, the result is a ``nan`` because it represents the sum of an element, like ``s[1:]["fri"]``, and a non-existing element, like ``s[:-1]["fro"]``. It is used to indicate undefined results (like above) or missing measurements (as we will see). The actual value of the ``nan`` symbol is taken from ``numpy``, see ``numpy.nan``. You can actually use it to insert ``nan``'s manually:: >>> s["fri"] = np.nan >>> s mon 6.0 tue 2.0 wed 8.0 thu 5.0 fri NaN dtype: float64 | Dealing with Missing Data ------------------------- By default, missing entries and invalid results in Pandas are defined as ``nan``. Consider the ``Series`` from the previous example:: >>> t = s[1:] + s[:-1] >>> t fri NaN mon NaN thu 12.0 tue 6.0 wed 18.0 dtype: float64 There are different strategies for dealing with ``nan``'s. There is no "best" strategy: you have to pick one depending on the problem you are trying to solve. - Drop all entries that have a ``nan``, with ``dropna()`` (read it as drop N/A, i.e. not available):: >>> nonan_t = t.dropna() >>> nonan_t thu 12.0 tue 6.0 wed 18.0 dtype: float64 - Filling in the ``nan`` entries with a more palatable value, for instance a default value that does not disrupt your statistics:: >>> t.fillna(0.0) fri 0.0 mon 0.0 thu 12.0 tue 6.0 wed 18.0 dtype: float64 or:: >>> t.fillna("unknown") fri unknown mon unknown thu 12 tue 6 wed 18 dtype: object - Filing in the ``nan`` by imputing the missing value from the surrouding ones. >>> s mon NaN tue 2.0 wed 8.0 thu NaN fri NaN dtype: float64 >>> s.ffill() mon NaN tue 2.0 wed 8.0 thu 8.0 fri 8.0 dtype: float64 .. warning:: Most of Pandas function gracefully deal with missing values. For instance the ``mean()`` method has a ``skipna`` argument can be either ``True`` or ``False``. If it is ``True``, the mean will not consider the missing values; if it is ``False``, the result of a series containing a ``nan`` will itself be a ``nan`` (as should happen in purely mathematical terms). | Computing Statistics -------------------- ``Series`` objects support a variety of mathematical and statistical operations: - The sum and product of all values in the ``Series`` :: >>> s.sum() 30 >>> s.prod() 11340 - The cumulative sum:: >>> s.cumsum() mon 6 tue 8 wed 16 thu 21 fri 30 dtype: int64 - The minimum and maximum values and their indices:: >>> s.max() 9 >>> s.argmax() 'fri' >>> s[s.argmax()] 9 and the same for ``s.min()`` and ``s.argmin()``. - The mean, variance, std. deviation, median, quantiles, etc.:: >>> s.mean() 6.0 >>> s.var() 7.5 >>> s.std() 2.7386127875258306 >>> s.median() 6.0 >>> s.quantile([0.25, 0.5, 0.75]) 0.25 6.0 0.50 7.0 0.75 9.0 dtype: float64 These can be paired with **masks** for awesome extraction effects:: >>> s[s > s.quantile(0.66)] wed 9 fri 10 dtype: int64 - The Pearson, Kendall, Spearman correlations between series:: >>> s.corr(s) # Pearson by default 1.0 >>> s.corr(s, method="spearman") 0.99999999999999989 - The autocorrelation with arbitrary time lag:: >>> s.autocorr(lag=0) 1.0 >>> s.autocorr(lag=1) -0.54812812776251907 >>> s.autocorr(lag=2) 0.99587059488582252 - ... and many, many others. .. note:: A quick way to examine a ``Series`` is to use the ``describe()`` method:: >>> s.describe() count 5.000000 mean 7.000000 std 2.738613 min 3.000000 25% 6.000000 50% 7.000000 75% 9.000000 max 10.000000 dtype: float64 .. warning:: You are not required to *remember* all these methods! Just be aware that if you need basic statistical tool, probably ``pandas`` has them. | Plotting -------- We want to visualize our simple sleeping hours data:: >>> sleephours = [6, 2, 8, 5, 9] >>> days = ["mon", "tue", "wed", "thu", "fri"] >>> s = pd.Series(sleephours, index=days) Now that we have a ``Series``, let's import the additional ``matplotlib`` module:: >>> import matplotlib.pyplot as plt This is the customary idiom to import the library. Let's plot the data! - With a simple **line plot**:: >>> s.plot() >>> plt.show() - With a more readable **bar plot**:: >>> s.plot(kind="bar") >>> plt.show() - With a **histogram**:: >>> s.plot(kind="hist") >>> plt.show() .. note:: There are very many ways to improve the looks of your plots, like changing the shape, size, margins, colors, labels, etc. Browse the documentation for inspiration: http://pandas.pydata.org/pandas-docs/stable/visualization.html .. note:: To save the plot to a file instead of merely showing it on screen, use the ``savefig()`` method of the ``matplotlib`` library:: >>> plt.savefig("output.png") | Pandas: ``DataFrame`` --------------------- Pandas ``DataFrame`` is the 2D analogue of a ``Series``: it is essentially a **table** of heterogeneous objects. Recall that a ``Series`` holds both the *values* and the *labels* of all its elements. Analogously, a ``DataFrame`` two major attributes: - the ``index``, which holds the labels of the **rows** - the ``columns``, which holds the labels of the **columns** - the ``shape``, which describes the **dimension** of the table Each column in the ``DataFrame`` is a ``Series``: when you extract a column from a ``DataFrame`` you get a proper ``Series``, and you can operate on it using all the tools presented in the previous section. Further, most of the operations that you can do on a ``Series``, you can also do on an entire ``DataFrame``. The sample principles apply: broadcasting, label alignment, and plotting. They all work very similarly. .. warning:: Not all operations work *exactly* the same on the two datatypes. This is of course because some operations that make sense on a 2D table, do not really apply to a 1D sequence, and vice versa. However the analogy holds well enough in most cases. | Creating a ``DataFrame`` ------------------------ Just like for ``Series``, there are various options. We can: - Create a ``DataFrame`` from a dictionary of ``Series``:: >>> d = { ... "name": ... pd.Series(["bobby", "ronald", "ronald", "ronald"]), ... "surname": ... pd.Series(["fisher", "fisher", "reagan", "mcdonald"]), ... } >>> df = pd.DataFrame(d) >>> df name surname 0 bobby fisher 1 ronald fisher 2 ronald reagan 3 ronald mcdonald >>> df.columns Index([u'name', u'surname'], dtype='object') >>> df.index RangeIndex(start=0, stop=4, step=1) As you can see: - the *keys* of the dictionary became the ``columns`` of ``df`` - the ``index`` of the various ``Series`` became the ``index`` of ``df``. .. warning:: If the ``index`` of the input ``Series`` do not match, since label alignment applies, the missing values are treated as ``nan``'s. This is exactly what happens in this example:: >>> d = { >>> "name": >>> pd.Series([0, 0], index=["a", "b"]), >>> "surname": >>> pd.Series([0, 0], index=["c", "d"]), >>> } >>> df = pd.DataFrame(d) >>> df name surname a 0.0 NaN b 0.0 NaN c NaN 0.0 d NaN 0.0 >>> df.columns Index(['name', 'surname'], dtype='object') >>> df.index Index(['a', 'b', 'c', 'd'], dtype='object') - Create a ``DataFrame`` from a single dictionary:: >>> d = { ... "column1": [1., 2., 6., -1.], ... "column2": [0., 1., -2., 4.], ... } >>> df = pd.DataFrame(d) >>> df column1 column2 0 1.0 0.0 1 2.0 1.0 2 6.0 -2.0 3 -1.0 4.0 Again, the ``columns`` are taken from the *keys*, while the ``index`` is set to the default one (i.e. ``range(0, num_rows)``). A custom ``index`` can be specified the usual way:: >>> df = pd.DataFrame(d, index=["a", "b", "c", "d"]) >>> df column1 column2 a 1.0 0.0 b 2.0 1.0 c 6.0 -2.0 d -1.0 4.0 - Create a ``DataFrame`` from a list of dictionaries:: >>> data = [ ... {"a": 1, "b": 2}, ... {"a": 2, "c": 3}, ... ] >>> df = pd.DataFrame(data) >>> df a b c 0 1 2.0 NaN 1 2 NaN 3.0 Here the ``columns`` are taken (again) from the *keys* of the dictionaries, while the ``index`` is the default one. Since not all common *keys* appear in all input dictionaries, missing values (i.e. ``nan``'s) are automatically added to ``df``. .. note:: Once a ``DataFrame`` has been created, you can change interactively its ``index`` and ``columns``:: >>> d = { ... "column1": [1., 2., 6., -1.], ... "column2": [0., 1., -2., 4.], ... } >>> df = pd.DataFrame(d) >>> df column1 column2 0 1.0 0.0 1 2.0 1.0 2 6.0 -2.0 3 -1.0 4.0 >>> df.columns = ["bar", "bar"] >>> df.columns Index([u'foo', u'bar'], dtype='object') >>> df.index = range(df.shape[1]) >>> df.index Int64Index([0, 1, 2, 3], dtype='int64', length=4) | Loading a CSV file ------------------ Of course, the advantage of Pandas is that it allows to load data from many different file formats. Here we will only consider **CSV** (comma separated values) files and similar. To load a CSV into a ``DataFrame``, for instance the *iris dataset*, type:: >>> df = pd.read_csv("iris.csv") >>> print type(df) Now that it's loaded, let's print some information about it:: >>> df.columns Index([u'SepalLength', u'SepalWidth', u'PetalLength', u'PetalWidth', u'Name'], dtype='object') >>> df.index RangeIndex(start=0, stop=150, step=1) >>> df.shape (150, 5) >>> print df ... .. note:: The CSV file format is not really well-defined. This means that the CSV files that you'll encounter in the wild can (and probably will!) be very different from one another. Given the diversity of this (under-specified) file format, the ``read_csv()`` provides a *large* number of options to deal with many of the (otherwise code-breaking) differences between CSV-like files. Here is an excerpt from the ``read_csv()`` manual:: >>> help(pd.read_csv) Help on function read_csv in module pandas.io.parsers: read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal='.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=False, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=False, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, memory_map=False, float_precision=None) Read CSV (comma-separated) file into DataFrame Also supports optionally iterating or breaking of the file into chunks. Additional help can be found in the `online docs for IO Tools `_. As you can see, you can tweak the ``read_csv()`` method in very many ways, in order to make it load your data the proper way. .. note:: A similar function esists for reading Excel files, ``read_excel()``. | **Example: A malformed TSV file.** Let's load a TAB-separated file instead (i.e. a CSV file that uses TABs instead of commas). The file can be found here: https://drive.google.com/open?id=0B0wILN942aEVeWdScy1wQnA3LTA It describes a mapping from UniProt protein IDs (i.e. sequences) to hits in the Protein Data Bank (i.e. structures). The TSV file looks like this:: # 2014/07/08 - 14:59 SP_PRIMARY PDB A0A011 3vk5;3vka;3vkb;3vkc;3vkd A0A2Y1 2jrd A0A585 4mnq A0A5B4 2ij0 A0A5B9 2bnq;2eyr;2eys;2eyt;3kxf;3o6f;3o8x;3o9w;3qux;3t0e;3ta3;3tvm A0A5E3 1hq4;1ob1 A0AEF5 4iu2;4iu3 A0AEF6 4iu2;4iu3 A0AQQ7 2ib5 Note that the two columns are separated by a TAB. Note also that the first row is actually a comment. We can use the ``sep`` (separator) argument of the ``read_csv()`` function to take care of the TABs. Let's write:: >>> df = pd.read_csv("uniprot_pdb.tsv", sep="\t") >>> df.shape (33637, 1) >>> df.columns Index([u'# 2014/07/08 - 14:59'], dtype='object') >>> df.head() # 2014/07/08 - 14:59 SP_PRIMARY PDB A0A011 3vk5;3vka;3vkb;3vkc;3vkd A0A2Y1 2jrd A0A585 4mnq A0A5B4 2ij0 >>> df.head().shape (5, 1) Unfortunately, the shape of the ``DataFrame`` is wrong: there is only **one** column! The problem is that the very first line (the comment) contains no tabs: this makes Panda think that there is only one column. We have to tell ``read_csv()`` to either ignore the first line:: >>> df = pd.read_csv("uniprot_pdb.tsv", sep="\t", skiprows=1) >>> df.shape (33636, 2) >>> df.head() SP_PRIMARY PDB 0 A0A011 3vk5;3vka;3vkb;3vkc;3vkd 1 A0A2Y1 2jrd 2 A0A585 4mnq 3 A0A5B4 2ij0 4 A0A5B9 2bnq;2eyr;2eys;2eyt;3kxf;3o6f;3o8x;3o9w;3qux;3... >>> df.shape (33636, 2) or, as an alternative, to ignore all comment lines, i.e. all lines that start with a sharp ``"#"`` character:: >>> df = pd.read_csv("uniprot_pdb.tsv", sep="\t", comment="#") >>> df.shape (33636, 2) >>> df.head() SP_PRIMARY PDB 0 A0A011 3vk5;3vka;3vkb;3vkc;3vkd 1 A0A2Y1 2jrd 2 A0A585 4mnq 3 A0A5B4 2ij0 4 A0A5B9 2bnq;2eyr;2eys;2eyt;3kxf;3o6f;3o8x;3o9w;3qux;3... >>> df.head().shape (5, 2) Now the number of columns is correct! | **Example: microarray data.** Let's also try to load the **microarray** file in Stanford PCL format (namely, a disguised TAB-separated-values file) available here: https://drive.google.com/open?id=0B0wILN942aEVMENTUnRpa05uY0k Again, it is pretty easy:: >>> df = pd.read_csv("2010.DNAdamage.pcl", sep="\t") >>> df.shape (6128, 55) >>> df.columns Index([u'YORF', u'NAME', u'GWEIGHT', u'wildtype + 0.02% MMS (5 min)', u'wildtype + 0.02% MMS (15 min)', u'wildtype + 0.02% MMS (30 min)', u'wildtype + 0.2% MMS (45 min)', u'wildtype + 0.02% MMS (60 min)', u'wildtype + 0.02% MMS (90 min)', u'wildtype + 0.02% MMS (120 min)', u'mec1 mutant + 0.02% MMS (5 min)', u'mec1 mutant + 0.02% MMS (15 min)', u'mec1 mutant + 0.02% MMS (30 min)', u'mec1 mutant + 0.02% MMS (45 min)', u'mec1 mutant + 0.02% MMS (60 min)', u'mec1 mutant + 0.02% MMS (90 min)', u'mec1 mutant + 0.02% MMS (120 min)', u'dun1 mutant + 0.02% MMS (30 min)', u'dun1 mutant + 0.02% MMS (90 min)', u'dun1 mutant + 0.02% MMS (120 min)', u'wildtype + gamma irradiation (5 min)', u'wildtype + gamma irradiation (10 min)', ... u'crt1 mutant vs wild type (expt 1)', u'crt1 mutant vs. wild type (expt 2)', u'GAL-inducible ROX1 on galactose'], dtype='object') .. warning:: ... at least at a quick glance. Additional care must be taken for this data: some columns are not separated correctly (just look at the file with the ``less`` command to see what I mean). Feel free to try and improve the example above to properly load the microarray file! | Extracting Rows and Columns --------------------------- Row and column extraction from a ``DataFrame`` can be very easy to very complex, depending on what you want to do. Here we will only deal with elementary extraction methods. .. note:: If you wish to know more, especially about the ever useful ``ix`` attribute, take a look at the documentation: http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe http://pandas.pydata.org/pandas-docs/stable/indexing.html http://pandas.pydata.org/pandas-docs/stable/advanced.html The (non-exhaustive) list of alternatives is: =============================== =================== ========= Operation Syntax Result =============================== =================== ========= Select column ``df[col]`` Series Select row by label ``df.loc[label]`` Series Select row by integer location ``df.iloc[loc]`` Series Slice rows ``df[5:10]`` DataFrame Select rows by boolean vector ``df[bool_vec]`` DataFrame =============================== =================== ========= .. note:: For simplicity, in the following examples we will use a random sample taken from the *iris dataset*, computed like this:: >>> import numpy as np >>> import pandas as pd >>> np.random.seed(0) >>> df = pd.read_csv("iris.csv") >>> small = df.iloc[np.random.permutation(df.shape[0])].head() >>> small.shape (5, 5) >>> small SepalLength SepalWidth PetalLength PetalWidth Name 114 5.8 2.8 5.1 2.4 Iris-virginica 62 6.0 2.2 4.0 1.0 Iris-versicolor 33 5.5 4.2 1.4 0.2 Iris-setosa 107 7.3 2.9 6.3 1.8 Iris-virginica 7 5.0 3.4 1.5 0.2 Iris-setosa Brief explanation: here we use ``numpy.random.permutation()`` to generate a random permutation of the indices from ``0`` to ``df.shape[0]``, i.e. the number of rows in the *iris dataset*; then we use these random numbers as row indices to permute all the rows in ``df``; finally, we take the first ``5`` rows of the permuted ``df`` using the ``head()`` method. - A ``DataFrame`` can now be accessed like a regular dictionary to extract individual **columns** -- indeed, you can also print its *keys*!:: >>> small.columns Index([u'SepalLength', u'SepalWidth', u'PetalLength', u'PetalWidth', u'Name'], dtype='object') >>> small.keys() Index([u'SepalLength', u'SepalWidth', u'PetalLength', u'PetalWidth', u'Name'], dtype='object') Now we access ``small`` by **column name**; the result is a ``Series``:: >>> type(small["Name"]) >>> small["Name"] 114 Iris-virginica 62 Iris-versicolor 33 Iris-setosa 107 Iris-virginica 7 Iris-setosa Name: Name, dtype: object If the name of the column is compatible with the Python conventions for variable names, you can also treat columns as if they were actual attributes of the ``df`` object:: >>> small.Name 114 Iris-virginica 62 Iris-versicolor 33 Iris-setosa 107 Iris-virginica 7 Iris-setosa Name: Name, dtype: object .. warning:: Here ``Name`` is a variable created on-the-fly by Pandas when loading the *iris setosa* dataset! It is possible to extract **multiple columns** in one go; the result will be a ``DataFrame``:: >>> small[["SepalLength", "PetalLength"]] SepalLength PetalLength 114 5.8 5.1 62 6.0 4.0 33 5.5 1.4 107 7.3 6.3 7 5.0 1.5 .. warning:: Note that the order of the column labels matters! Compare this with the previous code fragment:: >>> small[["PetalLength", "SepalLength"]] PetalLength SepalLength 114 5.1 5.8 62 4.0 6.0 33 1.4 5.5 107 6.3 7.3 7 1.5 5.0 - To extract the **rows**, we can use the ``loc`` and ``iloc`` attributes: - ``loc`` allows to retrieve a row **by label**:: >>> small.loc[114] SepalLength 5.8 SepalWidth 2.8 PetalLength 5.1 PetalWidth 2.4 Name Iris-virginica Name: 114, dtype: object >>> type(small.loc[114]) The result is a ``Series``. - ``iloc`` allows to retrieve a row **by position**:: >>> small.iloc[0] SepalLength 5.8 SepalWidth 2.8 PetalLength 5.1 PetalWidth 2.4 Name Iris-virginica Name: 114, dtype: object The result is, again, a ``Series``. - The ``loc`` and ``iloc`` attribute also allow to retrieve **multiple rows**:: >>> small.loc[[114,7,62]] SepalLength SepalWidth PetalLength PetalWidth Name 114 5.8 2.8 5.1 2.4 Iris-virginica 7 5.0 3.4 1.5 0.2 Iris-setosa 62 6.0 2.2 4.0 1.0 Iris-versicolor >>> small.iloc[[0,1,2]] SepalLength SepalWidth PetalLength PetalWidth Name 114 5.8 2.8 5.1 2.4 Iris-virginica 62 6.0 2.2 4.0 1.0 Iris-versicolor 33 5.5 4.2 1.4 0.2 Iris-setosa This time, the result is a ``DataFrame``. .. warning:: Again, note that the order of the row labels matters! | Broadcasting, Masking, Label Alignment -------------------------------------- All of these work just like with ``Series``: - **Broacasting** is applied automatically to all rows:: >>> small["SepalLength"] + small["SepalWidth"] 114 8.6 62 8.2 33 9.7 107 10.2 7 8.4 dtype: float64 or to the whole table:: >>> small + small SepalLength SepalWidth PetalLength PetalWidth \ 114 11.6 5.6 10.2 4.8 62 12.0 4.4 8.0 2.0 33 11.0 8.4 2.8 0.4 107 14.6 5.8 12.6 3.6 7 10.0 6.8 3.0 0.4 Name 114 Iris-virginicaIris-virginica 62 Iris-versicolorIris-versicolor 33 Iris-setosaIris-setosa 107 Iris-virginicaIris-virginica 7 Iris-setosaIris-setosa The usual **label alignment** (and corresponding mismatching-labels-equals-``nan``) rules apply:: >>> small["SepalLength"][1:] + small["SepalWidth"][:-1] 7 NaN 33 9.7 62 8.2 107 10.2 114 NaN dtype: float64 .. warning:: Both row labels and column labels are automatically aligned! - **Masking** works as well, for instance:: >>> small.PetalLength[small.PetalLength > 5] 114 5.1 107 6.3 Name: PetalLength, dtype: float64 >>> small["PetalLength"][small["PetalLength"] > 5] 114 5.1 107 6.3 Name: PetalLength, dtype: float64 >>> small["Name"][small["PetalLength"] > 5] 114 Iris-virginica 107 Iris-virginica Name: Name, dtype: object >>> small[["Name", "PetalLength", "SepalLength"]][small.Name == "Iris-virginica"] Name PetalLength SepalLength 114 Iris-virginica 5.1 5.8 107 Iris-virginica 6.3 7.3 - **Statistics** can be computed on rows:: >>> df.loc[114][:-1].mean() 4.0249999999999995 (Here the ``[:-1]`` skips over the ``Name`` column), on columns:: >>> df.PetalLength.mean() 3.7586666666666662 >>> df["PetalLength"].mean() 3.7586666666666662 or on the whole table:: >>> df.mean() SepalLength 5.843333 SepalWidth 3.054000 PetalLength 3.758667 PetalWidth 1.198667 dtype: float64 or between rows:: >>> small.PetalLength.cov(small.SepalLength) 1.6084999999999996 You can combine them with *masking* as well:: >>> small[["Name", "PetalLength", "SepalLength"]][small.PetalLength > small.PetalLength.mean()] Name PetalLength SepalLength 114 Iris-virginica 5.1 5.8 62 Iris-versicolor 4.0 6.0 107 Iris-virginica 6.3 7.3 | Merging Tables -------------- Merging different dataframes is performed using the ``merge()`` function. Merging means that, given two tables with a common column name, first the rows with the same column value are matched; then a new table is created by concatenating the matching rows. Here is an explanatory image: .. image:: figures/merge.png (Taken from the `Merge, join, and concatenate `_ section of the official Pandas documentation.) .. note:: Only the column **name** must be the same in both tables: *the actual values in the column may differ*. What happens when the column **values** are different, depends on the ``how`` argument of the ``merge()`` function. See below. A simple example of merging two tables describing protein properties (appropriately simplified for ease of exposition):: sequences = pd.DataFrame({ "id": ["Q99697", "O18400", "P78337", "Q9W5Z2"], "seq": ["METNCR", "MDRSSA", "MDAFKG", "MTSMKD"], }) names = pd.DataFrame({ "id": ["Q99697", "O18400", "P78337", "P59583"], "name": ["PITX2_HUMAN", "PITX_DROME", "PITX1_HUMAN", "WRK32_ARATH"], }) print "input dataframes:" print sequences print names print "full inner:" print pd.merge(sequences, names, on="id", how="inner") print "full left:" print pd.merge(sequences, names, on="id", how="left") print "full right:" print pd.merge(sequences, names, on="id", how="right") print "full outer:" print pd.merge(sequences, names, on="id", how="outer") Note that the ``"id"`` column appears in both dataframes. However, the values in the column differ: ``"Q9W5Z2"`` only appears in the ``sequences`` table, while ``"P59583"`` only appears in ``names``. Let's see what happens using the four different joining semantics:: input dataframes: id seq 0 Q99697 METNCR 1 O18400 MDRSSA 2 P78337 MDAFKG 3 Q9W5Z2 MTSMKD id name 0 Q99697 PITX2_HUMAN 1 O18400 PITX_DROME 2 P78337 PITX1_HUMAN 3 P59583 WRK32_ARATH # mismatched ids are dropped full inner: id seq name 0 Q99697 METNCR PITX2_HUMAN 1 O18400 MDRSSA PITX_DROME 2 P78337 MDAFKG PITX1_HUMAN # ids are taken from the left table full left: id seq name 0 Q99697 METNCR PITX2_HUMAN 1 O18400 MDRSSA PITX_DROME 2 P78337 MDAFKG PITX1_HUMAN 3 Q9W5Z2 MTSMKD NaN # ids are taken from the right table full right: id seq name 0 Q99697 METNCR PITX2_HUMAN 1 O18400 MDRSSA PITX_DROME 2 P78337 MDAFKG PITX1_HUMAN 3 P59583 NaN WRK32_ARATH # all ids are retained full outer: id seq name 0 Q99697 METNCR PITX2_HUMAN 1 O18400 MDRSSA PITX_DROME 2 P78337 MDAFKG PITX1_HUMAN 3 Q9W5Z2 MTSMKD NaN 4 P59583 NaN WRK32_ARATH To summarize, the ``how`` argument establishes the semantics for the mismatched labels, and where the ``nan``'s should be inserted. | **Example**. Dataframe merging automatically takes care of replicating rows appropriately if the same key appears twice. For instance, consider this code:: sequences = pd.DataFrame({ "id": ["Q99697", "Q99697"], "seq": ["METNCR", "METNCR"], }) names = pd.DataFrame({ "id": ["Q99697", "O18400", "P78337", "P59583"], "name": ["PITX2_HUMAN", "PITX_DROME", "PITX1_HUMAN", "WRK32_ARATH"], }) print "input dataframes:" print sequences print names print "full:" print pd.merge(sequences, names, on="id", how="inner") The result is:: input dataframes: id seq 0 Q99697 METNCR 1 Q99697 METNCR id name 0 Q99697 PITX2_HUMAN 1 O18400 PITX_DROME 2 P78337 PITX1_HUMAN 3 P59583 WRK32_ARATH full: id seq name 0 Q99697 METNCR PITX2_HUMAN 1 Q99697 METNCR PITX2_HUMAN As you can see, rows are *replicated appropriately*! .. hint:: This is *extremely* useful. Consider for instance the problem of having to compute statistics on all the *three* gene-exon-CDS tables in the exercises section, jointly. Using ``merge()`` it is easy to merge the three tables in a single, big table, where the gene information is appropriately replicated for all gene exons, and the exon information is replicated for all coding sequences! .. note:: The ``merge()`` method also allows to join tables by more than one column, using the ``key`` attribute. See the official documentation here: http://pandas.pydata.org/pandas-docs/stable/merging.html | Grouping Tables --------------- The ``groupby`` method is essential for efficiently performing operations on groups of rows -- especially split-apply-aggregate operations. A picture is worth one thousand words, so here it is: .. image:: figures/groupby.png (Picture source: `Hadley Wickham's Data Science in R slides `_) The ``groupby()`` method allows to extract groups of rows with the same value (the "split" part), perform some computation on them (the "apply" part) and then combine the results into a single table over the groups (the "aggregate" part). **Example**. Let's take the *iris dataset*. We want to compute the average of the four data tables for the three different iris species. The result should be a 3 species (rows) by 4 columns (petal/sepal length/width) dataframe. First, let's load the CSV file and print some information:: >>> iris = pd.read_csv("iris.csv") >>> iris.info() RangeIndex: 150 entries, 0 to 149 Data columns (total 5 columns): SepalLength 150 non-null float64 SepalWidth 150 non-null float64 PetalLength 150 non-null float64 PetalWidth 150 non-null float64 Name 150 non-null object dtypes: float64(4), object(1) memory usage: 5.9+ KB Here is how to do that with ``groupby()``. First, let's group all rows of the dataframe by the ``Name`` column:: >>> grouped = iris.groupby(iris.Name) >>> type(grouped) >>> for group in grouped: ... print group[0], group[1].shape ... Iris-setosa (50, 5) Iris-versicolor (50, 5) Iris-virginica (50, 5) The ``groupby`` method returns a ``pandas.DataFrameGroupBy`` object (the details are not important). To select a single group, write:: >>> group_df = grouped.get_group("Iris-versicolor") >>> print type(group) >>> group.info() Int64Index: 50 entries, 50 to 99 Data columns (total 5 columns): SepalLength 50 non-null float64 SepalWidth 50 non-null float64 PetalLength 50 non-null float64 PetalWidth 50 non-null float64 Name 50 non-null object dtypes: float64(4), object(1) memory usage: 2.3+ KB .. note:: As shown by the above code snippet, iterating over the ``grouped`` variable returns ``(value-of-Name, DataFrame)`` tuples: - The first element is the value of the column ``Name`` shared by the group - The second element is a ``DataFrame`` including only the rows in that group. | The whole idea is that you can *apply* some transformation (e.g. ``mean()``) to the individual groups **automatically**, using the ``aggregate()`` method directly on the ``grouped`` variable. For instance, let's compute the mean of all columns for the three groups:: >>> iris_mean_by_name = grouped.aggregate(pd.DataFrame.mean) >>> iris_mean_by_name.info() Index: 3 entries, Iris-setosa to Iris-virginica Data columns (total 4 columns): SepalLength 3 non-null float64 SepalWidth 3 non-null float64 PetalLength 3 non-null float64 PetalWidth 3 non-null float64 dtypes: float64(4) memory usage: 120.0+ bytes So the result of ``aggregate()`` is a dataframe. Here is what it looks like:: >>> iris_mean_by_name SepalLength SepalWidth PetalLength PetalWidth Name Iris-setosa 5.006 3.418 1.464 0.244 Iris-versicolor 5.936 2.770 4.260 1.326 Iris-virginica 6.588 2.974 5.552 2.026 .. warning:: Note that the ``Name`` colum of the ``iris`` dataframe becomes the index of the aggregated ``iris_mean_by_name`` dataframe: it is not a proper column anymore! The only actual columns are:: >>> iris_mean_by_name.columns Index([u'SepalLength', u'SepalWidth', u'PetalLength', u'PetalWidth'], dtype='object') | .. note:: The ``groupby()`` method also allows to group by more than one column, using the ``key`` attribute. See the official documentation here: http://pandas.pydata.org/pandas-docs/stable/groupby.html