=========================
Numpy: a cursory overview
=========================

.. warning::
    This is an **appendix**.

    This material will **not** be part of the final exam.

|

Numpy is a freely available library for performing efficient numerical
computations over scalars, vectors, matrices, and more generally N-dimensional
tensors.

Its main features are:

- Flexible **indexing** and **manipulation** of arbitrary dimensional data.

- Many **numerical routines** (linear algebra, Fourier analysis, local and
  global optimization, etc.) using a variety of algorithms.

- Clean Python interface to many **underlying libraries** (for instance, the
  GNU Scientific Library, or GSL), which are used as efficient implementations
  of numerical algorithms.

- Support for random sampling from a variety of distributions.

This list is far from complete!

|

Documentation & Source
----------------------

The official numpy website is:

    http://www.numpy.org/

The numpy source code is available at:

    http://github.com/numpy/numpy

and is distributed under the numpy license:

    http://github.com/numpy/numpy/blob/master/LICENSE.txt

|

Importing numpy
---------------

The customary idiom to import numpy is::

    >>> import numpy as np
    >>> help(np)

Importing specific sub-modules also makes sense, so feel free to write (in
this case for the *linear algebra* module)::

    >>> import numpy.linalg as la
    >>> help(la)

|

The Array class
---------------

An ``ndarray`` is what it says: an **N-dimensional array** (also known as a
*tensor*).

The usual concepts encountered in analysis and algebra can be implemented as
arrays: **vectors** are 1D arrays, **matrices** are 2D arrays.

Given an array ``x``, its dimensionality and type can be ascertained by
using the ``shape``, ``ndim`` and ``dtype`` attributes::

    >>> x = np.zeros(10)
    array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])
    >>> print type(x)
    <type 'numpy.ndarray'>
    >>> print x.shape
    (10,)
    >>> print x.ndim
    1
    >>> print x.dtype
    float64

|

Creating an Array
-----------------

Creating arrays from scratch can be useful to start off your mathematical
code or for debugging purposes. A few useful creation functions are:

- Creating an all-zero array is simple with ``np.zeros()``::

    >>> # One dimensional case
    >>> x = np.zeros(2)
    >>> x
    array([ 0.,  0.])

    >>> # Two dimensional case
    >>> x = np.zeros((2, 2))
    >>> x
    array([[ 0.,  0.],
           [ 0.,  0.]])
    >>> x.shape
    (2, 2)

    >>> # Three dimensional case
    >>> x = np.zeros((2, 2, 2))
    >>> x
    array([[[ 0.,  0.],
            [ 0.,  0.]],
           [[ 0.,  0.],
            [ 0.,  0.]]])
    >>> x.shape
    (2, 2, 2)

- Creating an all-ones array is equally simple with ``np.ones()``::

    >>> # N-dimensional case
    >>> x = np.ones((2, 5, 5, 10))
    >>> x.shape
    (2, 5, 5, 10)

- To create a diagonal matrix, use either the ``diag()`` and ``ones()``
  methods together, or the ``eye()`` method (reminiscent of Matlab)::

    >>> np.diag(np.ones(3))
    array([[ 1.,  0.,  0.],
           [ 0.,  1.,  0.],
           [ 0.,  0.,  1.]])
    >>> np.eye(3)
    array([[ 1.,  0.,  0.],
           [ 0.,  1.,  0.],
           [ 0.,  0.,  1.]])

- The numpy analogue of ``range()`` is ``np.arange()``::

    >>> x = np.arange(10)
    >>> x.shape
    (10,)
    >>> x
    array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

- Converting from an existing list or list-of-lists object::

    >>> l = [range(3*i,3*(i+3),3) for i in range(3)]
    >>> l
    [[0, 3, 6], [3, 6, 9], [6, 9, 12]]
    >>> np.array(l)
    array([[ 0,  3,  6],
           [ 3,  6,  9],
           [ 6,  9, 12]])

  In this case, the ``dtype`` will be based on the type of the elements
  in the input ``list``.

- Random sampling an array from a distribution::

    >>> # Sampling a single scalar from a real-valued uniform (sampling
    >>> # functions return a scalar if no shape information is given)
    >>> x = np.random.uniform(0, 10)
    >>> x
    3.9300397597884897

    >>> # Sampling a 3-element vector from a Normal distribution
    >>> # with mean=1 and sigma=0.2
    >>> x = np.random.normal(1, 0.2, size=3)
    >>> x
    array([ 1.2001157 ,  0.94376082,  1.09514026])

    >>> # Sampling a binary 3x3 matrix, i.e. from a uniform
    >>> # distribution over integers
    >>> x = np.random.randint(0, 2, size=(3, 3))
    >>> x
    array([[1, 0, 0],
           [1, 0, 1],
           [0, 1, 0]])

|

**Example**. To sample a random 100x100 matrix where every entry is distributed
according to a normal distribution, write::

    import numpy as np
    import matplotlib.pyplot as plt

    np.random.seed(0)

    rm = np.random.normal(0, 1, size=(100, 100))

    plt.imshow(rm, cmap="gray", interpolation="nearest")
    plt.savefig("random-matrix.png")

The result is:

.. figure:: figures/random-matrix.png

.. note::
    The reference manual of the ``numpy.random`` module is here:

        http://docs.scipy.org/doc/numpy/reference/routines.random.html

    For a quick-and-dirty explanation of (pseudo!) random number generation
    as used in Computer Science, see:

        http://en.wikipedia.org/wiki/Pseudorandom_number_generator

    The ``numpy.random`` RNG makes use of the very popular Mersenne Twister (MT)
    PRNG. For an introduction, see here:

        http://en.wikipedia.org/wiki/Mersenne_Twister

.. warning::
    For reproducibility, make sure to initialize the pseudo-random number
    generator with ``np.random.seed()`` before running your code, i.e.::

        >>> np.random.seed(0)

    After setting the seed, your code will be deterministic as long as
    calls to the sampling routines above remain the same.


|

From ``pandas`` to ``numpy`` and back
-------------------------------------

It is very easy to "convert" a ``Series`` or ``DataFrame`` into an array: as a
matter of facts, these Pandas types are built on numpy arrays: their
underlying array is accessible through the ``.values`` attribute.

(Pandas of course coats the underlying array with a lot of useful
functionality!)

For instance, given a ``DataFrame`` of the iris dataset::

    iris = pd.read_csv("iris.csv")

the underlying array can be accessed through its ``.values`` attribute::

    >>> # For a table of two columns
    >>> type(iris[["PetalLength", "SepalLength"]])
    <class 'pandas.core.frame.DataFrame'>
    >>> type(iris[["PetalLength", "SepalLength"]].values)
    <type 'numpy.ndarray'>

The same can be done with individual ``Series`` objects, e.g. the columns
of the ``DataFrame``::

    >>> type(iris.PetalWidth.values)
    <type 'numpy.ndarray'>

|

It is equally easy to construct a ``DataFrame`` out of an arrray: just call
the ``DataFrame`` passing it the array as an argument::

    >>> pd.DataFrame(x)
       0  1  2
    0  0  1  2
    1  3  4  5
    2  6  7  8

If you also want to assign the column and row labels, you can pass them as
optional arguments::

    >>> pd.DataFrame(x,
    ...              columns=['a', 'b', 'c'],
    ...              index=['one', 'two', 'three'])
           a  b  c
    one    0  1  2
    two    3  4  5
    three  6  7  8

|

Reshaping
---------

Reshaping. The arrangement of entries into rows, columns, tubes, *etc.* can be
changed at any time using the ``reshape()`` and ``ravel()`` methods::

    >>> # Turn a 9-element vector into a 3x3 matrix
    >>> x = np.arange(9)
    >>> x
    array([0, 1, 2, 3, 4, 5, 6, 7, 8])
    >>> x.reshape((3, 3))
    array([[0, 1, 2],
           [3, 4, 5],
           [6, 7, 8]])

    >>> # Turn a 3x3 matrix into a 9-element vector
    >>> x = np.ones((3, 3))
    >>> x
    array([[ 1.,  1.,  1.],
           [ 1.,  1.,  1.],
           [ 1.,  1.,  1.]])
    >>> x.ravel()
    array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

|

Indexing
--------

Given an array of shape ``(n1, n2, ..., nd)``, you can extract a single element
through its index.

For instance, in the 2D case::

    >>> x = np.arange(9).reshape((3, 3))
    >>> x
    array([[0, 1, 2],
           [3, 4, 5],
           [6, 7, 8]])
    >>> x[0,0]  # first row, first column
    0
    >>> x[1,2]  # second row, third column
    5
    >>> x[2,0]  # thrid row, first column
    6

In the example above, ``x[i,j]`` extracts the element in row ``i``, column
``j`` -- just like in standard matrix algebra.

You can also use the colon ``:`` notation to select specify all the elements
on an axis, for instance a whole column, a whole row, or an entire sub-array::

    >>> # The first row
    >>> x[0,:]
    array([0, 1, 2])

    >>> # A shorthand for the first row
    >>> x[0]
    array([0, 1, 2])

    >>> # The first column (no shorthand available)
    >>> x[:,1]
    array([1, 4, 7])


    >>> # A sub-table
    >>> x[1:3, :]
    array([[3, 4, 5],
           [6, 7, 8]])

The very same syntax applies to the general N-dimensional case::

    >>> x = np.arange(10**5).reshape((10,)*5)
    >>> x.shape
    (10, 10, 10, 10, 10)
    >>> x[1,2,3,4,5]
    12345
    >>> x[0,0,:,:,0]
    array([[  0,  10,  20,  30,  40,  50,  60,  70,  80,  90],
           [100, 110, 120, 130, 140, 150, 160, 170, 180, 190],
           [200, 210, 220, 230, 240, 250, 260, 270, 280, 290],
           [300, 310, 320, 330, 340, 350, 360, 370, 380, 390],
           [400, 410, 420, 430, 440, 450, 460, 470, 480, 490],
           [500, 510, 520, 530, 540, 550, 560, 570, 580, 590],
           [600, 610, 620, 630, 640, 650, 660, 670, 680, 690],
           [700, 710, 720, 730, 740, 750, 760, 770, 780, 790],
           [800, 810, 820, 830, 840, 850, 860, 870, 880, 890],
           [900, 910, 920, 930, 940, 950, 960, 970, 980, 990]])

.. note::
    It can be difficult to visualize operations over N-dimensional arrays,
    when N is larger than 3.

    However, it is much easier when the axes have a concrete semantics. For
    instance, assume that the 4D array ``performace`` holds the performance
    of several sequence alignment algorithms over multiple sequence clusters
    in multiple databases:

    #. The first axis is the index of the algorithm

    #. The second axis is the index of the database

    #. The third axis is the index of a cluster of sequences in the database

    #. The fourth axis is one of three measures of performance: precision,
       recall, and accuracy

    Your tensor looks like this::

        performances[alg][db][cluster][measure]

    Then in extract the accuracy of algorithm ``1`` over all databases
    and sequence clusters, you can just do::

        performances[1,:,:,0]

    or even::

        performances[1,:,:,-1]

    Now that the axes have a concrete semantics, it is much easier to see
    what this operation does!

|

Arithmetic and Functions
------------------------

Operations between scalars and arrays are **broadcast**, like in Pandas (and
more generally in linear algebra). For instance::

    >>> x = np.arange(5)
    >>> 10 * x
    array([ 0, 10, 20, 30, 40])
    >>> y = -x
    >>> y
    array([ 0, -10, -20, -30, -40])

Operations between equally-shaped arrays operate **element-wise**::

    >>> x
    array([0, 1, 2, 3, 4])
    >>> x + x
    array([0, 2, 4, 6, 8])
    >>> x + y
    array([0, 0, 0, 0, 0])

Operations can update the sub-arrays automatically::

    >>> x = np.arange(16).reshape((4, 4))
    >>> x
    array([[ 0,  1,  2,  3],
           [ 4,  5,  6,  7],
           [ 8,  9, 10, 11],
           [12, 13, 14, 15]])
    >>> x[1:3,1:3] += 100 # Note the automatic assignment!
    >>> x
    array([[  0,   1,   2,   3],
           [  4, 105, 106,   7],
           [  8, 109, 110,  11],
           [ 12,  13,  14,  15]])

Operations between arrays of different shapes but **compatible sizes** are
broadcast by "matching" the **last** (right-most) dimension. For instance,
addition between a matrix ``x`` and a vector ``y`` broadcasts ``y`` over all
rows of ``x``::

    >>> x = np.arange(9).reshape((3, 3))
    >>> y = np.arange(3)
    >>> x + y
    array([[ 0,  2,  4],
           [ 3,  5,  7],
           [ 6,  8, 10]])

The result, ``x + y``, can be seen as the result of::

    result = np.zeros(x.shape)
    for i in range(x.shape[0]):
        result[i,:] = x[i,:] + y[:]

As you can see, the last (right-most) dimension of ``x`` is matched to the last
(and only) dimension of ``y``.

.. warning::
    Operations between arrays of incompatible sizes raise an error::

        >>> np.arange(3) + np.arange(2)
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        ValueError: operands could not be broadcast together with shapes (3,) (2,)

|

Mathematical Functions
----------------------

Most of the usual mathematical functions are implemented in numpy, including
the usual max and min operators and trigonometric functions.

For instance::

    >>> np.sin(0), np.cos(0)
    (0.0, 1.0)
    >>> np.sin(np.pi / 2), np.cos(np.pi / 2)
    (1.0, 6.123233995736766e-17)

As you can see, the value of the cosine at :math:`\frac{\pi}{2}` should be
zero, but is approximated by a very (very) small value. The same is true for
other functions::

    >>> np.tan(np.pi / 4)
    0.99999999999999989

The nice property of mathematical functions in numpy is that they are
automatically applied to all elements of an array. For instance::

    >>> x = [0, 0.25 * np.pi, 0.5 * np.pi, 0.75 * np.pi]
    >>> np.cos(x)
    array([  1.00000000e+00,   7.07106781e-01,   6.12323400e-17,
            -7.07106781e-01])

where the second and last elements are half the square root of two and its
opposite::

    >>> np.sqrt(2) / 2
    0.70710678118654757

Complex mathematical expressions can be therefore specified very compactly
using this property. For instance, in order to plot a sinewave with small
additive normally-distributed noise, we can write::

    import numpy as np
    import matplotlib.pyplot as plt

    # The x values
    x = np.arange(100)

    # The y values, uncorrupted
    y = np.sin(2 * np.pi * np.arange(100) / 100.0)
    plt.plot(x, y)

    # The y values, now corrupted by Gaussian noise
    y += np.random.normal(0, 0.05, size=100)
    plt.plot(x, y)

    plt.savefig("sin.png")

The result is:

.. image:: figures/sin.png

.. note::
    For the complete list of the available mathematical functions, see:

       https://docs.scipy.org/doc/numpy/reference/routines.math.html


Linear Algebra
--------------

Numpy provides a number of linear algebraic functions, including dot products,
matrix-vector multiplication, and matrix-matrix multiplication::

    import numpy as np

    # dot (scalar) product between vectors
    x = np.array([0, 0, 1, 1])
    print np.dot(x, x)

    z = np.array([1, 1, 0, 0])
    print np.dot(z, z)

    print np.dot(x, z)

    # Matrix-vector product
    a = 2 * np.diag(np.ones(4))
    print(np.dot(a, x))
    print(np.dot(a, z))

    # Matrix-matrix product
    b = np.ones((4, 4)) - a
    print(np.dot(a, b))

The result is::

    # np.dot(x, x)
    2

    # np.dot(z, z)
    2

    # np.dot(x, z)
    0

    # np.dot(a, x)
    [ 0.  0.  2.  2.]

    # np.dot(a, z)
    [ 2.  2.  0.  0.]

    # np.dot(a, b)
    [[-2.  2.  2.  2.]
     [ 2. -2.  2.  2.]
     [ 2.  2. -2.  2.]
     [ 2.  2.  2. -2.]]

Numpy also provides routines for dealing with eigenvalues/eigenvects, singular
value decompositions, and other decompositions.

Let's compute the (trivial) eigenvalues/eigenvectors of the 4x4 identity
matrix::

    >>> import numpy.linalg as la
    >>> eigenvalues, eigenvectors = la.eig(np.ones((4, 4)))
    >>> eigenvalues
    array([ 0.,  4.,  0.,  0.])
    >>> eigenvectors
    array([[ -8.66025404e-01,   5.00000000e-01,  -2.77555756e-17,
             -2.77555756e-17],
           [  2.88675135e-01,   5.00000000e-01,  -5.77350269e-01,
             -5.77350269e-01],
           [  2.88675135e-01,   5.00000000e-01,   7.88675135e-01,
             -2.11324865e-01],
           [  2.88675135e-01,   5.00000000e-01,  -2.11324865e-01,
              7.88675135e-01]])

.. note::
    The reference manual of the ``numpy.linalg`` module is here:

        http://docs.scipy.org/doc/numpy/reference/routines.linalg.html