=========================
Numpy: a cursory overview
=========================
.. warning::
This is an **appendix**.
This material will **not** be part of the final exam.

Numpy is a freely available library for performing efficient numerical
computations over scalars, vectors, matrices, and more generally Ndimensional
tensors.
Its main features are:
 Flexible **indexing** and **manipulation** of arbitrary dimensional data.
 Many **numerical routines** (linear algebra, Fourier analysis, local and
global optimization, etc.) using a variety of algorithms.
 Clean Python interface to many **underlying libraries** (for instance, the
GNU Scientific Library, or GSL), which are used as efficient implementations
of numerical algorithms.
 Support for random sampling from a variety of distributions.
This list is far from complete!

Documentation & Source

The official numpy website is:
http://www.numpy.org/
The numpy source code is available at:
http://github.com/numpy/numpy
and is distributed under the numpy license:
http://github.com/numpy/numpy/blob/master/LICENSE.txt

Importing numpy

The customary idiom to import numpy is::
>>> import numpy as np
>>> help(np)
Importing specific submodules also makes sense, so feel free to write (in
this case for the *linear algebra* module)::
>>> import numpy.linalg as la
>>> help(la)

The Array class

An ``ndarray`` is what it says: an **Ndimensional array** (also known as a
*tensor*).
The usual concepts encountered in analysis and algebra can be implemented as
arrays: **vectors** are 1D arrays, **matrices** are 2D arrays.
Given an array ``x``, its dimensionality and type can be ascertained by
using the ``shape``, ``ndim`` and ``dtype`` attributes::
>>> x = np.zeros(10)
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
>>> print type(x)
>>> print x.shape
(10,)
>>> print x.ndim
1
>>> print x.dtype
float64

Creating an Array

Creating arrays from scratch can be useful to start off your mathematical
code or for debugging purposes. A few useful creation functions are:
 Creating an allzero array is simple with ``np.zeros()``::
>>> # One dimensional case
>>> x = np.zeros(2)
>>> x
array([ 0., 0.])
>>> # Two dimensional case
>>> x = np.zeros((2, 2))
>>> x
array([[ 0., 0.],
[ 0., 0.]])
>>> x.shape
(2, 2)
>>> # Three dimensional case
>>> x = np.zeros((2, 2, 2))
>>> x
array([[[ 0., 0.],
[ 0., 0.]],
[[ 0., 0.],
[ 0., 0.]]])
>>> x.shape
(2, 2, 2)
 Creating an allones array is equally simple with ``np.ones()``::
>>> # Ndimensional case
>>> x = np.ones((2, 5, 5, 10))
>>> x.shape
(2, 5, 5, 10)
 To create a diagonal matrix, use either the ``diag()`` and ``ones()``
methods together, or the ``eye()`` method (reminiscent of Matlab)::
>>> np.diag(np.ones(3))
array([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.]])
>>> np.eye(3)
array([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.]])
 The numpy analogue of ``range()`` is ``np.arange()``::
>>> x = np.arange(10)
>>> x.shape
(10,)
>>> x
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
 Converting from an existing list or listoflists object::
>>> l = [range(3*i,3*(i+3),3) for i in range(3)]
>>> l
[[0, 3, 6], [3, 6, 9], [6, 9, 12]]
>>> np.array(l)
array([[ 0, 3, 6],
[ 3, 6, 9],
[ 6, 9, 12]])
In this case, the ``dtype`` will be based on the type of the elements
in the input ``list``.
 Random sampling an array from a distribution::
>>> # Sampling a single scalar from a realvalued uniform (sampling
>>> # functions return a scalar if no shape information is given)
>>> x = np.random.uniform(0, 10)
>>> x
3.9300397597884897
>>> # Sampling a 3element vector from a Normal distribution
>>> # with mean=1 and sigma=0.2
>>> x = np.random.normal(1, 0.2, size=3)
>>> x
array([ 1.2001157 , 0.94376082, 1.09514026])
>>> # Sampling a binary 3x3 matrix, i.e. from a uniform
>>> # distribution over integers
>>> x = np.random.randint(0, 2, size=(3, 3))
>>> x
array([[1, 0, 0],
[1, 0, 1],
[0, 1, 0]])

**Example**. To sample a random 100x100 matrix where every entry is distributed
according to a normal distribution, write::
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
rm = np.random.normal(0, 1, size=(100, 100))
plt.imshow(rm, cmap="gray", interpolation="nearest")
plt.savefig("randommatrix.png")
The result is:
.. figure:: figures/randommatrix.png
.. note::
The reference manual of the ``numpy.random`` module is here:
http://docs.scipy.org/doc/numpy/reference/routines.random.html
For a quickanddirty explanation of (pseudo!) random number generation
as used in Computer Science, see:
http://en.wikipedia.org/wiki/Pseudorandom_number_generator
The ``numpy.random`` RNG makes use of the very popular Mersenne Twister (MT)
PRNG. For an introduction, see here:
http://en.wikipedia.org/wiki/Mersenne_Twister
.. warning::
For reproducibility, make sure to initialize the pseudorandom number
generator with ``np.random.seed()`` before running your code, i.e.::
>>> np.random.seed(0)
After setting the seed, your code will be deterministic as long as
calls to the sampling routines above remain the same.

From ``pandas`` to ``numpy`` and back

It is very easy to "convert" a ``Series`` or ``DataFrame`` into an array: as a
matter of facts, these Pandas types are built on numpy arrays: their
underlying array is accessible through the ``.values`` attribute.
(Pandas of course coats the underlying array with a lot of useful
functionality!)
For instance, given a ``DataFrame`` of the iris dataset::
iris = pd.read_csv("iris.csv")
the underlying array can be accessed through its ``.values`` attribute::
>>> # For a table of two columns
>>> type(iris[["PetalLength", "SepalLength"]])
>>> type(iris[["PetalLength", "SepalLength"]].values)
The same can be done with individual ``Series`` objects, e.g. the columns
of the ``DataFrame``::
>>> type(iris.PetalWidth.values)

It is equally easy to construct a ``DataFrame`` out of an arrray: just call
the ``DataFrame`` passing it the array as an argument::
>>> pd.DataFrame(x)
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
If you also want to assign the column and row labels, you can pass them as
optional arguments::
>>> pd.DataFrame(x,
... columns=['a', 'b', 'c'],
... index=['one', 'two', 'three'])
a b c
one 0 1 2
two 3 4 5
three 6 7 8

Reshaping

Reshaping. The arrangement of entries into rows, columns, tubes, *etc.* can be
changed at any time using the ``reshape()`` and ``ravel()`` methods::
>>> # Turn a 9element vector into a 3x3 matrix
>>> x = np.arange(9)
>>> x
array([0, 1, 2, 3, 4, 5, 6, 7, 8])
>>> x.reshape((3, 3))
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>> # Turn a 3x3 matrix into a 9element vector
>>> x = np.ones((3, 3))
>>> x
array([[ 1., 1., 1.],
[ 1., 1., 1.],
[ 1., 1., 1.]])
>>> x.ravel()
array([ 1., 1., 1., 1., 1., 1., 1., 1., 1.])

Indexing

Given an array of shape ``(n1, n2, ..., nd)``, you can extract a single element
through its index.
For instance, in the 2D case::
>>> x = np.arange(9).reshape((3, 3))
>>> x
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>> x[0,0] # first row, first column
0
>>> x[1,2] # second row, third column
5
>>> x[2,0] # thrid row, first column
6
In the example above, ``x[i,j]`` extracts the element in row ``i``, column
``j``  just like in standard matrix algebra.
You can also use the colon ``:`` notation to select specify all the elements
on an axis, for instance a whole column, a whole row, or an entire subarray::
>>> # The first row
>>> x[0,:]
array([0, 1, 2])
>>> # A shorthand for the first row
>>> x[0]
array([0, 1, 2])
>>> # The first column (no shorthand available)
>>> x[:,1]
array([1, 4, 7])
>>> # A subtable
>>> x[1:3, :]
array([[3, 4, 5],
[6, 7, 8]])
The very same syntax applies to the general Ndimensional case::
>>> x = np.arange(10**5).reshape((10,)*5)
>>> x.shape
(10, 10, 10, 10, 10)
>>> x[1,2,3,4,5]
12345
>>> x[0,0,:,:,0]
array([[ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90],
[100, 110, 120, 130, 140, 150, 160, 170, 180, 190],
[200, 210, 220, 230, 240, 250, 260, 270, 280, 290],
[300, 310, 320, 330, 340, 350, 360, 370, 380, 390],
[400, 410, 420, 430, 440, 450, 460, 470, 480, 490],
[500, 510, 520, 530, 540, 550, 560, 570, 580, 590],
[600, 610, 620, 630, 640, 650, 660, 670, 680, 690],
[700, 710, 720, 730, 740, 750, 760, 770, 780, 790],
[800, 810, 820, 830, 840, 850, 860, 870, 880, 890],
[900, 910, 920, 930, 940, 950, 960, 970, 980, 990]])
.. note::
It can be difficult to visualize operations over Ndimensional arrays,
when N is larger than 3.
However, it is much easier when the axes have a concrete semantics. For
instance, assume that the 4D array ``performace`` holds the performance
of several sequence alignment algorithms over multiple sequence clusters
in multiple databases:
#. The first axis is the index of the algorithm
#. The second axis is the index of the database
#. The third axis is the index of a cluster of sequences in the database
#. The fourth axis is one of three measures of performance: precision,
recall, and accuracy
Your tensor looks like this::
performances[alg][db][cluster][measure]
Then in extract the accuracy of algorithm ``1`` over all databases
and sequence clusters, you can just do::
performances[1,:,:,0]
or even::
performances[1,:,:,1]
Now that the axes have a concrete semantics, it is much easier to see
what this operation does!

Arithmetic and Functions

Operations between scalars and arrays are **broadcast**, like in Pandas (and
more generally in linear algebra). For instance::
>>> x = np.arange(5)
>>> 10 * x
array([ 0, 10, 20, 30, 40])
>>> y = x
>>> y
array([ 0, 10, 20, 30, 40])
Operations between equallyshaped arrays operate **elementwise**::
>>> x
array([0, 1, 2, 3, 4])
>>> x + x
array([0, 2, 4, 6, 8])
>>> x + y
array([0, 0, 0, 0, 0])
Operations can update the subarrays automatically::
>>> x = np.arange(16).reshape((4, 4))
>>> x
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
>>> x[1:3,1:3] += 100 # Note the automatic assignment!
>>> x
array([[ 0, 1, 2, 3],
[ 4, 105, 106, 7],
[ 8, 109, 110, 11],
[ 12, 13, 14, 15]])
Operations between arrays of different shapes but **compatible sizes** are
broadcast by "matching" the **last** (rightmost) dimension. For instance,
addition between a matrix ``x`` and a vector ``y`` broadcasts ``y`` over all
rows of ``x``::
>>> x = np.arange(9).reshape((3, 3))
>>> y = np.arange(3)
>>> x + y
array([[ 0, 2, 4],
[ 3, 5, 7],
[ 6, 8, 10]])
The result, ``x + y``, can be seen as the result of::
result = np.zeros(x.shape)
for i in range(x.shape[0]):
result[i,:] = x[i,:] + y[:]
As you can see, the last (rightmost) dimension of ``x`` is matched to the last
(and only) dimension of ``y``.
.. warning::
Operations between arrays of incompatible sizes raise an error::
>>> np.arange(3) + np.arange(2)
Traceback (most recent call last):
File "", line 1, in
ValueError: operands could not be broadcast together with shapes (3,) (2,)

Mathematical Functions

Most of the usual mathematical functions are implemented in numpy, including
the usual max and min operators and trigonometric functions.
For instance::
>>> np.sin(0), np.cos(0)
(0.0, 1.0)
>>> np.sin(np.pi / 2), np.cos(np.pi / 2)
(1.0, 6.123233995736766e17)
As you can see, the value of the cosine at :math:`\frac{\pi}{2}` should be
zero, but is approximated by a very (very) small value. The same is true for
other functions::
>>> np.tan(np.pi / 4)
0.99999999999999989
The nice property of mathematical functions in numpy is that they are
automatically applied to all elements of an array. For instance::
>>> x = [0, 0.25 * np.pi, 0.5 * np.pi, 0.75 * np.pi]
>>> np.cos(x)
array([ 1.00000000e+00, 7.07106781e01, 6.12323400e17,
7.07106781e01])
where the second and last elements are half the square root of two and its
opposite::
>>> np.sqrt(2) / 2
0.70710678118654757
Complex mathematical expressions can be therefore specified very compactly
using this property. For instance, in order to plot a sinewave with small
additive normallydistributed noise, we can write::
import numpy as np
import matplotlib.pyplot as plt
# The x values
x = np.arange(100)
# The y values, uncorrupted
y = np.sin(2 * np.pi * np.arange(100) / 100.0)
plt.plot(x, y)
# The y values, now corrupted by Gaussian noise
y += np.random.normal(0, 0.05, size=100)
plt.plot(x, y)
plt.savefig("sin.png")
The result is:
.. image:: figures/sin.png
.. note::
For the complete list of the available mathematical functions, see:
https://docs.scipy.org/doc/numpy/reference/routines.math.html
Linear Algebra

Numpy provides a number of linear algebraic functions, including dot products,
matrixvector multiplication, and matrixmatrix multiplication::
import numpy as np
# dot (scalar) product between vectors
x = np.array([0, 0, 1, 1])
print np.dot(x, x)
z = np.array([1, 1, 0, 0])
print np.dot(z, z)
print np.dot(x, z)
# Matrixvector product
a = 2 * np.diag(np.ones(4))
print(np.dot(a, x))
print(np.dot(a, z))
# Matrixmatrix product
b = np.ones((4, 4))  a
print(np.dot(a, b))
The result is::
# np.dot(x, x)
2
# np.dot(z, z)
2
# np.dot(x, z)
0
# np.dot(a, x)
[ 0. 0. 2. 2.]
# np.dot(a, z)
[ 2. 2. 0. 0.]
# np.dot(a, b)
[[2. 2. 2. 2.]
[ 2. 2. 2. 2.]
[ 2. 2. 2. 2.]
[ 2. 2. 2. 2.]]
Numpy also provides routines for dealing with eigenvalues/eigenvects, singular
value decompositions, and other decompositions.
Let's compute the (trivial) eigenvalues/eigenvectors of the 4x4 identity
matrix::
>>> import numpy.linalg as la
>>> eigenvalues, eigenvectors = la.eig(np.ones((4, 4)))
>>> eigenvalues
array([ 0., 4., 0., 0.])
>>> eigenvectors
array([[ 8.66025404e01, 5.00000000e01, 2.77555756e17,
2.77555756e17],
[ 2.88675135e01, 5.00000000e01, 5.77350269e01,
5.77350269e01],
[ 2.88675135e01, 5.00000000e01, 7.88675135e01,
2.11324865e01],
[ 2.88675135e01, 5.00000000e01, 2.11324865e01,
7.88675135e01]])
.. note::
The reference manual of the ``numpy.linalg`` module is here:
http://docs.scipy.org/doc/numpy/reference/routines.linalg.html