# Numpy: a cursory overview¶

Warning

This is an appendix.

This material will not be part of the final exam.

Numpy is a freely available library for performing efficient numerical computations over scalars, vectors, matrices, and more generally N-dimensional tensors.

Its main features are:

• Flexible indexing and manipulation of arbitrary dimensional data.
• Many numerical routines (linear algebra, Fourier analysis, local and global optimization, etc.) using a variety of algorithms.
• Clean Python interface to many underlying libraries (for instance, the GNU Scientific Library, or GSL), which are used as efficient implementations of numerical algorithms.
• Support for random sampling from a variety of distributions.

This list is far from complete!

## Documentation & Source¶

The official numpy website is:

The numpy source code is available at:

## Importing numpy¶

The customary idiom to import numpy is:

>>> import numpy as np
>>> help(np)


Importing specific sub-modules also makes sense, so feel free to write (in this case for the linear algebra module):

>>> import numpy.linalg as la
>>> help(la)


## The Array class¶

An ndarray is what it says: an N-dimensional array (also known as a tensor).

The usual concepts encountered in analysis and algebra can be implemented as arrays: vectors are 1D arrays, matrices are 2D arrays.

Given an array x, its dimensionality and type can be ascertained by using the shape, ndim and dtype attributes:

>>> x = np.zeros(10)
array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])
>>> print type(x)
<type 'numpy.ndarray'>
>>> print x.shape
(10,)
>>> print x.ndim
1
>>> print x.dtype
float64


## Creating an Array¶

Creating arrays from scratch can be useful to start off your mathematical code or for debugging purposes. A few useful creation functions are:

• Creating an all-zero array is simple with np.zeros():

>>> # One dimensional case
>>> x = np.zeros(2)
>>> x
array([ 0.,  0.])

>>> # Two dimensional case
>>> x = np.zeros((2, 2))
>>> x
array([[ 0.,  0.],
[ 0.,  0.]])
>>> x.shape
(2, 2)

>>> # Three dimensional case
>>> x = np.zeros((2, 2, 2))
>>> x
array([[[ 0.,  0.],
[ 0.,  0.]],
[[ 0.,  0.],
[ 0.,  0.]]])
>>> x.shape
(2, 2, 2)

• Creating an all-ones array is equally simple with np.ones():

>>> # N-dimensional case
>>> x = np.ones((2, 5, 5, 10))
>>> x.shape
(2, 5, 5, 10)

• To create a diagonal matrix, use either the diag() and ones() methods together, or the eye() method (reminiscent of Matlab):

>>> np.diag(np.ones(3))
array([[ 1.,  0.,  0.],
[ 0.,  1.,  0.],
[ 0.,  0.,  1.]])
>>> np.eye(3)
array([[ 1.,  0.,  0.],
[ 0.,  1.,  0.],
[ 0.,  0.,  1.]])

• The numpy analogue of range() is np.arange():

>>> x = np.arange(10)
>>> x.shape
(10,)
>>> x
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

• Converting from an existing list or list-of-lists object:

>>> l = [range(3*i,3*(i+3),3) for i in range(3)]
>>> l
[[0, 3, 6], [3, 6, 9], [6, 9, 12]]
>>> np.array(l)
array([[ 0,  3,  6],
[ 3,  6,  9],
[ 6,  9, 12]])


In this case, the dtype will be based on the type of the elements in the input list.

• Random sampling an array from a distribution:

>>> # Sampling a single scalar from a real-valued uniform (sampling
>>> # functions return a scalar if no shape information is given)
>>> x = np.random.uniform(0, 10)
>>> x
3.9300397597884897

>>> # Sampling a 3-element vector from a Normal distribution
>>> # with mean=1 and sigma=0.2
>>> x = np.random.normal(1, 0.2, size=3)
>>> x
array([ 1.2001157 ,  0.94376082,  1.09514026])

>>> # Sampling a binary 3x3 matrix, i.e. from a uniform
>>> # distribution over integers
>>> x = np.random.randint(0, 2, size=(3, 3))
>>> x
array([[1, 0, 0],
[1, 0, 1],
[0, 1, 0]])


Example. To sample a random 100x100 matrix where every entry is distributed according to a normal distribution, write:

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(0)

rm = np.random.normal(0, 1, size=(100, 100))

plt.imshow(rm, cmap="gray", interpolation="nearest")
plt.savefig("random-matrix.png")


The result is: Note

The reference manual of the numpy.random module is here:

For a quick-and-dirty explanation of (pseudo!) random number generation as used in Computer Science, see:

The numpy.random RNG makes use of the very popular Mersenne Twister (MT) PRNG. For an introduction, see here:

Warning

For reproducibility, make sure to initialize the pseudo-random number generator with np.random.seed() before running your code, i.e.:

>>> np.random.seed(0)


After setting the seed, your code will be deterministic as long as calls to the sampling routines above remain the same.

## From pandas to numpy and back¶

It is very easy to “convert” a Series or DataFrame into an array: as a matter of facts, these Pandas types are built on numpy arrays: their underlying array is accessible through the .values attribute.

(Pandas of course coats the underlying array with a lot of useful functionality!)

For instance, given a DataFrame of the iris dataset:

iris = pd.read_csv("iris.csv")


the underlying array can be accessed through its .values attribute:

>>> # For a table of two columns
>>> type(iris[["PetalLength", "SepalLength"]])
<class 'pandas.core.frame.DataFrame'>
>>> type(iris[["PetalLength", "SepalLength"]].values)
<type 'numpy.ndarray'>


The same can be done with individual Series objects, e.g. the columns of the DataFrame:

>>> type(iris.PetalWidth.values)
<type 'numpy.ndarray'>


It is equally easy to construct a DataFrame out of an arrray: just call the DataFrame passing it the array as an argument:

>>> pd.DataFrame(x)
0  1  2
0  0  1  2
1  3  4  5
2  6  7  8


If you also want to assign the column and row labels, you can pass them as optional arguments:

>>> pd.DataFrame(x,
...              columns=['a', 'b', 'c'],
...              index=['one', 'two', 'three'])
a  b  c
one    0  1  2
two    3  4  5
three  6  7  8


## Reshaping¶

Reshaping. The arrangement of entries into rows, columns, tubes, etc. can be changed at any time using the reshape() and ravel() methods:

>>> # Turn a 9-element vector into a 3x3 matrix
>>> x = np.arange(9)
>>> x
array([0, 1, 2, 3, 4, 5, 6, 7, 8])
>>> x.reshape((3, 3))
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])

>>> # Turn a 3x3 matrix into a 9-element vector
>>> x = np.ones((3, 3))
>>> x
array([[ 1.,  1.,  1.],
[ 1.,  1.,  1.],
[ 1.,  1.,  1.]])
>>> x.ravel()
array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])


## Indexing¶

Given an array of shape (n1, n2, ..., nd), you can extract a single element through its index.

For instance, in the 2D case:

>>> x = np.arange(9).reshape((3, 3))
>>> x
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>> x[0,0]  # first row, first column
0
>>> x[1,2]  # second row, third column
5
>>> x[2,0]  # thrid row, first column
6


In the example above, x[i,j] extracts the element in row i, column j – just like in standard matrix algebra.

You can also use the colon : notation to select specify all the elements on an axis, for instance a whole column, a whole row, or an entire sub-array:

>>> # The first row
>>> x[0,:]
array([0, 1, 2])

>>> # A shorthand for the first row
>>> x
array([0, 1, 2])

>>> # The first column (no shorthand available)
>>> x[:,1]
array([1, 4, 7])

>>> # A sub-table
>>> x[1:3, :]
array([[3, 4, 5],
[6, 7, 8]])


The very same syntax applies to the general N-dimensional case:

>>> x = np.arange(10**5).reshape((10,)*5)
>>> x.shape
(10, 10, 10, 10, 10)
>>> x[1,2,3,4,5]
12345
>>> x[0,0,:,:,0]
array([[  0,  10,  20,  30,  40,  50,  60,  70,  80,  90],
[100, 110, 120, 130, 140, 150, 160, 170, 180, 190],
[200, 210, 220, 230, 240, 250, 260, 270, 280, 290],
[300, 310, 320, 330, 340, 350, 360, 370, 380, 390],
[400, 410, 420, 430, 440, 450, 460, 470, 480, 490],
[500, 510, 520, 530, 540, 550, 560, 570, 580, 590],
[600, 610, 620, 630, 640, 650, 660, 670, 680, 690],
[700, 710, 720, 730, 740, 750, 760, 770, 780, 790],
[800, 810, 820, 830, 840, 850, 860, 870, 880, 890],
[900, 910, 920, 930, 940, 950, 960, 970, 980, 990]])


Note

It can be difficult to visualize operations over N-dimensional arrays, when N is larger than 3.

However, it is much easier when the axes have a concrete semantics. For instance, assume that the 4D array performace holds the performance of several sequence alignment algorithms over multiple sequence clusters in multiple databases:

1. The first axis is the index of the algorithm
2. The second axis is the index of the database
3. The third axis is the index of a cluster of sequences in the database
4. The fourth axis is one of three measures of performance: precision, recall, and accuracy

performances[alg][db][cluster][measure]


Then in extract the accuracy of algorithm 1 over all databases and sequence clusters, you can just do:

performances[1,:,:,0]


or even:

performances[1,:,:,-1]


Now that the axes have a concrete semantics, it is much easier to see what this operation does!

## Arithmetic and Functions¶

Operations between scalars and arrays are broadcast, like in Pandas (and more generally in linear algebra). For instance:

>>> x = np.arange(5)
>>> 10 * x
array([ 0, 10, 20, 30, 40])
>>> y = -x
>>> y
array([ 0, -10, -20, -30, -40])


Operations between equally-shaped arrays operate element-wise:

>>> x
array([0, 1, 2, 3, 4])
>>> x + x
array([0, 2, 4, 6, 8])
>>> x + y
array([0, 0, 0, 0, 0])


Operations can update the sub-arrays automatically:

>>> x = np.arange(16).reshape((4, 4))
>>> x
array([[ 0,  1,  2,  3],
[ 4,  5,  6,  7],
[ 8,  9, 10, 11],
[12, 13, 14, 15]])
>>> x[1:3,1:3] += 100 # Note the automatic assignment!
>>> x
array([[  0,   1,   2,   3],
[  4, 105, 106,   7],
[  8, 109, 110,  11],
[ 12,  13,  14,  15]])


Operations between arrays of different shapes but compatible sizes are broadcast by “matching” the last (right-most) dimension. For instance, addition between a matrix x and a vector y broadcasts y over all rows of x:

>>> x = np.arange(9).reshape((3, 3))
>>> y = np.arange(3)
>>> x + y
array([[ 0,  2,  4],
[ 3,  5,  7],
[ 6,  8, 10]])


The result, x + y, can be seen as the result of:

result = np.zeros(x.shape)
for i in range(x.shape):
result[i,:] = x[i,:] + y[:]


As you can see, the last (right-most) dimension of x is matched to the last (and only) dimension of y.

Warning

Operations between arrays of incompatible sizes raise an error:

>>> np.arange(3) + np.arange(2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: operands could not be broadcast together with shapes (3,) (2,)


## Mathematical Functions¶

Most of the usual mathematical functions are implemented in numpy, including the usual max and min operators and trigonometric functions.

For instance:

>>> np.sin(0), np.cos(0)
(0.0, 1.0)
>>> np.sin(np.pi / 2), np.cos(np.pi / 2)
(1.0, 6.123233995736766e-17)


As you can see, the value of the cosine at $$\frac{\pi}{2}$$ should be zero, but is approximated by a very (very) small value. The same is true for other functions:

>>> np.tan(np.pi / 4)
0.99999999999999989


The nice property of mathematical functions in numpy is that they are automatically applied to all elements of an array. For instance:

>>> x = [0, 0.25 * np.pi, 0.5 * np.pi, 0.75 * np.pi]
>>> np.cos(x)
array([  1.00000000e+00,   7.07106781e-01,   6.12323400e-17,
-7.07106781e-01])


where the second and last elements are half the square root of two and its opposite:

>>> np.sqrt(2) / 2
0.70710678118654757


Complex mathematical expressions can be therefore specified very compactly using this property. For instance, in order to plot a sinewave with small additive normally-distributed noise, we can write:

import numpy as np
import matplotlib.pyplot as plt

# The x values
x = np.arange(100)

# The y values, uncorrupted
y = np.sin(2 * np.pi * np.arange(100) / 100.0)
plt.plot(x, y)

# The y values, now corrupted by Gaussian noise
y += np.random.normal(0, 0.05, size=100)
plt.plot(x, y)

plt.savefig("sin.png")


The result is: Note

For the complete list of the available mathematical functions, see:

## Linear Algebra¶

Numpy provides a number of linear algebraic functions, including dot products, matrix-vector multiplication, and matrix-matrix multiplication:

import numpy as np

# dot (scalar) product between vectors
x = np.array([0, 0, 1, 1])
print np.dot(x, x)

z = np.array([1, 1, 0, 0])
print np.dot(z, z)

print np.dot(x, z)

# Matrix-vector product
a = 2 * np.diag(np.ones(4))
print(np.dot(a, x))
print(np.dot(a, z))

# Matrix-matrix product
b = np.ones((4, 4)) - a
print(np.dot(a, b))


The result is:

# np.dot(x, x)
2

# np.dot(z, z)
2

# np.dot(x, z)
0

# np.dot(a, x)
[ 0.  0.  2.  2.]

# np.dot(a, z)
[ 2.  2.  0.  0.]

# np.dot(a, b)
[[-2.  2.  2.  2.]
[ 2. -2.  2.  2.]
[ 2.  2. -2.  2.]
[ 2.  2.  2. -2.]]


Numpy also provides routines for dealing with eigenvalues/eigenvects, singular value decompositions, and other decompositions.

Let’s compute the (trivial) eigenvalues/eigenvectors of the 4x4 identity matrix:

>>> import numpy.linalg as la
>>> eigenvalues, eigenvectors = la.eig(np.ones((4, 4)))
>>> eigenvalues
array([ 0.,  4.,  0.,  0.])
>>> eigenvectors
array([[ -8.66025404e-01,   5.00000000e-01,  -2.77555756e-17,
-2.77555756e-17],
[  2.88675135e-01,   5.00000000e-01,  -5.77350269e-01,
-5.77350269e-01],
[  2.88675135e-01,   5.00000000e-01,   7.88675135e-01,
-2.11324865e-01],
[  2.88675135e-01,   5.00000000e-01,  -2.11324865e-01,
7.88675135e-01]])


Note

The reference manual of the numpy.linalg module is here: