Numpy: a cursory overview

Warning

This is an appendix.

This material will not be part of the final exam.


Numpy is a freely available library for performing efficient numerical computations over scalars, vectors, matrices, and more generally N-dimensional tensors.

Its main features are:

  • Flexible indexing and manipulation of arbitrary dimensional data.
  • Many numerical routines (linear algebra, Fourier analysis, local and global optimization, etc.) using a variety of algorithms.
  • Clean Python interface to many underlying libraries (for instance, the GNU Scientific Library, or GSL), which are used as efficient implementations of numerical algorithms.
  • Support for random sampling from a variety of distributions.

This list is far from complete!


Documentation & Source

The official numpy website is:

The numpy source code is available at:

and is distributed under the numpy license:


Importing numpy

The customary idiom to import numpy is:

>>> import numpy as np
>>> help(np)

Importing specific sub-modules also makes sense, so feel free to write (in this case for the linear algebra module):

>>> import numpy.linalg as la
>>> help(la)

The Array class

An ndarray is what it says: an N-dimensional array (also known as a tensor).

The usual concepts encountered in analysis and algebra can be implemented as arrays: vectors are 1D arrays, matrices are 2D arrays.

Given an array x, its dimensionality and type can be ascertained by using the shape, ndim and dtype attributes:

>>> x = np.zeros(10)
array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])
>>> print type(x)
<type 'numpy.ndarray'>
>>> print x.shape
(10,)
>>> print x.ndim
1
>>> print x.dtype
float64

Creating an Array

Creating arrays from scratch can be useful to start off your mathematical code or for debugging purposes. A few useful creation functions are:

  • Creating an all-zero array is simple with np.zeros():

    >>> # One dimensional case
    >>> x = np.zeros(2)
    >>> x
    array([ 0.,  0.])
    
    >>> # Two dimensional case
    >>> x = np.zeros((2, 2))
    >>> x
    array([[ 0.,  0.],
           [ 0.,  0.]])
    >>> x.shape
    (2, 2)
    
    >>> # Three dimensional case
    >>> x = np.zeros((2, 2, 2))
    >>> x
    array([[[ 0.,  0.],
            [ 0.,  0.]],
           [[ 0.,  0.],
            [ 0.,  0.]]])
    >>> x.shape
    (2, 2, 2)
    
  • Creating an all-ones array is equally simple with np.ones():

    >>> # N-dimensional case
    >>> x = np.ones((2, 5, 5, 10))
    >>> x.shape
    (2, 5, 5, 10)
    
  • To create a diagonal matrix, use either the diag() and ones() methods together, or the eye() method (reminiscent of Matlab):

    >>> np.diag(np.ones(3))
    array([[ 1.,  0.,  0.],
           [ 0.,  1.,  0.],
           [ 0.,  0.,  1.]])
    >>> np.eye(3)
    array([[ 1.,  0.,  0.],
           [ 0.,  1.,  0.],
           [ 0.,  0.,  1.]])
    
  • The numpy analogue of range() is np.arange():

    >>> x = np.arange(10)
    >>> x.shape
    (10,)
    >>> x
    array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
    
  • Converting from an existing list or list-of-lists object:

    >>> l = [range(3*i,3*(i+3),3) for i in range(3)]
    >>> l
    [[0, 3, 6], [3, 6, 9], [6, 9, 12]]
    >>> np.array(l)
    array([[ 0,  3,  6],
           [ 3,  6,  9],
           [ 6,  9, 12]])
    

    In this case, the dtype will be based on the type of the elements in the input list.

  • Random sampling an array from a distribution:

    >>> # Sampling a single scalar from a real-valued uniform (sampling
    >>> # functions return a scalar if no shape information is given)
    >>> x = np.random.uniform(0, 10)
    >>> x
    3.9300397597884897
    
    >>> # Sampling a 3-element vector from a Normal distribution
    >>> # with mean=1 and sigma=0.2
    >>> x = np.random.normal(1, 0.2, size=3)
    >>> x
    array([ 1.2001157 ,  0.94376082,  1.09514026])
    
    >>> # Sampling a binary 3x3 matrix, i.e. from a uniform
    >>> # distribution over integers
    >>> x = np.random.randint(0, 2, size=(3, 3))
    >>> x
    array([[1, 0, 0],
           [1, 0, 1],
           [0, 1, 0]])
    

Example. To sample a random 100x100 matrix where every entry is distributed according to a normal distribution, write:

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(0)

rm = np.random.normal(0, 1, size=(100, 100))

plt.imshow(rm, cmap="gray", interpolation="nearest")
plt.savefig("random-matrix.png")

The result is:

_images/random-matrix.png

Note

The reference manual of the numpy.random module is here:

For a quick-and-dirty explanation of (pseudo!) random number generation as used in Computer Science, see:

The numpy.random RNG makes use of the very popular Mersenne Twister (MT) PRNG. For an introduction, see here:

Warning

For reproducibility, make sure to initialize the pseudo-random number generator with np.random.seed() before running your code, i.e.:

>>> np.random.seed(0)

After setting the seed, your code will be deterministic as long as calls to the sampling routines above remain the same.


From pandas to numpy and back

It is very easy to “convert” a Series or DataFrame into an array: as a matter of facts, these Pandas types are built on numpy arrays: their underlying array is accessible through the .values attribute.

(Pandas of course coats the underlying array with a lot of useful functionality!)

For instance, given a DataFrame of the iris dataset:

iris = pd.read_csv("iris.csv")

the underlying array can be accessed through its .values attribute:

>>> # For a table of two columns
>>> type(iris[["PetalLength", "SepalLength"]])
<class 'pandas.core.frame.DataFrame'>
>>> type(iris[["PetalLength", "SepalLength"]].values)
<type 'numpy.ndarray'>

The same can be done with individual Series objects, e.g. the columns of the DataFrame:

>>> type(iris.PetalWidth.values)
<type 'numpy.ndarray'>

It is equally easy to construct a DataFrame out of an arrray: just call the DataFrame passing it the array as an argument:

>>> pd.DataFrame(x)
   0  1  2
0  0  1  2
1  3  4  5
2  6  7  8

If you also want to assign the column and row labels, you can pass them as optional arguments:

>>> pd.DataFrame(x,
...              columns=['a', 'b', 'c'],
...              index=['one', 'two', 'three'])
       a  b  c
one    0  1  2
two    3  4  5
three  6  7  8

Reshaping

Reshaping. The arrangement of entries into rows, columns, tubes, etc. can be changed at any time using the reshape() and ravel() methods:

>>> # Turn a 9-element vector into a 3x3 matrix
>>> x = np.arange(9)
>>> x
array([0, 1, 2, 3, 4, 5, 6, 7, 8])
>>> x.reshape((3, 3))
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

>>> # Turn a 3x3 matrix into a 9-element vector
>>> x = np.ones((3, 3))
>>> x
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])
>>> x.ravel()
array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

Indexing

Given an array of shape (n1, n2, ..., nd), you can extract a single element through its index.

For instance, in the 2D case:

>>> x = np.arange(9).reshape((3, 3))
>>> x
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
>>> x[0,0]  # first row, first column
0
>>> x[1,2]  # second row, third column
5
>>> x[2,0]  # thrid row, first column
6

In the example above, x[i,j] extracts the element in row i, column j – just like in standard matrix algebra.

You can also use the colon : notation to select specify all the elements on an axis, for instance a whole column, a whole row, or an entire sub-array:

>>> # The first row
>>> x[0,:]
array([0, 1, 2])

>>> # A shorthand for the first row
>>> x[0]
array([0, 1, 2])

>>> # The first column (no shorthand available)
>>> x[:,1]
array([1, 4, 7])


>>> # A sub-table
>>> x[1:3, :]
array([[3, 4, 5],
       [6, 7, 8]])

The very same syntax applies to the general N-dimensional case:

>>> x = np.arange(10**5).reshape((10,)*5)
>>> x.shape
(10, 10, 10, 10, 10)
>>> x[1,2,3,4,5]
12345
>>> x[0,0,:,:,0]
array([[  0,  10,  20,  30,  40,  50,  60,  70,  80,  90],
       [100, 110, 120, 130, 140, 150, 160, 170, 180, 190],
       [200, 210, 220, 230, 240, 250, 260, 270, 280, 290],
       [300, 310, 320, 330, 340, 350, 360, 370, 380, 390],
       [400, 410, 420, 430, 440, 450, 460, 470, 480, 490],
       [500, 510, 520, 530, 540, 550, 560, 570, 580, 590],
       [600, 610, 620, 630, 640, 650, 660, 670, 680, 690],
       [700, 710, 720, 730, 740, 750, 760, 770, 780, 790],
       [800, 810, 820, 830, 840, 850, 860, 870, 880, 890],
       [900, 910, 920, 930, 940, 950, 960, 970, 980, 990]])

Note

It can be difficult to visualize operations over N-dimensional arrays, when N is larger than 3.

However, it is much easier when the axes have a concrete semantics. For instance, assume that the 4D array performace holds the performance of several sequence alignment algorithms over multiple sequence clusters in multiple databases:

  1. The first axis is the index of the algorithm
  2. The second axis is the index of the database
  3. The third axis is the index of a cluster of sequences in the database
  4. The fourth axis is one of three measures of performance: precision, recall, and accuracy

Your tensor looks like this:

performances[alg][db][cluster][measure]

Then in extract the accuracy of algorithm 1 over all databases and sequence clusters, you can just do:

performances[1,:,:,0]

or even:

performances[1,:,:,-1]

Now that the axes have a concrete semantics, it is much easier to see what this operation does!


Arithmetic and Functions

Operations between scalars and arrays are broadcast, like in Pandas (and more generally in linear algebra). For instance:

>>> x = np.arange(5)
>>> 10 * x
array([ 0, 10, 20, 30, 40])
>>> y = -x
>>> y
array([ 0, -10, -20, -30, -40])

Operations between equally-shaped arrays operate element-wise:

>>> x
array([0, 1, 2, 3, 4])
>>> x + x
array([0, 2, 4, 6, 8])
>>> x + y
array([0, 0, 0, 0, 0])

Operations can update the sub-arrays automatically:

>>> x = np.arange(16).reshape((4, 4))
>>> x
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])
>>> x[1:3,1:3] += 100 # Note the automatic assignment!
>>> x
array([[  0,   1,   2,   3],
       [  4, 105, 106,   7],
       [  8, 109, 110,  11],
       [ 12,  13,  14,  15]])

Operations between arrays of different shapes but compatible sizes are broadcast by “matching” the last (right-most) dimension. For instance, addition between a matrix x and a vector y broadcasts y over all rows of x:

>>> x = np.arange(9).reshape((3, 3))
>>> y = np.arange(3)
>>> x + y
array([[ 0,  2,  4],
       [ 3,  5,  7],
       [ 6,  8, 10]])

The result, x + y, can be seen as the result of:

result = np.zeros(x.shape)
for i in range(x.shape[0]):
    result[i,:] = x[i,:] + y[:]

As you can see, the last (right-most) dimension of x is matched to the last (and only) dimension of y.

Warning

Operations between arrays of incompatible sizes raise an error:

>>> np.arange(3) + np.arange(2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: operands could not be broadcast together with shapes (3,) (2,)

Mathematical Functions

Most of the usual mathematical functions are implemented in numpy, including the usual max and min operators and trigonometric functions.

For instance:

>>> np.sin(0), np.cos(0)
(0.0, 1.0)
>>> np.sin(np.pi / 2), np.cos(np.pi / 2)
(1.0, 6.123233995736766e-17)

As you can see, the value of the cosine at \(\frac{\pi}{2}\) should be zero, but is approximated by a very (very) small value. The same is true for other functions:

>>> np.tan(np.pi / 4)
0.99999999999999989

The nice property of mathematical functions in numpy is that they are automatically applied to all elements of an array. For instance:

>>> x = [0, 0.25 * np.pi, 0.5 * np.pi, 0.75 * np.pi]
>>> np.cos(x)
array([  1.00000000e+00,   7.07106781e-01,   6.12323400e-17,
        -7.07106781e-01])

where the second and last elements are half the square root of two and its opposite:

>>> np.sqrt(2) / 2
0.70710678118654757

Complex mathematical expressions can be therefore specified very compactly using this property. For instance, in order to plot a sinewave with small additive normally-distributed noise, we can write:

import numpy as np
import matplotlib.pyplot as plt

# The x values
x = np.arange(100)

# The y values, uncorrupted
y = np.sin(2 * np.pi * np.arange(100) / 100.0)
plt.plot(x, y)

# The y values, now corrupted by Gaussian noise
y += np.random.normal(0, 0.05, size=100)
plt.plot(x, y)

plt.savefig("sin.png")

The result is:

_images/sin.png

Note

For the complete list of the available mathematical functions, see:

Linear Algebra

Numpy provides a number of linear algebraic functions, including dot products, matrix-vector multiplication, and matrix-matrix multiplication:

import numpy as np

# dot (scalar) product between vectors
x = np.array([0, 0, 1, 1])
print np.dot(x, x)

z = np.array([1, 1, 0, 0])
print np.dot(z, z)

print np.dot(x, z)

# Matrix-vector product
a = 2 * np.diag(np.ones(4))
print(np.dot(a, x))
print(np.dot(a, z))

# Matrix-matrix product
b = np.ones((4, 4)) - a
print(np.dot(a, b))

The result is:

# np.dot(x, x)
2

# np.dot(z, z)
2

# np.dot(x, z)
0

# np.dot(a, x)
[ 0.  0.  2.  2.]

# np.dot(a, z)
[ 2.  2.  0.  0.]

# np.dot(a, b)
[[-2.  2.  2.  2.]
 [ 2. -2.  2.  2.]
 [ 2.  2. -2.  2.]
 [ 2.  2.  2. -2.]]

Numpy also provides routines for dealing with eigenvalues/eigenvects, singular value decompositions, and other decompositions.

Let’s compute the (trivial) eigenvalues/eigenvectors of the 4x4 identity matrix:

>>> import numpy.linalg as la
>>> eigenvalues, eigenvectors = la.eig(np.ones((4, 4)))
>>> eigenvalues
array([ 0.,  4.,  0.,  0.])
>>> eigenvectors
array([[ -8.66025404e-01,   5.00000000e-01,  -2.77555756e-17,
         -2.77555756e-17],
       [  2.88675135e-01,   5.00000000e-01,  -5.77350269e-01,
         -5.77350269e-01],
       [  2.88675135e-01,   5.00000000e-01,   7.88675135e-01,
         -2.11324865e-01],
       [  2.88675135e-01,   5.00000000e-01,  -2.11324865e-01,
          7.88675135e-01]])

Note

The reference manual of the numpy.linalg module is here: