=================
Python: Functions
=================

Functions are **named blocks of code**. They take inputs and produce outputs.

The abstract syntax is::

    def function(arg1, arg2, ...):
        # the code
        return result

Once a function is defined (as above), it can be called as follows::

    the_result = function(value1, value2, ...)

The arguments (``arg1``, ``arg2``, etc.) are variables that specify how many
inputs the function takes. The variable ``result`` is output from the function
to its caller.

.. warning::

    - The name of the variables I pass to the function has nothing to do with
      the name of the arguments.

      In the code above, the values of the variables ``valueX`` are visible
      from the inside the function as ``argX``:

      - ``arg1`` takes the value of ``value1``
      - ``arg2`` takes the value of ``value2``
      - *etc.*

    - The name of the variable I use to store the result of the function has
      nothing to do with the name of the variable used inside the function to
      store the result.

      In the code above, the value of the result is stored in the ``result``
      variable inside the function, but is stored inside the ``the_result``
      variable by the caller.

      - ``the_result`` takes the value of ``result``

    Example::

        def function(a, b):
            r = a + b
            return r

        x = 5
        y = 10

        z = function(x, y)
        print z

    Here ``a`` takes its value from ``x``, ``b`` from ``y``; ``return r``
    makes the function return the value of ``r``, which is assigned by the
    caller to the variable ``z``.

****

**Example**. Let's define a function that takes two numbers and **returns**
their sum::

    def add(n, m):
        return n + m

It can be used as follows::

    result = add(4, 6)
    print result            # 10

    result = add(6, 4)
    print result            # 10

By replacing the calls to ``add()`` with the code in the definition of the
add function, and substituting the arguments with the input values, we get
the equivalent code::

    n = 4
    m = 6
    result = n + m
    print result            # 10

    n = 6
    m = 4
    result = n + m
    print result            # 10

As with all Python-provided functions, I can also skip assigning the return
value to a variable (``result`` in the code above)::

    print add(4, 6)         # 10

****

**Example**. Let's write a function ``print_sum()`` that **prints** the sum of
two numbers::

    def print_sum(n, m):
        print n + m

It can be used as follows::

    print_sum(4, 6)         # prints 10
    print_sum(6, 4)         # prints 10

.. warning::

    Notice that there is no ``return`` statement in ``print_sum()``.

    When ``return`` is omitted, the function automatically returns ``None``::

        result = print_sum(4, 6)    # prints 10
        print result                # prints None

****

.. warning::

    A function does **nothing** until it is called.

    Consider the Python module ``test.py``::

        print "beginning"

        def function():
            print "I do stuff"

        print "end"

    Running it with the Python interpreter::

        $ python test.py

    produces the following output::

        beginning
        end

    As you can see, the interpreter executes the code line by line. However,
    while the function ``function()`` is defined in the middle of the module,
    it is not **called** anywhere. Therefore it is not executed at all.

    In order to actually call the function, let's write::

        print "beginning"

        def function():
            print "I do stuff"

        function()

        print "end"

    This code prints::

        beginning
        I do stuff
        end

    as expected.

****

**Example**. Let's write a function ``factorial()`` that computes the factorial
of an integer ``n``:

.. math::

    n! = 1 \times 2 \times 3 \times \ldots (n - 2) \times (n - 1) \times n

Now, let's compute the factorial of ``n`` normally (i.e. without defining
a new function)::

    fact = 1
    for k in range(1, n + 1):
        fact = fact * k

Now that we have the code for computing the factorial, it is easy to write
a *function* that computes the factorial: it is sufficient to plug the above
code inside a new function, as follows::

    def factorial(n):
        fact = 1
        for k in range(1, n + 1):
            fact = fact * k
        return fact

And let's check if it works::

    print factorial(1)          # 1
    print factorial(2)          # 2
    print factorial(3)          # 6
    print factorial(4)          # 24
    print factorial(5)          # 120
    print factorial(6)          # 720

Of course, the new function can be used like any of the Python-defined
functions, e.g. in *list comprehensions*::

    factorials = [factorial(n) for n in range(10)]

****

.. warning::

    The name of the function, as well as the name of the arguments, are
    arbitrary: pick whichever name you find more fitting.

**Quiz**. What is the difference between this code::

    def arith(op, a, b):
        if op == "+":
            return a + b
        elif op == "*":
            return a * b
        else:
            return None

    print arith("+", 10, 10)
    print arith("*", 2, 2)

and this code?::

    def f(what, x, y):
        if what == "+":
            return x + y
        elif what == "*":
            return x * y
        else:
            return 0

    print f("+", 10, 10)
    print f("*", 2, 2)

****

.. note::

    A function can return more than one result, as follows::

        def multiresult():
            result_1 = "first result"
            result_2 = 0.12
            result_3 = "something else"
            return result_1, result_2, result_3

    Internally, Python interprets the ``return`` statement as returning
    a tuple. In practice, the above code is equivalent to::

        def multiresult():
            return ("first result", 0.12, "something else")

    When I call a "multi-result" function, I can either put the resulting
    tuple into a variable and extract the various elements individually::

        result = multiresult()
        res1 = result[0]
        print res1
        res2 = result[1]
        print res2
        res3 = result[2]
        print res3

    or I can use the "automatic unpacking" feature of Python, as follows::

        res1, res2, res3 = multiresult()
        print res1
        print res2
        print res3

****

.. warning::

    Variables have a **scope**, and in particular:

    - Variables declared outside the function are not visible the inside. [1]

      If you want to pass one or more values from the outside to the function,
      pass them through the arguments.

    - Variables declared inside the function are not visible from the outside.

      If you want to pass one or more values from the function to the
      external world, use the ``return`` statement.

    [1] *There are exceptions to this rule; we will ignore them in this
    presentation.*

****

**Exampe**. Consider this code::

    def find_physical(triples):
        """Takes a mixed interaction protein network, example:

            [("1A3A", "physical", "5ARM"),
             ("5JTD", "genetic", "5TGD")]

        and extracts physical interacting protein pairs, example:

            [("1A3A", "5ARM")]
        """
        phys_pairs = []
        for p1, relation, p2 in triples:
            if relation == "physical":
                phys_pairs.append((p1, p2))
        # XXX I forgot to return `phys_pairs` here!

    network = [
        ("1A3A", "physical", "5ARM"),
        ("5JTD", "genetic", "5TGD")
    ]
    find_physical(network)
    print phys_pairs

Here ``phys_pairs`` is declared **inside** the function: it is not visible
from the **outside**!

In order to fix this issue, I have to explicitly ``return`` it::

    def find_physical(triples):
        phys_pairs = []
        for p1, relation, p2 in triples:
            if relation == "physical":
                phys_pairs.append((p1, p2))
        return phys_pairs

    network = [
        ("1A3A", "physical", "5ARM"),
        ("5JTD", "genetic", "5TGD")
    ]
    result = find_physical(network)
    print result

****

**Example**. Functions can call other functions. Let's write two functions::

    def read_fasta(path):
        """Reads a FASTA file with one-line sequences."""
        fasta = {}
        for line in open(path).readlines():
            line = line.strip()
            if line[0].startswith(">"):
                header = line
            else:
                fasta[header] = line
        return fasta

    def compute_histogram(sequence):
        """Computes the histogram of the characters."""
        histogram = {}
        for letter in sequence:
            if not histogram.has_key(letter):
                histogram[letter] = 0
            histogram[letter] += 1
        return histogram

These functions can be used to implement a complex program that:

#. Reads a FASTA file into a dictionary

#. For each sequence in the FASTA file, computes the histogram of its letters

#. Prints each sequence header and the corresponding histogram

as follows::

    path = raw_input("enter a path: ")
    fasta = read_fasta(path)

    for header, sequence in fasta.items():
        histogram = compute_histogram(sequence)
        print "header =", header.lstrip(">"), ":"
        print histogram

****

**Example**. Since functions can call other functions, the "call graph" of a
program can become arbitrarily complicated. Let's see a moderately realistic
example of what a call graph looks like.

Let's write a (mock!) program, composed of multiple functions,
that asks the user for:

#. the path to one or more FASTA files.

#. the path to a file describing a physical protein interaction network (PIN).

and computes some average statistic (say, a histogram) of the amino acid
composition of interacting proteins.

When ran, the program does the following:

#. reads the sequence data from each FASTA file, see the ``read_sequences()``
   and ``read_fasta()`` functions

#. reads the interaction network with the ``read_interactions()`` function

#. for each pair of interacting proteins, computes statistics about their
   joint amino acid composition, through the ``compute_aa_stats()`` function,
   and computes an "average" summary statistic in the ``compute_avg_stats()``
   function.

Here is the code::

    def read_fasta(path):
        """Takes a path to a FASTA file, returns a
        header->sequence dict."""
        # TODO actually read the file
        return "1A3A:A", "MANLFKLG..."

    def read_sequences(paths):
        """Reads a bunch of FASTA files, returns a
        list of dicts."""
        header_to_seq = {}
        for path in paths:
            header, seq = read_fasta(path)
            header_to_seq[header] = seq
        return header_to_seq

    def read_interactions(path):
        """Reads physical protein interactions from a
        file. Returns a list of pairs of strings."""
        # TODO actually read the file
        return [("1A3A:A", "5AA3:F"), ("5AA3:F", "5K9C:A")]

    def compute_aa_stats(seq1, seq2):
        """Compute amino acid statistics, e.g.
        co-occurrence."""
        # TODO actually compute co-occurrence and MI
        cooccurrence = {"A": 0.2, "C": 0.01}
        mutual_information = 0.72
        return cooccurrence, mutual_information

    def compute_avg_stats(sequences, interactions):
        """Takes a list of statistics (in some format) and
        computes the average statistics."""
        stats = []
        for prot1, prot2 in interactions:
            if not (sequences.has_key(prot1) and sequences.has_key(prot2)):
                continue
            seq1 = sequences[prot1]
            seq2 = sequences[prot2]
            stats.append(compute_aa_stats(seq1, seq2))
        # TODO actually average all the collected statistics
        return 0.3

    def main():
        """The whole (fake) program."""

        # Read the sequence files
        paths = []
        while True:
            ans = raw_input("path to FASTA file: ")
            if len(ans) == 0:
                break
            paths.append(ans)

        sequences = read_sequences(paths)

        # Read the interaction file
        ans = raw_input("path to interaction data: ")
        interactions = read_interactions(ans)

        # Print the average stats
        print "average stats =", compute_avg_stats(sequences, interactions)

    main()

As you can see, Python begins by calling the ``main()`` function at the very
last line of the program. The ``main()`` function calls all the other
"major" functions: ``read_sequences()``, ``read_interactions()`` and
``compute_avg_stats()``.

The ``read_sequences()`` function internally calls the ``read_fasta()``
function multiple times, once for each user-provided FASTA file.

The ``read_interactions()`` function calls no other function.

The ``compute_avg_stats()`` function uses the ``compute_aa_stats()`` function
to compute the statistics of individual protein-protein pairs.

The above can be summarized using a "call graph" like this:

.. image:: figures/callgraph.png

**Quiz**. How many times is:
- the ``main()`` function called?
- the ``read_fasta()`` function called?
- the ``compute_aa_stats()`` function called?