================= Python: Functions ================= Functions are **named blocks of code**. They take inputs and produce outputs. The abstract syntax is:: def function(arg1, arg2, ...): # the code return result Once a function is defined (as above), it can be called as follows:: the_result = function(value1, value2, ...) The arguments (``arg1``, ``arg2``, etc.) are variables that specify how many inputs the function takes. The variable ``result`` is output from the function to its caller. .. warning:: - The name of the variables I pass to the function has nothing to do with the name of the arguments. In the code above, the values of the variables ``valueX`` are visible from the inside the function as ``argX``: - ``arg1`` takes the value of ``value1`` - ``arg2`` takes the value of ``value2`` - *etc.* - The name of the variable I use to store the result of the function has nothing to do with the name of the variable used inside the function to store the result. In the code above, the value of the result is stored in the ``result`` variable inside the function, but is stored inside the ``the_result`` variable by the caller. - ``the_result`` takes the value of ``result`` Example:: def function(a, b): r = a + b return r x = 5 y = 10 z = function(x, y) print z Here ``a`` takes its value from ``x``, ``b`` from ``y``; ``return r`` makes the function return the value of ``r``, which is assigned by the caller to the variable ``z``. **** **Example**. Let's define a function that takes two numbers and **returns** their sum:: def add(n, m): return n + m It can be used as follows:: result = add(4, 6) print result # 10 result = add(6, 4) print result # 10 By replacing the calls to ``add()`` with the code in the definition of the add function, and substituting the arguments with the input values, we get the equivalent code:: n = 4 m = 6 result = n + m print result # 10 n = 6 m = 4 result = n + m print result # 10 As with all Python-provided functions, I can also skip assigning the return value to a variable (``result`` in the code above):: print add(4, 6) # 10 **** **Example**. Let's write a function ``print_sum()`` that **prints** the sum of two numbers:: def print_sum(n, m): print n + m It can be used as follows:: print_sum(4, 6) # prints 10 print_sum(6, 4) # prints 10 .. warning:: Notice that there is no ``return`` statement in ``print_sum()``. When ``return`` is omitted, the function automatically returns ``None``:: result = print_sum(4, 6) # prints 10 print result # prints None **** .. warning:: A function does **nothing** until it is called. Consider the Python module ``test.py``:: print "beginning" def function(): print "I do stuff" print "end" Running it with the Python interpreter:: $ python test.py produces the following output:: beginning end As you can see, the interpreter executes the code line by line. However, while the function ``function()`` is defined in the middle of the module, it is not **called** anywhere. Therefore it is not executed at all. In order to actually call the function, let's write:: print "beginning" def function(): print "I do stuff" function() print "end" This code prints:: beginning I do stuff end as expected. **** **Example**. Let's write a function ``factorial()`` that computes the factorial of an integer ``n``: .. math:: n! = 1 \times 2 \times 3 \times \ldots (n - 2) \times (n - 1) \times n Now, let's compute the factorial of ``n`` normally (i.e. without defining a new function):: fact = 1 for k in range(1, n + 1): fact = fact * k Now that we have the code for computing the factorial, it is easy to write a *function* that computes the factorial: it is sufficient to plug the above code inside a new function, as follows:: def factorial(n): fact = 1 for k in range(1, n + 1): fact = fact * k return fact And let's check if it works:: print factorial(1) # 1 print factorial(2) # 2 print factorial(3) # 6 print factorial(4) # 24 print factorial(5) # 120 print factorial(6) # 720 Of course, the new function can be used like any of the Python-defined functions, e.g. in *list comprehensions*:: factorials = [factorial(n) for n in range(10)] **** .. warning:: The name of the function, as well as the name of the arguments, are arbitrary: pick whichever name you find more fitting. **Quiz**. What is the difference between this code:: def arith(op, a, b): if op == "+": return a + b elif op == "*": return a * b else: return None print arith("+", 10, 10) print arith("*", 2, 2) and this code?:: def f(what, x, y): if what == "+": return x + y elif what == "*": return x * y else: return 0 print f("+", 10, 10) print f("*", 2, 2) **** .. note:: A function can return more than one result, as follows:: def multiresult(): result_1 = "first result" result_2 = 0.12 result_3 = "something else" return result_1, result_2, result_3 Internally, Python interprets the ``return`` statement as returning a tuple. In practice, the above code is equivalent to:: def multiresult(): return ("first result", 0.12, "something else") When I call a "multi-result" function, I can either put the resulting tuple into a variable and extract the various elements individually:: result = multiresult() res1 = result[0] print res1 res2 = result[1] print res2 res3 = result[2] print res3 or I can use the "automatic unpacking" feature of Python, as follows:: res1, res2, res3 = multiresult() print res1 print res2 print res3 **** .. warning:: Variables have a **scope**, and in particular: - Variables declared outside the function are not visible the inside. [1] If you want to pass one or more values from the outside to the function, pass them through the arguments. - Variables declared inside the function are not visible from the outside. If you want to pass one or more values from the function to the external world, use the ``return`` statement. [1] *There are exceptions to this rule; we will ignore them in this presentation.* **** **Exampe**. Consider this code:: def find_physical(triples): """Takes a mixed interaction protein network, example: [("1A3A", "physical", "5ARM"), ("5JTD", "genetic", "5TGD")] and extracts physical interacting protein pairs, example: [("1A3A", "5ARM")] """ phys_pairs = [] for p1, relation, p2 in triples: if relation == "physical": phys_pairs.append((p1, p2)) # XXX I forgot to return `phys_pairs` here! network = [ ("1A3A", "physical", "5ARM"), ("5JTD", "genetic", "5TGD") ] find_physical(network) print phys_pairs Here ``phys_pairs`` is declared **inside** the function: it is not visible from the **outside**! In order to fix this issue, I have to explicitly ``return`` it:: def find_physical(triples): phys_pairs = [] for p1, relation, p2 in triples: if relation == "physical": phys_pairs.append((p1, p2)) return phys_pairs network = [ ("1A3A", "physical", "5ARM"), ("5JTD", "genetic", "5TGD") ] result = find_physical(network) print result **** **Example**. Functions can call other functions. Let's write two functions:: def read_fasta(path): """Reads a FASTA file with one-line sequences.""" fasta = {} for line in open(path).readlines(): line = line.strip() if line[0].startswith(">"): header = line else: fasta[header] = line return fasta def compute_histogram(sequence): """Computes the histogram of the characters.""" histogram = {} for letter in sequence: if not histogram.has_key(letter): histogram[letter] = 0 histogram[letter] += 1 return histogram These functions can be used to implement a complex program that: #. Reads a FASTA file into a dictionary #. For each sequence in the FASTA file, computes the histogram of its letters #. Prints each sequence header and the corresponding histogram as follows:: path = raw_input("enter a path: ") fasta = read_fasta(path) for header, sequence in fasta.items(): histogram = compute_histogram(sequence) print "header =", header.lstrip(">"), ":" print histogram **** **Example**. Since functions can call other functions, the "call graph" of a program can become arbitrarily complicated. Let's see a moderately realistic example of what a call graph looks like. Let's write a (mock!) program, composed of multiple functions, that asks the user for: #. the path to one or more FASTA files. #. the path to a file describing a physical protein interaction network (PIN). and computes some average statistic (say, a histogram) of the amino acid composition of interacting proteins. When ran, the program does the following: #. reads the sequence data from each FASTA file, see the ``read_sequences()`` and ``read_fasta()`` functions #. reads the interaction network with the ``read_interactions()`` function #. for each pair of interacting proteins, computes statistics about their joint amino acid composition, through the ``compute_aa_stats()`` function, and computes an "average" summary statistic in the ``compute_avg_stats()`` function. Here is the code:: def read_fasta(path): """Takes a path to a FASTA file, returns a header->sequence dict.""" # TODO actually read the file return "1A3A:A", "MANLFKLG..." def read_sequences(paths): """Reads a bunch of FASTA files, returns a list of dicts.""" header_to_seq = {} for path in paths: header, seq = read_fasta(path) header_to_seq[header] = seq return header_to_seq def read_interactions(path): """Reads physical protein interactions from a file. Returns a list of pairs of strings.""" # TODO actually read the file return [("1A3A:A", "5AA3:F"), ("5AA3:F", "5K9C:A")] def compute_aa_stats(seq1, seq2): """Compute amino acid statistics, e.g. co-occurrence.""" # TODO actually compute co-occurrence and MI cooccurrence = {"A": 0.2, "C": 0.01} mutual_information = 0.72 return cooccurrence, mutual_information def compute_avg_stats(sequences, interactions): """Takes a list of statistics (in some format) and computes the average statistics.""" stats = [] for prot1, prot2 in interactions: if not (sequences.has_key(prot1) and sequences.has_key(prot2)): continue seq1 = sequences[prot1] seq2 = sequences[prot2] stats.append(compute_aa_stats(seq1, seq2)) # TODO actually average all the collected statistics return 0.3 def main(): """The whole (fake) program.""" # Read the sequence files paths = [] while True: ans = raw_input("path to FASTA file: ") if len(ans) == 0: break paths.append(ans) sequences = read_sequences(paths) # Read the interaction file ans = raw_input("path to interaction data: ") interactions = read_interactions(ans) # Print the average stats print "average stats =", compute_avg_stats(sequences, interactions) main() As you can see, Python begins by calling the ``main()`` function at the very last line of the program. The ``main()`` function calls all the other "major" functions: ``read_sequences()``, ``read_interactions()`` and ``compute_avg_stats()``. The ``read_sequences()`` function internally calls the ``read_fasta()`` function multiple times, once for each user-provided FASTA file. The ``read_interactions()`` function calls no other function. The ``compute_avg_stats()`` function uses the ``compute_aa_stats()`` function to compute the statistics of individual protein-protein pairs. The above can be summarized using a "call graph" like this: .. image:: figures/callgraph.png **Quiz**. How many times is: - the ``main()`` function called? - the ``read_fasta()`` function called? - the ``compute_aa_stats()`` function called?