==========================
Python: Complex Statements
==========================

Conditional code: ``if``
------------------------

The ``if``/``elif``/``else`` statements allow to write code that gets
executed **if and only if** some condition is satisfied.

For instance::

    if condition:
        print "the condition is True"

executes the ``print`` statement if and only if ``condition`` evaluates to
``True``.

We can also have multiple *mutually exclusive* alternatives::

    if condition:
        print "the condition is True"
    else:
        print "the condition is not True"

Here only one *branch* is executed, based on the value of ``condition``.

The same is true here::

    if condition1:
        print "condition1 is True"
    elif condition2:
        print "condition1 is not True"
        print "but condition2 is!"
    elif condition3:
        print "condition1 and condition2 are not True:
        print "but condition3 is!"
    else:
        print "no condition is True"

The ``if``, ``elif`` and ``else`` form a "chain": only one of the branches
is executed.

****

**Example**. Suppose we have two Booleans ``c1`` and ``c2``. Let's see in
detail which lines are executed based on the value of the two variables::

                     # c1   c2   | c1   c2    | c1    c2   | c1    c2
                     # True True | True False | False True | False False
                     # ----------+------------+------------+------------
    print "begin"    # yes       | yes        | yes        | yes
    if c1:           # yes       | yes        | yes        | yes
        print "1"    # yes       | yes        | no         | no
    elif c2:         # no        | no         | yes        | yes
        print "2"    # no        | no         | yes        | no
    else:            # no        | no         | no         | yes
        print "none" # no        | no         | no         | yes
    print "end "     # yes       | yes        | yes        | yes

Let's break the above into pieces:

- if ``c1`` is ``True``, then the value of ``c2`` does not matter: neither
  the ``elif c2`` nor the ``else`` are executed.

- if ``c1`` is ``False``, ``c2`` decides whether the ``elif c2`` is executed
  or the ``else`` is.

In other words, ``"c1"`` and ``"c2"`` are evaluated sequentially.

|

Assume that, instead, we want to print ``"2"`` independently of whether ``c1``
is ``True`` or not. The only way to do that is to avoid the ``elif``'s::

    print "begin"

    if c1:
        print "1"

    if c2:
        print "2"

    if not c1 and not c2:
        print "0"

Here the ``if``'s do not form a chain anymore: they are *independent* of
one another!

****

**Example**. Python uses the *indentation* to decide which code is "inside"
the ``if`` and which code is "outside".

Let's write a short program to check whether the user is a mentalist::

    print "I am thinking of a number between 1 and 10"
    is_mentalist = int(raw_input("which one? ")) == 72

    print "Computing..."

    if is_mentalist:
        print "CONGRATULATIONS!!!"
        print "you are a mentalist"
    else:
        print "thanks for playing"
        print "better luck next time"

    print "done"

In this example, the ``print`` statements that are indented after the ``if``
and the ``else`` are "inside": they are conditional on ``is_mentalist``.

The other ``print``'s are "outside": they are executed unconditionally.

****

**Example**. This code opens a file and checks whether it is a "valid" FASTA
file. In order to do so, it checks whether (1) the file is empty, and (2) the
file contains lines that start with the ``">"`` character::

    lines = open("data/prot-fasta/1A3A.fasta").readlines()

    if len(lines) == 0:
        print "the file is empty"
    else:
        first_characters = [line[0] for line in lines]
        if not ">" in first_characters:
            print "not a fasta file"
        else:
            print "a fasta file"

    print "done"

**Quiz**:

    #. Can the code print both that the file is empty *and* that the file
       is valid?
    #. Can the code *not* print ``"done"``?
    #. If the file is actually empty, what is the value of the variable
       ``first_characters`` at the end of the execution?
    #. Can I simplify the code by using an ``elif`` statement?

|

Iterative code: ``for``
-----------------------

The ``for`` statement allows to write code that is executed multiple times.

In particular, the code inside the ``for`` is executed once for each and
every element in a collection (i.e. string, list, tuple, dictionary).

The abstract syntax is::

    collection = range(10) # a list, for instance

    for element in collection:
        body(element)

This ``for`` iterates over all the elements ``element`` in the collection and
executes the ``body(element)`` block on each of them.

Just like for *list comprehensions*, the ``element`` variable is defined by the
``for`` loop. At the Nth iteration, ``element`` will refer to the Nth element
of ``collection``.

The flow of the execution can be modified with the ``break`` and ``continue``
statements, see below for details.

.. warning::

    If ``collection`` is a:

    - ``str``, the ``for`` iterates over the characters.
    - ``list``, the ``for`` iterates over the elements.
    - ``tuple``, the ``for`` iterates over the elements.
    - ``dict``, the ``for`` iterates over the keys.

****

**Example**. This ``for``::

    l = [1, 25, 6, 27, 57, 12]

    for number in l:
        print number

iterates over all elements of ``l``, from beginning to end. At each iteration,
the value of ``number`` changes (as shown by the ``print``).

The above ``for`` is equivalent to this code::

    number = l[0]
    print number

    number = l[1]
    print number

    number = l[2]
    print number

    # etc.

except that it is a **lot** shorter!

****

**Example**. Let's compute the *sum* of all elements of the previous list.

We can do that by modifying the ``for`` as follows::

    l = [1, 25, 6, 27, 57, 12]

    s = 0
    for number in l:
        s = s + number          # equiv. s += number

    print "the sum is", s

Here ``s`` plays the role of a support variable. It is initialized to ``0``
just before the loop. Then, each number in ``l`` is added, in turn, to ``s``.
By the end of the ``for``, 

The above code is equivalent to::

    s = 0

    number = l[0]
    s += number

    number = l[1]
    s += number

    # etc.

****

**Example**. Now let's find the *largest* element in the list. The idea is:

* We use a support variable ``largest_so_far`` that always (at all iterations)
  holds the largest element found so far. It is initialized to some sensible
  value.

* We use a ``for`` to iterate over all elements of the list.

* If the current element is smaller than or equal than ``largest_so_far``,
  the latter is left untouched.

* Otherwise, ``largest_so_far`` is updated to reference the current element.

Once the ``for`` is done, i.e. after iterating over the very last element
of the list, ``largest_so_far`` will hold the largest element in the list.

Let's write::

    l = [1, 25, 6, 27, 57, 12]

    # l[0] is a sensible initial value
    largest_so_far = l[0]

    for number in l[1:]:
        if number > largest_so_far:
            largest_so_far = number

    print "the maximum is", largest_so_far

****

**Example**. Given the following table (list of strings)::

    table = [
        "protein domain start end",
        "YNL275W PF00955 236 498",
        "YHR065C SM00490 335 416",
        "YKL053C-A PF05254 5 72",
        "YOR349W PANTHER 353 414",
    ]

I want to convert it to a dictionary like this::

    data = {
        "YNL275W": ("PF00955", 236, 498),
        "YHR065C": ("SM00490", 335, 416),
        "YKL053C-A": ("PF05254", 5, 72),
        "YOR349W": ("PANTHER", 353, 414)
    }

The keys are taken from the first column, while the values are the remaining
columns. Let's write::

    # the dictionary is initially empty
    data = {}

    # for each line in the table (except the header)
    for line in table[1:]:
        words = line.split()

        protein = words[0]
        domain  = words[1]
        pos0    = int(words[2])
        pos1    = int(words[3])

        # update the dictionary
        data[protein] = (domain, pos0, pos1)

****

**Example**. The ``break`` statement allows to interrupt the ``for``. For
instance::

    path = raw_input("write a path to a file: ")
    lines = open(path).readlines()

    for line in lines:
        line = line.strip()
        print "processing:", line

        # if the line is "STOP", we break out of the
        # for loop: the remaining lines are not
        # processed
        if line == "STOP":
            break

    # <--- when Python encounters the break statement,
    #      it "jumps" here

This code reads a text file and prints each line on screen. However, as soon
as it finds a ``"STOP"`` line, it executes the ``break``, which exits the
``for`` loop.

All the lines coming after the ``"STOP"`` line are *not* processed.

**Example**. The ``continue`` statement allows to skip to the next iteration,
skipping the remainder of the code in this iteration. For instance::

    path = raw_input("write a path to a file: ")
    lines = open(path).readlines()

    for line in lines:

        line = line.strip()
        print "processing:", line

        if line == "CONTINUE":
            continue

        # <--- if the continue is executed, the code from here...

        print "this is not a CONTINUE line"

        # <--- ... to here is not executed

reads a user-provided text file. It prints every line in turn. If the line
is ``"CONTINUE"``, the ``continue`` statement skips over the second ``print``.
The ``for`` cycle restarts from the *next* line.

|

Iterative code: ``while``
-------------------------

The ``while`` statement allows to write code that repeats as long as a certain
condition is true. The ``while`` stops iterating as soon as the condition is
not true anymore.

The abstract syntax is::

    while condition:
        do_stuff()
        condition = check_condition()

As with the ``for``, the ``break`` and ``continue`` statements can be used to
modify the flow of the execution.

.. note::

    The big difference between the ``for`` and ``while`` statements is:

    - ``for element in collection:`` executes **N times**, where N is the
      length of ``collection``.

    - ``while condition:`` executes an **indefinite** number of times, that
      is, as long as the condition is true.

****

**Example**. The ``while`` statement is useful when the value of ``condition``
can not be known beforehand, for instance when interacting with a user.

Let's write a ``while`` that asks the user whether she wants to stop, and
keeps asking as long as the user does not reply ``"yes"``::

    while raw_input("do you want me to stop? ") != "yes":
        print "Then I'll keep going!"

****

**Example**. Let's see another simple example with a ``break``::

    while True: # this is an infinite while!

        ans = raw_input("what is the capital of Italy? ")

        if ans.lower() == "rome":
            print "correct"
            break

        print "try again"

    # <--- the break jumps here
    print "done"

I can not really do the same with a ``for`` loop!

Let's make the code ask the user whether she actually wants to retry::

    while True:

        ans = raw_input("what is the capital of Italy? ")
        if ans.lower() == "rome":
            print "correct"
            break

        ans = raw_input("try again? ")
        if ans.lower() == "no":
            print "allright"
            break

|

Nested code
-----------

Now that we know what ``if``, ``for`` and ``while`` do, we can combine them in
arbitrary ways by properly nesting (that is, indenting) the statements.

****

**Example**. Let's write a simulator of a two-hand clock (hours and minutes)::

    for h in range(24):
        for m in range(60):
            print "time =", h, ":", m

Here the external ``for`` iterates over the 24 hours; for each hour, the inner
``for`` iterates over the 60 minutes.

Every time the internal ``for`` completes, the external ``for`` completes
one iteration.

Let's extend the simulator to a hour-minutes-seconds clock::

    for h in range(24):
        for m in range(60):
            for s in range(60):
                print "time =", h, ":", m, ":", s

Of course, it is possible to take days into consideration by adding one more
external loop that iterates over ``range(1, 366)``.

****

**Example**. I want to check whether a list contains repeated elements, and
if it does, what are their positions. Starting from::

    numbers = [5, 9, 4, 4, 9, 2]

we can use two nested ``for`` statements to iterate over the *pairs* of
elements of ``numbers``.

For every element (let's say the one in position ``i``), I want to check
whether the following elements (those in position ``i+1`` to ``len(numbers) - 1``)
match.

A picture is worth a thousand words::

    +---+---+---+---+---+---+
    | 5 | 9 | 4 | 4 | 9 | 2 |
    +---+---+---+---+---+---+
      ^
      i
        \__________________/
          the possible positions of the 2nd element


    +---+---+---+---+---+---+
    | 5 | 9 | 4 | 4 | 9 | 2 |
    +---+---+---+---+---+---+
          ^           ^
          i           MATCH!
            \______________/
          the possible positions of the 2nd element


    +---+---+---+---+---+---+
    | 5 | 9 | 4 | 4 | 9 | 2 |
    +---+---+---+---+---+---+
              ^   ^
              i   MATCH!
                \__________/
          the possible positions of the 2nd element

Let's write::

    matches = []

    for i in range(len(numbers)):

        # the number at position i
        number_at_i = numbers[i]

        for j in range(i + 1, len(numbers)):

            # the number at position j
            number_at_j = numbers[j]

            # do they match?
            if number_at_i == number_at_j:

                # they do! let's store their
                # positions
                matches.append((i, j))

    print matches

Let's verify whether ``matches`` actually identifies pairs of identical
elements::

    for pair in matches:
        number_at_i = numbers[pair[0]]
        number_at_j = numbers[pair[1]]
        print number_at_i == number_at_j

****

**Example**. Given the contents of a FASTA a file::

    >>> lines = open("data/prot-fasta/3J01.fasta").readlines()
    >>> print lines
    [
        ">3J01:0|PDBID|CHAIN|SEQUENCE",
        "AVQQNKPTRSKRGMRRSHDALTAVTSLSVDKTSGEKHLRHHITADGYYRGRKVIAK",
        ">3J01:1|PDBID|CHAIN|SEQUENCE",
        "AKGIREKIKLVSSAGTGHFYTTTKNKRTKPEKLELKKFDPVVRQHVIYKEAKIK",
        ">3J01:2|PDBID|CHAIN|SEQUENCE",
        "MKRTFQPSVLKRNRSHGFRARMATKNGRQVLARRRAKGRARLTVSK",
        ">3J01:3|PDBID|CHAIN|SEQUENCE",
        # ...
    ]

I want to convert ``lines`` into a dictionary that maps from each header (key)
to the corresponding sequence (value). Let's write::

    sequence_of = {}

    for line in lines:

        # remove newlines and spaces around the line
        line = line.strip()

        if line.startswith(">"):
            # this is a header, store it for later use
            header = line
        else:
            # this is a sequence
            sequence = line

            # now let's use the header we read at the
            # *previous* iteration and the sequence we
            # got at the *current* iteration to update
            # dictionary
            sequence_of[header] = sequence

    # we are done; print the dictionary
    print sequence_of

This code works as long as the sequences only span one line. However, this is
not the case for the FASTA file we have. Looking closer, we see that ``lines``
includes these lines::

    lines = [
        # ...
        ">3J01:5|PDBID|CHAIN|SEQUENCE",
        "MAKLTKRMRVIREKVDATKQYDINEAIALLKELATAKFVESVDVAVNLGIDARKSDQNVRGATVLPHGTGRSVRVAVFTQ",
        "GANAEAAKAAGAELVGMEDLADQIKKGEMNFDVVIASPDAMRVVGQLGQVLGPRGLMPNPKVGTVTPNVAEAVKNAKAGQ",
        "VRYRNDKNGIIHTTIGKVDFDADKLKENLEALLVALKKAKPTQAKGVYIKKVSISTTMGAGVAVDQAGLSASVN",
        # ...
    ]

So there is one header, then a multi-line sequence.

Unfortunately with the above code, the first line of the sequence is
overwritten by the second line of the sequence, which is then overwritten by
the third line of the sequence. In other words, *only the last line of a
multi-line sequence makes it to the dictionary*.

Let's fix the code::

    sequence_of = {}

    for line in lines:

        line = line.strip()

        if line.startswith(">"):
            header = line
        else:
            sequence = line

            # the first time we encounter a header, we
            # associate it to an empty string
            if not sequence_of.has_key(header):
                sequence_of[header] = ""

            # now we take whatever sequence is associated
            # to the header and concatenate it with the
            # current line
            old_sequence = sequence_of[header]
            new_sequence = old_sequence + sequence
            sequence_of[header] = new_sequence

A shorter version::

    for line in lines:

        line = line.strip()

        if line.startswith(">"):
            header = line
        else:
            if not sequence_of.has_key(header):
                sequence_of[header] = line
            else:
                sequence_of[header] += line

**Example**. Same setup as before. Some anonymous jester wrote the FASTA
file wrong: sequences come **before** the corresponding headers. For
instance::

    wrong_fasta = [
        # first sequence
        "AVQQNKPTRSKRGMRRSHDALTAVTSLSVDKTSGEKHLRHHITADGYYRGRKVIAK",
        # first header
        ">3J01:0|PDBID|CHAIN|SEQUENCE",

        # second sequence
        "AKGIREKIKLVSSAGTGHFYTTTKNKRTKPEKLELKKFDPVVRQHVIYKEAKIK",
        # second header
        ">3J01:1|PDBID|CHAIN|SEQUENCE",
    ]

Our code of course relies on the header coming before the sequence -- so it
does not work in this case. How do make it work again?

We have to rewrite the code based on the fact that the header is *not* known
when we get the sequence. It is true however that we know the sequence when
we get the header.

Let's write::

    sequence_of = {}

    # this variable is used to hold the multi-line
    # sequence we have seen `so far'
    # it is initialized with an empty string, because
    # we have not seen any sequence yet!
    latest_sequence_seen = ""

    for line in lines:
        line = line.strip()

        if line.startswith(">"):
            # this is a header line. at this point the
            # sequence is known, and we can update the
            # dictionary
            sequence_of[line] = latest_sequence_seen

            # reset the latest sequence seen (so not to
            # mix the sequences of different proteins/genes)
            latest_sequence_seen = ""
        else:

            # this is a sequence line. we do not know
            # the header yet. let's just add this sequence
            # to the 
            latest_sequence_seen += line

    print sequence_of