Python: Complex Statements

Conditional code: if

The if/elif/else statements allow to write code that gets executed if and only if some condition is satisfied.

For instance:

if condition:
    print "the condition is True"

executes the print statement if and only if condition evaluates to True.

We can also have multiple mutually exclusive alternatives:

if condition:
    print "the condition is True"
else:
    print "the condition is not True"

Here only one branch is executed, based on the value of condition.

The same is true here:

if condition1:
    print "condition1 is True"
elif condition2:
    print "condition1 is not True"
    print "but condition2 is!"
elif condition3:
    print "condition1 and condition2 are not True:
    print "but condition3 is!"
else:
    print "no condition is True"

The if, elif and else form a “chain”: only one of the branches is executed.


Example. Suppose we have two Booleans c1 and c2. Let’s see in detail which lines are executed based on the value of the two variables:

                 # c1   c2   | c1   c2    | c1    c2   | c1    c2
                 # True True | True False | False True | False False
                 # ----------+------------+------------+------------
print "begin"    # yes       | yes        | yes        | yes
if c1:           # yes       | yes        | yes        | yes
    print "1"    # yes       | yes        | no         | no
elif c2:         # no        | no         | yes        | yes
    print "2"    # no        | no         | yes        | no
else:            # no        | no         | no         | yes
    print "none" # no        | no         | no         | yes
print "end "     # yes       | yes        | yes        | yes

Let’s break the above into pieces:

  • if c1 is True, then the value of c2 does not matter: neither the elif c2 nor the else are executed.
  • if c1 is False, c2 decides whether the elif c2 is executed or the else is.

In other words, "c1" and "c2" are evaluated sequentially.


Assume that, instead, we want to print "2" independently of whether c1 is True or not. The only way to do that is to avoid the elif‘s:

print "begin"

if c1:
    print "1"

if c2:
    print "2"

if not c1 and not c2:
    print "0"

Here the if‘s do not form a chain anymore: they are independent of one another!


Example. Python uses the indentation to decide which code is “inside” the if and which code is “outside”.

Let’s write a short program to check whether the user is a mentalist:

print "I am thinking of a number between 1 and 10"
is_mentalist = int(raw_input("which one? ")) == 72

print "Computing..."

if is_mentalist:
    print "CONGRATULATIONS!!!"
    print "you are a mentalist"
else:
    print "thanks for playing"
    print "better luck next time"

print "done"

In this example, the print statements that are indented after the if and the else are “inside”: they are conditional on is_mentalist.

The other print‘s are “outside”: they are executed unconditionally.


Example. This code opens a file and checks whether it is a “valid” FASTA file. In order to do so, it checks whether (1) the file is empty, and (2) the file contains lines that start with the ">" character:

lines = open("data/prot-fasta/1A3A.fasta").readlines()

if len(lines) == 0:
    print "the file is empty"
else:
    first_characters = [line[0] for line in lines]
    if not ">" in first_characters:
        print "not a fasta file"
    else:
        print "a fasta file"

print "done"

Quiz:

  1. Can the code print both that the file is empty and that the file is valid?
  2. Can the code not print "done"?
  3. If the file is actually empty, what is the value of the variable first_characters at the end of the execution?
  4. Can I simplify the code by using an elif statement?

Iterative code: for

The for statement allows to write code that is executed multiple times.

In particular, the code inside the for is executed once for each and every element in a collection (i.e. string, list, tuple, dictionary).

The abstract syntax is:

collection = range(10) # a list, for instance

for element in collection:
    body(element)

This for iterates over all the elements element in the collection and executes the body(element) block on each of them.

Just like for list comprehensions, the element variable is defined by the for loop. At the Nth iteration, element will refer to the Nth element of collection.

The flow of the execution can be modified with the break and continue statements, see below for details.

Warning

If collection is a:

  • str, the for iterates over the characters.
  • list, the for iterates over the elements.
  • tuple, the for iterates over the elements.
  • dict, the for iterates over the keys.

Example. This for:

l = [1, 25, 6, 27, 57, 12]

for number in l:
    print number

iterates over all elements of l, from beginning to end. At each iteration, the value of number changes (as shown by the print).

The above for is equivalent to this code:

number = l[0]
print number

number = l[1]
print number

number = l[2]
print number

# etc.

except that it is a lot shorter!


Example. Let’s compute the sum of all elements of the previous list.

We can do that by modifying the for as follows:

l = [1, 25, 6, 27, 57, 12]

s = 0
for number in l:
    s = s + number          # equiv. s += number

print "the sum is", s

Here s plays the role of a support variable. It is initialized to 0 just before the loop. Then, each number in l is added, in turn, to s. By the end of the for,

The above code is equivalent to:

s = 0

number = l[0]
s += number

number = l[1]
s += number

# etc.

Example. Now let’s find the largest element in the list. The idea is:

  • We use a support variable largest_so_far that always (at all iterations) holds the largest element found so far. It is initialized to some sensible value.
  • We use a for to iterate over all elements of the list.
  • If the current element is smaller than or equal than largest_so_far, the latter is left untouched.
  • Otherwise, largest_so_far is updated to reference the current element.

Once the for is done, i.e. after iterating over the very last element of the list, largest_so_far will hold the largest element in the list.

Let’s write:

l = [1, 25, 6, 27, 57, 12]

# l[0] is a sensible initial value
largest_so_far = l[0]

for number in l[1:]:
    if number > largest_so_far:
        largest_so_far = number

print "the maximum is", largest_so_far

Example. Given the following table (list of strings):

table = [
    "protein domain start end",
    "YNL275W PF00955 236 498",
    "YHR065C SM00490 335 416",
    "YKL053C-A PF05254 5 72",
    "YOR349W PANTHER 353 414",
]

I want to convert it to a dictionary like this:

data = {
    "YNL275W": ("PF00955", 236, 498),
    "YHR065C": ("SM00490", 335, 416),
    "YKL053C-A": ("PF05254", 5, 72),
    "YOR349W": ("PANTHER", 353, 414)
}

The keys are taken from the first column, while the values are the remaining columns. Let’s write:

# the dictionary is initially empty
data = {}

# for each line in the table (except the header)
for line in table[1:]:
    words = line.split()

    protein = words[0]
    domain  = words[1]
    pos0    = int(words[2])
    pos1    = int(words[3])

    # update the dictionary
    data[protein] = (domain, pos0, pos1)

Example. The break statement allows to interrupt the for. For instance:

path = raw_input("write a path to a file: ")
lines = open(path).readlines()

for line in lines:
    line = line.strip()
    print "processing:", line

    # if the line is "STOP", we break out of the
    # for loop: the remaining lines are not
    # processed
    if line == "STOP":
        break

# <--- when Python encounters the break statement,
#      it "jumps" here

This code reads a text file and prints each line on screen. However, as soon as it finds a "STOP" line, it executes the break, which exits the for loop.

All the lines coming after the "STOP" line are not processed.

Example. The continue statement allows to skip to the next iteration, skipping the remainder of the code in this iteration. For instance:

path = raw_input("write a path to a file: ")
lines = open(path).readlines()

for line in lines:

    line = line.strip()
    print "processing:", line

    if line == "CONTINUE":
        continue

    # <--- if the continue is executed, the code from here...

    print "this is not a CONTINUE line"

    # <--- ... to here is not executed

reads a user-provided text file. It prints every line in turn. If the line is "CONTINUE", the continue statement skips over the second print. The for cycle restarts from the next line.


Iterative code: while

The while statement allows to write code that repeats as long as a certain condition is true. The while stops iterating as soon as the condition is not true anymore.

The abstract syntax is:

while condition:
    do_stuff()
    condition = check_condition()

As with the for, the break and continue statements can be used to modify the flow of the execution.

Note

The big difference between the for and while statements is:

  • for element in collection: executes N times, where N is the length of collection.
  • while condition: executes an indefinite number of times, that is, as long as the condition is true.

Example. The while statement is useful when the value of condition can not be known beforehand, for instance when interacting with a user.

Let’s write a while that asks the user whether she wants to stop, and keeps asking as long as the user does not reply "yes":

while raw_input("do you want me to stop? ") != "yes":
    print "Then I'll keep going!"

Example. Let’s see another simple example with a break:

while True: # this is an infinite while!

    ans = raw_input("what is the capital of Italy? ")

    if ans.lower() == "rome":
        print "correct"
        break

    print "try again"

# <--- the break jumps here
print "done"

I can not really do the same with a for loop!

Let’s make the code ask the user whether she actually wants to retry:

while True:

    ans = raw_input("what is the capital of Italy? ")
    if ans.lower() == "rome":
        print "correct"
        break

    ans = raw_input("try again? ")
    if ans.lower() == "no":
        print "allright"
        break

Nested code

Now that we know what if, for and while do, we can combine them in arbitrary ways by properly nesting (that is, indenting) the statements.


Example. Let’s write a simulator of a two-hand clock (hours and minutes):

for h in range(24):
    for m in range(60):
        print "time =", h, ":", m

Here the external for iterates over the 24 hours; for each hour, the inner for iterates over the 60 minutes.

Every time the internal for completes, the external for completes one iteration.

Let’s extend the simulator to a hour-minutes-seconds clock:

for h in range(24):
    for m in range(60):
        for s in range(60):
            print "time =", h, ":", m, ":", s

Of course, it is possible to take days into consideration by adding one more external loop that iterates over range(1, 366).


Example. I want to check whether a list contains repeated elements, and if it does, what are their positions. Starting from:

numbers = [5, 9, 4, 4, 9, 2]

we can use two nested for statements to iterate over the pairs of elements of numbers.

For every element (let’s say the one in position i), I want to check whether the following elements (those in position i+1 to len(numbers) - 1) match.

A picture is worth a thousand words:

+---+---+---+---+---+---+
| 5 | 9 | 4 | 4 | 9 | 2 |
+---+---+---+---+---+---+
  ^
  i
    \__________________/
      the possible positions of the 2nd element


+---+---+---+---+---+---+
| 5 | 9 | 4 | 4 | 9 | 2 |
+---+---+---+---+---+---+
      ^           ^
      i           MATCH!
        \______________/
      the possible positions of the 2nd element


+---+---+---+---+---+---+
| 5 | 9 | 4 | 4 | 9 | 2 |
+---+---+---+---+---+---+
          ^   ^
          i   MATCH!
            \__________/
      the possible positions of the 2nd element

Let’s write:

matches = []

for i in range(len(numbers)):

    # the number at position i
    number_at_i = numbers[i]

    for j in range(i + 1, len(numbers)):

        # the number at position j
        number_at_j = numbers[j]

        # do they match?
        if number_at_i == number_at_j:

            # they do! let's store their
            # positions
            matches.append((i, j))

print matches

Let’s verify whether matches actually identifies pairs of identical elements:

for pair in matches:
    number_at_i = numbers[pair[0]]
    number_at_j = numbers[pair[1]]
    print number_at_i == number_at_j

Example. Given the contents of a FASTA a file:

>>> lines = open("data/prot-fasta/3J01.fasta").readlines()
>>> print lines
[
    ">3J01:0|PDBID|CHAIN|SEQUENCE",
    "AVQQNKPTRSKRGMRRSHDALTAVTSLSVDKTSGEKHLRHHITADGYYRGRKVIAK",
    ">3J01:1|PDBID|CHAIN|SEQUENCE",
    "AKGIREKIKLVSSAGTGHFYTTTKNKRTKPEKLELKKFDPVVRQHVIYKEAKIK",
    ">3J01:2|PDBID|CHAIN|SEQUENCE",
    "MKRTFQPSVLKRNRSHGFRARMATKNGRQVLARRRAKGRARLTVSK",
    ">3J01:3|PDBID|CHAIN|SEQUENCE",
    # ...
]

I want to convert lines into a dictionary that maps from each header (key) to the corresponding sequence (value). Let’s write:

sequence_of = {}

for line in lines:

    # remove newlines and spaces around the line
    line = line.strip()

    if line.startswith(">"):
        # this is a header, store it for later use
        header = line
    else:
        # this is a sequence
        sequence = line

        # now let's use the header we read at the
        # *previous* iteration and the sequence we
        # got at the *current* iteration to update
        # dictionary
        sequence_of[header] = sequence

# we are done; print the dictionary
print sequence_of

This code works as long as the sequences only span one line. However, this is not the case for the FASTA file we have. Looking closer, we see that lines includes these lines:

lines = [
    # ...
    ">3J01:5|PDBID|CHAIN|SEQUENCE",
    "MAKLTKRMRVIREKVDATKQYDINEAIALLKELATAKFVESVDVAVNLGIDARKSDQNVRGATVLPHGTGRSVRVAVFTQ",
    "GANAEAAKAAGAELVGMEDLADQIKKGEMNFDVVIASPDAMRVVGQLGQVLGPRGLMPNPKVGTVTPNVAEAVKNAKAGQ",
    "VRYRNDKNGIIHTTIGKVDFDADKLKENLEALLVALKKAKPTQAKGVYIKKVSISTTMGAGVAVDQAGLSASVN",
    # ...
]

So there is one header, then a multi-line sequence.

Unfortunately with the above code, the first line of the sequence is overwritten by the second line of the sequence, which is then overwritten by the third line of the sequence. In other words, only the last line of a multi-line sequence makes it to the dictionary.

Let’s fix the code:

sequence_of = {}

for line in lines:

    line = line.strip()

    if line.startswith(">"):
        header = line
    else:
        sequence = line

        # the first time we encounter a header, we
        # associate it to an empty string
        if not sequence_of.has_key(header):
            sequence_of[header] = ""

        # now we take whatever sequence is associated
        # to the header and concatenate it with the
        # current line
        old_sequence = sequence_of[header]
        new_sequence = old_sequence + sequence
        sequence_of[header] = new_sequence

A shorter version:

for line in lines:

    line = line.strip()

    if line.startswith(">"):
        header = line
    else:
        if not sequence_of.has_key(header):
            sequence_of[header] = line
        else:
            sequence_of[header] += line

Example. Same setup as before. Some anonymous jester wrote the FASTA file wrong: sequences come before the corresponding headers. For instance:

wrong_fasta = [
    # first sequence
    "AVQQNKPTRSKRGMRRSHDALTAVTSLSVDKTSGEKHLRHHITADGYYRGRKVIAK",
    # first header
    ">3J01:0|PDBID|CHAIN|SEQUENCE",

    # second sequence
    "AKGIREKIKLVSSAGTGHFYTTTKNKRTKPEKLELKKFDPVVRQHVIYKEAKIK",
    # second header
    ">3J01:1|PDBID|CHAIN|SEQUENCE",
]

Our code of course relies on the header coming before the sequence – so it does not work in this case. How do make it work again?

We have to rewrite the code based on the fact that the header is not known when we get the sequence. It is true however that we know the sequence when we get the header.

Let’s write:

sequence_of = {}

# this variable is used to hold the multi-line
# sequence we have seen `so far'
# it is initialized with an empty string, because
# we have not seen any sequence yet!
latest_sequence_seen = ""

for line in lines:
    line = line.strip()

    if line.startswith(">"):
        # this is a header line. at this point the
        # sequence is known, and we can update the
        # dictionary
        sequence_of[line] = latest_sequence_seen

        # reset the latest sequence seen (so not to
        # mix the sequences of different proteins/genes)
        latest_sequence_seen = ""
    else:

        # this is a sequence line. we do not know
        # the header yet. let's just add this sequence
        # to the
        latest_sequence_seen += line

print sequence_of