============================
Python: Strings (Solutions)
============================

.. note::

    Later on in the solutions, I will sometimes use the backslash character
    ``\`` at the end of a line.

    When used this way, ``\`` tells Python that the command continues on the
    following line, allowing to break long commands over multiple lines.

#. Solutions:

   #. Solution::

        #        12345
        text = "     "
        print text
        print len(text)

   #. Solution::

        at_least_one_space = " " in text

        # check whether it works
        print " " in "nospaceatallhere"
        print " " in "onlyonespacehere--> <--"
        print " " in "more spaces in here"

   #. Solution::

        exactly_5_characters = len(text) == 5

        # check whether it works
        print len("1234") == 5
        print len("12345") == 5
        print len("123456") == 5

   #. Solution::

        empty_string = ""
        print len(empty_string) == 0

   #. Solution::

        base = "Python is great"
        repeats = base * 100

        # check whether the length is correct
        print len(repeats) == len(base) * 100

   #. Solution::

        part_1 = "but cell"
        part_2 = "biology"
        part_3 = "is way better"

        text = (part_1 + part_2 + part_3) * 1000

   #. Let's try this::

        start_with_1 = "12345".startswith(1)

      but Python gives an error message::

        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        TypeError: startswith first arg must be str, unicode, or tuple, not int
        #                     ^^^^^^^^^^^^^^^^^^^^^                     ^^^^^^^

      The error message, see highlighted parts, says that ``startswith()`` requires the argument to be a string, non an *int* as in our case: ``1``, is an int.

      The solution is::

        start_with_1 = "12345".startswith("1")
        print start_with_1

      the value is ``True``, as expected.

   #. Solution::

        string = "\\"
        string
        print string
        print len(string)                  # 1

      alternatively::

        string = r"\"
        string
        print string
        print len(string)                  # 1

   #. Already checked before, the answer is no. Anyway::

        backslash = r"\"

        print backslash*2 in "\\"           # False

   #. First method::

        backslash = r"\"

        condition = text.startswith(backslash) or \
                     text.endswith(backslash)

      Second method::

        condition = (text[0] == backslash) or \
                     (text[-1] == backslash)

   #. Solution::

        condition = \
             text.startswith("xxx") or \
            (text.startswith("xx") and text.endswith("x")) or \
            (text.startswith("x")  and text.endswith("xx")) or \
                                        text.endswith("xxx")

      It's worth to check the condition using the examples provided in the exercise.

#. Solution::

    s = "0123456789"
    print len(s)                        # 10

   Which of the following extractions are correct?

   #. ``s[9]``: correct, extracts the last character.
   #. ``s[10]``: invalid.
   #. ``s[:10]``: corrett, extracts all characters (remember that the second index, ``10`` in this case, is exclusive.)
   #. ``s[1000]``: invalid.
   #. ``s[0]``: correct, extracts the first character.
   #. ``s[-1]``: correct, extracts the last character.
   #. ``s[1:5]``: correct, ectracts from the 2nd to the 6th character.
   #. ``s[-1:-5]``: correct
   #. ``s[-5:-1]``: correct, but nothing is extracted (indexes are inverted!)
   #. ``s[-1000]``: invalid.

#. Solution (one of two possible solutions)::

    text = """never say \"never!\"
    \said the sad turtle."""

#. Solution::

    string = "a 1 b 2 c 3"

    digit = "DIGIT"
    character = "CHARACTER"

    result = string.replace("1", digit)
    result = result.replace("2", digit)
    result = result.replace("3", digit)
    result = result.replace("a", character)
    result = result.replace("b", character)
    result = result.replace("c", character)

    print result                     # "CHARACTER DIGIT CHARACTER ..."

   In one line::

    print string.replace("1", digit).replace("2", digit) ...

#. Solution::

    chain_a = """SSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKM
    FCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVV
    RRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFR
    HSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILT
    IITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKG
    EPHHELPPGSTKRALPNNT"""


    num_lines = chain_a.count("\n") + 1
    print num_lines                          # 6


    # NOTE: we want to know the length of the actual *sequence*, non the length of the *string*
    length_sequence = len(chain_a) - chain_a.count("\n")
    print length_sequenza                    # 219


    sequence = chain_a.replace("\n", "")
    print len(chain_a) - len(sequence)          # 5 (giusto)
    print len(sequence)                         # 219


    num_cysteine = sequence.count("C")
    num_histidine = sequence.count("H")
    print num_cysteine, num_histidine            # 10, 9


    print "NLRVEYLDDRN" in sequence             # True
    print sequence.find("NLRVEYLDDRN")          # 106
    # let's check
    print sequence[106 : 106 + len("NLRVEYLDDRN")]  # "NLRVEYLDDRN"

	
    index_first_newline = chain_a.find("\n")
    first_line = chain_a[:index_first_newline]
    print first_line

#. Solution::

    structure_chain_a = """SER A 96 77.253 20.522 75.007
    VAL A 97 76.066 22.304 71.921
    PRO A 98 77.731 23.371 68.681
    SER A 99 80.136 26.246 68.973
    GLN A 100 79.039 29.534 67.364
    LYS A 101 81.787 32.022 68.157"""

    # I use a variable with a shorter name 
    chain = structure_chain_a


    index_first_newline = chain.find("\n")
    index_second_newline = chain[index_first_newline + 1:].find("\n")
    index_third_newline = chain[index_second_newline + 1:].find("\n")
    print index_first_newline, index_second_newline, index_third_newline

    second_line = chain[index_first_newline + 1 : index_second_newline]
    print second_line                      # "VAL A 97 76.066 22.304 71.921"
                                            #           |    | |    | |    |
                                            #  01234567890123456789012345678
                                            #  0         1         2

    x = second_line[9:15]
    y = second_line[16:22]
    z = second_line[23:]
    print x, y, z
    # NOTE: they are all strings


    third_line = chain[index_second_newline + 1 : index_third_newline]
    print third_line                        # "PRO A 98 77.731 23.371 68.681"
                                            #           |    | |    | |    |
                                            #  01234567890123456789012345678
                                            #  0         1         2

    x_prime = third_line[9:15]
    y_prime = third_line[16:22]
    z_prime = third_line[23:]
    print x_prime, y_prime, z_prime
    # NOTE: they are all strings


    # we should convert all variables to floats, in order to calculate distances 
    x, y, z = float(x), float(y), float(z)
    x_prime, y_prime, z_prime = float(x_prime), float(y_prime), float(z_prime)

    diff_x = x - x_prime
    diff_y = y - y_prime
    diff_z = z - z_prime

    distance = (diff_x**2 + diff_y**2 + diff_z**2)**0.5
    print distance

   The solution is way simpler using ``split()``::

    lines = chain.split("\n")
    second_line = lines[1]
    third_line = lines[2]

    words = second_line.split()
    x, y, z = float(words[-3]), float(words[-2]), float(words[-1])

    words = third_line.split()
    x_prime, y_prime, z_prime = float(words[-3]), float(words[-2]), float(words[-1])

    distance = ((x - x_prime)**2 + (y - y_prime)**2 + (z - z_prime)**2)**0.5

#. Solutions:
	
   #. Solution::
	
       dna_seq = dna_seq.replace("\n", "") # Remove newline characters
       length = len(dna_seq)               # Calculate length
       ng = dna_seq.count("G")             # Calculate the number of Gs
       nc = dna_seq.count("C")             # Calculate the number of Cs
       gc_cont = (ng + nc)/float(length)   # Calculate the GC-content
	
   #. Solution::
	
       rna_seq = dna_seq.replace("T","U")
		
   #. Solution::
	
       intron = dna_seq[50:156]        # Careful with indexes
       exon1 = dna_seq[:50]            # Careful with indexes
       exon2 = dna_seq[156:]           # Careful with indexes
       spliced = exon1+exon2