================
Python: Strings
================

Strings are **immutable** objects representing text.

To define a string, we can write::

    var = "text"

or, equivalently::

    var = 'text'

To insert special characters, we need to perform *escaping* with a backslash
``\``::

    path = "data\\fasta"

or use the prefix ``r`` (*raw*)::

    path = r"data\fasta"


.. note::

    Here is a reference list of escape characters. You will probably only need
    the most obvious ones, like ``\\``, ``\n`` and ``\t``.

    ================    =====================================================
    Escape Character    Meaning
    ================    =====================================================
    ``\\``              Backslash
    ``\'``              Single-quote
    ``\"``              Double-quote
    ``\a``              ASCII bell
    ``\b``              ASCII backspace
    ``\f``              ASCII formfeed
    ``\n``              ASCII linefeed (also known as newline)
    ``\r``              Carriage Return
    ``\t``              Horizontal Tab
    ``\v``              ASCII vertical tab
    ``\N{name}``        Unicode character name (Unicode only!)
    ``\uxxxx``          Unicode 16-bit hex value xxxx (u'' string only)
    ``\Uxxxxxxxx``      Unicode 32-bit hex value xxxxxxxx (u'' string only)
    ``\ooo``            Character with octal value ooo
    ``\xhh``            Character with hex value hh
    ================    =====================================================

To create a multi-line string, we can manually place the
*newline* character ``\n`` at each line::

    sad_joke = "Time flies like an arrow.\nFruit flies like a banana."

    print sad_joke

or we can use triple quotes::

    sad_joke = """Time flies like an arrow.
    Fruit flies like a banana."""

    print sad_joke

.. warning::

    ``print`` interprets special characters, while terminal echo doesn't.
    Try to write::

        print path

    and (from the interpreter)::

        path

    In the rirst case, we see one slash (the escaping slash is automatically interpreted by ``print``), in the second case we see two slashes (the escape slash is not interpreted).

    The same if we print ``sad_joke``.

String-Number conversion
--------------------------

We can convert a number into a string with ``str()``::

    n = 10
    s = str(n)
    print n, type(n)
    print s, type(s)

``int()`` or ``float()`` perform the opposite conversion::

    n = int("123")
    q = float("1.23")
    print n, type(n)
    print q, type(q)

If the string doesn't contain the correct numeric type, Python will give an error message::

    int("3.14")             # Not an int
    float("ribosome")       # Not a number
    int("1 2 3")            # Not a number
    int("fifteen")          # Not a number

Operations
----------

======== ================== ===========================================
Result   Operator           Meaning
======== ================== ===========================================
``bool`` ``==``             Check whether two strings are identical.
``int``  ``len(str)``       Return the length of the string
``str``  ``str + str``      Concatenate two strings
``str``  ``str * int``      Replicate the string
``bool`` ``str in str``     Check if a string is present in another string
``str``  ``str[int:int]``   Extract a sub-string
======== ================== ===========================================

**Example**. Let's concatenate two strings::

    string = "one" + " " + "string"
    length = len(string)
    print "the string:", string, "is", length, "characters long"

Another example::

    string = "Python is hell!" * 1000
    print "the string is", len(string), "characters long"

.. warning::

    We cannot concatenate strings with other types. For example::

        var = 123
        print "the value of var is" + var

    gives an error message. Two working alternatives::

        print "the value of var is" + str(123)

    or::

        print "the value of var is", var

    (In the second case we miss a space between ``is`` and ``123``.)

**Example**. The operator ``substring in string`` checks if
``substring`` appears once or more times in ``string``, for example::

    string = "A beautiful journey"

    print "A" in string            # True
    print "beautiful" in string    # True
    print "BEAUTIFUL" in string    # False
    print "ul jour" in string      # True
    print "Gengis Khan" in string  # False
    print " " in string            # True
    print "     " in string        # False

The result is always ``True`` or ``False``.

**Example**. To extract a substring we can use indexes::

    #           0                       -1
    #           |1                     -2|
    #           ||2                   -3||
    #           |||        ...         |||
    alphabet = "abcdefghijklmnopqrstuvwxyz"

    print alphabet[0]               # "a"
    print alphabet[1]               # "b"
    print alphabet[len(alphabet)-1] # "z"
    print alphabet[len(alphabet)]   # Error
    print alphabet[10000]           # Error

    print alphabet[-1]              # "z"
    print alphabet[-2]              # "y"

    print alphabet[0:1]             # "a"
    print alphabet[0:2]             # "ab"
    print alphabet[0:5]             # "abcde"
    print alphabet[:5]              # "abcde"

    print alphabet[-5:-1]           # "vwxy"
    print alphabet[-5:]             # "vwxyz"

    print alphabet[10:-10]          # "klmnop"

.. warning::

    Extraction is inclusive with respect to the first index, but exclusive with respect to the second. In other words ``alphabet[i:j]`` corresponds to::

        alphabet[i] + alphabet[i+1] + ... + alphabet[j-1]

    Note that ``alphabet[j]`` is excluded.

.. warning::

    Extraction return a *new* string, leaving the original unvaried::

        alphabet = "abcdefghijklmnopqrstuvwxyz"

        substring = alphabet[2:-2]
        print substring
        print alphabet                  # Is unvaried


Methods
-------

======== =========================== ===================================================
Result   Method                      Meaning
======== =========================== ===================================================
``str``  ``str.upper()``             Return the string in upper case
``str``  ``str.lower()``             Return the string in lower case
``str``  ``str.strip(str)``          Remove strings from the sides
``str``  ``str.lstrip(str)``         Remove strings from the left
``str``  ``str.rstrip(str)``         Remove strings from the right
``bool`` ``str.startswith(str)``     Check if the string starts with another
``bool`` ``str.endswith(str)``       Check if the string ends with another
``int``  ``str.find(str)``           Return the position of a substring
``int``  ``str.count(str)``          Count the number of occurrences of a substring
``str``  ``str.replace(str, str)``   Replace substrings
======== =========================== ===================================================

.. warning::

    Methods return a *new* string, leaving the original unvaried (as with extraction)::

        alphabet = "abcdefghijklmnopqrstuvwxyz"

        alphabet_upper = alphabet.upper()
        print alphabet_upper
        print alphabet                 # Is unvaried

**Example**. ``upper()`` and ``lower()`` are very simple::

    text = "No Yelling"

    result = text.upper()
    print result

    result = result.lower()
    print result

**Example**. ``strip()`` variants are also simple::

    text = "    one example    "

    print text.strip()         # equivalent to text.strip(" ")
    print text.lstrip()        # idem
    print text.rstrip()        # idem

    print text                 # text is unvaried

Note that the space between ``"one"`` and ``"example"`` is never removed. We can specify more than one *character* to be removed::

    "AAAA one example BBBB".strip("AB")

**Example**. The same is valid with ``startswith()`` and ``endswith()``::

    text = "123456789"

    print text.startswith("1")     # True
    print text.startswith("a")     # False

    print text.endswith("56789")   # True
    print text.endswith("5ABC9")   # False

**Example**. ``find()`` returns the position of the first occurrence of a substring, or ``-1`` if the substring never occurs::

    text = "123456789"

    print text.find("1")           # 0
    print text.find("56789")       # 4

    print text.find("Q")           # -1

**Example**. ``replace()`` returns a copy of the string where a substring is replaced with another::

    text = "if roses were rotten, then"

    print text.replace("ro", "gro")

**Example**. Given this unformatted string of aminoacids::

    sequence = ">MAnlFKLgaENIFLGrKW    "

To increase uniformity, we want to remove the ``">"`` character, remove spaces and finally convert everything to upper case::

    s1 = sequence.lstrip(">")
    s2 = s2.rstrip(" ")
    s3 = s2.upper()

    print s3

Alternatively, all in one step::

    print sequence.lstrip(">").rstrip(" ").upper()

Why does it work? Let's write it with brackets::

    print ( ( sequence.lstrip(">") ).rstrip(" ") ).upper()
            \_____________________/
                      str
          \_____________________________________/
                             str
          \_____________________________________________/
                                 str

As you can see, the result of each method is a string (as ``s1``, ``s2`` e ``s3`` in the example above); 
and we can invoke string methods.

Exercises
---------

#. How can I:

   #. Create a string consisting of five spaces only.
   #. Check whether a string contains at least one space.
   #. Check whether a string contains exactly five (arbitrary) characters.
   #. Create an empty string, and check whether it is really empty.
   #. Create a string that contains one hundred copies of ``"Python is great"``.
   #. Given the strings ``"but cell"``, ``"biology"`` and ``"is way better"``,
      compose them into the string ``"but cell biology is way better"``.
   #. Check whether the string ``"12345"`` begins with ``1`` (the character, not the number!)
   #. Create a string consisting of a single character ``\``. (Check whether
      the output matches using both the echo of the interpreter and ``print``,
      and possibly also with ``len()``)
   #. Check whether the string ``"\\"`` contains one or two backslashes.
   #. Check whether a string (of choice) begins or ends by ``\``.
   #. Check whether a string (of choice) contains ``x`` at least three times
      at the beginning and/or at the end. For instance, the following strings
      satisfy the desideratum::

       "x....xx"           # 1 + 2 >= 3
       "xx....x"           # 2 + 1 >= 3
       "xxxx..."           # 4 + 0 >= 3

      while these do not::

       "x.....x"           # 1 + 1 < 3
       "...x..."           # 0 + 0 < 3
       "......."           # 0 + 0 < 3

#. Given the string::

    s = "0123456789"

   which of the following extractions are correct?

   #. ``s[9]``
   #. ``s[10]``
   #. ``s[:10]``
   #. ``s[1000]``
   #. ``s[0]``
   #. ``s[-1]``
   #. ``s[1:5]``
   #. ``s[-1:-5]``
   #. ``s[-5:-1]``
   #. ``s[-1000]``

#. Create a two-line string that contains the two following lines of text
   **literally**, including all the special characters and the implicit newline
   character:

    *never say "never"!*

    *said the sad turtle*

#. Given the strings::

    string = "a 1 b 2"

    digit = "DIGIT"
    character = "CHARACTER"

   replace all the digits in the variable ``string`` with the text provided by
   the variable ``digit``, and all alphabetic characters with the content of
   the variable ``character``.

   The result should look like this::

     "CHARACTER DIGIT CHARACTER DIGIT"

   You are free to use auxiliary variables to hold any intermediate results,
   but do not need to.

#. Given the following multi-line sequence::

    chain_a = """SSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKM
       FCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVV
       RRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFR
       HSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILT
       IITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKG
       EPHHELPPGSTKRALPNNT"""

   which represents the aminoacid sequence of the DNA-binding domain of the `Tumor Suppressor Protein TP53 <http://www.rcsb.org/pdb/explore.do?structureId=1TSR>`_
   , answer the following questions.

   #. How many lines does it hold?
   #. How long is the sequence? (Do not forget to ignore the special characters!)
   #. Remove all newline characters, and put the result in a new variable
      ``sequence``.
   #. How many cysteines ``"C"`` are there in the sequence? How many histidines ``"H"``?
   #. Does the chain contain the sub-sequence ``"NLRVEYLDDRN"``? In what position?
   #. How can I use ``find()`` and the sub-string extraction ``[i:j]`` operators
      to extract the first line from ``chain_a``?

#. Given (a small portion of) the tertiary structure of chain A of the TP53 protein::

    structure_chain_a = """SER A 96 77.253 20.522 75.007
    VAL A 97 76.066 22.304 71.921
    PRO A 98 77.731 23.371 68.681
    SER A 99 80.136 26.246 68.973
    GLN A 100 79.039 29.534 67.364
    LYS A 101 81.787 32.022 68.157"""

   Each line represents an :math:`C_\alpha` atom of the backbone of the
   structure. Of each atom, we know:
   - the aminoacid code of the residue
   - the chain (which is always ``"A"`` in this example)
   - the position of the residue within the chain (starting from the N-terminal)
   - and the :math:`x, y, z` coordinates of the atom

   #. Extract the second line using ``find()`` and the extraction operator. Put
      the line in a new variable ``line``.
   #. Extract the coordinates of the second residue, and put them into three
      variables ``x``, ``y``, and ``z``.
   #. Extract the coordinates from *third* residue as well, putting them in
      different variables ``x_prime``, ``y_prime``, ``z_prime``
   #. Compute the Euclidean distance between the two residues:

        :math:`d((x,y,z),(x',y',z')) = \sqrt{(x-x')^2 + (y-y')^2 + (z-z')^2}`

      *Hint*: make sure to use ``float`` numbers when computing the distance.

#. Given the following DNA sequence, part of the `BRCA2 <https://en.wikipedia.org/wiki/BRCA2>`_ human gene::

    dna_seq = """GGGCTTGTGGCGCGAGCTTCTGAAACTAGGCGGCAGAGGCGGAGCCGCT
    GTGGCACTGCTGCGCCTCTGCTGCGCCTCGGGTGTCTTTT
    GCGGCGGTGGGTCGCCGCCGGGAGAAGCGTGAGGGGACAG
    ATTTGTGACCGGCGCGGTTTTTGTCAGCTTACTCCGGCCA
    AAAAAGAACTGCACCTCTGGAGCGG"""

   #. Calculate the **GC-content** of the sequence
   #. Convert the **DNA** sequence into an **RNA** sequence
   #. Assuming that this sequence contains an **intron** ranging from nucleotide *51* to nucleotide *156*, store the sequence of the intron in a string, and the sequence of the spliced transcript in another string.