Python: Strings

Strings are immutable objects representing text.

To define a string, we can write:

var = "text"

or, equivalently:

var = 'text'

To insert special characters, we need to perform escaping with a backslash \:

path = "data\\fasta"

or use the prefix r (raw):

path = r"data\fasta"

Note

Here is a reference list of escape characters. You will probably only need the most obvious ones, like \\, \n and \t.

Escape Character Meaning
\\ Backslash
\' Single-quote
\" Double-quote
\a ASCII bell
\b ASCII backspace
\f ASCII formfeed
\n ASCII linefeed (also known as newline)
\r Carriage Return
\t Horizontal Tab
\v ASCII vertical tab
\N{name} Unicode character name (Unicode only!)
\uxxxx Unicode 16-bit hex value xxxx (u’’ string only)
\Uxxxxxxxx Unicode 32-bit hex value xxxxxxxx (u’’ string only)
\ooo Character with octal value ooo
\xhh Character with hex value hh

To create a multi-line string, we can manually place the newline character \n at each line:

sad_joke = "Time flies like an arrow.\nFruit flies like a banana."

print sad_joke

or we can use triple quotes:

sad_joke = """Time flies like an arrow.
Fruit flies like a banana."""

print sad_joke

Warning

print interprets special characters, while terminal echo doesn’t. Try to write:

print path

and (from the interpreter):

path

In the rirst case, we see one slash (the escaping slash is automatically interpreted by print), in the second case we see two slashes (the escape slash is not interpreted).

The same if we print sad_joke.

String-Number conversion

We can convert a number into a string with str():

n = 10
s = str(n)
print n, type(n)
print s, type(s)

int() or float() perform the opposite conversion:

n = int("123")
q = float("1.23")
print n, type(n)
print q, type(q)

If the string doesn’t contain the correct numeric type, Python will give an error message:

int("3.14")             # Not an int
float("ribosome")       # Not a number
int("1 2 3")            # Not a number
int("fifteen")          # Not a number

Operations

Result Operator Meaning
bool == Check whether two strings are identical.
int len(str) Return the length of the string
str str + str Concatenate two strings
str str * int Replicate the string
bool str in str Check if a string is present in another string
str str[int:int] Extract a sub-string

Example. Let’s concatenate two strings:

string = "one" + " " + "string"
length = len(string)
print "the string:", string, "is", length, "characters long"

Another example:

string = "Python is hell!" * 1000
print "the string is", len(string), "characters long"

Warning

We cannot concatenate strings with other types. For example:

var = 123
print "the value of var is" + var

gives an error message. Two working alternatives:

print "the value of var is" + str(123)

or:

print "the value of var is", var

(In the second case we miss a space between is and 123.)

Example. The operator substring in string checks if substring appears once or more times in string, for example:

string = "A beautiful journey"

print "A" in string            # True
print "beautiful" in string    # True
print "BEAUTIFUL" in string    # False
print "ul jour" in string      # True
print "Gengis Khan" in string  # False
print " " in string            # True
print "     " in string        # False

The result is always True or False.

Example. To extract a substring we can use indexes:

#           0                       -1
#           |1                     -2|
#           ||2                   -3||
#           |||        ...         |||
alphabet = "abcdefghijklmnopqrstuvwxyz"

print alphabet[0]               # "a"
print alphabet[1]               # "b"
print alphabet[len(alphabet)-1] # "z"
print alphabet[len(alphabet)]   # Error
print alphabet[10000]           # Error

print alphabet[-1]              # "z"
print alphabet[-2]              # "y"

print alphabet[0:1]             # "a"
print alphabet[0:2]             # "ab"
print alphabet[0:5]             # "abcde"
print alphabet[:5]              # "abcde"

print alphabet[-5:-1]           # "vwxy"
print alphabet[-5:]             # "vwxyz"

print alphabet[10:-10]          # "klmnop"

Warning

Extraction is inclusive with respect to the first index, but exclusive with respect to the second. In other words alphabet[i:j] corresponds to:

alphabet[i] + alphabet[i+1] + ... + alphabet[j-1]

Note that alphabet[j] is excluded.

Warning

Extraction return a new string, leaving the original unvaried:

alphabet = "abcdefghijklmnopqrstuvwxyz"

substring = alphabet[2:-2]
print substring
print alphabet                  # Is unvaried

Methods

Result Method Meaning
str str.upper() Return the string in upper case
str str.lower() Return the string in lower case
str str.strip(str) Remove strings from the sides
str str.lstrip(str) Remove strings from the left
str str.rstrip(str) Remove strings from the right
bool str.startswith(str) Check if the string starts with another
bool str.endswith(str) Check if the string ends with another
int str.find(str) Return the position of a substring
int str.count(str) Count the number of occurrences of a substring
str str.replace(str, str) Replace substrings

Warning

Methods return a new string, leaving the original unvaried (as with extraction):

alphabet = "abcdefghijklmnopqrstuvwxyz"

alphabet_upper = alphabet.upper()
print alphabet_upper
print alphabet                 # Is unvaried

Example. upper() and lower() are very simple:

text = "No Yelling"

result = text.upper()
print result

result = result.lower()
print result

Example. strip() variants are also simple:

text = "    one example    "

print text.strip()         # equivalent to text.strip(" ")
print text.lstrip()        # idem
print text.rstrip()        # idem

print text                 # text is unvaried

Note that the space between "one" and "example" is never removed. We can specify more than one character to be removed:

"AAAA one example BBBB".strip("AB")

Example. The same is valid with startswith() and endswith():

text = "123456789"

print text.startswith("1")     # True
print text.startswith("a")     # False

print text.endswith("56789")   # True
print text.endswith("5ABC9")   # False

Example. find() returns the position of the first occurrence of a substring, or -1 if the substring never occurs:

text = "123456789"

print text.find("1")           # 0
print text.find("56789")       # 4

print text.find("Q")           # -1

Example. replace() returns a copy of the string where a substring is replaced with another:

text = "if roses were rotten, then"

print text.replace("ro", "gro")

Example. Given this unformatted string of aminoacids:

sequence = ">MAnlFKLgaENIFLGrKW    "

To increase uniformity, we want to remove the ">" character, remove spaces and finally convert everything to upper case:

s1 = sequence.lstrip(">")
s2 = s2.rstrip(" ")
s3 = s2.upper()

print s3

Alternatively, all in one step:

print sequence.lstrip(">").rstrip(" ").upper()

Why does it work? Let’s write it with brackets:

print ( ( sequence.lstrip(">") ).rstrip(" ") ).upper()
        \_____________________/
                  str
      \_____________________________________/
                         str
      \_____________________________________________/
                             str

As you can see, the result of each method is a string (as s1, s2 e s3 in the example above); and we can invoke string methods.

Exercises

  1. How can I:

    1. Create a string consisting of five spaces only.

    2. Check whether a string contains at least one space.

    3. Check whether a string contains exactly five (arbitrary) characters.

    4. Create an empty string, and check whether it is really empty.

    5. Create a string that contains one hundred copies of "Python is great".

    6. Given the strings "but cell", "biology" and "is way better", compose them into the string "but cell biology is way better".

    7. Check whether the string "12345" begins with 1 (the character, not the number!)

    8. Create a string consisting of a single character \. (Check whether the output matches using both the echo of the interpreter and print, and possibly also with len())

    9. Check whether the string "\\" contains one or two backslashes.

    10. Check whether a string (of choice) begins or ends by \.

    11. Check whether a string (of choice) contains x at least three times at the beginning and/or at the end. For instance, the following strings satisfy the desideratum:

      "x....xx"           # 1 + 2 >= 3
      "xx....x"           # 2 + 1 >= 3
      "xxxx..."           # 4 + 0 >= 3
      

      while these do not:

      "x.....x"           # 1 + 1 < 3
      "...x..."           # 0 + 0 < 3
      "......."           # 0 + 0 < 3
      
  2. Given the string:

    s = "0123456789"
    

    which of the following extractions are correct?

    1. s[9]
    2. s[10]
    3. s[:10]
    4. s[1000]
    5. s[0]
    6. s[-1]
    7. s[1:5]
    8. s[-1:-5]
    9. s[-5:-1]
    10. s[-1000]
  3. Create a two-line string that contains the two following lines of text literally, including all the special characters and the implicit newline character:

    never say “never”!

    said the sad turtle

  4. Given the strings:

    string = "a 1 b 2"
    
    digit = "DIGIT"
    character = "CHARACTER"
    

    replace all the digits in the variable string with the text provided by the variable digit, and all alphabetic characters with the content of the variable character.

    The result should look like this:

    "CHARACTER DIGIT CHARACTER DIGIT"
    

    You are free to use auxiliary variables to hold any intermediate results, but do not need to.

  5. Given the following multi-line sequence:

    chain_a = """SSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKM
       FCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVV
       RRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFR
       HSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILT
       IITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKG
       EPHHELPPGSTKRALPNNT"""
    

    which represents the aminoacid sequence of the DNA-binding domain of the Tumor Suppressor Protein TP53 , answer the following questions.

    1. How many lines does it hold?
    2. How long is the sequence? (Do not forget to ignore the special characters!)
    3. Remove all newline characters, and put the result in a new variable sequence.
    4. How many cysteines "C" are there in the sequence? How many histidines "H"?
    5. Does the chain contain the sub-sequence "NLRVEYLDDRN"? In what position?
    6. How can I use find() and the sub-string extraction [i:j] operators to extract the first line from chain_a?
  6. Given (a small portion of) the tertiary structure of chain A of the TP53 protein:

    structure_chain_a = """SER A 96 77.253 20.522 75.007
    VAL A 97 76.066 22.304 71.921
    PRO A 98 77.731 23.371 68.681
    SER A 99 80.136 26.246 68.973
    GLN A 100 79.039 29.534 67.364
    LYS A 101 81.787 32.022 68.157"""
    

    Each line represents an \(C_\alpha\) atom of the backbone of the structure. Of each atom, we know: - the aminoacid code of the residue - the chain (which is always "A" in this example) - the position of the residue within the chain (starting from the N-terminal) - and the \(x, y, z\) coordinates of the atom

    1. Extract the second line using find() and the extraction operator. Put the line in a new variable line.

    2. Extract the coordinates of the second residue, and put them into three variables x, y, and z.

    3. Extract the coordinates from third residue as well, putting them in different variables x_prime, y_prime, z_prime

    4. Compute the Euclidean distance between the two residues:

      \(d((x,y,z),(x',y',z')) = \sqrt{(x-x')^2 + (y-y')^2 + (z-z')^2}\)

      Hint: make sure to use float numbers when computing the distance.

  7. Given the following DNA sequence, part of the BRCA2 human gene:

    dna_seq = """GGGCTTGTGGCGCGAGCTTCTGAAACTAGGCGGCAGAGGCGGAGCCGCT
    GTGGCACTGCTGCGCCTCTGCTGCGCCTCGGGTGTCTTTT
    GCGGCGGTGGGTCGCCGCCGGGAGAAGCGTGAGGGGACAG
    ATTTGTGACCGGCGCGGTTTTTGTCAGCTTACTCCGGCCA
    AAAAAGAACTGCACCTCTGGAGCGG"""
    
    1. Calculate the GC-content of the sequence
    2. Convert the DNA sequence into an RNA sequence
    3. Assuming that this sequence contains an intron ranging from nucleotide 51 to nucleotide 156, store the sequence of the intron in a string, and the sequence of the spliced transcript in another string.