Biopython (Exercises)¶

Biopython: Sequences¶

Note

The sequences.fasta file is available at:

https://drive.google.com/open?id=0B0wILN942aEVcXhqZFlGdTh0U3c

Write a Python program that asks the user for two DNA sequences, and prints the reverse complement of their concatenation.
Write a Python program that asks the user for a DNA sequence, and prints both the corresponding mRNA sequence and protein sequence, including stop codons (according to the standard translation table).
Write a Python program that asks the user for a DNA sequence, and prints both the corresponding mRNA sequence and protein sequence, including stop codons (according to the Yeast Mitochondrial Code translation table).

Note

See the table list at:

https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi
Write a Python program that takes the sequence of the 1AI4 PDB protein (download the FASTA file manually), and writes a corresponding UniProt file.
Write a Python program that takes the sequences.fasta file and writes N single-sequence FASTA files, called sequence{number}.fasta, each one containing a single sequence of the original file.
Do the same, but create N GenBank files instead.
Write a Python program that takes the sequences.fasta file and writes a revcomp.fasta file with the reverse complements of the original sequences.

Hint. The SeqIO.write() function can write an entire list of SeqIO records
Solve Exercise 3 of the Programs section using Biopython where appropriate.
Solve Exercise 2 of the Programs section using Biopython where appropriate.

Hint. Study carefully the .annotations of the SeqRecord obtained by parsing the UniProt file.
Find and download a single sequence record from genbank. The genbank identifier of the record is HE805982.1. This record contains information about the DNA region coding for HBx, a multifunctional hepatitis B viral protein involved in modulating several pathways by directly or indirectly interacting with hosts factors (protein degradation, apoptosis, transcription, signal transduction, cell cycle progress, and genetic stability). Write a Python program that, using the genbank record, saves the corresponding protein sequence in fasta format.
From UniProt find and download the records relative to the four human ELAV proteins (ELAVL1: Q15717, ELAVL2: Q12926, ELAVL3: Q14576, ELAVL4: P26378). Download each record in text format and store the four records in a dedicated directory. Now write a python program that takes all the uniprot files in the directory and appends all the sequences in fasta format, to the file created in the previous exercise. Can you print sequences ordered by increasing length?

Hint. The os.listdir() function in the os module can save in a list all the names of the files in a directory.
Write a program that, given a fasta file containing multiple protein sequences and a string specified by the user, prints to a new file only sequences that contain at least one occurrence of the string (regular expressions are allowed). Test your program with the file sequences.fasta, printing sequences containing a stretch of at least three glutamines.
Try to execute exercises 20.1.6 and 20.1.7 in the biopython tutorial page:

http://biopython.org/DIST/docs/tutorial/Tutorial.html

Biopython: Structures¶

Note

The P53 files (FASTA and three PDB files) are available here:

https://drive.google.com/open?id=0B0wILN942aEVSUZXamFlUmRnUU0

Ask the user for the path to a pdb file. Print the header of the PDB file on screen, including the name, author, resolution and release date of the PDB structure.
Same as above, but for an mmcif file.
Ask the user for the path to a pdb file. Print the number of models it contains. For each model, print the name length of its chains.
Same as above, but for an mmcif file.
Ask the user for the path to a pdb file, as well as the name of a chain. Print the number of atoms that each of its residues has.
Ask the user for the path to a pdb file, and convert its sequence to a FASTA file.

Note. You will need to convert from three-letter amino acid codes to one-letter codes.

Hint. You can use the SeqIO module, but it is not mandatory.
Ask the user for the ID of a PDB structure (e.g. "1A3A"), and download it somewhere using the PDBList class. Then print the hierarchy of the structure: how many models; for each model how many chains; for each chain how many residues.

Note. Try to make the output look pretty.
Ask the user the path to two pdb files, then perform structural alignment on them.
Ask the user the path to two pdb files and a float threshold. For each chain in the first structure, compare it to each chain in the second structure: if any two residues in the first chain is close enough to a residue of the second chain (i.e. their distance is smaller than the threshold), print the chain ID and residue type of the two residues, as well as their distance.
Given the files 1TSR.pdb and wt_1tsr_B.pdb in the P53 package, make sure that they have exactly the same residues.
Given the files wt_1tsr_B.pdb and wt_1tsr_B_nowater.pdb in the P53 package, make sure that the only missing molecules are the "HOH" ones.

Hint. Use the hetero attribute of the residues.