Pandas: Solutions¶

Solutions:

import pandas as pd
import matplotlib.pyplot as plt
df_iris = pd.read_csv("iris.csv")

Solution:

# number of rows and columns
print df_iris.shape

# only rows
print len(df_iris.index)

# only columns
print len(df_iris.columns)

Solution:

# average petal length
print df_iris.PetalLength.mean()

Solution:

# average of all numeric columns
print df_iris.mean()

Solution:

# rows corresponding to petal length outliers
print df_iris[df_iris.PetalLength > 1.5*df_iris.PetalLength.mean()]

Solution:

# group-wise standard deviation
grouped = df_iris.groupby(df_iris.Name)
iris_std_by_name = grouped.aggregate(pd.DataFrame.std)
print iris_std_by_name

# or in a single row
print df_iris.groupby(df_iris.Name).aggregate(pd.DataFrame.std)

Solution:

# grouped outliers
for name, dataframe in df_iris.groupby(df_iris.Name):
    print dataframe[dataframe.PetalLength > 1.5*df_iris.PetalLength.mean()]

Solution:

# WARNING: there are *no* group-wise outliers!

# group-wise outliers
for name, dataframe in df_iris.groupby(df_iris.Name):
    print dataframe[dataframe.PetalLength > 1.5*dataframe.PetalLength.mean()]

# using groupby-aggregate-merge
grouped = df_iris.groupby(df_iris.Name, as_index=False)
petlen_mean_by_name = grouped.aggregate(pd.DataFrame.mean)
merged = pd.merge(iris, petlen_mean_by_name, on="Name")
# use `print merged.columns` to take a peek at the column names; the
# column_x columns come from the left table (i.e. the original iris
# DataFrame), the column_y ones come # from the right table (i.e. the
# group-wise means)
print merged[merged.PetalLength_x > 1.5 * merged.PetalLength_y]

Solutions:

import pandas as pd
import matplotlib.pyplot as plt
df_gene = pd.read_csv("gene_table.txt")

Solution:

# number of genes (there is exactly one per row)
print df_gene.shape[0]

Solution:

# minimum, maximum, average and median number of isoforms per gene
print df_gene.transcript_count.min(), \
      df_gene.transcript_count.max(), \
      df_gene.transcript_count.mean(), \
      df_gene.transcript_count.median()

Solution:

# plot a histogram of the number of known isoforms per gene
df_gene.transcript_count.plot(kind="hist",bins=100)
plt.show()

Solution:

# computes, for each gene_biotype, the number of associated genes
grouped = df_gene.groupby(df_gene.gene_biotype)
grouped_number_by_biotype = grouped.aggregate(pd.DataFrame.count)
print grouped_number_by_biotype

Solution:

rows = []
for biotype, subdf in df_gene.groupby(df_gene.gene_biotype):
    rows.append((len(subdf), biotype))

# alternative 1
rows.sort()
rows.reverse()
print rows

# alternative 2, more advanced
print sorted(rows, reverse=True)

Solution:

print len(df_gene.chromosome.unique())

# or equivalently (the result is a numpy array)
print df_gene.shape[0]

Solution: almost identical to the solution of point 5.

Solution:

for chromosome, subdf in df_gene.groupby(df_gene.chromosome):
    num_plus = float(len(subdf[subdf.strand == "+"]))
    perc_plus = 100 * num_plus / len(subdf)
    print chromosome, perc_plus

Solution:

df_gene.groupby(df_gene.gene_biotype, as_index=False) \
       .aggregate(pd.DataFrame.mean)[['gene_biotype', 'transcript_count']]

Solutions:

import pandas as pd
import matplotlib.pyplot as plt
df_tt = pd.read_csv("transcript_table.txt")

Solution:
```
print df_tt.shape[0]
```

Solution:

# minimum, maximum, average and median length of human transcripts
print df_tt.transcript_length.min(), \
      df_tt.transcript_length.max(), \
      df_tt.transcript_length.mean(), \
      df_tt.transcript_length.median()

Solution:

# minimum, maximum, average and median length of the CDS of human transcripts (excluding values equal to 0)
print (df_tt.cds_length[df_tt.cds_length>0]).min(), \
      (df_tt.cds_length[df_tt.cds_length>0]).max(), \
      (df_tt.cds_length[df_tt.cds_length>0]).mean(), \
      (df_tt.cds_length[df_tt.cds_length>0]).median()

Solution:

# computes the percentage of human transcripts with a CDS length that is a multiple of 3 (excluding values equal to 0)
cds_number = df_tt.cds_length[df_tt.cds_length>0].shape[0]
cds_multiple_number = df_tt.cds_length[df_tt.cds_length>0][df_tt.cds_length%3==0].shape[0]
print (float(cds_multiple_number)/float(cds_number))*100
# Is the percentage equal to 100%?
# How can you explain coding sequences whose length is not multiple of 3? (remember the rules of the genetic code)

Solution:

# minimum, maximum, average and median length of the UTRs of human transcripts (excluding values equal to 0)
# for 5' UTRs
filter_df = df_tt[df_tt.utr5_length>0]
print filter_df.utr5_length.min(), \
      filter_df.utr5_length.max(), \
      filter_df.utr5_length.mean(), \
      filter_df.utr5_length.median()

# for 3' UTRs
filter_df = df_tt[df_tt.utr3_length>0]
print filter_df.utr3_length.min(), \
      filter_df.utr3_length.max(), \
      filter_df.utr3_length.mean(), \
      filter_df.utr3_length.median()

# for both 5' and 5' UTRs (calculating the sum of lengths)
filter_df = df_tt[df_tt.utr5_length>0][df_tt.utr3_length>0]
sum_of_lengths = filter_df.utr5_length + filter_df.utr3_length
print sum_of_lengths.min(), \
      sum_of_lengths.max(), \
      sum_of_lengths.mean(), \
      sum_of_lengths.median()

Solution:

# computes, for each transcript_biotype, the number of associated
#transcripts (a histogram), and prints the transcript_biotype with the
#number of associated transcripts in decreasing order
grouped = df_tt.groupby(df_tt.transcript_biotype)
grouped_number_by_biotype = grouped.size() # the result is a series
grouped_number_by_biotype.sort(ascending=0) # sort the series
print grouped_number_by_biotype

Solution:

# computes, for each transcript_biotype, the average transcript length, and prints the results in increasing order
grouped = df_tt.groupby(df_tt.transcript_biotype)
grouped_number_by_biotype = grouped.aggregate(pd.DataFrame.mean)
sorted_number_by_biotype = grouped_number_by_biotype.sort("transcript_length")
print sorted_number_by_biotype.transcript_length

Solution:

# computes, for protein_coding transcripts, the average length of the 5'UTR, CDS and 3' UTR
filter_df = df_tt[df_tt.transcript_biotype=="protein_coding"]
print filter_df.utr5_length.mean(), \
      filter_df.cds_length.mean(), \
      filter_df.utr3_length.mean()

Solution:

# computes, for each transcript_biotype and considering only canonical transcripts, the average number of exons
filter_df = df_tt[df_tt.canonical_flag=="T"]
grouped = filter_df.groupby(filter_df.transcript_biotype)
grouped_number_by_biotype = grouped.aggregate(pd.DataFrame.mean)
print grouped_number_by_biotype.exon_count

Solutions:

import pandas as pd
import matplotlib.pyplot as plt
df_exon = pd.read_csv("exon_table.txt")

Solution:

df_length = df_exon.exon_chrom_start - df_exon.exon_chrom_end + 1
print df_length.min(), df_length.max(), df_length.mean(), df_length.median()

Solution:

name = raw_input("write a trascript name :")

filtered = df_exon[df_exon.transcript_name == name]
filtered_length = filtered.exon_chrom_start - filtered.exon_chrom_end + 1
merged = df.merge(filtered, filtered_length)
merged.sort(["exon_chrom_start"])

Solutions:

ts = pd.read_csv("expression_timeseries.txt", sep="\t")

Solution:

# number of genes
print ts.shape[0]

# number of time steps (just count how many columns end by 'h')
print len([c for c in ts.columns if c[-1] == "h"])

Solution:

# if there are NaNs, then dropna will remove at least one row
print "got nans?", ts.shape[0] != ts.dropna().shape[0]

# how many rows with at least one NaN are there? (just to double check)
print ts.shape[0] - ts.dropna().shape[0]

# fill in the missing values
ts = ts.ffill()

Note

In the following solutions we will assume that ffill() has already been called.

Solution: there are actually two expression profiles for the "ZFX" gene:

>>> expr_zfx = ts[ts["Gene Symbol"] == "ZFX"]
>>> expr_zfx
      SPOT Gene Symbol     0h   0.5h     3h     6h    12h
0        1         ZFX -0.092  0.126  0.109  0.043  0.074
1739  1740         ZFX -0.487  0.298 -0.058 -0.158 -0.155

Let’s plot the first; the other can be dealt with similarly:

# first is a Series
first = expr_zfx.iloc[0]
numbers = first[2:]
numbers.plot(kind="line")
plt.savefig("output.png")

Solution: let’s take the mean over the time steps:
```
means = ts[ts.columns[2:]].mean(axis=1)
```
Here the first part extracts only those columns that encode expression measurements (from the third onwards), while axis=1 specifies that the average should be taken by averaging over columns, rather than over rows as we are used to. The latter case corresponds to axis=0, and is the default.

Now:
```
variances = ts[ts.columns[2:]].var(axis=1)
```
does the same thing for the variance.

Let’s plot the mean and the variance:
```
means.plot(kind="line")
plt.savefig("means.png")

# clear the screen between plots
plt.clf()

# same thing for variance here
```

Warning

The above exercise is slightly bugged. In order to be solved “the Pandas way”, it requires more advanced material that is not provided by the notes. I gave a fully Pandas-approved solution.

Solution:

# we know that the only columns we care about are from the third
# onwards -- for simplicity
for column in ts.columns[2:]:
    index_of_max = ts[column].argmax()
    print column, ts.iloc[index_of_max]

Unfortunately most of the gene names are NaN; the solution is correct nevertheless.

Solution: same as below, except that the inner loop keeps track of the index of the maximally correlated gene.

Very elementary solution, it will take a long time to complete:

ts = pd.read_csv("expression_timeseries.txt", sep="\t")
num_rows = ts.shape[0] # set this to 100 or so to make it faster

corr = []
for i in range(num_rows):
    row_i = ts.iloc[i][2:].astype(float)
    corr_i = []
    for j in range(num_rows):
        row_j = ts.iloc[j][2:].astype(float)
        corr_i.append(row_i.corr(row_j))
    corr.append(corr_i)

plt.imshow(pd.DataFrame(corr).values, cmap="gray", interpolation="nearest")
plt.savefig("corr.png")

Left to the reader.