INFO390C DNDS Pset05
INFO390C DNDS Pset05
INFO390C DNDS Pset05
Instructions
This problem set is provided in the form of a Jupyter notebook. Problems are posed
within this notebook file and you are expected to provide code and/or written answers
when prompted. Remember that you can use Markdown cells to format written
responses where necessary.
Before submitting your assignment, be sure to do a clean run of your notebook and
verify that your cell outputs (e.g., prints, figures, tables) are correctly shown. To do
a clean run, click Kernel→Restart Kernal and Run All Cells....
1. Submit a PDF of the completed notebook. To produce a PDF, you can use File→Save
and Export Notebook As...→HTML and then convert the HTML file to a PDF using
your preferred web browser. Verify that your code, written answers, and cell
outputs are visible in the submitted PDF.
2. Submit a zip file (including the .ipynb file) of this assignment to Gradescope.
Online resoures:
Student Names:
In [ ]: import pandas as pd
from Bio import SeqIO
import numpy as np
from copy import deepcopy
The purpose of this notebook is to write a series of functions to calculate the dN/dS ratio
of two coding nucleotide sequences.
We will then compute the dN/dS ratio of two proteins from the Sars-Cov-2 genome, to
determine whether the genes are under selection, and of what type.
file:///Users/chaitanyasachdeva/Downloads/pset05/INFO390C_dNdS_pset05.html Page 1 of 9
INFO390C_dNdS_pset05 11/5/24, 11:11 PM
Write a function to read in these files and prepare the sequences for downstream
calculation (you may want to write this function iteratively with the downstream
calculation functions). I recommend converting the sequences into numpy arrays of type
str, where one nucleotide is one entry in the array, as this allows for easy indexing.
remove any positions from consideration where either sequence has a gap
ensure that the length of both sequences is divisible by 3 after removal of the gaps
ensure that both sequences are the same length
(20 points)
file:///Users/chaitanyasachdeva/Downloads/pset05/INFO390C_dNdS_pset05.html Page 2 of 9
INFO390C_dNdS_pset05 11/5/24, 11:11 PM
In [ ]: def preprocess_alignment(alignment_file):
"""
Parameters:
alignment_file: str, path to file containing alignments
Returns:
seq1, seq2: np.array of type str, containing positions in the sequence amenabl
"""
sequences = []
for record in SeqIO.parse(alignment_file, "fasta"):
sequences.append(str(record.seq))
seq_1, seq_2 = sequences[0], sequences[1]
# check that your sequences are of equal length and then a multiple of 3
if len(seq1) != len(seq2):
print("sequence noe equal ")
if (len(seq1)) %3 != 0:
return print("not a multiple of 3 ")
13206
13206
file:///Users/chaitanyasachdeva/Downloads/pset05/INFO390C_dNdS_pset05.html Page 3 of 9
INFO390C_dNdS_pset05 11/5/24, 11:11 PM
Write a function that, given a codon and a nucleotide position (0, 1, or 2) within the
codon, compute the fraction of possible changes in that site that are synonymous and
non-synonymous (15 points). The codon table below is given.
You will notice that some sequences have an "N" in some position. This, and other non-
traditional nucleotide characters, are introduced as a placeholder when the results of the
sequencing experiment and/or variant calling pipeline are inconclusive about the
nucleotide present in a particular position. For the purposes of this calculation, we are
going to exclude codons with an "N" from both the calculation of number of S vs. NS
sites, and number of S vs. NS mutations, since we can't be certain about their sequence.
In [ ]: # provide the dictionary below - this gives all codons to their amino acid and/or
codon_table = {
'TCA': 'S', # Serine
'TCC': 'S', # Serine
'TCG': 'S', # Serine
'TCT': 'S', # Serine
'TTC': 'F', # Phenylalanine
'TTT': 'F', # Phenylalanine
'TTA': 'L', # Leucine
'TTG': 'L', # Leucine
'TAC': 'Y', # Tyrosine
'TAT': 'Y', # Tyrosine
'TAA': '*', # Stop
'TAG': '*', # Stop
'TGC': 'C', # Cysteine
'TGT': 'C', # Cysteine
'TGA': '*', # Stop
'TGG': 'W', # Tryptophan
'CTA': 'L', # Leucine
'CTC': 'L', # Leucine
'CTG': 'L', # Leucine
'CTT': 'L', # Leucine
'CCA': 'P', # Proline
'CCC': 'P', # Proline
'CCG': 'P', # Proline
'CCT': 'P', # Proline
'CAC': 'H', # Histidine
'CAT': 'H', # Histidine
'CAA': 'Q', # Glutamine
'CAG': 'Q', # Glutamine
'CGA': 'R', # Arginine
'CGC': 'R', # Arginine
'CGG': 'R', # Arginine
'CGT': 'R', # Arginine
'ATA': 'I', # Isoleucine
'ATC': 'I', # Isoleucine
'ATT': 'I', # Isoleucine
'ATG': 'M', # Methionine
file:///Users/chaitanyasachdeva/Downloads/pset05/INFO390C_dNdS_pset05.html Page 4 of 9
INFO390C_dNdS_pset05 11/5/24, 11:11 PM
file:///Users/chaitanyasachdeva/Downloads/pset05/INFO390C_dNdS_pset05.html Page 5 of 9
INFO390C_dNdS_pset05 11/5/24, 11:11 PM
Returns:
total_S, total_NS: float
the fraction of possible changes at the given position that are synonymous
"""
# error check for codons not in the table - ie, non-standard nucleotides
# code here
if 'N' in codon or codon not in codon_table:
print("invalid!!!!")
return None , None
amino_acid = codon_table[codon]
o_nucleotide = codon[position]
# determine which three nucleotides are not present at the current site
# code here
possible_substitutions = [n for n in ['A', 'T', 'C', 'G'] if n != o_nucleotide
Write a function that iterates through a sequence and counts the total synonymous and
non-synonymous sites in that sequence (by summing up the values returned by your
previous function). The start and stop codon should be ignored. (15 points)
file:///Users/chaitanyasachdeva/Downloads/pset05/INFO390C_dNdS_pset05.html Page 6 of 9
INFO390C_dNdS_pset05 11/5/24, 11:11 PM
def compute_total_sites(seq1):
"""
Parameters:
seq1: np.array of type str containing the sequence
returns:
total_S_sites, total_NS_sites: float, number of possible synonymous vs. non-sy
"""
total_S_sites = 0
total_NS_sites = 0
for i in range(0, len(seq1) - 2, 3):
condon = ''.join(seq1[i:i+3])
if codon_table[condon] == '*' or condon == "ATG" :
continue
for i in range(3):
s, ns = calculate_num_synonymous_vs_nonsynonymous(condon , i )
if s is None and ns is None :
continue
total_S_sites += s
total_NS_sites += ns
file:///Users/chaitanyasachdeva/Downloads/pset05/INFO390C_dNdS_pset05.html Page 7 of 9
INFO390C_dNdS_pset05 11/5/24, 11:11 PM
returns:
total_S, total_NS: float, number of OBSERVED synonymous vs. non-synonymous mut
"""
total_S = 0
total_NS = 0
for i in range(0, len(seq1) - 2, 3):
codon1 = ''.join(seq1[i:i+3])
codon2 = ''.join(seq2[i:i+3])
if codon_table[codon1] == '*' or codon2 == "ATG" or 'N' in codon1 or 'N' i
continue
amino_acid1 = codon_table[codon1]
amino_acid2 = codon_table[codon2]
if amino_acid1 is None or amino_acid2 is None:
continue
if amino_acid1 == amino_acid2:
total_S += 1
else:
total_NS += 1
return ratio
file:///Users/chaitanyasachdeva/Downloads/pset05/INFO390C_dNdS_pset05.html Page 8 of 9
INFO390C_dNdS_pset05 11/5/24, 11:11 PM
Apply your function to the S protein sequences and the ORF1ab protein sequences (10
points). Does the S protein show evidence of positive selection, negative selection, or
neither (5 points)? Does the ORF1ab sequence show evidence of positive selection,
negative selection, or neither (5 points)?
Speculate about why the S protein displays the signature it does, using the information
found here (5 points): https://en.wikipedia.org/wiki/Coronavirus_spike_protein
0.00040614193402633365
Put Answer with explanation here: the low ratio suggest srong negative selection on the
ORF!ab protien This is expected since ORF1ab encodes essential enzymes and proteins
required for viral replication and transcription
print(ratio_s)
0.00662276822276831
Put Answer with explanation here: the raio here is also low which shows us negative
selection but better that the ORFab , Although the s protein is subject to purifying
selection to maintain its functional integrity.
Put Answer on your speculation about with S protein displays the signature here: the
wikipidea sources tells us that the s protein is responsible for Binding to the ACE2
receptor, allowing the virus to enter host cells
file:///Users/chaitanyasachdeva/Downloads/pset05/INFO390C_dNdS_pset05.html Page 9 of 9