INFO390C DNDS Pset05

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

INFO390C_dNdS_pset05 11/5/24, 11:11 PM

Instructions
This problem set is provided in the form of a Jupyter notebook. Problems are posed
within this notebook file and you are expected to provide code and/or written answers
when prompted. Remember that you can use Markdown cells to format written
responses where necessary.

Before submitting your assignment, be sure to do a clean run of your notebook and
verify that your cell outputs (e.g., prints, figures, tables) are correctly shown. To do
a clean run, click Kernel→Restart Kernal and Run All Cells....

You are required to submit this notebook to Gradescope in two forms:

1. Submit a PDF of the completed notebook. To produce a PDF, you can use File→Save
and Export Notebook As...→HTML and then convert the HTML file to a PDF using
your preferred web browser. Verify that your code, written answers, and cell
outputs are visible in the submitted PDF.
2. Submit a zip file (including the .ipynb file) of this assignment to Gradescope.

Online Resources and Collaborators


Please list the online resources you used and the names of other students you
collaborated with while working on this problem set.

Online resoures:
Student Names:

In [ ]: import pandas as pd
from Bio import SeqIO
import numpy as np
from copy import deepcopy

The purpose of this notebook is to write a series of functions to calculate the dN/dS ratio
of two coding nucleotide sequences.

We will then compute the dN/dS ratio of two proteins from the Sars-Cov-2 genome, to
determine whether the genes are under selection, and of what type.

file:///Users/chaitanyasachdeva/Downloads/pset05/INFO390C_dNdS_pset05.html Page 1 of 9
INFO390C_dNdS_pset05 11/5/24, 11:11 PM

01. read in the two sequences (20 points)


You are provided with two files that contain the alignments of the two proteins we will be
studying, the gene which encodes the S protein, and the gene called "ORF1ab" which
codes for the ORF1ab gene, which encodes several proteins essential for viral replication
inside the cell (yes, it's weird that one open reading frame encodes for several proteins,
this does not typically happen, and is a feature of viruses and the covid-19 virus in
particular). https://en.wikipedia.org/wiki/ORF1ab

Write a function to read in these files and prepare the sequences for downstream
calculation (you may want to write this function iteratively with the downstream
calculation functions). I recommend converting the sequences into numpy arrays of type
str, where one nucleotide is one entry in the array, as this allows for easy indexing.

Your function should contain the following checks:

remove any positions from consideration where either sequence has a gap
ensure that the length of both sequences is divisible by 3 after removal of the gaps
ensure that both sequences are the same length

(20 points)

file:///Users/chaitanyasachdeva/Downloads/pset05/INFO390C_dNdS_pset05.html Page 2 of 9
INFO390C_dNdS_pset05 11/5/24, 11:11 PM

In [ ]: def preprocess_alignment(alignment_file):
"""
Parameters:
alignment_file: str, path to file containing alignments

Returns:
seq1, seq2: np.array of type str, containing positions in the sequence amenabl
"""

sequences = []
for record in SeqIO.parse(alignment_file, "fasta"):
sequences.append(str(record.seq))
seq_1, seq_2 = sequences[0], sequences[1]

# convert to numpy arrays for easy indexing


seq1 = np.array(list(seq_1),dtype=str)
seq2 = np.array(list(seq_2),dtype=str)

# determine which positions have gaps in either sequences


gaps = (seq1 == '-') | (seq2 == '-')

# remove gapped positions from sequences


seq1 = seq1[~gaps]
seq2 = seq2[~gaps]

# check that your sequences are of equal length and then a multiple of 3
if len(seq1) != len(seq2):
print("sequence noe equal ")

if (len(seq1)) %3 != 0:
return print("not a multiple of 3 ")

return seq1, seq2

In [ ]: seq1, seq2 = preprocess_alignment("ORFab_alignment.fasta")


print(len(seq1))
print(len(seq2))

13206
13206

In [ ]: seq1, seq2 = preprocess_alignment("S_protein_alignment.fasta")

02. Count the number of synonymous and non-synonymous


sites in the first sequence (30 points)

file:///Users/chaitanyasachdeva/Downloads/pset05/INFO390C_dNdS_pset05.html Page 3 of 9
INFO390C_dNdS_pset05 11/5/24, 11:11 PM

Write a function that, given a codon and a nucleotide position (0, 1, or 2) within the
codon, compute the fraction of possible changes in that site that are synonymous and
non-synonymous (15 points). The codon table below is given.

You will notice that some sequences have an "N" in some position. This, and other non-
traditional nucleotide characters, are introduced as a placeholder when the results of the
sequencing experiment and/or variant calling pipeline are inconclusive about the
nucleotide present in a particular position. For the purposes of this calculation, we are
going to exclude codons with an "N" from both the calculation of number of S vs. NS
sites, and number of S vs. NS mutations, since we can't be certain about their sequence.

In [ ]: # provide the dictionary below - this gives all codons to their amino acid and/or
codon_table = {
'TCA': 'S', # Serine
'TCC': 'S', # Serine
'TCG': 'S', # Serine
'TCT': 'S', # Serine
'TTC': 'F', # Phenylalanine
'TTT': 'F', # Phenylalanine
'TTA': 'L', # Leucine
'TTG': 'L', # Leucine
'TAC': 'Y', # Tyrosine
'TAT': 'Y', # Tyrosine
'TAA': '*', # Stop
'TAG': '*', # Stop
'TGC': 'C', # Cysteine
'TGT': 'C', # Cysteine
'TGA': '*', # Stop
'TGG': 'W', # Tryptophan
'CTA': 'L', # Leucine
'CTC': 'L', # Leucine
'CTG': 'L', # Leucine
'CTT': 'L', # Leucine
'CCA': 'P', # Proline
'CCC': 'P', # Proline
'CCG': 'P', # Proline
'CCT': 'P', # Proline
'CAC': 'H', # Histidine
'CAT': 'H', # Histidine
'CAA': 'Q', # Glutamine
'CAG': 'Q', # Glutamine
'CGA': 'R', # Arginine
'CGC': 'R', # Arginine
'CGG': 'R', # Arginine
'CGT': 'R', # Arginine
'ATA': 'I', # Isoleucine
'ATC': 'I', # Isoleucine
'ATT': 'I', # Isoleucine
'ATG': 'M', # Methionine

file:///Users/chaitanyasachdeva/Downloads/pset05/INFO390C_dNdS_pset05.html Page 4 of 9
INFO390C_dNdS_pset05 11/5/24, 11:11 PM

'ACA': 'T', # Threonine


'ACC': 'T', # Threonine
'ACG': 'T', # Threonine
'ACT': 'T', # Threonine
'AAC': 'N', # Asparagine
'AAT': 'N', # Asparagine
'AAA': 'K', # Lysine
'AAG': 'K', # Lysine
'AGC': 'S', # Serine
'AGT': 'S', # Serine
'AGA': 'R', # Arginine
'AGG': 'R', # Arginine
'GTA': 'V', # Valine
'GTC': 'V', # Valine
'GTG': 'V', # Valine
'GTT': 'V', # Valine
'GCA': 'A', # Alanine
'GCC': 'A', # Alanine
'GCG': 'A', # Alanine
'GCT': 'A', # Alanine
'GAC': 'D', # Aspartic Acid
'GAT': 'D', # Aspartic Acid
'GAA': 'E', # Glutamic Acid
'GAG': 'E', # Glutamic Acid
'GGA': 'G', # Glycine
'GGC': 'G', # Glycine
'GGG': 'G', # Glycine
'GGT': 'G' # Glycine
}

file:///Users/chaitanyasachdeva/Downloads/pset05/INFO390C_dNdS_pset05.html Page 5 of 9
INFO390C_dNdS_pset05 11/5/24, 11:11 PM

In [ ]: def calculate_num_synonymous_vs_nonsynonymous(codon, position):


"""
Parameters:
codon: str of length 3 giving the nucleotide codon
position: int [0,1,or 2] giving the position in the codon

Returns:
total_S, total_NS: float
the fraction of possible changes at the given position that are synonymous

"""

# error check for codons not in the table - ie, non-standard nucleotides
# code here
if 'N' in codon or codon not in codon_table:
print("invalid!!!!")
return None , None
amino_acid = codon_table[codon]
o_nucleotide = codon[position]
# determine which three nucleotides are not present at the current site
# code here
possible_substitutions = [n for n in ['A', 'T', 'C', 'G'] if n != o_nucleotide

# count the number of S or NS substitutions that would be made by all possible


# code here3
total_NS =0
total_S = 0
for n in possible_substitutions:
new_codon = list(codon)
new_codon[position] = n
new_codon = ''.join(new_codon)
if new_codon in codon_table:
new_amino_acid = codon_table[new_codon]
if new_amino_acid == amino_acid:
total_S+=1
else :
total_NS+=1

#divide by 3 for the 3 possible changes


return total_S/3, total_NS/3

Write a function that iterates through a sequence and counts the total synonymous and
non-synonymous sites in that sequence (by summing up the values returned by your
previous function). The start and stop codon should be ignored. (15 points)

file:///Users/chaitanyasachdeva/Downloads/pset05/INFO390C_dNdS_pset05.html Page 6 of 9
INFO390C_dNdS_pset05 11/5/24, 11:11 PM

In [ ]: # iterate through the sequence in chunks of 3


# ignore the start and end codons

def compute_total_sites(seq1):
"""
Parameters:
seq1: np.array of type str containing the sequence

returns:
total_S_sites, total_NS_sites: float, number of possible synonymous vs. non-sy
"""
total_S_sites = 0
total_NS_sites = 0
for i in range(0, len(seq1) - 2, 3):
condon = ''.join(seq1[i:i+3])
if codon_table[condon] == '*' or condon == "ATG" :
continue
for i in range(3):
s, ns = calculate_num_synonymous_vs_nonsynonymous(condon , i )
if s is None and ns is None :
continue
total_S_sites += s
total_NS_sites += ns

return total_S_sites, total_NS_sites

03. Compute the number of synonymous and non-synonymous


changes that have occurred (15 points)
write a function that iterates through two sequences and computes the number of
OBSERVED synonymous or non-synonymous changes.

file:///Users/chaitanyasachdeva/Downloads/pset05/INFO390C_dNdS_pset05.html Page 7 of 9
INFO390C_dNdS_pset05 11/5/24, 11:11 PM

In [ ]: def compute_total_mutations(seq1, seq2):


"""
Parameters:
seq1, seq2: np.arrays of type str containing the sequences to compare

returns:
total_S, total_NS: float, number of OBSERVED synonymous vs. non-synonymous mut
"""
total_S = 0
total_NS = 0
for i in range(0, len(seq1) - 2, 3):
codon1 = ''.join(seq1[i:i+3])
codon2 = ''.join(seq2[i:i+3])
if codon_table[codon1] == '*' or codon2 == "ATG" or 'N' in codon1 or 'N' i
continue
amino_acid1 = codon_table[codon1]
amino_acid2 = codon_table[codon2]
if amino_acid1 is None or amino_acid2 is None:
continue
if amino_acid1 == amino_acid2:
total_S += 1
else:
total_NS += 1

return total_S, total_NS

04. Compute the dN/dS ratio and interpret the results(35


points)
Write a function that computes the dN/dS ratio for a pair of sequences, given the
functions you've written above (10 points)

In [ ]: def compute_dn_ds(seq1, seq2):


total_S_sites, total_NS_sites = compute_total_sites(seq1)
total_S, total_NS = compute_total_mutations(seq1, seq2)
dS = total_S / total_S_sites
dN = total_NS / total_NS_sites
ratio = dN / dS

return ratio

file:///Users/chaitanyasachdeva/Downloads/pset05/INFO390C_dNdS_pset05.html Page 8 of 9
INFO390C_dNdS_pset05 11/5/24, 11:11 PM

Apply your function to the S protein sequences and the ORF1ab protein sequences (10
points). Does the S protein show evidence of positive selection, negative selection, or
neither (5 points)? Does the ORF1ab sequence show evidence of positive selection,
negative selection, or neither (5 points)?

Speculate about why the S protein displays the signature it does, using the information
found here (5 points): https://en.wikipedia.org/wiki/Coronavirus_spike_protein

In [ ]: ## put code here for ORFab alignment


seq1_orf, seq2_orf = preprocess_alignment("ORFab_alignment.fasta")

ratio_orf = compute_dn_ds(seq1_orf, seq2_orf)


print(ratio_orf)

0.00040614193402633365

Put Answer with explanation here: the low ratio suggest srong negative selection on the
ORF!ab protien This is expected since ORF1ab encodes essential enzymes and proteins
required for viral replication and transcription

In [ ]: ## put code here


seq1_s, seq2_s = preprocess_alignment("S_protein_alignment.fasta")
ratio_s = compute_dn_ds(seq1_s, seq2_s)

print(ratio_s)

0.00662276822276831

Put Answer with explanation here: the raio here is also low which shows us negative
selection but better that the ORFab , Although the s protein is subject to purifying
selection to maintain its functional integrity.

Put Answer on your speculation about with S protein displays the signature here: the
wikipidea sources tells us that the s protein is responsible for Binding to the ACE2
receptor, allowing the virus to enter host cells

file:///Users/chaitanyasachdeva/Downloads/pset05/INFO390C_dNdS_pset05.html Page 9 of 9

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy