INFO390C DNDS Pset05

INFO390C_dNdS_pset05 11/5/24, 11:11 PM
Instructions
This problem set is provided in the form of a Jupyter notebook. Problems are posed
within this notebook file and you are expected to provide code and/or written answers
when prompted. Remember that you can use Markdown cells to format written
responses where necessary.
Before submitting your assignment, be sure to do a clean run of your notebook and
verify that your cell outputs (e.g., prints, figures, tables) are correctly shown. To do
a clean run, click Kernel→Restart Kernal and Run All Cells....
You are required to submit this notebook to Gradescope in two forms:
1. Submit a PDF of the completed notebook. To produce a PDF, you can use File→Save
and Export Notebook As...→HTML and then convert the HTML file to a PDF using
your preferred web browser. Verify that your code, written answers, and cell
outputs are visible in the submitted PDF.
2. Submit a zip file (including the .ipynb file) of this assignment to Gradescope.
Online Resources and Collaborators

Please list the online resources you used and the names of other students you
collaborated with while working on this problem set.
Online resoures:
Student Names:
In [ ]: import pandas as pd
from Bio import SeqIO
import numpy as np
from copy import deepcopy
The purpose of this notebook is to write a series of functions to calculate the dN/dS ratio
of two coding nucleotide sequences.
We will then compute the dN/dS ratio of two proteins from the Sars-Cov-2 genome, to
determine whether the genes are under selection, and of what type.
file:///Users/chaitanyasachdeva/Downloads/pset05/INFO390C_dNdS_pset05.html Page 1 of 9
01. read in the two sequences (20 points)

You are provided with two files that contain the alignments of the two proteins we will be
studying, the gene which encodes the S protein, and the gene called "ORF1ab" which
codes for the ORF1ab gene, which encodes several proteins essential for viral replication
inside the cell (yes, it's weird that one open reading frame encodes for several proteins,
this does not typically happen, and is a feature of viruses and the covid-19 virus in
particular). https://en.wikipedia.org/wiki/ORF1ab
Write a function to read in these files and prepare the sequences for downstream
calculation (you may want to write this function iteratively with the downstream
calculation functions). I recommend converting the sequences into numpy arrays of type
str, where one nucleotide is one entry in the array, as this allows for easy indexing.
Your function should contain the following checks:
remove any positions from consideration where either sequence has a gap
ensure that the length of both sequences is divisible by 3 after removal of the gaps
ensure that both sequences are the same length
(20 points)
In [ ]: def preprocess_alignment(alignment_file):
"""
Parameters:
alignment_file: str, path to file containing alignments
Returns:
seq1, seq2: np.array of type str, containing positions in the sequence amenabl
"""
sequences = []
for record in SeqIO.parse(alignment_file, "fasta"):
sequences.append(str(record.seq))
seq_1, seq_2 = sequences[0], sequences[1]
# convert to numpy arrays for easy indexing

seq1 = np.array(list(seq_1),dtype=str)
seq2 = np.array(list(seq_2),dtype=str)
# determine which positions have gaps in either sequences

gaps = (seq1 == '-') | (seq2 == '-')
# remove gapped positions from sequences

seq1 = seq1[~gaps]
seq2 = seq2[~gaps]
# check that your sequences are of equal length and then a multiple of 3
if len(seq1) != len(seq2):
print("sequence noe equal ")
if (len(seq1)) %3 != 0:
return print("not a multiple of 3 ")
return seq1, seq2
In [ ]: seq1, seq2 = preprocess_alignment("ORFab_alignment.fasta")

print(len(seq1))
print(len(seq2))
13206
13206
In [ ]: seq1, seq2 = preprocess_alignment("S_protein_alignment.fasta")
02. Count the number of synonymous and non-synonymous

sites in the first sequence (30 points)
Write a function that, given a codon and a nucleotide position (0, 1, or 2) within the
codon, compute the fraction of possible changes in that site that are synonymous and
non-synonymous (15 points). The codon table below is given.
You will notice that some sequences have an "N" in some position. This, and other non-
traditional nucleotide characters, are introduced as a placeholder when the results of the
sequencing experiment and/or variant calling pipeline are inconclusive about the
nucleotide present in a particular position. For the purposes of this calculation, we are
going to exclude codons with an "N" from both the calculation of number of S vs. NS
sites, and number of S vs. NS mutations, since we can't be certain about their sequence.
In [ ]: # provide the dictionary below - this gives all codons to their amino acid and/or
codon_table = {
'TCA': 'S', # Serine
'TCC': 'S', # Serine
'TCG': 'S', # Serine
'TCT': 'S', # Serine
'TTC': 'F', # Phenylalanine
'TTT': 'F', # Phenylalanine
'TTA': 'L', # Leucine
'TTG': 'L', # Leucine
'TAC': 'Y', # Tyrosine
'TAT': 'Y', # Tyrosine
'TAA': '*', # Stop
'TAG': '*', # Stop
'TGC': 'C', # Cysteine
'TGT': 'C', # Cysteine
'TGA': '*', # Stop
'TGG': 'W', # Tryptophan
'CTA': 'L', # Leucine
'CTC': 'L', # Leucine
'CTG': 'L', # Leucine
'CTT': 'L', # Leucine
'CCA': 'P', # Proline
'CCC': 'P', # Proline
'CCG': 'P', # Proline
'CCT': 'P', # Proline
'CAC': 'H', # Histidine
'CAT': 'H', # Histidine
'CAA': 'Q', # Glutamine
'CAG': 'Q', # Glutamine
'CGA': 'R', # Arginine
'CGC': 'R', # Arginine
'CGG': 'R', # Arginine
'CGT': 'R', # Arginine
'ATA': 'I', # Isoleucine
'ATC': 'I', # Isoleucine
'ATT': 'I', # Isoleucine
'ATG': 'M', # Methionine
'ACA': 'T', # Threonine

'ACC': 'T', # Threonine
'ACG': 'T', # Threonine
'ACT': 'T', # Threonine
'AAC': 'N', # Asparagine
'AAT': 'N', # Asparagine
'AAA': 'K', # Lysine
'AAG': 'K', # Lysine
'AGC': 'S', # Serine
'AGT': 'S', # Serine
'AGA': 'R', # Arginine
'AGG': 'R', # Arginine
'GTA': 'V', # Valine
'GTC': 'V', # Valine
'GTG': 'V', # Valine
'GTT': 'V', # Valine
'GCA': 'A', # Alanine
'GCC': 'A', # Alanine
'GCG': 'A', # Alanine
'GCT': 'A', # Alanine
'GAC': 'D', # Aspartic Acid
'GAT': 'D', # Aspartic Acid
'GAA': 'E', # Glutamic Acid
'GAG': 'E', # Glutamic Acid
'GGA': 'G', # Glycine
'GGC': 'G', # Glycine
'GGG': 'G', # Glycine
'GGT': 'G' # Glycine
}
In [ ]: def calculate_num_synonymous_vs_nonsynonymous(codon, position):

"""
Parameters:
codon: str of length 3 giving the nucleotide codon
position: int [0,1,or 2] giving the position in the codon
Returns:
total_S, total_NS: float
the fraction of possible changes at the given position that are synonymous
"""
# error check for codons not in the table - ie, non-standard nucleotides
# code here
if 'N' in codon or codon not in codon_table:
print("invalid!!!!")
return None , None
amino_acid = codon_table[codon]
o_nucleotide = codon[position]
# determine which three nucleotides are not present at the current site
# code here
possible_substitutions = [n for n in ['A', 'T', 'C', 'G'] if n != o_nucleotide
# count the number of S or NS substitutions that would be made by all possible

# code here3
total_NS =0
total_S = 0
for n in possible_substitutions:
new_codon = list(codon)
new_codon[position] = n
new_codon = ''.join(new_codon)
if new_codon in codon_table:
new_amino_acid = codon_table[new_codon]
if new_amino_acid == amino_acid:
total_S+=1
else :
total_NS+=1
#divide by 3 for the 3 possible changes

return total_S/3, total_NS/3
Write a function that iterates through a sequence and counts the total synonymous and
non-synonymous sites in that sequence (by summing up the values returned by your
previous function). The start and stop codon should be ignored. (15 points)
In [ ]: # iterate through the sequence in chunks of 3

# ignore the start and end codons
def compute_total_sites(seq1):
"""
Parameters:
seq1: np.array of type str containing the sequence
returns:
total_S_sites, total_NS_sites: float, number of possible synonymous vs. non-sy
"""
total_S_sites = 0
total_NS_sites = 0
for i in range(0, len(seq1) - 2, 3):
condon = ''.join(seq1[i:i+3])
if codon_table[condon] == '*' or condon == "ATG" :
continue
for i in range(3):
s, ns = calculate_num_synonymous_vs_nonsynonymous(condon , i )
if s is None and ns is None :
continue
total_S_sites += s
total_NS_sites += ns
return total_S_sites, total_NS_sites
03. Compute the number of synonymous and non-synonymous

changes that have occurred (15 points)
write a function that iterates through two sequences and computes the number of
OBSERVED synonymous or non-synonymous changes.
In [ ]: def compute_total_mutations(seq1, seq2):

"""
Parameters:
seq1, seq2: np.arrays of type str containing the sequences to compare
returns:
total_S, total_NS: float, number of OBSERVED synonymous vs. non-synonymous mut
"""
total_S = 0
total_NS = 0
for i in range(0, len(seq1) - 2, 3):
codon1 = ''.join(seq1[i:i+3])
codon2 = ''.join(seq2[i:i+3])
if codon_table[codon1] == '*' or codon2 == "ATG" or 'N' in codon1 or 'N' i
continue
amino_acid1 = codon_table[codon1]
amino_acid2 = codon_table[codon2]
if amino_acid1 is None or amino_acid2 is None:
continue
if amino_acid1 == amino_acid2:
total_S += 1
else:
total_NS += 1
return total_S, total_NS
04. Compute the dN/dS ratio and interpret the results(35

points)
Write a function that computes the dN/dS ratio for a pair of sequences, given the
functions you've written above (10 points)
In [ ]: def compute_dn_ds(seq1, seq2):

total_S_sites, total_NS_sites = compute_total_sites(seq1)
total_S, total_NS = compute_total_mutations(seq1, seq2)
dS = total_S / total_S_sites
dN = total_NS / total_NS_sites
ratio = dN / dS
return ratio
Apply your function to the S protein sequences and the ORF1ab protein sequences (10
points). Does the S protein show evidence of positive selection, negative selection, or
neither (5 points)? Does the ORF1ab sequence show evidence of positive selection,
negative selection, or neither (5 points)?
Speculate about why the S protein displays the signature it does, using the information
found here (5 points): https://en.wikipedia.org/wiki/Coronavirus_spike_protein
In [ ]: ## put code here for ORFab alignment

seq1_orf, seq2_orf = preprocess_alignment("ORFab_alignment.fasta")
ratio_orf = compute_dn_ds(seq1_orf, seq2_orf)

print(ratio_orf)
0.00040614193402633365
Put Answer with explanation here: the low ratio suggest srong negative selection on the
ORF!ab protien This is expected since ORF1ab encodes essential enzymes and proteins
required for viral replication and transcription
In [ ]: ## put code here

seq1_s, seq2_s = preprocess_alignment("S_protein_alignment.fasta")
ratio_s = compute_dn_ds(seq1_s, seq2_s)
print(ratio_s)
0.00662276822276831
Put Answer with explanation here: the raio here is also low which shows us negative
selection but better that the ORFab , Although the s protein is subject to purifying
selection to maintain its functional integrity.
Put Answer on your speculation about with S protein displays the signature here: the
wikipidea sources tells us that the s protein is responsible for Binding to the ACE2
receptor, allowing the virus to enter host cells

INFO390C DNDS Pset05

Uploaded by

Copyright:

Available Formats

INFO390C DNDS Pset05

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

INFO390C DNDS Pset05

Uploaded by

Copyright:

Available Formats

INFO390C_dNdS_pset05 11/5/24, 11:11 PM

You are required to submit this notebook to Gradescope in two forms:

Online Resources and Collaborators

01. read in the two sequences (20 points)

Your function should contain the following checks:

# convert to numpy arrays for easy indexing

# determine which positions have gaps in either sequences

# remove gapped positions from sequences

return seq1, seq2

In [ ]: seq1, seq2 = preprocess_alignment("ORFab_alignment.fasta")

In [ ]: seq1, seq2 = preprocess_alignment("S_protein_alignment.fasta")

02. Count the number of synonymous and non-synonymous

'ACA': 'T', # Threonine

In [ ]: def calculate_num_synonymous_vs_nonsynonymous(codon, position):

# count the number of S or NS substitutions that would be made by all possible

#divide by 3 for the 3 possible changes

In [ ]: # iterate through the sequence in chunks of 3

return total_S_sites, total_NS_sites

03. Compute the number of synonymous and non-synonymous

In [ ]: def compute_total_mutations(seq1, seq2):

return total_S, total_NS

04. Compute the dN/dS ratio and interpret the results(35

In [ ]: def compute_dn_ds(seq1, seq2):

In [ ]: ## put code here for ORFab alignment

ratio_orf = compute_dn_ds(seq1_orf, seq2_orf)

In [ ]: ## put code here

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.