Biopython Tutorial
Biopython Tutorial
Biopython Tutorial
of Contents
Basics of Biopython 1.1
First Steps 1.2
Using NCBI E-utilities 1.3
Diagnosing Sickle Cell Anemia 1.4
BLAST 1.5
Biopython Examples 1.6
Acknowledgements 1.7
2
Basics of Biopython
What is Biopython?
Biopython is a Python library for reading and writing many common biological data formats.
It contains some functionality to perform calculations, in particular on 3D structures.
License
This tutorial is distributed under the conditions of the Creative Commons Attribution
Share-alike License 4.0 (CC-BY-SA 4.0).
3
First Steps
Preparations
1. Start a Python console
2. Type import Bio
print(dna.name)
print(dna.description)
print(dna.seq[:100])
2. Manipulating a sequence
4
First Steps
result = dna.__________()
print(result)
if result == 'GTGCAACGT':
print('OK')
1. reverse_complement
2. transcribe
3. translate
3. Calculating GC-content
Look up Section 3.2 of the Biopython documentation on
(http://biopython.org/DIST/docs/tutorial/Tutorial.html) to find out how to calculate the GC-
content of a sequence.
55.556
46.875
33.514
50.000
5
First Steps
print(dna.annotations['organism'])
print(dna.annotations['references'][0])
print(dna.annotations.keys())
There is a bug in the program. Execute the program. Identify the problem and fix it.
parser = PDB.PDBParser()
struc = parser.get_structure("tRNA", "1ehz.pdb")
n_atoms = 0
for model in struc:
for chain in model:
for residue in chain:
for atom in residue:
print(residue.resname, residue.id)
n_atoms += 1
print(n_atoms)
6
First Steps
Use the following code to download identifiers (with the esearch web app) and protein
sequences for these identifiers (with the efetch web app) from the NCBI databases.
The order of lines got messed up! Please sort the lines to make the code work.
identifiers = records['IdList']
records = handle.read()
open("globins.fasta", "w").write(records)
records = Entrez.read(searchresult)
Entrez.email = "krother@academis.eu"
7
Using NCBI E-utilities
Entrez.email = "my@email.eu"
Parameters include:
parameter examples
db nucleotide
protein
pubmed
term human[Organism]
hemoglobin
hemoglobin AND alpha
records = Entrez.read(handle)
identifiers = records['IdList']
8
Using NCBI E-utilities
parameter examples
id single id
rettype fasta
gb
retmode text
xml
Documentation:
You find a full list of available options on http://www.ncbi.nlm.nih.gov/books/NBK25500/
Example URLs
1. Searching for papers in PubMed
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?
db=pubmed&term=thermophilic,packing&rettype=uilist
9
Using NCBI E-utilities
10
Diagnosing Sickle Cell Anemia
Goal
Your goal is to develop an experimental test that reveals whether a patient suffers from the
hereditary disease sickle cell anemia. The test for diagnosis should use a restriction enzyme
on a patients DNA sample. For the test to work, you need to know exactly what genetic
difference to test against. In this tutorial, you will use Biopython to find out.
The idea is to compare DNA and protein sequences of sickle cell and healthy globin, and to
try out different restriction enzymes on them.
1. Use the module Bio.Entrez to retrieve DNA and protein sequences from NCBI
databases.
2. Use the module Bio.SeqIO to read, write, and filter information in sequence files.
3. Use the modules Bio.Seq and Bio.SeqRecord to extract exons, transcribe and
translate them to protein sequences.
4. Use the module re to identify restriction sites.
Have fun!
11
Diagnosing Sickle Cell Anemia
Write Python code that searches for the cDNA sequence of the sickle cell globin protein from
NCBI. Use the Entrez.esearch function. As keywords, use sickle cell AND human NOT
chromosome. Print the resulting database identifiers (not the full sequences).
Print the identifier and defline for each entry using a for loop.
Hint:
It is often more convenient to design the query in a browser window before moving to
Python.
12
Diagnosing Sickle Cell Anemia
2. Bio.SeqIO
Reading, writing, and filtering sequence files
The first parameter of parse() is the filename, the format is genbank. Print the records
object.
The first parameter of write() is a list of sequence records, the second a file open for
writing, and the third should be fasta.
Hint:
This is a great occasion to exercise string formatting, e.g. to obtain tabular outpus:
13
Diagnosing Sickle Cell Anemia
GenBank file.
14
Diagnosing Sickle Cell Anemia
Print all features of the L26452 entry. Use the field r.features on a SeqRecord object.
feature[nofuzzy_start:nofuzzy_end]
Use the indices to extract portions of the complete sequence. Concatenate all exon
sequences to a single string.
3.8 Results
Print the sickle cell and healthy beta-globin sequence in subsequent lines or a text file. Also
print the two corresponding protein sequences in subsequent lines. What differences do you
see?
4. Pattern Matching
Identification of restriction sites
15
Diagnosing Sickle Cell Anemia
import re
If the search method returns a match, print the start and stop found in both DNA sequences.
HinfI (GTNNAC)
BceAI (ACGGCNNNNNNNNNNNNN)
BseRI (GAGGAGNNNNNNNNNN)
EcoRI (GAATTC)
MstII (CCTNAGG)
on both DNA sequences. Which restriction enzyme could you use to specifically identify
carriers of the sickle cell anemia gene?
Optional Exercises
Take a look at the website Regex101
To facilitate the restriction analysis, replace the N's in the sickle cell DNA by the
16
Diagnosing Sickle Cell Anemia
corresponding positions from the healthy DNA. Print the resulting DNA sequence.
17
BLAST
Overview
In this tutorial, you will automate BLAST queries with Python. You will learn how to run
BLAST locally, multiple times, and how to read BLAST results with Python. In the
process, you will build a program pipeline, a concept useful in many biological analyses
independent of BLAST.
1. Preparations
2. Running local BLAST manually
3. Running local BLAST from Python
4. Running BLAST many times with Python
5. Reading BLAST output with Biopython
6. Plotting the results
As a small sample study, we will BLAST a set of peptides from a few Homo sapiens proteins
against the proteome of Plasmodium falciparum. As a control, we will use the proteome of
Schizosaccharomyces pombe.
1. Preparations
1.1 Check whether BLAST+ is properly installed
Enter the two following commands in a Linux console:
18
BLAST
makeblastdb
blastp
Both should result in an error message other than command not found.
1.4 Questions
What files have appeared in the data/ directory?
Why do we need to create a database first? Why can't BLAST do that right before each
query?
19
BLAST
DAAITAALNANAVK
Make sure that there are no other characters in the file (no empty lines or FASTA deflines).
Save the file to the data/ directory.
2.5 Questions
Take a look at the BLAST output. Is the result what you would expect?
Does the control group support your assumptions so far?
Which of the output formats do you find the easiest to read?
Which of the output formats is probably the easiest to read for a program?
20
BLAST
Now we will use the function os.system to run BLAST. Create a Python script run_blast.py
in the data/ directory. Write the following commands into it:
import os
db = "Plasmodium_falciparum.fasta"
cmd = "blastp -query query.seq -db " + db + " -out output.txt -outfmt 7"
3.6 Questions
Is the output of the Python BLAST run identical to the one you did manually? How can
you check that?
21
BLAST
mkdir data/queries
The Python script multiblast/split_fasta.py does that using Bio.SeqIO. You can use it by
typing in the multiblast/ directory:
ls -l data/queries
more data/queries/9568103_99.fasta
mkdir data/Plasmodium_out
mkdir data/Pombe_out
You need to complete the BLAST command inserting the file names from the given
variables. Use the parameter -outfmt 5 in order to create XML output. We will need this later
to read it from Biopython.
When everything is done, you should be able to execute the script with:
22
BLAST
python parse_blast_xml.py
References
BLAST+ is a new, faster (C++ based) version that replaces BLAST2, as of Oct 2013. Also
see: http://blast.ncbi.nlm.nih.gov/Blast.cgi?
CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download
23
Biopython Examples
Biopython Examples
1. Getting started
import Bio
from Bio.Seq import Seq
dna = Seq("ACGTTGCAC")
print(dna)
(alternative)
(alternative)
3. Calculating GC-content
from Bio.SeqUtils import GC
GC(dna)
24
Biopython Examples
import pylab
sizes=[len(r.seq) for r in SeqIO.parse("ls_orchid.fasta","fasta")]
pylab.hist(sizes, bins=20)
pylab.title("%i orchid sequences\nLengths %i to %i" \
% (len(sizes), min(sizes), max(sizes)))
pylab.xlabel("Sequence length (bp)")
pylab.ylabel("Count")
pylab.show()
25
Acknowledgements
Authors
2015 Kristian Rother (krother@academis.eu)
This document contains contributions by Allegra Via, Magdalena Rother and Olga
Sheshukova. I would like to thank Pedro Fernandes, Janick Mathys, Janusz M. Bujnicki and
Artur Jarmolowski for their support during the courses in which the material was developed.
License
Distributed under the conditions of a Creative Commons Attribution Share-alike License 3.0.
26