Strategies for Whole Genome Sequencing
sequencing - which sequences mapped fragments (1000kb)
sequencing - which sequences random fragments (500kb)
1. genome is cut into 160Kb fragments
2. fragments inserted into
segments are cut in 1000bp pieces
4. 1K pieces are put into plasmids & plasmids copied
5. copies are sequenced
overlapped in sequence creating the entire 160KB sequenced
Understanding the genome
genome sequence is like a parts list or dictionary, except that it has no
one needs to
annotate the sequence by finding gene (coding
called Open Reading Frames (ORF's)
look for start (5'ATG'3) and stop (5'TGA'3) codons. Computers
can also search for promoters, operators, regulators.
in eukaryotes: introns & exons present a problem; known sequences of cDNA
probes can be used by computers to spot ORF's.
Open reading frames: once found one can check sequence against catalogs of known
in Sequences eukaryotic genomes...
Humans contain up to
2,000 more DNA bp's than other organisms (table)
Human genome expected to have 100,000+ genes, but only has
if proc's have a strong correlation between genome size, gene number,
and metabolic capability (# proteins), why don't eukaryotes?
splicing - on average each coding eucaryotic locus
produces 3 distinct mRNA transcripts = 120,000 proteins?
repeat sequences: > 50% human DNA is repeat
a. most is due to transposable elements - segments capable of moving
from one location to another in a genome.
LINE elements - long interspersed nuclear elements,
a promoter and 2 genes (a reverse transcriptase & an integrase)
b. simple repeat sequences: makes up 3% of human genome
microsatellites - stretches of 1 to 13 bp - mostly
minisatellites - repeat units of 14 to 500 bp's
locations of satellites in chromosomes are highly variable from individual to
during replication repeat satellites
synapse and DNA polymerase slips
result is every individual has a UNIQUE number of repeats at satellite loci,
which becomes the basis of DNA fingerprinting.
split genes: in humans only 5% of DNA is
have we learned of bacterial
genomes to date (summer 2002)
1. there is a general correlation between
SIZE of genome & metabolic
parasites have small genomes and require host metabolism (Mycoplasms)
non-parasites have large genomes (E. coli &
Pseudomonas 10x larger)
2. many sequenced genes have NO known function
38% of E. coli genome's use is unknown
3. about 15% of each prokaryotic genome is unique
to that species
4. redundancy of sequence is very common
E. coli has 86 gene pairs that are identical (diploidy?)
5. more than one chromosome in proc's is common
many prokaryotes has 2 circular DNA molecules
6. most prokaryotes have plasmids
7. many prokaryotes are scavengers by acquiring DNA
from other species
done via Lateral Transfer, i.e.,
a) proc's take up raw pieces of DNA from environment
b) Chlamydia bacteria (cause STD's) contain may euc-like
c) pathogenic E.coli have 1,400 or more genes than non-pathogens,
that are similar to gene sequences from phage viruses, thus
"prokaryotic pathogenic genes may have a viral origin".
- groups of genes clustered along the same chromosome
probably arose from a common ancestral gene sequence through gene duplication
especially via unequal crossing over.
a Pseudogene - a DNA sequence that closely
resembles a working gene,
but is not transcribed, & has no known function.