home page

Some facts about human genome



What is the possibilities to sequence whole genome or part of it ?

1. "Chromosome walking"

 

2. "Genome shotgun sequencing"


 



 
HUMAN GENOME PROJECT (HGP)   1990 
Hierarchical Shotgun
(“map-based”, “BAC-based”,
“clone-by-clone”)

- hierarchical method allows targeting of additional sequencing to under-represented regions

 
 
 
CELERA GENOMICS
1998
Whole Genome Shotgun
 
 

Potential risks:
- hum. genome is the first repeat-rich genome to be sequenced – problems to identify the regions with incorrect assembly

- outbred organism – at least 2 copies -> large-scale structural heterozygosity, Single Nucleotide Polymorphisms 
(1 per 1300 bp)

 
 



CELERA GENOMICS

Sequencing
- fully automated from library transformation to reading
- DNA from 5 subjects was selected for sequencing

Random Shotgun Data Set
8 Sep 1999 – 17 Jun 2000 -> 27 mil. reads of average
length 540 bp (175 000 reads per day) -> ~ 5x coverage

Mate pairs (a key feature of the sequencing)
from 2-, 10-, 50-kbp inserts (3.42x, 16.40x, 18.84x coverage)
 

Two different approaches to assembly:

1. Whole-genome assembly  (WGA)
Two sets of data:
-random shotgun trimmed sequences produced at Celera (5x coverage)
-publicly funded HGP data derived from BAC clones (downloaded from GenBank on 2 Sep 2000, shredded to reads; locations of BACs were not used in this process) (2.96x coverage)

2.  Compartmentalized assembly (CSA) 
- first,  to partition data into sets localized to large chromosomal segments (using HGP information) and then shotgun assembly on each set (~hierarchical approach)
 

Whole-genome assembly

SCREENER – screened out all repeat elements but microsatellites

OVERLAPPER – compared every read against every other read in search of overlaps of at least 40 bp with <6% differences (4-5 days for 40 computers with 4-GB RAM operating in parallel); algorithm used by Celera was able to identified reads from repetitive elements and find the boundary of the start of such elements

SCAFFOLDER – proceeded to use mate-pair information to link these together into scaffolds: 2- and 10-kbp mate pairs -> intermediate-sized scaffolds that are then linked together by confirming 50-kbp mate pairs

REPEAT RESOLVER – filling the gaps with certain level of mistake

Set of scaffolds -> 2.85 Gbp in span, 2.6 Gbp of sequence
Scaffolds >100 kbp long cover 84% of the genome
Scaffolds >10 Mbp long cover 25% of the genome
The average scaffold size was 1.5 Mbp
The average contig size was 24 kbp
The average gap size was 2.4 kbp



Human Genome Project

HIERARCHICAL SHOTGUN METHOD
Genomic DNA from anonymous human donors was partially digested with restriction enzymes ->
Clones from 8 large-insert libraries containing BAC or PAC (bacterial or P1-derived arteficial chromosome) –
together 65-fold coverage

             HindIII
BACs  -------->  agarose gels -------->      fingerprints

Fingerprint clone contigs – anchoring to chromosomes by STS markers from existing genetic and physical maps and also by FISH.

SELECTION CLONES FOR SEQUENCING
that make up the draft genome sequence with minimal overlaps

(in addition, the overlaps between BAC clones provide a rich collection of SNPs (Single Nucleotide Polymorphisms))

Sequencing project shared among 20 centres from 6 countries --->  necessary to coordinate the selection of clones --->
most centres focused on particular chromosome
 

SHOTGUN SEQUENCING OF SELECTED CLONES
the details of protocol and automation varied among the centres – the most aggressive automation ---> 100 000 reactions in 12 hours

Data integration by a common computational procedure; all assembled contigs >2kb deposited in public databases within 24 hours

Sequencing output rose sharply during production:
By June 2000 – sequence equivalent to 1-fold coverage
of the entire human genome in less than 6 weeks !
 

GIG ASSEMBLER
Version of the draft sequence on 7 Oct 2000:
 29 298 overlapping BACs  ~  4.26 Gbp
   ---> 23 Gbp sequences ~ 7.5-fold coverage
   ---> 90% of euchromatic part of the genome
 

additionally:
3 centres – WHOLE GENOME SHOTGUN
     ~ 0.75-fold coverage
     ~ statistically includes 50% of the nucleotides in the human genome
By comparing this raw data with draft ---> SNPs
 

When is the human genome “finished”?
- Fewer than 1 base in 10 000 is incorrectly assigned
- More than 95% of euchromatic region is sequenced
- Each gap is smaller than 150 kb

(Such standards represent realistic goals given current technology)
 
7 Oct 2000: 25% of the human genome in finished stage
(this include finished chromosomes 21 and 22)
 

FILLING THE GAPS