What does Genbank contain?
From genomewiki
Jump to navigationJump to search
In my experience, Human Genbank consists of fragments, ends, full-length and partial full-length sequences.
- fragments are cloned PCR products of mostly exons. They are rarely longer than 1kb. They make up the majority of Genbank submissions, but not of sequence size. A proto-typical example is the sequence of an exon involved in some genetic disease, which includes the mutation. These genbank submissions are done by individual authors, are typically accompanied by references to a single article and the authors submit usually less than ~30 sequences. The sequences were submitted because journals require it and to preserve the sequence information in a standard format. They are only mapped on the NCBI genome browser (is this still true?).
- ends: some projects clone a fragment, but then sequence only the ends of it. The cloning can be done with BACs, Fosmids, cosmids or plasmids (BAC<150kb, Cosmid<20kb, Fosmid<40kb, plasmid<10kb). The genbank submissions include typically hundreds to thousands of sequences, as two separate records. They are submitted by projects, not authors and were submitted by core-facilities. The ends are used to find a BAC for a given region, then order the frozen clone from a supplier. Most of them are mapped on the UCSC genome browser.
- full-length: These can be full length sequences of cosmids,bacs,fosmids or cDNAs (=genes). These sequences were cloned, then shredded into smaller pieces, sequenced and assembled. The biggest contributor to this type was the human genome project, because they sequenced full-length BACs that they then put together to make the genome. Other main contributors were project like MGC, to build gene models. Others were projects that are interested in a given single clone for a project. These contain the bulk of sequence information in genbank, concentrated on only a few hundred submitters. They are mapped on the NCBI genome browser.
- partial full-length: This is a BAC that was started to be sequenced, but wasn't finished. Reasons can include: the sequences indicated a mixed or hybrid clone or the sequences pointed towards a clone that had already been sequenced before. It is a long sequence of 1kb pieces, separated by stretches of 100 N-characters. Mostly produced by the human genome project, mostly junk by today's standards, so not mapped by any genome browser.