What does Genbank contain?
From genomewiki
Jump to navigationJump to search
Human Genbank RNA is easy: apart from a few errors, it contains mostly cDNA and is used to create genes. cDNA are either submitted by individual authors (of older articles, before the big-data biology era) and sequencing centers that clone cDNAs with robots into plasmids and sequenced them.
DNA in Genbank is more diverse. In my experience, Human Genbank DNA consists of short fragments, ends, full-length and partial full-length sequences.
- short fragments are cloned PCR products of mostly exons. They are rarely longer than 1kb. They make up the majority of Genbank submissions, but not of its sequence size. A prototypical example is the sequence of an exon involved in some genetic disease, which includes the mutation. These genbank submissions are done by individual authors, are typically accompanied by references to a single article and the authors submit usually less than ~30 sequences. The sequences were submitted because journals require it and to preserve the sequence information in a standard format. They are only mapped on the NCBI genome browser (is this still true?).
- ends: some projects clone a fragment, but then sequence only the ends of it. The cloning can be done with BACs, Fosmids, cosmids or plasmids (BAC<150kb, Cosmid<20kb, Fosmid<40kb, plasmid<10kb). The genbank submissions include typically hundreds to thousands of sequences, as two separate records. They are submitted by projects, not authors and were submitted by core-facilities. The ends are used to find a BAC for a given region, then order the frozen clone from a supplier. Most of them are mapped on the UCSC genome browser.
- full-length: These can be full length sequences of cosmids,bacs,fosmids or cDNAs (=genes). These sequences were cloned, then shredded into smaller pieces, sequenced and assembled. The biggest contributor to this type was the human genome project, because they sequenced full-length BACs that they then put together to make the genome. Other main contributors were project like MGC, to build gene models. Others were projects that are interested in a given single clone for a project. These contain the bulk of sequence information in genbank, concentrated on only a few hundred submitters. They are mapped on the NCBI genome browser.
- partial full-length: This is a BAC that was started to be sequenced, but wasn't finished. Reasons can include: the sequences indicated a mixed or hybrid clone or the sequences pointed towards a clone that had already been sequenced before. It is a long sequence of 1kb pieces, separated by stretches of 100 N-characters. Mostly produced by the human genome project, mostly junk by today's standards, so not mapped by any genome browser.
The mouse and zebrafish Genbank's have a similar structure as human. More exotic species consists mostly of their genome project output + many cDNAs contributed by gene sequencing projects or individual labs. Even more exotic species have only a few cDNAs in Genbank.