Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • AanaNahum
    replied
    Nice helpful answers I was also about to ask the similar question and found it already here.

    Leave a comment:


  • Ric69
    replied
    Here's a nice uncomplicated summary of hg19... https://grch37.ensembl.org/Homo_sapiens/Info/Annotation

    Leave a comment:


  • zzta
    replied
    Sorry to revive this thread, but exons or CDSs are not the only thing transcribed, so how can we account for non-coding RNAs? My understanding is that they are also part of the transcriptome...

    Leave a comment:


  • steven
    replied
    Originally posted by ulz_peter View Post
    Just to throw my 2 Cents in. As far as I know most exome-enriching kits use the CDS database for generating the exome library. As this database is less comprehensive than the Refseq or knownGene annotations in UCSC some exons will be missed due to that. Of course others are discarded because of hybridization difficulties (repetitive regions, etc).
    That makes sense, thanks.

    Leave a comment:


  • steven
    replied
    Originally posted by ssully View Post
    I keep seeing a figure of 30-33Mb for the human exome e.g.

    This 2009 Nature paper

    "Protein-coding regions constitute ~1% of the human genome or ~30 megabases (Mb), split across ~180,000 exons."

    30-33Mb is also the figure cited in Illumina's "Sequencing Output Calculator' , sent to me by tech support.

    Anyone know why the number is so much higher on this thread?
    Because "protein coding regions" and "exons" are different things. UTRs can be long, especially in human.

    I think it is important to know what we are talking about:

    1. number of genomic positions that are annotated as coding (included in CDS)
    2. number of genomic positions that are annotated as exonic (included in exons)

    As frozenlyse and Richard Finney indicated, values for 2. range around 60 and 80Mb, depending on the annotation source.
    Ssully, the citation you mention with the number of 30Mb refers to 1. ("protein coding regions").
    Rstarke, what is this number of 1 billion referring to? "Annotated bases" can be anything, on a genome you can annotate introns, promoters, repeated regions.. a link to this information would help.
    Now, is there a precise definition of "exome" or is it a loose term? Is it supposed to include coding regions only, or can anyone put in there some UTR, promoters, intronic flanks, etc?

    Leave a comment:


  • ulz_peter
    replied
    Just to throw my 2 Cents in. As far as I know most exome-enriching kits use the CDS database for generating the exome library. As this database is less comprehensive than the Refseq or knownGene annotations in UCSC some exons will be missed due to that. Of course others are discarded because of hybridization difficulties (repetitive regions, etc).

    Leave a comment:


  • Richard Finney
    replied
    Our friend Mr. Ref Seq says ...

    Back of the envelope calculations:
    The sum of the values for base coverage of the exons for the data above in the hg19/UCSCknown table (posted above) is
    81,105,734

    The Refseq table from UCSC for hg19 (jan 2011 version) says : 63,995,498
    [ method : load table into datastruct, sort by name, traverse, if (currentname==previousname) dont count else calculate sum of exons and add to sum]. Notabene: this won't eliminate some overlapping situations.

    Refseq is more conservative than UCSCknown and relies more on hand curation and less on computation.

    I don't know about GENCODE but if it's that for human only and that number is right then it's probably any transcript ever measured. I could only speculate on what that extra bonus coverage is. A free trip to Sweden goes to the guy that can explain and prove it (if it's functionally real).

    Leave a comment:


  • rstarke
    replied
    I would also like to know why the huge discrepancy between what's in the literature (~30-40Mb) and the numbers cited in this thread. I just checked the GENCODE v6 annotations and the total annotated base count is over a billion, supporting the estimates in this thread. I'm confused. Can anyone clear up the discrepancy?

    Leave a comment:


  • ssully
    replied
    I keep seeing a figure of 30-33Mb for the human exome e.g.

    This 2009 Nature paper

    "Protein-coding regions constitute ~1% of the human genome or ~30 megabases (Mb), split across ~180,000 exons."

    30-33Mb is also the figure cited in Illumina's "Sequencing Output Calculator' , sent to me by tech support.

    Anyone know why the number is so much higher on this thread?

    Leave a comment:


  • NextGenSeq
    replied
    By comparing the genes listed in the bed file to the UCSC annotation. I tried attaching the bed file but it's too large for this site to allow it.

    Leave a comment:


  • bioinfosm
    replied
    Originally posted by NextGenSeq View Post
    I assume you are interested in this since you are doing whole exome sequence enrichment and subsequent sequencing.

    Different vendors have different amounts of "whole exome" coverage. We found that the Agilent Sure Select only enriches for ~89% of the human whole exome.
    NextGenSeq, how did you get the number of ~89% exome targetted by agilent? Could you share some detail on that!

    Thanks,
    sm

    Leave a comment:


  • NextGenSeq
    replied
    I assume you are interested in this since you are doing whole exome sequence enrichment and subsequent sequencing.

    Different vendors have different amounts of "whole exome" coverage. We found that the Agilent Sure Select only enriches for ~89% of the human whole exome.

    Leave a comment:


  • apratap
    replied
    Thanks Guys. I understand that it is acceptable to remove redundancy at exon level.

    @frozenlyse : your end number (exons) seems to match mine.

    How do I deal with gene level coverage. There are many genes which overlap each other and as noted in my first post.

    Total # bases in RefSeq Genes : 2,011,862,672

    Is it acceptable to remove redundancy while counting bases in all human genes. In a way this will lead us to underestimate coverage. I say so because overlapping genes can be coexpressed right >>?

    Thanks for your time to help me understand this.

    Best,
    -Abhi

    Leave a comment:


  • frozenlyse
    replied
    If you just want a base pair count for different annotations, you can just use UCSC table browser, choose the genome build you are using and annotation you are interested in, and press "summary/statistics" at the bottom, eg for hg18 RefSeq you get

    item count 34,702
    item bases 1,166,592,699 (40.49%)
    item total 2,020,112,601 (70.11%)
    smallest item 33
    average item 58,213
    biggest item 2,304,634
    block count 347,347
    block bases 66,601,430 (2.31%)
    block total 104,526,351 (3.63%)
    smallest block 3
    average block 301
    biggest block 59,461


    The "block" lines are what you are interested in: 347,347 exons from 34,702 Refseq genes, with total size of 104MB, however when removing redundancies 66Mb is covered

    Leave a comment:


  • steven
    replied
    Originally posted by apratap View Post
    Clearly there are overlapping regions in each of these annotation files [...] Just wondering if I should count the bases common to two genes twice or only uniq regions should be counted.
    Most of the transcribed nucleotides of the human genome are represented in different transcripts (whatever they are considered as same "gene" or not). As Bio.X2Y pointed out, you definitely have to remove redundancy. You can send your annotations to galaxy or use BEDtools to "collapse" ("project"/"fusion"/"merge") your annotated exons before adding the lengths.

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Exploring the Dynamics of the Tumor Microenvironment
    by seqadmin




    The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
    07-08-2024, 03:19 PM
  • seqadmin
    Exploring Human Diversity Through Large-Scale Omics
    by seqadmin


    In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
    06-25-2024, 06:43 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 07-10-2024, 07:30 AM
0 responses
23 views
0 likes
Last Post seqadmin  
Started by seqadmin, 07-03-2024, 09:45 AM
0 responses
200 views
0 likes
Last Post seqadmin  
Started by seqadmin, 07-03-2024, 08:54 AM
0 responses
209 views
0 likes
Last Post seqadmin  
Started by seqadmin, 07-02-2024, 03:00 PM
0 responses
192 views
0 likes
Last Post seqadmin  
Working...
X