Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • AanaNahum
    replied
    Nice helpful answers I was also about to ask the similar question and found it already here.

    Leave a comment:


  • Ric69
    replied
    Here's a nice uncomplicated summary of hg19... https://grch37.ensembl.org/Homo_sapiens/Info/Annotation

    Leave a comment:


  • zzta
    replied
    Sorry to revive this thread, but exons or CDSs are not the only thing transcribed, so how can we account for non-coding RNAs? My understanding is that they are also part of the transcriptome...

    Leave a comment:


  • steven
    replied
    Originally posted by ulz_peter View Post
    Just to throw my 2 Cents in. As far as I know most exome-enriching kits use the CDS database for generating the exome library. As this database is less comprehensive than the Refseq or knownGene annotations in UCSC some exons will be missed due to that. Of course others are discarded because of hybridization difficulties (repetitive regions, etc).
    That makes sense, thanks.

    Leave a comment:


  • steven
    replied
    Originally posted by ssully View Post
    I keep seeing a figure of 30-33Mb for the human exome e.g.

    This 2009 Nature paper

    "Protein-coding regions constitute ~1% of the human genome or ~30 megabases (Mb), split across ~180,000 exons."

    30-33Mb is also the figure cited in Illumina's "Sequencing Output Calculator' , sent to me by tech support.

    Anyone know why the number is so much higher on this thread?
    Because "protein coding regions" and "exons" are different things. UTRs can be long, especially in human.

    I think it is important to know what we are talking about:

    1. number of genomic positions that are annotated as coding (included in CDS)
    2. number of genomic positions that are annotated as exonic (included in exons)

    As frozenlyse and Richard Finney indicated, values for 2. range around 60 and 80Mb, depending on the annotation source.
    Ssully, the citation you mention with the number of 30Mb refers to 1. ("protein coding regions").
    Rstarke, what is this number of 1 billion referring to? "Annotated bases" can be anything, on a genome you can annotate introns, promoters, repeated regions.. a link to this information would help.
    Now, is there a precise definition of "exome" or is it a loose term? Is it supposed to include coding regions only, or can anyone put in there some UTR, promoters, intronic flanks, etc?

    Leave a comment:


  • ulz_peter
    replied
    Just to throw my 2 Cents in. As far as I know most exome-enriching kits use the CDS database for generating the exome library. As this database is less comprehensive than the Refseq or knownGene annotations in UCSC some exons will be missed due to that. Of course others are discarded because of hybridization difficulties (repetitive regions, etc).

    Leave a comment:


  • Richard Finney
    replied
    Our friend Mr. Ref Seq says ...

    Back of the envelope calculations:
    The sum of the values for base coverage of the exons for the data above in the hg19/UCSCknown table (posted above) is
    81,105,734

    The Refseq table from UCSC for hg19 (jan 2011 version) says : 63,995,498
    [ method : load table into datastruct, sort by name, traverse, if (currentname==previousname) dont count else calculate sum of exons and add to sum]. Notabene: this won't eliminate some overlapping situations.

    Refseq is more conservative than UCSCknown and relies more on hand curation and less on computation.

    I don't know about GENCODE but if it's that for human only and that number is right then it's probably any transcript ever measured. I could only speculate on what that extra bonus coverage is. A free trip to Sweden goes to the guy that can explain and prove it (if it's functionally real).

    Leave a comment:


  • rstarke
    replied
    I would also like to know why the huge discrepancy between what's in the literature (~30-40Mb) and the numbers cited in this thread. I just checked the GENCODE v6 annotations and the total annotated base count is over a billion, supporting the estimates in this thread. I'm confused. Can anyone clear up the discrepancy?

    Leave a comment:


  • ssully
    replied
    I keep seeing a figure of 30-33Mb for the human exome e.g.

    This 2009 Nature paper

    "Protein-coding regions constitute ~1% of the human genome or ~30 megabases (Mb), split across ~180,000 exons."

    30-33Mb is also the figure cited in Illumina's "Sequencing Output Calculator' , sent to me by tech support.

    Anyone know why the number is so much higher on this thread?

    Leave a comment:


  • NextGenSeq
    replied
    By comparing the genes listed in the bed file to the UCSC annotation. I tried attaching the bed file but it's too large for this site to allow it.

    Leave a comment:


  • bioinfosm
    replied
    Originally posted by NextGenSeq View Post
    I assume you are interested in this since you are doing whole exome sequence enrichment and subsequent sequencing.

    Different vendors have different amounts of "whole exome" coverage. We found that the Agilent Sure Select only enriches for ~89% of the human whole exome.
    NextGenSeq, how did you get the number of ~89% exome targetted by agilent? Could you share some detail on that!

    Thanks,
    sm

    Leave a comment:


  • NextGenSeq
    replied
    I assume you are interested in this since you are doing whole exome sequence enrichment and subsequent sequencing.

    Different vendors have different amounts of "whole exome" coverage. We found that the Agilent Sure Select only enriches for ~89% of the human whole exome.

    Leave a comment:


  • apratap
    replied
    Thanks Guys. I understand that it is acceptable to remove redundancy at exon level.

    @frozenlyse : your end number (exons) seems to match mine.

    How do I deal with gene level coverage. There are many genes which overlap each other and as noted in my first post.

    Total # bases in RefSeq Genes : 2,011,862,672

    Is it acceptable to remove redundancy while counting bases in all human genes. In a way this will lead us to underestimate coverage. I say so because overlapping genes can be coexpressed right >>?

    Thanks for your time to help me understand this.

    Best,
    -Abhi

    Leave a comment:


  • frozenlyse
    replied
    If you just want a base pair count for different annotations, you can just use UCSC table browser, choose the genome build you are using and annotation you are interested in, and press "summary/statistics" at the bottom, eg for hg18 RefSeq you get

    item count 34,702
    item bases 1,166,592,699 (40.49%)
    item total 2,020,112,601 (70.11%)
    smallest item 33
    average item 58,213
    biggest item 2,304,634
    block count 347,347
    block bases 66,601,430 (2.31%)
    block total 104,526,351 (3.63%)
    smallest block 3
    average block 301
    biggest block 59,461


    The "block" lines are what you are interested in: 347,347 exons from 34,702 Refseq genes, with total size of 104MB, however when removing redundancies 66Mb is covered

    Leave a comment:


  • steven
    replied
    Originally posted by apratap View Post
    Clearly there are overlapping regions in each of these annotation files [...] Just wondering if I should count the bases common to two genes twice or only uniq regions should be counted.
    Most of the transcribed nucleotides of the human genome are represented in different transcripts (whatever they are considered as same "gene" or not). As Bio.X2Y pointed out, you definitely have to remove redundancy. You can send your annotations to galaxy or use BEDtools to "collapse" ("project"/"fusion"/"merge") your annotated exons before adding the lengths.

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Pathogen Surveillance with Advanced Genomic Tools
    by seqadmin




    The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
    Yesterday, 11:48 AM
  • seqadmin
    New Genomics Tools and Methods Shared at AGBT 2025
    by seqadmin


    This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

    The Headliner
    The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
    03-03-2025, 01:39 PM
  • seqadmin
    Investigating the Gut Microbiome Through Diet and Spatial Biology
    by seqadmin




    The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
    02-24-2025, 06:31 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 03-20-2025, 05:03 AM
0 responses
26 views
0 reactions
Last Post seqadmin  
Started by seqadmin, 03-19-2025, 07:27 AM
0 responses
33 views
0 reactions
Last Post seqadmin  
Started by seqadmin, 03-18-2025, 12:50 PM
0 responses
25 views
0 reactions
Last Post seqadmin  
Started by seqadmin, 03-03-2025, 01:15 PM
0 responses
190 views
0 reactions
Last Post seqadmin  
Working...