Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • kmcarr
    replied
    I am quite familiar with that data set! As you pointed out this is an RNA sequencing project so that is an added complication. The cDNA was generated using the Clontech SMART PCR protocol which is supposed to generate full length cDNAs but this could introduce a bias. Also, if you look at some of the supplementary figures for the paper you will see that there is a bias for reads starting at the 5' or 3' end of the predicted cDNA. This is to be expected. Unlike genomic DNA where the fragmentation should produce random start points, cDNA will always have the fixed end points to start sequencing from.

    An important point to understand about the chemistry of the 454 sequencer is that if two reads start at exactly the same point and there are no missed or extra incorporations then they will end at exactly the same point. The 454 runs a fixed number of sequencing cycles so the read length is going to be fixed for a given sequence. The GS20 (which was used for this study) ran 42 cycles with the base order TACG. If the bases in the sample are randomly distributed you should see on average 2.5 bases incorporated per cycle or an average read length of 105nt. If the bases happen to be a repetitive stretch in the exact same order as the flow cycle you would get 4 bases incorporated per cycle for a read length of 168nt. If the base order of the read were adverse to the flow order you can see that you would end with a length shorter than the expected 100nt. This may explain the 4000 72nt RuBisCO reads. This library was prepared from leaf and was not normalized. There was toooooooon of RuBisCO mRNA present. In fact only 10 transcripts (RuBisCO and chlorophyll subunits) accounted for >50% of all reads.

    Leave a comment:


  • behoward
    replied
    I think I have noticed the same thing...

    I am looking at a public 454 GS20 dataset from the paper "Sampling the Arabidopsis Transcriptome with massively parallel pyrosequencing" (Weber et al, Plant Physiology May 2007) This was actually an early 'RNA-Seq' experiment, not a genome sequencing project. I have never worked with the equipment, though, so I'm no expert here.

    In any event, there seems to be an unexpected number of reads that are duplicates (multiple reads with exactly the same read start position and read length.) Often you can see the exact same read two or three times, and in one extreme case (the extremely highly-expressed Rubisco gene), there are about 4000 duplicates of one short 72bp read.

    In some cases, I suppose, the duplicates could be a result of the end of a transcript... i.e. any fragment starting x bp before an end of transcription will have the same length. But a lot of these reads occur in the middles of known gene models. Maybe occassionally they are short non-coding RNA. But there are so many that it seems like it must be a technical bias...

    Leave a comment:


  • kmcarr
    replied
    [c]oma,

    We did have a researcher report discovering this in their data as well. I never had a chance to follow up on any other samples so I can't say how common this problem is on our hands. Our immediate thought was duplicates generated during the library amplification since that is the most logical explanation. Like your case though there was evidence that it was not occurring at this step (different duplicates observed from multiple runs of the same library). The only explanation we could think of is that during the emulsification step some micelles (micro reactors in 454 speak) were created that contained a single DNA molecule but multiple beads. This was only a hypothesis, we never did anything to test. Maybe when I get some free time (yeah, right) I'll look at some of our other 454 runs to see how many duplicates may exist

    Leave a comment:


  • [c]oma
    replied
    Some of the duplicated reads are indeed short sequences, but a good majority of them fall inside the normal sequence length distribution. There is no PCR step involved prior to the emPCR as far as I know, since it's a genomic shotgun library.

    As for amplicon sequencing bias, a collegue of mine pointed me to this publication (granted, it's about Illumina data, but some of it might also apply to 454). Maybe it can help you understand your data better

    I realize now that I underestimated the contribution of repeats to this phenomenon, so I am currently looking into that. Nonetheless any insight into this is appreciated!

    Leave a comment:


  • joa_ds
    replied
    I can only tell you that the emulsion pcr has huuuuge quantification bias.

    When we were running amplicon sequencing, we normalized the amplicon dna amount before the emulsion pcr and after the emulsion pcr there were fragments there was sometimes 100fold differences between coverage of certain amplicons.

    So yeah, it could be that certain sequences in the emPCR are preferentially amplified. Are those sequences the short sequences?

    How does the experiment work, is there any amplification prior to the emulsion pcr?

    Leave a comment:


  • Duplicate reads ("same start" reads) in 454 FLX/Titanium shotgun runs

    Hi all,

    I have been performing an in-depth quality analysis of some of our 454 whole-genome shotgun runs for a fungal species (~35-70 Mb genome) and plant species (~1 Gb genome) from both FLX and Titanium runs. In both datasets between 15 and 35% of the reads in each individual run are duplicate reads, i.e. the first 100 nt or more are exactly same and they start at exactly the same nucleotide. Even though both genomes are repetitive (to some extent), this is far more than expected by chance alone. Our hypothesis at the moment is that these duplicates are a result of the emulsion PCR step, but we think the percentage is really on the high side! Between runs from the same library there are not so many duplicates, so it is not a library issue. Furthermore we observe roughly the same numbers for paired-end libraries, so this confirms our hypothesis of this being an emPCR problem.

    Does anyone here have any experience with such analyses, and if so, do you find similar numbers?
    Last edited by [c]oma; 03-27-2009, 09:34 AM.

Latest Articles

Collapse

  • seqadmin
    Exploring the Dynamics of the Tumor Microenvironment
    by seqadmin




    The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
    07-08-2024, 03:19 PM
  • seqadmin
    Exploring Human Diversity Through Large-Scale Omics
    by seqadmin


    In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
    06-25-2024, 06:43 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 07-16-2024, 05:49 AM
0 responses
30 views
0 likes
Last Post seqadmin  
Started by seqadmin, 07-15-2024, 06:53 AM
0 responses
34 views
0 likes
Last Post seqadmin  
Started by seqadmin, 07-10-2024, 07:30 AM
0 responses
41 views
0 likes
Last Post seqadmin  
Started by seqadmin, 07-03-2024, 09:45 AM
0 responses
205 views
0 likes
Last Post seqadmin  
Working...
X