Header Leaderboard Ad

Collapse

repeat reads in Solexa

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • repeat reads in Solexa

    I found lots of reads were mapped at the same position on the reference.

    For 454 reads, if they start at the same position, and look the same, then only one of them would be kept, because it is likely caused by some experimental process.

    how about Solexa? does it have the some problem?

  • #2
    If you mean, will the solexa pipeline throw them out, the answer is no. Phix is only 5 kb, so there are lots and lots of identical reads.

    But I suppose every aligner is different, though throwing them out would be a little strange...what if you are trying to count coverage of something which has been sequenced very throughly? You'd need all the reads.

    Comment


    • #3
      How much starting material did you have and how amplified is the material? Overamplified material will align in a 'blocky' fashion no matter what the platform. However, if you're getting multiple alignments spread only 1bp apart then it is most likely high coverage and is fine.

      In agreement with swbarnes2, counting the 'pile-up' of sequences is required for a number of applications. De novo sequencing, CHiP-Seq/BS-Seq, RNA-Seq, SNP discovery/allele determination, etc.

      Comment


      • #4
        Something to remember when using high throughtput sequencing is that you get so much sequence that duplicates from the PCR amplification step during library preparation can become an issue. This is much more of an issue with a lot of PCR cycles or very deep coverage of a particular library. For single read sequence it's hard to tell what a duplicate and what's not but for paired-end sequence you'll see pretty obvious re-sequencing depending on the quality of your library preparation.

        But to answer your question ,the default Eland/Phagealign aligners in the Illumina pipeline will keep all reads that map to the genome given it's criteria because for all intents and purposes the sequence was there and it's fine. If you don't use those algorithms, then it's specific to what alignment algorithm you use because the base calling step outputs sequence for all the clusters that pass the filters.
        Cheers,

        Brad

        Comment


        • #5
          Actually, at the Illumina user meeting, Illumina showed off some of their own software that is downstream of ELAND, and I think they said that GenomeStudio, or maybe CASAVA, which feeds into Genome Studio, will throw away paired reads if they are identical at both ends to the reads from other clusters, figuring (I think rightly) that if your sequence is the same at both ends, its probably a PCR artifact. But if you are looking at something like SOAP or Bowtie, they just output the hits, one at a time. They don't remember where every hit is.

          Comment


          • #6
            Yes, by default, CASAVA throws away duplicate reads which can account for up to 10% or so of deep coverage sequencing runs (30X on human genome). But ELAND, like bowtie just outputs alignments.

            Comment


            • #7
              Casava does it only for paired-end reads

              it is very important to do it also for single reads but it is not provided by Illumina
              if you don't remove repeated reads you'll get more errors (see Nature Sarah B. Ng)

              anybody knows a software that can do this?

              Comment


              • #8
                Originally posted by vasvale View Post
                if you don't remove repeated reads you'll get more errors (see Nature Sarah B. Ng)
                Depending on the nature of your data removing duplicates may or may not be the correct thing to do. If you're using short reads and have high coverage over a region then eventually you're going to start getting exact duplicates - there are only so many places you can put a 36bp read after all! Depending on your application then removing these could be the wrong thing to do and might bias your quantiation.

                The approach I've tended to take is to filter our regions where the proportion of reads coming from exact overlaps is above a cutoff (eg 10%) and this seems to work pretty well to remove artefacts. This too will break down where you have lots of reads in a really short region, but it's scaled pretty well in the work we've done so far. I suppose eventually some observed/expected value calculated from the size of region and length and number of reads would be the best way to spot regions which have been affected by PCR or mapping artefacts.

                Comment


                • #9
                  Originally posted by vasvale View Post
                  Casava does it only for paired-end reads

                  it is very important to do it also for single reads but it is not provided by Illumina
                  if you don't remove repeated reads you'll get more errors (see Nature Sarah B. Ng)

                  anybody knows a software that can do this?
                  @vasvale: I believe Maq can do this.

                  maq rmdup out.rmdup.map in.ori.map
                  Remove pairs with identical outer coordinates.

                  Check out http://maq.sourceforge.net/maq-manpage.shtml

                  Comment


                  • #10
                    This is an interesting discussion. Here are the points I had

                    - How do you reach CASAVA with paired-end RNA-Seq data? Gerald has the options of eland_pair of eland_rna, how do you do eland paired rna?

                    - Vasvale, how do you run casava for PE RNA-Seq data?

                    - Simon, do you do this filtering of repeats making more than 10% data, on just rna-seq, or all kinds of sequencing?

                    - I know of tophat, that uses the paired-end info as well, when doing PE RNA-Seq, anyone out there has compared that with CASAV->GenomeStudio results?
                    --
                    bioinfosm

                    Comment


                    • #11
                      I'd need it for single-end reads for genomic DNA
                      don't know how to use MAQ

                      Comment


                      • #12
                        If your output alignment is in the SAM/BAM format, SAMTools can remove duplicate reads for you.

                        Comment


                        • #13
                          Originally posted by vasvale View Post
                          it is very important to do it also for single reads but it is not provided by Illumina
                          if you don't remove repeated reads you'll get more errors (see Nature Sarah B. Ng)
                          Could you provide more details on that "Nature Sarah B. Ng" reference? I can't locate it from pubmed or google scholar...

                          I tend to agree that removing exact duplicates is important, especially for any quantitative counting, I've seen huge suspect duplicate pileups that are difficult to explain other than being due to PCR artefacts...

                          Comment


                          • #14
                            exome sequencing

                            Originally posted by hingamp View Post
                            Could you provide more details on that "Nature Sarah B. Ng" reference? I can't locate it from pubmed or google scholar...

                            I tend to agree that removing exact duplicates is important, especially for any quantitative counting, I've seen huge suspect duplicate pileups that are difficult to explain other than being due to PCR artefacts...
                            It might be this paper:
                            Title: Targeted capture and massively parallel sequencing of 12 human exomes

                            Nature 461, 272-276 (10 September 2009) | doi:10.1038/nature08250; Received 5 June 2009; Accepted 29 June 2009; Published online 16 August 2009

                            Sarah B. Ng1, Emily H. Turner1, Peggy D. Robertson1, Steven D. Flygare1, Abigail W. Bigham2, Choli Lee1, Tristan Shaffer1, Michelle Wong1, Arindam Bhattacharjee4, Evan E. Eichler1,3, Michael Bamshad2, Deborah A. Nickerson1 & Jay Shendure1

                            In their particular case, duplicated reads should be filtered out, as their goal is to find mutations (SNPs). But for RNA-seq, it's hard to say if the duplicated reads are artifacts or reflection of real biology.
                            Last edited by liguow; 12-31-2009, 09:53 AM.

                            Comment


                            • #15
                              artifact?

                              Would you not expect many identical reads in an RNA seq experiment where amplification has been conducted? If fragments of length 300 are selected, and thereafter amplified (>=15 cycles or so) then I would suspect that many identical clusters will form on the flow cell. This is an effect of PCR amplification, sure, but I would not say that it is an artifact. Or am I missing something?

                              Comment

                              Working...
                              X