Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Bueller_007
    Member
    • May 2010
    • 16

    Removing duplicate reads from multigig .csfasta

    Hi all.

    I'm trying to do a de novo transcriptome assembly using ABI SOLiD data. I'm trying to use Velvet/Oases at the moment, and I've found that PCR duplicates seem to be a serious problem during the postprocessing step when the double-encoded contigs are converted back into colour-space reads prior to the final assembly. This step takes at least 72 hours, which is an order of magnitude greater than the time required by the Velvet/Oases assemblers themselves. The postprocessing output file just keeps swelling in size because there are so many PCR duplicates.

    So the question is: is there an efficient program out there I can use to remove duplicate reads from my .csfasta (and preferably the corresponding _QV.qual) file prior to assembly? I know there's an option to do this filtering on the SOLiD machine itself, but the person who did the sequencing didn't enable it.

    Thanks.
  • drio
    Senior Member
    • Oct 2008
    • 323

    #2
    I don't think there is anything like that out there. You need alignments to detect duplicates.
    About the SOLiD instrument filtering, perhaps you are talking about dropping reads with low quality?
    -drd

    Comment

    • Bueller_007
      Member
      • May 2010
      • 16

      #3
      Originally posted by drio View Post
      I don't think there is anything like that out there. You need alignments to detect duplicates.
      About the SOLiD instrument filtering, perhaps you are talking about dropping reads with low quality?
      I don't think I need alignments, as I'm talking about identical ~reads~. Removing these duplicates can be performed by Corona prior to data output using the --noduplicates option. However, I can't find an equivalent for data that has already been outputted by the SOLiD system.

      There are multiple programs available for filtering out low-quality reads. That's not what I need.

      Comment

      • nilshomer
        Nils Homer
        • Nov 2008
        • 1283

        #4
        Originally posted by Bueller_007 View Post
        I don't think I need alignments, as I'm talking about identical ~reads~. Removing these duplicates can be performed by Corona prior to data output using the --noduplicates option. However, I can't find an equivalent for data that has already been outputted by the SOLiD system.

        There are multiple programs available for filtering out low-quality reads. That's not what I need.
        A few lines of your favorite programming language should be able to do it. Lexicographically sort by sequence and remove duplicates.

        Comment

        • drio
          Senior Member
          • Oct 2008
          • 323

          #5
          Originally posted by nilshomer View Post
          A few lines of your favorite programming language should be able to do it. Lexicographically sort by sequence and remove duplicates.
          Something like this: http://github.com/drio/dups.fasta.qual
          -drd

          Comment

          • Bueller_007
            Member
            • May 2010
            • 16

            #6
            Thanks. I didn't get email notifications that people had replied to my post, so I didn't find these until just now.

            For what it's worth, I believe that FASTX_collapser ( http://hannonlab.cshl.edu/fastx_toolkit/ ) can also do this, with the caveat that your .csfasta and _QV.qual have to be merged into a .fastq first (with the .csfasta double-encoded) if you also want to remove the duplicates from your _QV.qual file.

            Comment

            • Chipper
              Senior Member
              • Mar 2008
              • 323

              #7
              Wouldn't removing all identical reads result in enrichment of reads with errorrs? Perhaps filterting on the first part and allowing some duplicates would work better.

              Comment

              • Bueller_007
                Member
                • May 2010
                • 16

                #8
                Originally posted by Chipper View Post
                Wouldn't removing all identical reads result in enrichment of reads with errorrs? Perhaps filterting on the first part and allowing some duplicates would work better.
                Probably true. That's why it's better to remove duplicates after alignment/assembly. Unfortunately, I'm feeding the end-product to CLC Genomics Workbench and they don't have duplicate removal yet. The dupes are messing up my SNP discovery pretty badly.

                I'd turn on a maximum coverage limit, but since it's a transcriptome, the coverage varies with expression level, so I'm hesitant to omit highly covered regions. I've tried exporting to BAM, removing dupes with Picard and importing back in, but the reimport didn't work for whatever reason.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  New Genomics Tools and Methods Shared at AGBT 2025
                  by seqadmin


                  This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                  The Headliner
                  The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                  03-03-2025, 01:39 PM
                • seqadmin
                  Investigating the Gut Microbiome Through Diet and Spatial Biology
                  by seqadmin




                  The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                  02-24-2025, 06:31 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Today, 05:03 AM
                0 responses
                10 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 07:27 AM
                0 responses
                11 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-18-2025, 12:50 PM
                0 responses
                14 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-03-2025, 01:15 PM
                0 responses
                185 views
                0 reactions
                Last Post seqadmin  
                Working...