Header Leaderboard Ad

Collapse

Dereplication tools needed

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dereplication tools needed

    It's been a while since I got my hands on shotgun Illumina metagenomic data. I've found that it's quite important to dereplicate before doing any downstream analysis to avoid problems with assembly and inaccurate quantification. The last time around I used usearch --derep_fulllength on a subset of the data to filter out artificial replicate reads, but it is choking on the larger datasets I have now. My approach was to identify a high quality subsection of R1 and dereplicate that, then filter out reads from the raw data. The reason for this is that often there can be a single cycle with high error, and there is always higher error at the end of the read, so some actual replicates could be missed if the whole read is used.

    Can anyone recommend a good current tool for dereplicating Illumina reads? My datasets are about 20-30 million reads each. I came across Fulcrum with google search--any experiences with that? (paper)

  • #2
    Here is a python script that I wrote to dereplicate larger fastas github
    The only requirement is BioPython needs to be in your python path. You can use the -h option for more information. An example usage would be derep_seqs.py -i somefasta.fna

    I hope it helps you!

    Comment


    • #3
      The BBMap package contains a tool for dereplication. It's intended for assembly dereplication, but written so that it works with paired-end fastq reads.

      Usage:
      dedupe.sh in=reads.fq out=fixed.fq maxsubs=0 int=t ac=f

      If your OS does not process bash shellscripts, you can replace "dedupe.sh" with "java -Xmx31g -cp /path/to/current jgi.Dedupe" where 31g should be adjusted to be around 80% of your physical memory.

      "maxsubs=0" means that only exact matches are allowed; you can make that number higher if you want. "int=t" is used to indicate that the data is paired and interleaved. If your data is paired and in 2 files, you need to interleave it first:

      reformat.sh in1=reads1.fq in2=reads2.fq out=interleaved.fq
      (or java -Xmx200m -cp /path/to/current jgi.ReformatReads)

      Other options are random subsampling to reduce the volume of data, and error-correction to help better detect duplicates. The BBMap package contains an error-correction program (ecc.sh), though the effectiveness depends on the makeup of the metagenomic community and data volume, as it depends on high kmer depth. 30m reads for a real metagenome is pretty small.
      Last edited by Brian Bushnell; 02-18-2015, 10:27 AM.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        How RNA-Seq is Transforming Cancer Studies
        by seqadmin



        Cancer research has been transformed through numerous molecular techniques, with RNA sequencing (RNA-seq) playing a crucial role in understanding the complexity of the disease. Maša Ivin, Ph.D., Scientific Writer at Lexogen, and Yvonne Goepel Ph.D., Product Manager at Lexogen, remarked that “The high-throughput nature of RNA-seq allows for rapid profiling and deep exploration of the transcriptome.” They emphasized its indispensable role in cancer research, aiding in biomarker...
        09-07-2023, 11:15 PM
      • seqadmin
        Methods for Investigating the Transcriptome
        by seqadmin




        Ribonucleic acid (RNA) represents a range of diverse molecules that play a crucial role in many cellular processes. From serving as a protein template to regulating genes, the complex processes involving RNA make it a focal point of study for many scientists. This article will spotlight various methods scientists have developed to investigate different RNA subtypes and the broader transcriptome.

        Whole Transcriptome RNA-seq
        Whole transcriptome sequencing...
        08-31-2023, 11:07 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Yesterday, 07:42 AM
      0 responses
      10 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 09-22-2023, 09:05 AM
      0 responses
      23 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 09-21-2023, 06:18 AM
      0 responses
      16 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 09-20-2023, 09:17 AM
      0 responses
      16 views
      0 likes
      Last Post seqadmin  
      Working...
      X