Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GenoMax
    replied
    @Dario1984: Have you run clumpify on your data? If you have excellent libraries with tightly controlled insert sizes you will find the duplicate rate to be well controlled. Brian has some explicit use cases in his Biostars post.

    Leave a comment:


  • Dario1984
    replied
    The Clumpify documentation webpage doesn't mention anything about removing duplicates, which I read about in a blog. May it be expanded upon? I have NovaSeq whole genome sequencing data (NovaSeq Control Software version 1.4.0 and Real Time Analysis version 3.3.3 acquisition followed by bcl2fastq version 2.20.0.422 conversion) for human samples of about 90 times coverage, so I think it's important that I use Clumpify. I intend to map the reads with bwa and I'm not sure if it supports some reads being pairs and some being merged (its documentation is minimal), so I plan to skip the clumping, if possible.

    Leave a comment:


  • DCZ
    replied
    Ah, if only everything was this simple! Should've thought of that! Thanks!

    Leave a comment:


  • GenoMax
    replied
    @DCZ: That is easy to do. If you do not provide any "out=" argument to most BBTools they will do the operation and produce relevant statistics without writing the output.

    Tip: If you ever want to pipe things then you can use "out=stdout.fq" from first tool and then "in=stdin.fq" for next tool. You get the idea.

    Leave a comment:


  • DCZ
    replied
    Hi Brian,

    I'm really appreciating clumpify, it's fast & does exactly what it should.

    I'm trying to use it for optical duplicate detection, which works great. However, I wish only to report the number of optical duplicates, without creating the deduplicated output fastq file. Is there a possibility to skip producing output? At the moment the writing output step takes the longest time in my pipeline.

    Thanks in advance

    Leave a comment:


  • kokyriakidis
    replied
    I am having trouble using clumpify with the parameters optical + dedupe, to remove optical duplicates. e.x. clumpify.sh in=temp.fq.gz out=clumped.fq.gz dedupe optical. Clumpify without these parameters works

    Leave a comment:


  • GenoMax
    replied
    I am able to do something like

    Code:
    for i in `ls -1 *_1*.fastq | sed 's/_1.fastq//'`; do clumpify.sh -Xmx10g in1=$i\_1.fastq in2=$i\_2.fastq out1=$i\_clu_1.fastq out2=$i\_clu_2.fastq; done
    and have clumpify.sh produce two files. I am not sure why you are having trouble.

    Leave a comment:


  • kokyriakidis
    replied
    I use the latest version of BBtools. I can't get it work

    Leave a comment:


  • GenoMax
    replied
    Are you using the latest version of BBMap? Have you tried to run a test with actual file names instead of shell variables?

    Leave a comment:


  • kokyriakidis
    replied
    Originally posted by GenoMax View Post
    It looks like out1= and out2= variables are not being correctly expanded. BBMap seems to think that your outputs are inputs (in1=./Preproccesing/ERR522065/ERR522065_1_optical.fastq.gz, in2=./Preproccesing/ERR522065/ERR522065_2_optical.fastq.gz). Are the input files in the correct directory with the right names?
    Yes! all files are in the same folder! Actually neither clumpify dedupe optical, nor filterbytile work. So I have to remove them in order to complete my pipeline...

    Leave a comment:


  • GenoMax
    replied
    It looks like out1= and out2= variables are not being correctly expanded. BBMap seems to think that your outputs are inputs (in1=./Preproccesing/ERR522065/ERR522065_1_optical.fastq.gz, in2=./Preproccesing/ERR522065/ERR522065_2_optical.fastq.gz). Are the input files in the correct directory with the right names?

    Leave a comment:


  • kokyriakidis
    replied
    After clumpify command to remove duplicates using in1 in2 and out1 out2, it seems like it produces only 1 out, messing with the pipeline stream! Why does it happen?

    ./bbmap/clumpify.sh in1=./Preproccesing/${ERR}/${ERR}_1_1.fastq.gz in2=./Preproccesing/${ERR}/${ERR}_2_1.fastq.gz out1=./Preproccesing/${ERR}/${ERR}_1_optical.fastq.gz out2=./Preproccesing/${ERR}/${ERR}_2_optical.fastq.gz dedupe=true optical=true overwrite=true

    ------

    Reset INTERLEAVED to false because paired input files were specified.
    Set INTERLEAVED to false
    Input is being processed as paired
    Writing interleaved.
    Made a comparator with k=31, seed=1, border=1, hashes=4
    Time: 22.512 seconds.
    Reads Processed: 13371k 593.99k reads/sec
    Bases Processed: 1145m 50.88m bases/sec
    Executing clump.KmerSort3 [in1=./Preproccesing/ERR522065/ERR522065_1_optical_clumpify_p1_temp%_10a607a7b7090ec6.fastq.gz, in2=, out=./Preproccesing/ERR522065/ERR522065_1_optical.fas
    tq.gz, out2=, groups=11, ecco=f, addname=false, shortname=f, unpair=f, repair=false, namesort=false, ow=true]

    ------

    java -Djava.library.path=/mnt/scratchdir/home/kyriakidk/KIWI/bbmap/jni/ -ea -Xmx33412m -Xms33412m -cp /mnt/scratchdir/home/kyriakidk/KIWI/bbmap/current/ jgi.BBDukF in1=./Preproccesi
    ng/ERR522065/ERR522065_1_optical.fastq.gz in2=./Preproccesing/ERR522065/ERR522065_2_optical.fastq.gz
    Executing jgi.BBDukF [in1=./Preproccesing/ERR522065/ERR522065_1_optical.fastq.gz, in2=./Preproccesing/ERR522065/ERR522065_2_optical.fastq.gz]
    Version 38.11

    No output stream specified. To write to stdout, please specify 'out=stdout.fq' or similar.
    Exception in thread "main" java.lang.RuntimeException: Can't read file './Preproccesing/ERR522065/ERR522065_2_optical.fastq.gz'
    Last edited by kokyriakidis; 07-21-2018, 12:51 AM.

    Leave a comment:


  • GenoMax
    replied
    Interesting point. I have always worked with data that was of uniform length. Based on what you have discovered clumpify does seem to have an underlying need/assumption that the reads are all equal length.

    Two options come to mind:

    1. You could trim that extra base off from end of the 251 bp reads to make them 250 bp by using bbduk.sh
    2. You could try using dedupe.sh which can match subsequences.
    Code:
    dedupe.sh
    
    Written by Brian Bushnell and Jonathan Rood
    Last modified March 9, 2017
    
    Description:  Accepts one or more files containing sets of sequences (reads or scaffolds).
    Removes duplicate sequences, which may be specified to be exact matches, subsequences, or sequences within some percent identity.
    Can also find overlapping sequences and group them into clusters.
    Please read bbmap/docs/guides/DedupeGuide.txt for more information.
    
    Usage:     dedupe.sh in=<file or stdin> out=<file or stdout>

    Leave a comment:


  • silask
    replied
    Sorry. For example I have two reads, which are 250 and 251 nt long, and identical.
    Clumpy doesn't mark them as duplicate even with dupesubs=2. I would say the reads are duplicates, what do you think?

    Leave a comment:


  • GenoMax
    replied
    On a test set with two paired end raw reads which are normally detected as duplicates, I can prevent marking the reads as duplicates, by only removing one nt from the end.
    I am not sure exactly what you are referring to. Clumpify by default will allow two substitutions (errors if you will). If you want to do strict matching then use dupesubs=0. Can you include the command line options you are using?
    Last edited by GenoMax; 04-18-2018, 03:24 AM.

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    04-22-2024, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 04-25-2024, 11:49 AM
0 responses
20 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-24-2024, 08:47 AM
0 responses
20 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
62 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
61 views
0 likes
Last Post seqadmin  
Working...
X