Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GenoMax
    replied
    Originally posted by duartemolha View Post
    I am having problems using clumpify with my fastqs and I beleive it is related to the UMI on the header of the fastq reads

    Here is a read from my read1 fastq:

    @VL00773:6:AAFVNLMM5:1:1101:21412:1000:CTGGTGGTT 1:N:0:ACTCTCGA+CTGTACCA
    GTGGGCACTAGCATACTTCCCAAGCTTGGGGTAGGGCAATATAGGCAAGTCGATCAAGCTTGCAGCTGACTCCCTTTGGGATCTTGGGCTTAACCTCCTTGGGCTTTACGAGGGCCTCGATAGCCTTGGCACGTGCACTCATGGCCTTGGC
    +
    CCCCCCCCCCCCCCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC;CCCC;CCCCCCCCCCCCCCCCCCCCCCCC​

    if I remove the :CTGGTGGTT from the end of the header I can use clumpify

    but with it there it just fails:


    clumpify.sh in1=sample1_R1_001.fastq.gz in2=sample1_R2_001.fastq.gz out1=sample1_dedup_R1_001.fastq.gz out
    2=sample1_dedup_R2_001.fastq.gz dedupe=t optical=t dupedist=40 spany=t t=1 -Xmx100g -Xms100g

    openjdk version "1.8.0_112"
    OpenJDK Runtime Environment (Zulu 8.19.0.1-linux64) (build 1.8.0_112-b16)
    OpenJDK 64-Bit Server VM (Zulu 8.19.0.1-linux64) (build 25.112-b16, mixed mode)
    java -ea -Xmx100g -Xms100g -cp .../bbtools/lib/current/ clump.Clumpify in1=sample1_R1_001.fastq.gz in2=sample1_R2_001.fastq.gz out1=sample1_dedup_R1_001.fastq.gz out out2=sample1_dedup_R2_001.fastq.gz out dedupe=t optical=t dupedist=40 spany=t t=1 -Xmx100g -Xms100g
    Executing clump.Clumpify [in1=sample1_R1_001.fastq.gz, in2=sample1_R2_001.fastq.gz, out1=sample1_dedup_R1_001.fastq.gz, out
    2=sample1_dedup_R2_001.fastq.gz​, dedupe=t, optical=t, dupedist=40, spany=t, t=1, -Xmx100g, -Xms100g]


    Clumpify version 37.62
    Read Estimate: 21805466
    Memory Estimate: 16636 MB
    Memory Available: 80430 MB
    Set groups to 1
    Executing clump.KmerSort [in1=sample1_R1_001.fastq.gz, in2=sample1_R2_001.fastq.gz, out1=sample1_dedup_R1_001.fastq.gz, out
    2=sample1_dedup_R2_001.fastq.gz​, groups=1, ecco=false, rename=false, shortname=f, unpair=false, repair=false, namesort=false, ow=true, dedupe=t, t=1, -Xmx100g, -Xms100g]

    Set threads to 1
    Making comparator.
    Made a comparator with k=31, seed=1, border=1, hashes=4
    Starting cris 0.
    Fetching reads.
    Making fetch threads.
    Starting threads.
    Waiting for threads.
    Exception in thread "Thread-3" java.lang.AssertionError: VL00773:7:AAFYLV7M5:1:1101:18648:1000:TAACCCATC 1:N:0:ACTCCATC+GATCAAGG
    at hiseq.FlowcellCoordinate.setFrom(FlowcellCoordinate.java:92)
    at clump.ReadKey.<init>(ReadKey.java:46)
    at clump.ReadKey.<init>(ReadKey.java:33)
    at clump.ReadKey.makeKey(ReadKey.java:23)
    at clump.KmerComparator.hash(KmerComparator.java:73)
    at clump.KmerComparator.hash(KmerComparator.java:66)
    at clump.KmerSort$FetchThread.run(KmerSort.java:816)
    Fetch time: 0.076 seconds.
    Closing input stream.
    Combining thread output.
    Combine time: 0.000 seconds.
    Exception in thread "main" java.lang.AssertionError: 0, 400, true
    at clump.KmerSort.fetchReads(KmerSort.java:718)
    at clump.KmerSort.processInner(KmerSort.java:400)
    at clump.KmerSort.process(KmerSort.java:320)
    at clump.KmerSort.main(KmerSort.java:51)
    at clump.Clumpify.process(Clumpify.java:247)
    at clump.Clumpify.main(Clumpify.java:37)




    Anyone has any solution to make this work without having to loose all my UMI information?
    I just did a brief test with the sample you included above. I did not have an issue with using clumpify with a couple of reads. So likely the issue lies someplace else and not in the UMI,

    Leave a comment:


  • duartemolha
    replied
    I am having problems using clumpify with my fastqs and I beleive it is related to the UMI on the header of the fastq reads

    Here is a read from my read1 fastq:

    @VL00773:6:AAFVNLMM5:1:1101:21412:1000:CTGGTGGTT 1:N:0:ACTCTCGA+CTGTACCA
    GTGGGCACTAGCATACTTCCCAAGCTTGGGGTAGGGCAATATAGGCAAGTCGATCAAGCTTGCAGCTGACTCCCTTTGGGATCTTGGGCTTAACCTCCTTGGGCTTTACGAGGGCCTCGATAGCCTTGGCACGTGCACTCATGGCCTTGGC
    +
    CCCCCCCCCCCCCCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC;CCCC;CCCCCCCCCCCCCCCCCCCCCCCC​

    if I remove the :CTGGTGGTT from the end of the header I can use clumpify

    but with it there it just fails:


    clumpify.sh in1=sample1_R1_001.fastq.gz in2=sample1_R2_001.fastq.gz out1=sample1_dedup_R1_001.fastq.gz out
    2=sample1_dedup_R2_001.fastq.gz dedupe=t optical=t dupedist=40 spany=t t=1 -Xmx100g -Xms100g

    openjdk version "1.8.0_112"
    OpenJDK Runtime Environment (Zulu 8.19.0.1-linux64) (build 1.8.0_112-b16)
    OpenJDK 64-Bit Server VM (Zulu 8.19.0.1-linux64) (build 25.112-b16, mixed mode)
    java -ea -Xmx100g -Xms100g -cp .../bbtools/lib/current/ clump.Clumpify in1=sample1_R1_001.fastq.gz in2=sample1_R2_001.fastq.gz out1=sample1_dedup_R1_001.fastq.gz out out2=sample1_dedup_R2_001.fastq.gz out dedupe=t optical=t dupedist=40 spany=t t=1 -Xmx100g -Xms100g
    Executing clump.Clumpify [in1=sample1_R1_001.fastq.gz, in2=sample1_R2_001.fastq.gz, out1=sample1_dedup_R1_001.fastq.gz, out
    2=sample1_dedup_R2_001.fastq.gz​, dedupe=t, optical=t, dupedist=40, spany=t, t=1, -Xmx100g, -Xms100g]


    Clumpify version 37.62
    Read Estimate: 21805466
    Memory Estimate: 16636 MB
    Memory Available: 80430 MB
    Set groups to 1
    Executing clump.KmerSort [in1=sample1_R1_001.fastq.gz, in2=sample1_R2_001.fastq.gz, out1=sample1_dedup_R1_001.fastq.gz, out
    2=sample1_dedup_R2_001.fastq.gz​, groups=1, ecco=false, rename=false, shortname=f, unpair=false, repair=false, namesort=false, ow=true, dedupe=t, t=1, -Xmx100g, -Xms100g]

    Set threads to 1
    Making comparator.
    Made a comparator with k=31, seed=1, border=1, hashes=4
    Starting cris 0.
    Fetching reads.
    Making fetch threads.
    Starting threads.
    Waiting for threads.
    Exception in thread "Thread-3" java.lang.AssertionError: VL00773:7:AAFYLV7M5:1:1101:18648:1000:TAACCCATC 1:N:0:ACTCCATC+GATCAAGG
    at hiseq.FlowcellCoordinate.setFrom(FlowcellCoordinate.java:92)
    at clump.ReadKey.<init>(ReadKey.java:46)
    at clump.ReadKey.<init>(ReadKey.java:33)
    at clump.ReadKey.makeKey(ReadKey.java:23)
    at clump.KmerComparator.hash(KmerComparator.java:73)
    at clump.KmerComparator.hash(KmerComparator.java:66)
    at clump.KmerSort$FetchThread.run(KmerSort.java:816)
    Fetch time: 0.076 seconds.
    Closing input stream.
    Combining thread output.
    Combine time: 0.000 seconds.
    Exception in thread "main" java.lang.AssertionError: 0, 400, true
    at clump.KmerSort.fetchReads(KmerSort.java:718)
    at clump.KmerSort.processInner(KmerSort.java:400)
    at clump.KmerSort.process(KmerSort.java:320)
    at clump.KmerSort.main(KmerSort.java:51)
    at clump.Clumpify.process(Clumpify.java:247)
    at clump.Clumpify.main(Clumpify.java:37)




    Anyone has any solution to make this work without having to loose all my UMI information?

    Leave a comment:


  • stevekm
    replied
    Is there any method available to run Clumpify directly from within another program? Such as a library that could be imported? I saw that the main Clumpify program is written in Java, however, I am not a Java programmer. Not sure what other options there might be if I want my own custom program, which outputs fastq data, to pass the output directly to Clumpify, especially considering the handling the paired-end files.

    Leave a comment:


  • phylloxera
    replied
    Looks like everything went fine after I 'unwrapped' the input fasta.

    Leave a comment:


  • phylloxera
    replied
    Hi, I've been using clumpify for sometime now. Thanks!
    Seem to have encountered a strange and unexpected result.
    pigz -dc test.fna.gz | grep "^>" | wc -l #4149
    ~/bbmap/clumpify.sh in=test.fna.gz out=test_dd.fna.gz dedupe subs=0
    #Version 38.51
    #Read Estimate: 352386
    ...
    #Reads In: 2
    #Clumps Formed: 2
    #Duplicates Found: 0
    #Reads Out: 2
    ...
    pigz -dc test_dd.fna.gz | grep "^>" | wc -l #2

    Any idea what might have happened?

    Leave a comment:


  • DCZ
    replied
    Thanks for your reply. I'm still confused though. Just like there can be empty wells on the same tile, there can also be empty wells on neighboring tiles (correct me if i'm wrong). I suppose these wells would not show a mixed signal but would just get filled with a duplicate in the same way as the optical duplicates get formed on the same tile.

    Leave a comment:


  • GenoMax
    replied
    Illumina's software pre-processing takes care of clusters that may be showing mixed signals etc so they may never pass that step. Spantiles=t is mainly for nextSeq, where the clusters are hugh (relatively) and as a result there is a chance they will cross tiles. I believe this was done based on empirical observation Brian had done when he was developing clumpify.

    Leave a comment:


  • DCZ
    replied
    Hi all,

    I was wondering why the default for spantiles is set to false. If a read for instance has coordinates (1000,1000) and the dupedist is set to 2500, (see sketch attached), there's a possible overlap with 3 other tiles. So even if it's not a NextSeq, but a HiSeq4000 for instance, there are no tile-edge duplicates, however there's still a possibility that optical duplicates end up on neighboring tiles (or even further). Can anyone elucidate on this?

    Thanks in advance!

    Attachment: The dot represents the "original read", the circle represents the distance of 2500 around the "original read". Rectangles represent tiles.
    Attached Files
    Last edited by DCZ; 05-23-2019, 07:27 AM.

    Leave a comment:


  • Chief_Lazy_Bison
    replied
    Thank you for the quick advice. I had attempted to merge many samples together at the front end of the pipeline so that I could to all the QC and error correction at once. My problem was fixed when I did QC and error correction on each sample individually and then merged for a co-assembly.

    Thanks again.

    Leave a comment:


  • GenoMax
    replied
    I think you should follow the order of tools that Brian has in his script example. Do clumpify job first. Since you are merging the reads first I am going to speculate that clumpify is unable to identify duplicates properly. If your data in not from a patterned flowcell you could remove the "optical" flag for clumpify.

    Leave a comment:


  • Chief_Lazy_Bison
    replied
    It is submitted to a SLURM queue via the attached script.

    These reads are a collection of concatenated interleaved paired end libraries

    The same script worked well on the individual libraries, but I wanted to do an assembly with all of the reads together so I concatenated them all with
    Code:
    cat *fq.gz > ALL.fq.gz
    The command that ends up stalling is this:
    Code:
    clumpify.sh in=ALL_temp.fq.gz out=ALL.eccc.fq.gz ecc passes=4 reorder

    bbmerge plows through these reads with no complaints just prior to clumpify

    Code:
    bbmerge.sh in=ALL_temp.fq.gz out=ALL.ecco.fq.gz ecco mix vstrict ordered ihist=ALL_ihist_merge1.txt
    Attached Files

    Leave a comment:


  • GenoMax
    replied
    Can you provide the exact command line you are using? Is this being submitted via a job scheduler?

    Leave a comment:


  • Chief_Lazy_Bison
    replied
    So I resubmitted the job on a node with 40 processors and 1TB of memory and I received two very similar exceptions and the job is hanging again.

    Exception in thread "Thread-147" java.lang.AssertionError
    at clump.KmerSort3$FetchThread3.fetchNext_inner(KmerSort3.java:706)
    at clump.KmerSort3$FetchThread3.fetchNext(KmerSort3.java:655)
    at clump.KmerSort3$FetchThread3.run(KmerSort3.java:577)
    --
    Exception in thread "Thread-146" java.lang.AssertionError
    at clump.KmerSort3$FetchThread3.fetchNext_inner(KmerSort3.java:706)
    at clump.KmerSort3$FetchThread3.fetchNext(KmerSort3.java:655)
    at clump.KmerSort3$FetchThread3.run(KmerSort3.java:577)

    Leave a comment:


  • Chief_Lazy_Bison
    replied
    Just resubmitted on a high memory partition, hopefully this resolves the issue. Will update once the job finishes.

    Leave a comment:


  • GenoMax
    replied
    Clumpify can need a lot of memory depending on size of data. With the data you have it is possible that you are simply running out of available memory. Have you looked into that?

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Recent Advances in Sequencing Technologies
    by seqadmin







    Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

    Long-Read Sequencing
    Long-read sequencing has...
    12-02-2024, 01:49 PM
  • seqadmin
    Genetic Variation in Immunogenetics and Antibody Diversity
    by seqadmin



    The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
    11-06-2024, 07:24 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 12-02-2024, 09:29 AM
0 responses
134 views
0 likes
Last Post seqadmin  
Started by seqadmin, 12-02-2024, 09:06 AM
0 responses
48 views
0 likes
Last Post seqadmin  
Started by seqadmin, 12-02-2024, 08:03 AM
0 responses
38 views
0 likes
Last Post seqadmin  
Started by seqadmin, 11-22-2024, 07:36 AM
0 responses
68 views
0 likes
Last Post seqadmin  
Working...
X