Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files

stevekm replied

10-27-2023, 09:33 AM
Is there any method available to run Clumpify directly from within another program? Such as a library that could be imported? I saw that the main Clumpify program is written in Java, however, I am not a Java programmer. Not sure what other options there might be if I want my own custom program, which outputs fastq data, to pass the output directly to Clumpify, especially considering the handling the paired-end files.
Leave a comment:
phylloxera replied

07-10-2020, 06:06 PM
Looks like everything went fine after I 'unwrapped' the input fasta.
Leave a comment:
phylloxera replied

07-09-2020, 01:15 PM
Hi, I've been using clumpify for sometime now. Thanks!
Seem to have encountered a strange and unexpected result.
pigz -dc test.fna.gz | grep "^>" | wc -l #4149
~/bbmap/clumpify.sh in=test.fna.gz out=test_dd.fna.gz dedupe subs=0
#Version 38.51
#Read Estimate: 352386
...
#Reads In: 2
#Clumps Formed: 2
#Duplicates Found: 0
#Reads Out: 2
...
pigz -dc test_dd.fna.gz | grep "^>" | wc -l #2

Any idea what might have happened?
Leave a comment:
DCZ replied

05-26-2019, 11:55 PM
Thanks for your reply. I'm still confused though. Just like there can be empty wells on the same tile, there can also be empty wells on neighboring tiles (correct me if i'm wrong). I suppose these wells would not show a mixed signal but would just get filled with a duplicate in the same way as the optical duplicates get formed on the same tile.
Leave a comment:
GenoMax replied

05-24-2019, 09:57 AM
Illumina's software pre-processing takes care of clusters that may be showing mixed signals etc so they may never pass that step. Spantiles=t is mainly for nextSeq, where the clusters are hugh (relatively) and as a result there is a chance they will cross tiles. I believe this was done based on empirical observation Brian had done when he was developing clumpify.
Leave a comment:
DCZ replied

05-23-2019, 07:15 AM
Hi all,

I was wondering why the default for spantiles is set to false. If a read for instance has coordinates (1000,1000) and the dupedist is set to 2500, (see sketch attached), there's a possible overlap with 3 other tiles. So even if it's not a NextSeq, but a HiSeq4000 for instance, there are no tile-edge duplicates, however there's still a possibility that optical duplicates end up on neighboring tiles (or even further). Can anyone elucidate on this?

Thanks in advance!

Attachment: The dot represents the "original read", the circle represents the distance of 2500 around the "original read". Rectangles represent tiles.
Attached Files

Screenshot from 2019-05-23 17-11-08.png (4.3 KB, 60 views)
Last edited by DCZ; 05-23-2019, 07:27 AM.
Leave a comment:
Chief_Lazy_Bison replied

04-15-2019, 03:17 AM
Thank you for the quick advice. I had attempted to merge many samples together at the front end of the pipeline so that I could to all the QC and error correction at once. My problem was fixed when I did QC and error correction on each sample individually and then merged for a co-assembly.

Thanks again.
Leave a comment:
GenoMax replied

04-11-2019, 06:12 AM
I think you should follow the order of tools that Brian has in his script example. Do clumpify job first. Since you are merging the reads first I am going to speculate that clumpify is unable to identify duplicates properly. If your data in not from a patterned flowcell you could remove the "optical" flag for clumpify.
Leave a comment:
Chief_Lazy_Bison replied

04-11-2019, 04:24 AM
It is submitted to a SLURM queue via the attached script.

These reads are a collection of concatenated interleaved paired end libraries

The same script worked well on the individual libraries, but I wanted to do an assembly with all of the reads together so I concatenated them all with

Code:

cat *fq.gz > ALL.fq.gz

The command that ends up stalling is this:

Code:

clumpify.sh in=ALL_temp.fq.gz out=ALL.eccc.fq.gz ecc passes=4 reorder

bbmerge plows through these reads with no complaints just prior to clumpify

Code:

bbmerge.sh in=ALL_temp.fq.gz out=ALL.ecco.fq.gz ecco mix vstrict ordered ihist=ALL_ihist_merge1.txt

Attached Files

ALL_ec.SLURM.txt (6.5 KB, 62 views)
Leave a comment:
GenoMax replied

04-11-2019, 03:53 AM
Can you provide the exact command line you are using? Is this being submitted via a job scheduler?
Leave a comment:
Chief_Lazy_Bison replied

04-11-2019, 03:01 AM
So I resubmitted the job on a node with 40 processors and 1TB of memory and I received two very similar exceptions and the job is hanging again.

Exception in thread "Thread-147" java.lang.AssertionError
at clump.KmerSort3$FetchThread3.fetchNext_inner(KmerSort3.java:706)
at clump.KmerSort3$FetchThread3.fetchNext(KmerSort3.java:655)
at clump.KmerSort3$FetchThread3.run(KmerSort3.java:577)
--
Exception in thread "Thread-146" java.lang.AssertionError
at clump.KmerSort3$FetchThread3.fetchNext_inner(KmerSort3.java:706)
at clump.KmerSort3$FetchThread3.fetchNext(KmerSort3.java:655)
at clump.KmerSort3$FetchThread3.run(KmerSort3.java:577)
Leave a comment:
Chief_Lazy_Bison replied

04-10-2019, 11:07 AM
Just resubmitted on a high memory partition, hopefully this resolves the issue. Will update once the job finishes.
Leave a comment:
GenoMax replied

04-10-2019, 10:18 AM
Clumpify can need a lot of memory depending on size of data. With the data you have it is possible that you are simply running out of available memory. Have you looked into that?
Leave a comment:
Chief_Lazy_Bison replied

04-10-2019, 10:03 AM
java.lang.AssertionError

Hello,

I'm using bbtools to preprocess some metagenomic hiseq reads prior to assembly and I've run into a little issue with clumpify. I am using the recommended 3 step error correction found in the AssemblyPipeline.sh script but the second error correction step stalls/freezes.

when I check the stderr file generated by the job I see these exceptions:

Exception in thread "Thread-1202" java.lang.AssertionError
at clump.KmerSort3$FetchThread3.fetchNext_inner(KmerSort3.java:706)
at clump.KmerSort3$FetchThread3.fetchNext(KmerSort3.java:655)
at clump.KmerSort3$FetchThread3.run(KmerSort3.java:577)
Fetched 4595507 reads: 12.948 seconds.
--
Exception in thread "Thread-1203" java.lang.AssertionError
at clump.KmerSort3$FetchThread3.fetchNext_inner(KmerSort3.java:706)
at clump.KmerSort3$FetchThread3.fetchNext(KmerSort3.java:655)
at clump.KmerSort3$FetchThread3.run(KmerSort3.java:577)

I have resubmitted the job and got the exact same exceptions the second time as well.

A little background: this job is running on a cluster with SLURM scheduling. The job requests an entire node with 40 processors and 125G of ram.

The reads are HiSeq PE 2x150 and the total size of the compressed reads is 343G.
This is the command that keeps stalling:
clumpify.sh in=ALL_temp.fq.gz out=ALL.eccc.fq.gz ecc passes=4 reorder

there are 1158 temp files generated by clumpify that occupy ~750G

Once this exception is thrown the whole job kindof just hangs.

Using version 38.43 with java 1.8.0_121

Any feedback would be greatly appreciated.

Thanks!
Leave a comment:
Dario1984 replied

02-25-2019, 06:00 PM
I have not run Clumpify but I will by following the examples which you linked to.
Leave a comment:

Previous 1 2 3 4 8 template Next

Recent Advances in Sequencing Analysis Tools

by seqadmin

The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
- Channel: Articles
05-06-2024, 07:48 AM
Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM

Topics	Statistics	Last Post
The Role of Spliceosomes in RNA Splicing and Genome Evolution by seqadmin Started by seqadmin, 05-14-2024, 07:03 AM	0 responses 15 views 0 likes	Last Post by seqadmin 05-14-2024, 07:03 AM
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, 05-10-2024, 06:35 AM	0 responses 37 views 0 likes	Last Post by seqadmin 05-10-2024, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, 05-09-2024, 02:46 PM	0 responses 45 views 0 likes	Last Post by seqadmin 05-09-2024, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 39 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News