Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #91
    I am able to do something like

    Code:
    for i in `ls -1 *_1*.fastq | sed 's/_1.fastq//'`; do clumpify.sh -Xmx10g in1=$i\_1.fastq in2=$i\_2.fastq out1=$i\_clu_1.fastq out2=$i\_clu_2.fastq; done
    and have clumpify.sh produce two files. I am not sure why you are having trouble.

    Comment


    • #92
      I am having trouble using clumpify with the parameters optical + dedupe, to remove optical duplicates. e.x. clumpify.sh in=temp.fq.gz out=clumped.fq.gz dedupe optical. Clumpify without these parameters works

      Comment


      • #93
        Hi Brian,

        I'm really appreciating clumpify, it's fast & does exactly what it should.

        I'm trying to use it for optical duplicate detection, which works great. However, I wish only to report the number of optical duplicates, without creating the deduplicated output fastq file. Is there a possibility to skip producing output? At the moment the writing output step takes the longest time in my pipeline.

        Thanks in advance

        Comment


        • #94
          @DCZ: That is easy to do. If you do not provide any "out=" argument to most BBTools they will do the operation and produce relevant statistics without writing the output.

          Tip: If you ever want to pipe things then you can use "out=stdout.fq" from first tool and then "in=stdin.fq" for next tool. You get the idea.

          Comment


          • #95
            Ah, if only everything was this simple! Should've thought of that! Thanks!

            Comment


            • #96
              The Clumpify documentation webpage doesn't mention anything about removing duplicates, which I read about in a blog. May it be expanded upon? I have NovaSeq whole genome sequencing data (NovaSeq Control Software version 1.4.0 and Real Time Analysis version 3.3.3 acquisition followed by bcl2fastq version 2.20.0.422 conversion) for human samples of about 90 times coverage, so I think it's important that I use Clumpify. I intend to map the reads with bwa and I'm not sure if it supports some reads being pairs and some being merged (its documentation is minimal), so I plan to skip the clumping, if possible.

              Comment


              • #97
                @Dario1984: Have you run clumpify on your data? If you have excellent libraries with tightly controlled insert sizes you will find the duplicate rate to be well controlled. Brian has some explicit use cases in his Biostars post.

                Comment


                • #98
                  I have not run Clumpify but I will by following the examples which you linked to.

                  Comment


                  • #99
                    java.lang.AssertionError

                    Hello,

                    I'm using bbtools to preprocess some metagenomic hiseq reads prior to assembly and I've run into a little issue with clumpify. I am using the recommended 3 step error correction found in the AssemblyPipeline.sh script but the second error correction step stalls/freezes.

                    when I check the stderr file generated by the job I see these exceptions:

                    Exception in thread "Thread-1202" java.lang.AssertionError
                    at clump.KmerSort3$FetchThread3.fetchNext_inner(KmerSort3.java:706)
                    at clump.KmerSort3$FetchThread3.fetchNext(KmerSort3.java:655)
                    at clump.KmerSort3$FetchThread3.run(KmerSort3.java:577)
                    Fetched 4595507 reads: 12.948 seconds.
                    --
                    Exception in thread "Thread-1203" java.lang.AssertionError
                    at clump.KmerSort3$FetchThread3.fetchNext_inner(KmerSort3.java:706)
                    at clump.KmerSort3$FetchThread3.fetchNext(KmerSort3.java:655)
                    at clump.KmerSort3$FetchThread3.run(KmerSort3.java:577)

                    I have resubmitted the job and got the exact same exceptions the second time as well.

                    A little background: this job is running on a cluster with SLURM scheduling. The job requests an entire node with 40 processors and 125G of ram.

                    The reads are HiSeq PE 2x150 and the total size of the compressed reads is 343G.
                    This is the command that keeps stalling:
                    clumpify.sh in=ALL_temp.fq.gz out=ALL.eccc.fq.gz ecc passes=4 reorder

                    there are 1158 temp files generated by clumpify that occupy ~750G

                    Once this exception is thrown the whole job kindof just hangs.

                    Using version 38.43 with java 1.8.0_121

                    Any feedback would be greatly appreciated.

                    Thanks!

                    Comment


                    • Clumpify can need a lot of memory depending on size of data. With the data you have it is possible that you are simply running out of available memory. Have you looked into that?

                      Comment


                      • Just resubmitted on a high memory partition, hopefully this resolves the issue. Will update once the job finishes.

                        Comment


                        • So I resubmitted the job on a node with 40 processors and 1TB of memory and I received two very similar exceptions and the job is hanging again.

                          Exception in thread "Thread-147" java.lang.AssertionError
                          at clump.KmerSort3$FetchThread3.fetchNext_inner(KmerSort3.java:706)
                          at clump.KmerSort3$FetchThread3.fetchNext(KmerSort3.java:655)
                          at clump.KmerSort3$FetchThread3.run(KmerSort3.java:577)
                          --
                          Exception in thread "Thread-146" java.lang.AssertionError
                          at clump.KmerSort3$FetchThread3.fetchNext_inner(KmerSort3.java:706)
                          at clump.KmerSort3$FetchThread3.fetchNext(KmerSort3.java:655)
                          at clump.KmerSort3$FetchThread3.run(KmerSort3.java:577)

                          Comment


                          • Can you provide the exact command line you are using? Is this being submitted via a job scheduler?

                            Comment


                            • It is submitted to a SLURM queue via the attached script.

                              These reads are a collection of concatenated interleaved paired end libraries

                              The same script worked well on the individual libraries, but I wanted to do an assembly with all of the reads together so I concatenated them all with
                              Code:
                              cat *fq.gz > ALL.fq.gz
                              The command that ends up stalling is this:
                              Code:
                              clumpify.sh in=ALL_temp.fq.gz out=ALL.eccc.fq.gz ecc passes=4 reorder

                              bbmerge plows through these reads with no complaints just prior to clumpify

                              Code:
                              bbmerge.sh in=ALL_temp.fq.gz out=ALL.ecco.fq.gz ecco mix vstrict ordered ihist=ALL_ihist_merge1.txt
                              Attached Files

                              Comment


                              • I think you should follow the order of tools that Brian has in his script example. Do clumpify job first. Since you are merging the reads first I am going to speculate that clumpify is unable to identify duplicates properly. If your data in not from a patterned flowcell you could remove the "optical" flag for clumpify.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Choosing Between NGS and qPCR
                                  by seqadmin



                                  Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                                  10-18-2024, 07:11 AM
                                • seqadmin
                                  Non-Coding RNA Research and Technologies
                                  by seqadmin




                                  Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                                  Nobel Prize for MicroRNA Discovery
                                  This week,...
                                  10-07-2024, 08:07 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 05:31 AM
                                0 responses
                                10 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 10-24-2024, 06:58 AM
                                0 responses
                                20 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 10-23-2024, 08:43 AM
                                0 responses
                                48 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 10-17-2024, 07:29 AM
                                0 responses
                                58 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X