Announcement

Collapse
No announcement yet.

Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Hi Brian,

    that dedupe function looks great! We have been waiting for such a tool.

    Comment


    • #32
      Thanks, luc, I appreciate it.

      Comment


      • #33
        Hi Brian, any update on allowing non-interleaved input/output? I'd love to remove the reformat.sh steps before and after clumpify.sh

        Comment


        • #34
          Hi Devon,

          Yes, this is all done, I just haven't released it yet. I'll do so tomorrow (difficult for me to do where I am now; today's a vacation day here).

          Comment


          • #35
            Ah, right, MLK day. Enjoy the day off and stop checking SEQanswers!

            Comment


            • #36
              It's a day late since our cluster was down yesterday, but BBTools 36.85 is released, and Clumpify now supports twin files:

              Code:
              clumpify.sh in1=r1.fq.gz in2=r2.fq.gz out1=c1.fq.gz out2=c2.fq.gz

              Comment


              • #37
                Originally posted by Brian Bushnell View Post
                It's a day late since our cluster was down yesterday, but BBTools 36.85 is released, and Clumpify now supports twin files:

                Code:
                clumpify.sh in1=r1.fq.gz in2=r2.fq.gz out1=c1.fq.gz out2=c2.fq.gz
                Yay! Two less operations ...

                Comment


                • #38
                  Originally posted by Brian Bushnell View Post
                  It's a day late since our cluster was down yesterday, but BBTools 36.85 is released, and Clumpify now supports twin files:

                  Code:
                  clumpify.sh in1=r1.fq.gz in2=r2.fq.gz out1=c1.fq.gz out2=c2.fq.gz
                  A day late is still a quick turn around

                  Thanks for the great update!

                  Comment


                  • #39
                    Feature request: It'd be quite nice to be able to write marked duplicates to a different file or files. At the moment, I have to mark duplicates and write everything to a temporary file, which is then processed. Granted, one CAN use "out=stderr.fastq" and send that to a pipe, but then one needs to deal with all of the normal stuff that's written to stderr.

                    The impetus behind this is removing optical duplicates before delivery to the our labs but still writing them to a separate file or files in case they need them for some reason.

                    BTW, do you have any recommendations for the "dist" parameter on a HiSeq 4000? I was planning to just do a parameter sweep, but if that's already been done by someone else...

                    Comment


                    • #40
                      Originally posted by dpryan View Post
                      Feature request: It'd be quite nice to be able to write marked duplicates to a different file or files. At the moment, I have to mark duplicates and write everything to a temporary file, which is then processed. Granted, one CAN use "out=stderr.fastq" and send that to a pipe, but then one needs to deal with all of the normal stuff that's written to stderr.
                      That request has been in for some time I also wanted to see counts (with associated sequence) to see how acute of a problem the duplicates may be.

                      For now use the following workaround provided by @Brian.

                      Code:
                      clumpify.sh in=x.fq out=y.fq markduplicates [optical allduplicates subs=0]
                      filterbyname.sh in=y.fq out=dupes.fq names=duplicate substring include
                      filterbyname.sh in=y.fq out=unique.fq names=duplicate substring include=f
                      BTW, do you have any recommendations for the "dist" parameter on a HiSeq 4000? I was planning to just do a parameter sweep, but if that's already been done by someone else...
                      This is a bit murky. I have done the sweeps with 4000 data I have access to. If I keep the spantiles=f then I don't see any optical dups until dupedist=20. Note: The edge duplicates problem seen with NextSeq (which has @Brian setting spantiles=t by default) is not present in HiSeq 4000/MiSeq (again based on data I have seen).

                      I have not pulled out the reads using the method above to look at the co-ordinates/sequence as yet.

                      It may be good to see what you get.
                      Last edited by GenoMax; 01-23-2017, 06:13 AM.

                      Comment


                      • #41
                        Additionally, is there any way to make clumpify itself respect the "threads=" setting? pigz seems to, but clumpify itself seems to use as many as it can get regardless of what I specify. This is in version 36.86.

                        Comment


                        • #42
                          Originally posted by GenoMax View Post
                          For now use the following workaround provided by @Brian.

                          Code:
                          clumpify.sh in=x.fq out=y.fq markduplicates [optical allduplicates subs=0]
                          filterbyname.sh in=y.fq out=dupes.fq names=duplicate substring include
                          filterbyname.sh in=y.fq out=unique.fq names=duplicate substring include=f
                          Thanks, in the interim I just wrote something in C that I can just call once to do this (it also strips "duplicate" from the read names).

                          Originally posted by GenoMax View Post
                          This is a bit murky. I have done the sweeps with 4000 data I have access to. If I keep the spantiles=f then I don't see any optical dups until dupedist=20. Note: The edge duplicates problem seen with NextSeq (which has @Brian setting spantiles=t by default) is not present in HiSeq 4000/MiSeq (again based on data I have seen).
                          Thanks, I'm running this now on a single sample, I'll post in image when I have a worthwhile sweep range.

                          Comment


                          • #43
                            Originally posted by dpryan View Post
                            Thanks, I'm running this now on a single sample, I'll post in image when I have a worthwhile sweep range.
                            Do the sweep with spanfiles=f and t. I was only interested in optical duplicates when I did mine.

                            Comment


                            • #44
                              Originally posted by dpryan View Post
                              Additionally, is there any way to make clumpify itself respect the "threads=" setting? pigz seems to, but clumpify itself seems to use as many as it can get regardless of what I specify. This is in version 36.86.
                              Oh, hmmm... that will be very tricky. When running with one group, Clumpify should respect threads correctly. But when writing temp files (when happens whenever the reads won't all fit in memory), it uses at least one thread per temp file, and the default is a minimum of 11 temp files. Your best bet, unfortunately, would be to bind the process to a certain number of cores. You can also manually set the number of groups which indirectly affect the number of threads used.

                              Clumpify also uses multithreaded sorting, which uses all available cores, but normally that only happens for a small fraction of the runtime. However, I will add a flag to disable it.

                              Comment


                              • #45
                                Originally posted by dpryan View Post
                                Feature request: It'd be quite nice to be able to write marked duplicates to a different file or files. At the moment, I have to mark duplicates and write everything to a temporary file, which is then processed. Granted, one CAN use "out=stderr.fastq" and send that to a pipe, but then one needs to deal with all of the normal stuff that's written to stderr.

                                The impetus behind this is removing optical duplicates before delivery to the our labs but still writing them to a separate file or files in case they need them for some reason.
                                I will plan to add a new output stream for duplicate files as well, though I might not get to it this week.

                                Comment

                                Working...
                                X