Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • lianov
    Junior Member
    • Jun 2014
    • 5

    cutadapt trimming for multiple files

    Hi,

    First, I'm sorry if I missed this but I couldn't find another thread about this issue:

    I would like to run cutadapt on over 200 different .fastq files, so I clearly need to automate this process but I am not sure how to do this.

    What is the best way to tell cutadapt to remove the adaptors on all files within a directory and create an output for each of them? The manual only mentions running two files at one (if paired end sequencing was done)

    In short, I need to find all "*.fastq", trim sequences and output as trim.fastq

    Thanks!
  • GenoMax
    Senior Member
    • Feb 2008
    • 7142

    #2
    You can use one of the shell script ideas here to iterate over the set of files you have: http://stackoverflow.com/questions/1...-files-in-unix
    and http://stackoverflow.com/questions/1...s-in-directory

    If you are submitting to a job scheduler on a cluster a similar loop can be used.

    Comment

    • mgogol
      Senior Member
      • Mar 2008
      • 197

      #3
      I bet you could do something snazzy with gnu parallel.

      Comment

      • slefevre
        Junior Member
        • Jun 2015
        • 4

        #4
        Hi - for starters, I am a completely newbie at this, so bare with me if it's basic.

        I understand how to do a for loop with one object, but if you have paired-end reads, trim-galore takes two files as input, and I just can't figure out how to make a for loop including a list of pairs.... Would be easy if trim-galore accepted wildcards in the two filenames, but I am not sure of it does? Or is it possible to specify two objects to use as input? i.e.
        for read1 in /*/xxxxR1.fastq
        for read2 in /*/xxxxR2.fastq
        do trim-galore etc etc "$read1$ "$read2"
        done

        Any thoughts?

        Comment

        • Michael.Ante
          Senior Member
          • Oct 2011
          • 127

          #5
          You can deduce the read2 variable from read1:

          Code:
          for read1 in */xxxxR1.fastq; do read2=$(echo $read1| sed 's/R1.fastq/R2.fastq/'); trim-galore etc etc $read1 $read2 ; done

          Comment

          • mgogol
            Senior Member
            • Mar 2008
            • 197

            #6
            Ah. The way I usually handle stuff like this is just I have a text file listing all the files - in this case, in two columns. Then I have a perl script with a for loop going through each line in the text file and generating the commands I want to run. It's useful later on in R or whatever as well.

            Comment

            • slefevre
              Junior Member
              • Jun 2015
              • 4

              #7
              Originally posted by Michael.Ante View Post
              You can deduce the read2 variable from read1:

              Code:
              for read1 in */xxxxR1.fastq; do read2=$(echo $read1| sed 's/R1.fastq/R2.fastq/'); trim-galore etc etc $read1 $read2 ; done
              That makes sense - and even better, it seems to be working. Thank you so much!

              Comment

              • Michael.Ante
                Senior Member
                • Oct 2011
                • 127

                #8
                Originally posted by mgogol View Post
                Ah. The way I usually handle stuff like this is just I have a text file listing all the files - in this case, in two columns. Then I have a perl script with a for loop going through each line in the text file and generating the commands I want to run. It's useful later on in R or whatever as well.
                In case you want to speed up a bit, you can use such a file as input for GNU parallel:

                Code:
                parallel -j 8 --colsep '\t'  trim-galore etc etc {1} {2} :::: read-pairs-location.tsv
                Just adjust the number of threads (-j) according to your CPUs.

                Comment

                • yaseen.ladak
                  Junior Member
                  • Nov 2013
                  • 7

                  #9
                  Hello Michael,

                  I am trying to do exactly as you suggested just need a bit of help with that code.

                  what is --colsep and when you say '\t' means that you read-pairs-location.tsv is tab separated if it was a CSV you would write --colsep ',' please correct me if I am wrong. Also it means that it will use 8 threads for each of the trim_galore command?

                  I am currently testing on a Quad code i7 on my mac that has hyper threading so in total I have 8 cores 4 physical and 4 virtual so I can use -j8 although I need to test for what value of thread I get the best performance as using maximum threads generally gives lower performance.

                  Thanks,
                  Yaseen

                  Comment

                  • GenoMax
                    Senior Member
                    • Feb 2008
                    • 7142

                    #10
                    That --colsep option should work. Give it a try Otherwise convert your file into tsv format and then use the original \t option.

                    Having 8 cores does not mean that you will be able to use them efficiently. They are connected to your storage subsystem through a common bus which can only allow a certain amount of data to be read/written.

                    Again experiment with a small subset of data (test file) starting with 4 cores and go up and down in number to see what number works best for your setup.

                    Comment

                    • yaseen.ladak
                      Junior Member
                      • Nov 2013
                      • 7

                      #11
                      Thank you GenoMax.

                      I ran the following command:

                      parallel -j 3 --colsep '\t' trim_galore --output_dir tg/3 --paired --fastqc -a CTGTCTCTTATA -a2 CTGTCTCTTATA {1} {2} :::: tg.tsv

                      Is there a way that I can give a name to this command? and how would i check the number of threads being used. I am using a mac and when I check the activity monitor i cannot see 3 i.e. number of threads. Had i named this command it would have shown up in the activity monitor and or using top? Indeed this command is running as it is generating output. The tg.tsv has 4 files names with their complete paths i.e. 2 paired end samples.

                      Can someone please advise? thanks

                      Comment

                      • yaseen.ladak
                        Junior Member
                        • Nov 2013
                        • 7

                        #12
                        In my activity monitor when i run the above command its shows gzip command with 1 thread as i think it uncompressed the fastq.gz file initially? Whats the best to see the number of threads run by this command on a mac or a linux. I have a mac.
                        Can I name this command to something so i can quickly find it in top etc?

                        Comment

                        • GenoMax
                          Senior Member
                          • Feb 2008
                          • 7142

                          #13
                          When you run "top" processes that are actively consuming CPU cycles should show up at the top of the list. You can also use the activity monitor: https://support.apple.com/en-us/HT201464

                          Your file contains the two file name pairs separated by a tab on each line, right?

                          Comment

                          • yaseen.ladak
                            Junior Member
                            • Nov 2013
                            • 7

                            #14
                            When I do a top or a activity monitor it shows gzip most of the time and the number of threads as 1 even when i run with 3 threads.

                            Yes tsv in a text editor, with two reads each separated by a tab. I did not put two samples as my mac gives me out of space error so for testing I can only use 1 sample with paired end read. I then labelled the extension of the file with .tsv The below is the section how my tab separated file looks like, the gap between the two files is a tab.

                            /Users/Yaseen/S1R1.fastq.gz /Users/Yaseen/S1R2.fastq.gz

                            Comment

                            Latest Articles

                            Collapse

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by SEQadmin2, Today, 10:09 AM
                            0 responses
                            8 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, Yesterday, 08:59 AM
                            0 responses
                            14 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 06-02-2026, 12:03 PM
                            0 responses
                            22 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 06-02-2026, 11:40 AM
                            0 responses
                            19 views
                            0 reactions
                            Last Post SEQadmin2  
                            Working...