Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • cutadapt trimming for multiple files

    Hi,

    First, I'm sorry if I missed this but I couldn't find another thread about this issue:

    I would like to run cutadapt on over 200 different .fastq files, so I clearly need to automate this process but I am not sure how to do this.

    What is the best way to tell cutadapt to remove the adaptors on all files within a directory and create an output for each of them? The manual only mentions running two files at one (if paired end sequencing was done)

    In short, I need to find all "*.fastq", trim sequences and output as trim.fastq

    Thanks!

  • #2
    You can use one of the shell script ideas here to iterate over the set of files you have: http://stackoverflow.com/questions/1...-files-in-unix
    and http://stackoverflow.com/questions/1...s-in-directory

    If you are submitting to a job scheduler on a cluster a similar loop can be used.

    Comment


    • #3
      I bet you could do something snazzy with gnu parallel.

      This tutorial follows Ole Tange’s GNU parallel tutorial ( http://www.gnu.org/software/parallel/parallel_tutorial.html ) but I tried to use some bioinformatics-related  examples (align with BWA, Samtools, etc.. ).  

      Comment


      • #4
        Hi - for starters, I am a completely newbie at this, so bare with me if it's basic.

        I understand how to do a for loop with one object, but if you have paired-end reads, trim-galore takes two files as input, and I just can't figure out how to make a for loop including a list of pairs.... Would be easy if trim-galore accepted wildcards in the two filenames, but I am not sure of it does? Or is it possible to specify two objects to use as input? i.e.
        for read1 in /*/xxxxR1.fastq
        for read2 in /*/xxxxR2.fastq
        do trim-galore etc etc "$read1$ "$read2"
        done

        Any thoughts?

        Comment


        • #5
          You can deduce the read2 variable from read1:

          Code:
          for read1 in */xxxxR1.fastq; do read2=$(echo $read1| sed 's/R1.fastq/R2.fastq/'); trim-galore etc etc $read1 $read2 ; done

          Comment


          • #6
            Ah. The way I usually handle stuff like this is just I have a text file listing all the files - in this case, in two columns. Then I have a perl script with a for loop going through each line in the text file and generating the commands I want to run. It's useful later on in R or whatever as well.

            Comment


            • #7
              Originally posted by Michael.Ante View Post
              You can deduce the read2 variable from read1:

              Code:
              for read1 in */xxxxR1.fastq; do read2=$(echo $read1| sed 's/R1.fastq/R2.fastq/'); trim-galore etc etc $read1 $read2 ; done
              That makes sense - and even better, it seems to be working. Thank you so much!

              Comment


              • #8
                Originally posted by mgogol View Post
                Ah. The way I usually handle stuff like this is just I have a text file listing all the files - in this case, in two columns. Then I have a perl script with a for loop going through each line in the text file and generating the commands I want to run. It's useful later on in R or whatever as well.
                In case you want to speed up a bit, you can use such a file as input for GNU parallel:

                Code:
                parallel -j 8 --colsep '\t'  trim-galore etc etc {1} {2} :::: read-pairs-location.tsv
                Just adjust the number of threads (-j) according to your CPUs.

                Comment


                • #9
                  Hello Michael,

                  I am trying to do exactly as you suggested just need a bit of help with that code.

                  what is --colsep and when you say '\t' means that you read-pairs-location.tsv is tab separated if it was a CSV you would write --colsep ',' please correct me if I am wrong. Also it means that it will use 8 threads for each of the trim_galore command?

                  I am currently testing on a Quad code i7 on my mac that has hyper threading so in total I have 8 cores 4 physical and 4 virtual so I can use -j8 although I need to test for what value of thread I get the best performance as using maximum threads generally gives lower performance.

                  Thanks,
                  Yaseen

                  Comment


                  • #10
                    That --colsep option should work. Give it a try Otherwise convert your file into tsv format and then use the original \t option.

                    Having 8 cores does not mean that you will be able to use them efficiently. They are connected to your storage subsystem through a common bus which can only allow a certain amount of data to be read/written.

                    Again experiment with a small subset of data (test file) starting with 4 cores and go up and down in number to see what number works best for your setup.

                    Comment


                    • #11
                      Thank you GenoMax.

                      I ran the following command:

                      parallel -j 3 --colsep '\t' trim_galore --output_dir tg/3 --paired --fastqc -a CTGTCTCTTATA -a2 CTGTCTCTTATA {1} {2} :::: tg.tsv

                      Is there a way that I can give a name to this command? and how would i check the number of threads being used. I am using a mac and when I check the activity monitor i cannot see 3 i.e. number of threads. Had i named this command it would have shown up in the activity monitor and or using top? Indeed this command is running as it is generating output. The tg.tsv has 4 files names with their complete paths i.e. 2 paired end samples.

                      Can someone please advise? thanks

                      Comment


                      • #12
                        In my activity monitor when i run the above command its shows gzip command with 1 thread as i think it uncompressed the fastq.gz file initially? Whats the best to see the number of threads run by this command on a mac or a linux. I have a mac.
                        Can I name this command to something so i can quickly find it in top etc?

                        Comment


                        • #13
                          When you run "top" processes that are actively consuming CPU cycles should show up at the top of the list. You can also use the activity monitor: https://support.apple.com/en-us/HT201464

                          Your file contains the two file name pairs separated by a tab on each line, right?

                          Comment


                          • #14
                            When I do a top or a activity monitor it shows gzip most of the time and the number of threads as 1 even when i run with 3 threads.

                            Yes tsv in a text editor, with two reads each separated by a tab. I did not put two samples as my mac gives me out of space error so for testing I can only use 1 sample with paired end read. I then labelled the extension of the file with .tsv The below is the section how my tab separated file looks like, the gap between the two files is a tab.

                            /Users/Yaseen/S1R1.fastq.gz /Users/Yaseen/S1R2.fastq.gz

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Genetic Variation in Immunogenetics and Antibody Diversity
                              by seqadmin



                              The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
                              11-06-2024, 07:24 PM
                            • seqadmin
                              Choosing Between NGS and qPCR
                              by seqadmin



                              Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                              10-18-2024, 07:11 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 11-08-2024, 11:09 AM
                            0 responses
                            221 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 11-08-2024, 06:13 AM
                            0 responses
                            163 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 11-01-2024, 06:09 AM
                            0 responses
                            80 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 10-30-2024, 05:31 AM
                            0 responses
                            27 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X