Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GNU parallel - cutadapt with paired end reads

    Hi all,

    I have many filles which I need to quality trim using cutadapt. The sequences are paired end and the filenames are on the following format:

    sample#_Lane#_R#_.fastq.gz , where # is number.

    e.g.
    29_L001_R1.fastq, 29_L001_R2.fastq, 29_L002_R1.fastq, 29_L002_R2.fastq,
    30_L003_R1.fastq, 30_L003_R2.fastq, 30_L004_R1.fastq, 30_L004_R2.fastq, etc.

    I will use cutadapt to trim these sequences using this command:

    cutadapt -a adaptors_to_trim -A adaptors_to_trim -q 20 --minimum-length 5 -o outputfile_R1 -p outputfile_R2 inputfile_R1 inputfile_R2

    Further I would like to use GNU parallel to pipe this to use as much of my 16 cores as possible and in order to kind of loop all files into cutadapt.

    This means that for each cutadapt run I need to input R1 and R2 of all my samples but keeping sample#_Lane# seperate.

    So I was thinking of listing all input files followed by piping that into gnu parallel and there defining the R1 and R2 for each sample# in combination with lane# followed by input into cutadapt. Something like this:

    find *_L00*_R*.fastq.gz | parallel DEFINING_TWO INPUT_FILES_FROM_SAME_SAMPLE_AND_LANE_ j +0 cutadapt -a adaptors_to_trim -A adaptors_to_trim -q 20 --minimum-length 5 -o outputfile_R1 -p outputfile_R2 inputfile_R1 inputfile_R2

    Is this possible at all?

    Hope it makes sense. I will of course explain more if needed.

    Thank you very much in advance.

    Best,
    Toke

  • #2
    How about:

    find *_L00*_R1.fastq.gz | sed 's/_R1.fastq.gz$//' |parallel 'cutadapt -a adaptors_to_trim -A adaptors_to_trim -q 20 --minimum-length 5 -o {}_R1_cutadapt.fastq.gz -p {}_R2_cutadapt.fastq.gz {}_R1.fastq.gz {}_R2.fastq.gz &> {}.cutadapt'

    Comment


    • #3
      Originally posted by Roy View Post
      How about:

      find *_L00*_R1.fastq.gz | sed 's/_R1.fastq.gz$//' |parallel 'cutadapt -a adaptors_to_trim -A adaptors_to_trim -q 20 --minimum-length 5 -o {}_R1_cutadapt.fastq.gz -p {}_R2_cutadapt.fastq.gz {}_R1.fastq.gz {}_R2.fastq.gz &> {}.cutadapt'
      Thank you Roy! That worked! I really owe you a beer :-)

      I have to look into this sed option. Is this right understood: the sed 's/_R1.fastq.gz$//' indicate that the R1.fastq.gz is the part of the defined files (defined by find) that has to be substituted with what are defined by the {} ?

      Also i add -j +0 after parallel to use all cores of my server.

      Again. Thank you very much!

      Comment


      • #4
        No problem.

        The sed command gets rid of the _R1.fastq.gz from the end of the filenames produced by find before they are passed to parallel. Parallel then takes each shortened filename and uses it in place of {} in the command.

        The default for parallel is -j 100%, which is 1 process per core, and is usually the optimal solution. -j 0 is defined as "run as many jobs as possible", which may result in processes fighting for resources.

        Comment


        • #5
          You are the man!

          Thank a lot!

          Have a nice day.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Recent Advances in Sequencing Analysis Tools
            by seqadmin


            The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
            05-06-2024, 07:48 AM
          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin




            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
            04-22-2024, 07:01 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Today, 07:03 AM
          0 responses
          10 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 05-10-2024, 06:35 AM
          0 responses
          30 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 05-09-2024, 02:46 PM
          0 responses
          38 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 05-07-2024, 06:57 AM
          0 responses
          31 views
          0 likes
          Last Post seqadmin  
          Working...
          X