Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GNU parallel - cutadapt with paired end reads

    Hi all,

    I have many filles which I need to quality trim using cutadapt. The sequences are paired end and the filenames are on the following format:

    sample#_Lane#_R#_.fastq.gz , where # is number.

    e.g.
    29_L001_R1.fastq, 29_L001_R2.fastq, 29_L002_R1.fastq, 29_L002_R2.fastq,
    30_L003_R1.fastq, 30_L003_R2.fastq, 30_L004_R1.fastq, 30_L004_R2.fastq, etc.

    I will use cutadapt to trim these sequences using this command:

    cutadapt -a adaptors_to_trim -A adaptors_to_trim -q 20 --minimum-length 5 -o outputfile_R1 -p outputfile_R2 inputfile_R1 inputfile_R2

    Further I would like to use GNU parallel to pipe this to use as much of my 16 cores as possible and in order to kind of loop all files into cutadapt.

    This means that for each cutadapt run I need to input R1 and R2 of all my samples but keeping sample#_Lane# seperate.

    So I was thinking of listing all input files followed by piping that into gnu parallel and there defining the R1 and R2 for each sample# in combination with lane# followed by input into cutadapt. Something like this:

    find *_L00*_R*.fastq.gz | parallel DEFINING_TWO INPUT_FILES_FROM_SAME_SAMPLE_AND_LANE_ j +0 cutadapt -a adaptors_to_trim -A adaptors_to_trim -q 20 --minimum-length 5 -o outputfile_R1 -p outputfile_R2 inputfile_R1 inputfile_R2

    Is this possible at all?

    Hope it makes sense. I will of course explain more if needed.

    Thank you very much in advance.

    Best,
    Toke

  • #2
    How about:

    find *_L00*_R1.fastq.gz | sed 's/_R1.fastq.gz$//' |parallel 'cutadapt -a adaptors_to_trim -A adaptors_to_trim -q 20 --minimum-length 5 -o {}_R1_cutadapt.fastq.gz -p {}_R2_cutadapt.fastq.gz {}_R1.fastq.gz {}_R2.fastq.gz &> {}.cutadapt'

    Comment


    • #3
      Originally posted by Roy View Post
      How about:

      find *_L00*_R1.fastq.gz | sed 's/_R1.fastq.gz$//' |parallel 'cutadapt -a adaptors_to_trim -A adaptors_to_trim -q 20 --minimum-length 5 -o {}_R1_cutadapt.fastq.gz -p {}_R2_cutadapt.fastq.gz {}_R1.fastq.gz {}_R2.fastq.gz &> {}.cutadapt'
      Thank you Roy! That worked! I really owe you a beer :-)

      I have to look into this sed option. Is this right understood: the sed 's/_R1.fastq.gz$//' indicate that the R1.fastq.gz is the part of the defined files (defined by find) that has to be substituted with what are defined by the {} ?

      Also i add -j +0 after parallel to use all cores of my server.

      Again. Thank you very much!

      Comment


      • #4
        No problem.

        The sed command gets rid of the _R1.fastq.gz from the end of the filenames produced by find before they are passed to parallel. Parallel then takes each shortened filename and uses it in place of {} in the command.

        The default for parallel is -j 100%, which is 1 process per core, and is usually the optimal solution. -j 0 is defined as "run as many jobs as possible", which may result in processes fighting for resources.

        Comment


        • #5
          You are the man!

          Thank a lot!

          Have a nice day.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Best Practices for Single-Cell Sequencing Analysis
            by seqadmin



            While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
            06-06-2024, 07:15 AM
          • seqadmin
            Latest Developments in Precision Medicine
            by seqadmin



            Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

            Somatic Genomics
            “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
            05-24-2024, 01:16 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 06-07-2024, 06:58 AM
          0 responses
          13 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 06-06-2024, 08:18 AM
          0 responses
          20 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 06-06-2024, 08:04 AM
          0 responses
          20 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 06-03-2024, 06:55 AM
          0 responses
          14 views
          0 likes
          Last Post seqadmin  
          Working...
          X