Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina Paired End Merge script

    Hi all,

    I am looking for a script/program which will take paired end reads from an Illumina run and put into a single fastq file. I did search this site and found a perl script but it does not work. Any help would be appreciated. Thanks.

  • #2
    Not tested, and assumes no blank lines in your files, but this should work:

    Code:
    #!/usr/bin/perl
    use warnings;
    use strict;
    
    # Merge together two FastQ files
    # Usage is merge_fastq.pl [read1 file] [read2 file] [outfile]
    
    
    my ($in1,$in2,$out) = @ARGV;
    
    die "Usage is merge_fastq.pl [read1 file] [read2 file] [outfile]\n" unless ($out);
    
    open (IN1,$in1) or die "Can't open $in1: $!";
    open (IN2,$in2) or die "Can't open $in2: $!";
    open (OUT,'>',$out) or die "Can't write to $out: $!";
    
    my $count;
    while (1) {
      ++$count;
      my $line1 = <IN1>;
      my $line2 = <IN2>;
    
      last unless (defined $line1 and defined $line2);
    
      if ($count % 2) {
        print OUT $line1;
      }
      else {
        chomp $line1;
        print OUT $line1,$line2;
      }
    
    }
    
    close OUT or die "Can't write to $out: $!";

    Comment


    • #3
      Shorty provides a very fast script in perl to merge fastq-sequences in the following way:

      @read_id1/1
      ...
      +
      ...
      @read_id2/2
      ...
      +
      ...
      and so on..

      Code:
      #!/usr/bin/perl
      
      $filenameA = $ARGV[0];
      $filenameB = $ARGV[1];
      $filenameOut = $ARGV[2];
      
      open $FILEA, "< $filenameA";
      open $FILEB, "< $filenameB";
      
      open $OUTFILE, "> $filenameOut";
      
      while(<$FILEA>) {
      	print $OUTFILE $_;
      	$_ = <$FILEA>;
      	print $OUTFILE $_; 
      	$_ = <$FILEA>;
      	print $OUTFILE $_; 
      	$_ = <$FILEA>;
      	print $OUTFILE $_; 
      
      	$_ = <$FILEB>;
      	print $OUTFILE $_; 
      	$_ = <$FILEB>;
      	print $OUTFILE $_;
      	$_ = <$FILEB>;
      	print $OUTFILE $_;
      	$_ = <$FILEB>;
      	print $OUTFILE $_;
      }
      Note: It assumes that both files are of the same size and sequences are in the same order..
      Usage should be: merge.pl file1.fastq file2.fastq out.fastq
      Last edited by Jenzo; 04-07-2011, 12:32 AM.

      Comment


      • #4
        The two posted scripts do slightly different things. The one I posted concatenates the sequences and qualities together so if you started with a 2 x 40bp run then you'd end up with a file of 80bp reads.

        The second script simply places the reads from the two files one after another in the combined file, so you'd end up with a 40bp file which was twice as long. It's roughly equivalent to doing:

        Code:
        cat [file1] [file2] > [outfile]
        except that it puts the equivalent reads next to each other in the final file.

        I guess which one you use depends on how you wanted to combine the files....

        Comment


        • #5
          hehe, thats right ;-) thanks for pointing it out!

          Comment


          • #6
            Thanks for the script Andrew. I tried the script out and it seems that the script joins the files but not in the Paired end fashion.

            Original file (1)

            @HWI-EAS216_0001:1:1:1079:15982#0/1
            TATGCTCTGCCTTGGCTGTGTCATCGTGTTGATGCCAACTGACACGAAACTTCTAGGCTGATTCATCCTAAGTAT
            +
            CCCCCCCCCCCCCCCC@BCCCCCCCCCCCCC@@CCC>2?>>>A?C@CC7@@@@A<@@@A@@@?C@=CC#######
            @HWI-EAS216_0001:1:1:1079:9356#0/1
            CGCTCAAGAGATGGGCTTTGGGTGCGGAATGGGGATTTGGGTTGTGACCCAATACAGCGGTAGTAGCGTGCAGCA
            +
            BBB=>B=BCCCCCCCCCACCCCBCBCC@BBCCCABC@CCCB@CCA@C?B9C7?@:<@##################

            Original file (2)


            @HWI-EAS216_0001:1:1:1079:15982#0/2
            GTTTCTGAAGAGGCAGGCAGCAGAATTTGGTTTATTGAGTCTGTGTTGAAAAGAAACCACTTACGCATTATACTT
            +
            BCCCCBCCCCCCB7CCCC;9*;8:>?BB<CC<C@A?A5C<C@?C=CC4;>A########################
            @HWI-EAS216_0001:1:1:1079:9356#0/2
            GCAGGATTGCCATTCCCATCAGCTTTCTGCTGCACGCTACTACCGCTGTATTGGGTCACAACCCAAATCCCCATT
            +
            CCCCCBBCCCCCACCCCCCCCCC?CCCCBCCCCCCCCCCCBCCCC@ABCCCCCC<C;C>CCCBCCCBC>CCBC>>

            The script from Andrew, does this (putting all the 0/1 reads first)


            @HWI-EAS216_0001:1:1:1079:15982#0/1
            TATGCTCTGCCTTGGCTGTGTCATCGTGTTGATGCCAACTGACACGAAACTTCTAGGCTGATTCATCCTAAGTATGTTTCTGAAGAGGCAGGCAGCAGAATTTGGTTTATTGAGTCTGTGTTGAAAAGAAACCACTTACGCATTATACTT
            +
            CCCCCCCCCCCCCCCC@BCCCCCCCCCCCCC@@CCC>2?>>>A?C@CC7@@@@A<@@@A@@@?C@=CC#######BCCCCBCCCCCCB7CCCC;9*;8:>?BB<CC<C@A?A5C<C@?C=CC4;>A########################
            @HWI-EAS216_0001:1:1:1079:9356#0/1
            CGCTCAAGAGATGGGCTTTGGGTGCGGAATGGGGATTTGGGTTGTGACCCAATACAGCGGTAGTAGCGTGCAGCAGCAGGATTGCCATTCCCATCAGCTTTCTGCTGCACGCTACTACCGCTGTATTGGGTCACAACCCAAATCCCCATT
            +
            BBB=>B=BCCCCCCCCCACCCCBCBCC@BBCCCABC@CCCB@CCA@C?B9C7?@:<@##################CCCCCBBCCCCCACCCCCCCCCC?CCCCBCCCCCCCCCCCBCCCC@ABCCCCCC<C;C>CCCBCCCBC>CCBC>>


            What i want is :

            @HWI-EAS216_0001:1:1:1079:15982#0/1
            TATGCTCTGCCTTGGCTGTGTCATCGTGTTGATGCCAACTGACACGAAACTTCTAGGCTGATTCATCCTAAGTAT
            +
            CCCCCCCCCCCCCCCC@BCCCCCCCCCCCCC@@CCC>2?>>>A?C@CC7@@@@A<@@@A@@@?C@=CC#######
            @HWI-EAS216_0001:1:1:1079:15982#0/2
            GTTTCTGAAGAGGCAGGCAGCAGAATTTGGTTTATTGAGTCTGTGTTGAAAAGAAACCACTTACGCATTATACTT
            +
            BCCCCBCCCCCCB7CCCC;9*;8:>?BB<CC<C@A?A5C<C@?C=CC4;>A########################

            Hope this helps. I know its possible

            Thanks for all the help.

            Comment


            • #7
              Originally posted by newbietonextgen View Post
              What i want is :

              @HWI-EAS216_0001:1:1:1079:15982#0/1
              TATGCTCTGCCTTGGCTGTGTCATCGTGTTGAT
              +
              CCCCCCCCCCCCCCCC@BCCCCCCCCCCCCC@
              @HWI-EAS216_0001:1:1:1079:15982#0/2
              GTTTCTGAAGAGGCAGGCAGCAGAATTTGGTTT
              +
              BCCCCBCCCCCCB7CCCC;9*;8:>?BB<CC<C@
              That's what Jenzo's script would produce isn't it?

              Comment


              • #8
                I think, but i did not try. I used fastq_merge.pl, your script.

                Comment


                • #9
                  I explained in the second note I added that the two scripts posted did different things, and it depended on how you wanted to merge your files. Just out of interest which pipeline are you using which requires the paired files to be placed one after another?

                  Comment


                  • #10
                    Ha, Sorry my mistake. I figured it out. Thanks. SHRiMP requires that paired reads are put one behind the other.

                    Comment


                    • #11
                      Thanks guys. Scarpa too requires a merged fastq with "interleaved" reads (so that reads from the same pair follow each other) and Jenzo's script does that.

                      Comment


                      • #12
                        And what about combining PE reads from multiple runs? I have two runs from the same library and I would like to combine the PE reads into the same file (one file for R1 and one file for R2), keeping the reads separation as per Jenzo's script. Would my code look like something like this?

                        #!/usr/bin/perl

                        $filename_R1_Run1 = $ARGV[0];
                        $filename_R1_Run2 = $ARGV[1];
                        $filename_R1_Runs1And2 = $ARGV[2];

                        open $FILE_R1_Run1, "< $filename_R1_Run1";
                        open $FILE_R1_Run2, "< $filename_R1_Run2";

                        open $FILE_R1_Runs1And2, "> $filename_R1_Runs1And2";

                        while(<$FILE_R1_Run1>) {
                        print $FILE_R1_Runs1And2 $_;
                        $_ = <$FILE_R1_Run1>;
                        print $FILE_R1_Runs1And2 $_;
                        $_ = <$FILE_R1_Run1>;
                        print $FILE_R1_Runs1And2 $_;
                        $_ = <$FILE_R1_Run1>;
                        print $FILE_R1_Runs1And2 $_;

                        $_ = <$FILE_R1_Run2>;
                        print $FILE_R1_Runs1And2 $_;
                        $_ = <$FILE_R1_Run2>;
                        print $FILE_R1_Runs1And2 $_;
                        $_ = <$FILE_R1_Run2>;
                        print $FILE_R1_Runs1And2 $_;
                        $_ = <$FILE_R1_Run2>;
                        print $FILE_R1_Runs1And2 $_;
                        }

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Understanding Genetic Influence on Infectious Disease
                          by seqadmin




                          During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

                          Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
                          09-09-2024, 10:59 AM
                        • seqadmin
                          Addressing Off-Target Effects in CRISPR Technologies
                          by seqadmin






                          The first FDA-approved CRISPR-based therapy marked the transition of therapeutic gene editing from a dream to reality1. CRISPR technologies have streamlined gene editing, and CRISPR screens have become an important approach for identifying genes involved in disease processes2. This technique introduces targeted mutations across numerous genes, enabling large-scale identification of gene functions, interactions, and pathways3. Identifying the full range...
                          08-27-2024, 04:44 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Today, 06:25 AM
                        0 responses
                        13 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, Yesterday, 01:02 PM
                        0 responses
                        12 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 09-18-2024, 06:39 AM
                        0 responses
                        14 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 09-11-2024, 02:44 PM
                        0 responses
                        14 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X