Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • nposnien
    Member
    • May 2011
    • 13

    Sort and Split shuffled (interleaved) fastq file

    Hi all,

    I have a huge fastq file that contains paired reads (after trimming and quality filtering). The reads are shuffled cause they have been used for de-novo assembly with velvet/oases:

    @HTKZQN1:329:C09BUACXX:6:1101:1442:2235 2:N:0:
    TTTTGATTTCTACATTTCATCACTTTTCAGATAATACGATTTTTGAAGATTTTTTCAATGTTATTCGGGAATTATATTCCAA
    +
    +==A;B?DHHH>4+2CG:<IIA:AC>>C4ACHFEH<EF)CFHF*?1?*: DFG90@;=BCHGI4BFD824'-=EAE;.??;CC

    @HTKZQN1:329:C09BUACXX:6:1101:1575:2055 1:N:0:
    ATAATTTTGGGTGTTAATACAACAAGGAATCATGCTTTCATATTTGAAAAAATATAGATTAATTATAAAAAATACATTTAATTTGATATTAATGTATAAAA
    +
    @C@DFDFFHHFACFEHIIIHHIIIJIIJGGHIJJJJJJIIJJJJIIEHCBDFGGEGEIICGCCHGIEDHIHEB:??CBEE;AEEC@A>CCDDDCDADD@D>

    @HTKZQN1:329:C09BUACXX:6:1101:1735:2058 1:N:0:
    GTGAAGTATAGTAGTTCCATAGGGAATATAGTTAACAAAACACATAAAATCTATAAACTTCAATTTTTCTAGAGCAATAATGTCCCCTTGCAAAAATAAGT
    +
    @CCFFFFFHHHHHJJJJJHIIJJJGIJJJJJGIIJIJJJJIJJIJJJIJJJJJIIHIJIIJJGIJJJIJJIGHHIHIICHIGHIHHHFFFFFFF@AECEC3

    @HTKZQN1:329:C09BUACXX:6:1101:1701:2060 2:N:0:
    CTTTGCCATTTAATTCATAAACTGCATCATCAGCATCCCTGTAGTCATCAAATTCCACAAATCCAAAACCATTTTTAATAAGAATCTCTCGTATTTTCCCA
    +
    CCCFFFFFHGHHHJJJIJJIJJJJFHGJJJJJJJJJGIJJIJJJIIJIJJHIJJJJIIHIJJJJJJJJJCEHIJJJJJJJGHHHHHFFFFFCDEEEEEDDD
    Now I would like to reuse that file for mapping with Bowtie. The problem is that I would need the pairs in individual files. I tried several solutions like this or this script:

    Code:
    #!/usr/local/bin/perl -w
    # Daniel Brami
    # Util to split interlaced FASTQ files into pairs
     
    use strict;
     
    # Standard lib
    use IO::File;
    use File::Basename;
     
    my $INPUT=shift;
     
    if (!(defined ($INPUT)) || ($INPUT =~ '^\-')){
    die "Usage: $0 <interleaved paired FASTQ file>\n";
    }
     
    my ($name,$path,$suffix) = fileparse($INPUT, qw/fastq FASTQ txt TXT/);
     
    my $FH_IN = new IO::File($INPUT, "r") or die "could not open $INPUT: $!\n";
    my $FH_OUT1 = new IO::File($name."split1.$suffix", "w") or die "could not open $name.split1$suffix for writing: $!\n";
    my $FH_OUT2 = new IO::File($name."split2.$suffix", "w") or die "could not open $name.split2$suffix for writing: $!\n";
     
    my ($recs1, $recs2)= (0,0);
    my $flipflop = 1;
    my $counter = 0;
    my ($line, $TXT);
    while($line = $FH_IN->getline()){
    $TXT .= $line;
    $counter++;
    if($counter == 4){
     if($flipflop == 1){
      print $FH_OUT1 $TXT;
      ++$recs1;
     }else{
      print $FH_OUT2 $TXT;     
      ++$recs2;
     }
     $counter = 0;
     $TXT = '';
     $flipflop *= -1;
    }
    }
    $FH_IN->close();
    $FH_OUT1->close();
    $FH_OUT2->close();
     
    print STDERR "Processed $recs1 records for pair file 1 and $recs2 records for pair file 2.\n";
    if($recs1 != $recs2){
    print STDERR "The number of processed records does not match - check input data!";
    }exit;
    Problem is that those solutions seem to assume that the pairs are consecutive. Thus, I would need to sort the reads first to provide consecutive pairs and afterwards apply one of the above solutions to split the files.

    My question is now: How can I sort the fastq file?
    I would appreciate any hint!

    Thanks.
  • maubp
    Peter (Biopython etc)
    • Jul 2009
    • 1544

    #2
    There are many FASTQ de-interlacing scripts out there, and yes, most do assume sorting so that pairs are next to each other. But not all make this assumption - at the expense of needing to build an index or some similar technique.

    Don't you have the filtered and trimmed FASTQ files from before running Velvet?

    Comment

    • nposnien
      Member
      • May 2011
      • 13

      #3
      Thanks a lot for your reply!
      @Peter: Do you have a script in mind that would work with building an index? I would give it a try cause a collaborator did all the shuffling steps and the files are lost.
      Thanks a lot!

      Comment

      • maubp
        Peter (Biopython etc)
        • Jul 2009
        • 1544

        #4
        How many reads in this file? And how much RAM do you have? Some indexing is in-memory, but that may not be practical with your dataset and require a disk based index.

        Comment

        • nposnien
          Member
          • May 2011
          • 13

          #5
          33 Mio reads in the file and I have (locally) 16 GB RAM. This should work with a memory based approach. However, I will need to do this with other datasets (up to 170 Mio reads). Thus, a disc based method would be better I guess.

          Comment

          Latest Articles

          Collapse

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by SEQadmin2, 06-09-2026, 11:58 AM
          0 responses
          22 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-05-2026, 10:09 AM
          0 responses
          28 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-04-2026, 08:59 AM
          0 responses
          39 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-02-2026, 12:03 PM
          0 responses
          61 views
          0 reactions
          Last Post SEQadmin2  
          Working...