Hi all,
I have a huge fastq file that contains paired reads (after trimming and quality filtering). The reads are shuffled cause they have been used for de-novo assembly with velvet/oases:
Now I would like to reuse that file for mapping with Bowtie. The problem is that I would need the pairs in individual files. I tried several solutions like this or this script:
Problem is that those solutions seem to assume that the pairs are consecutive. Thus, I would need to sort the reads first to provide consecutive pairs and afterwards apply one of the above solutions to split the files.
My question is now: How can I sort the fastq file?
I would appreciate any hint!
Thanks.
I have a huge fastq file that contains paired reads (after trimming and quality filtering). The reads are shuffled cause they have been used for de-novo assembly with velvet/oases:
@HTKZQN1:329:C09BUACXX:6:1101:1442:2235 2:N:0:
TTTTGATTTCTACATTTCATCACTTTTCAGATAATACGATTTTTGAAGATTTTTTCAATGTTATTCGGGAATTATATTCCAA
+
+==A;B?DHHH>4+2CG:<IIA:AC>>C4ACHFEH<EF)CFHF*?1?*: DFG90@;=BCHGI4BFD824'-=EAE;.??;CC
@HTKZQN1:329:C09BUACXX:6:1101:1575:2055 1:N:0:
ATAATTTTGGGTGTTAATACAACAAGGAATCATGCTTTCATATTTGAAAAAATATAGATTAATTATAAAAAATACATTTAATTTGATATTAATGTATAAAA
+
@C@DFDFFHHFACFEHIIIHHIIIJIIJGGHIJJJJJJIIJJJJIIEHCBDFGGEGEIICGCCHGIEDHIHEB:??CBEE;AEEC@A>CCDDDCDADD@D>
@HTKZQN1:329:C09BUACXX:6:1101:1735:2058 1:N:0:
GTGAAGTATAGTAGTTCCATAGGGAATATAGTTAACAAAACACATAAAATCTATAAACTTCAATTTTTCTAGAGCAATAATGTCCCCTTGCAAAAATAAGT
+
@CCFFFFFHHHHHJJJJJHIIJJJGIJJJJJGIIJIJJJJIJJIJJJIJJJJJIIHIJIIJJGIJJJIJJIGHHIHIICHIGHIHHHFFFFFFF@AECEC3
@HTKZQN1:329:C09BUACXX:6:1101:1701:2060 2:N:0:
CTTTGCCATTTAATTCATAAACTGCATCATCAGCATCCCTGTAGTCATCAAATTCCACAAATCCAAAACCATTTTTAATAAGAATCTCTCGTATTTTCCCA
+
CCCFFFFFHGHHHJJJIJJIJJJJFHGJJJJJJJJJGIJJIJJJIIJIJJHIJJJJIIHIJJJJJJJJJCEHIJJJJJJJGHHHHHFFFFFCDEEEEEDDD
TTTTGATTTCTACATTTCATCACTTTTCAGATAATACGATTTTTGAAGATTTTTTCAATGTTATTCGGGAATTATATTCCAA
+
+==A;B?DHHH>4+2CG:<IIA:AC>>C4ACHFEH<EF)CFHF*?1?*: DFG90@;=BCHGI4BFD824'-=EAE;.??;CC
@HTKZQN1:329:C09BUACXX:6:1101:1575:2055 1:N:0:
ATAATTTTGGGTGTTAATACAACAAGGAATCATGCTTTCATATTTGAAAAAATATAGATTAATTATAAAAAATACATTTAATTTGATATTAATGTATAAAA
+
@C@DFDFFHHFACFEHIIIHHIIIJIIJGGHIJJJJJJIIJJJJIIEHCBDFGGEGEIICGCCHGIEDHIHEB:??CBEE;AEEC@A>CCDDDCDADD@D>
@HTKZQN1:329:C09BUACXX:6:1101:1735:2058 1:N:0:
GTGAAGTATAGTAGTTCCATAGGGAATATAGTTAACAAAACACATAAAATCTATAAACTTCAATTTTTCTAGAGCAATAATGTCCCCTTGCAAAAATAAGT
+
@CCFFFFFHHHHHJJJJJHIIJJJGIJJJJJGIIJIJJJJIJJIJJJIJJJJJIIHIJIIJJGIJJJIJJIGHHIHIICHIGHIHHHFFFFFFF@AECEC3
@HTKZQN1:329:C09BUACXX:6:1101:1701:2060 2:N:0:
CTTTGCCATTTAATTCATAAACTGCATCATCAGCATCCCTGTAGTCATCAAATTCCACAAATCCAAAACCATTTTTAATAAGAATCTCTCGTATTTTCCCA
+
CCCFFFFFHGHHHJJJIJJIJJJJFHGJJJJJJJJJGIJJIJJJIIJIJJHIJJJJIIHIJJJJJJJJJCEHIJJJJJJJGHHHHHFFFFFCDEEEEEDDD
Code:
#!/usr/local/bin/perl -w
# Daniel Brami
# Util to split interlaced FASTQ files into pairs
use strict;
# Standard lib
use IO::File;
use File::Basename;
my $INPUT=shift;
if (!(defined ($INPUT)) || ($INPUT =~ '^\-')){
die "Usage: $0 <interleaved paired FASTQ file>\n";
}
my ($name,$path,$suffix) = fileparse($INPUT, qw/fastq FASTQ txt TXT/);
my $FH_IN = new IO::File($INPUT, "r") or die "could not open $INPUT: $!\n";
my $FH_OUT1 = new IO::File($name."split1.$suffix", "w") or die "could not open $name.split1$suffix for writing: $!\n";
my $FH_OUT2 = new IO::File($name."split2.$suffix", "w") or die "could not open $name.split2$suffix for writing: $!\n";
my ($recs1, $recs2)= (0,0);
my $flipflop = 1;
my $counter = 0;
my ($line, $TXT);
while($line = $FH_IN->getline()){
$TXT .= $line;
$counter++;
if($counter == 4){
if($flipflop == 1){
print $FH_OUT1 $TXT;
++$recs1;
}else{
print $FH_OUT2 $TXT;
++$recs2;
}
$counter = 0;
$TXT = '';
$flipflop *= -1;
}
}
$FH_IN->close();
$FH_OUT1->close();
$FH_OUT2->close();
print STDERR "Processed $recs1 records for pair file 1 and $recs2 records for pair file 2.\n";
if($recs1 != $recs2){
print STDERR "The number of processed records does not match - check input data!";
}exit;
My question is now: How can I sort the fastq file?
I would appreciate any hint!
Thanks.
Comment