Unconfigured Ad

**BAMseek** · 08-30-2011, 07:29 PM

Could you post the first few lines of the file? Is there something in the description line that indicates where to split them? Or are paired reads listed one right after the other?

**Balat** · 08-30-2011, 07:39 PM

The paired reads are listed as first mate read followed by second mate read.

@HWI-ST945:92:d059facxx:8:1101:1567:2217 1:N:0:TGACCA
@HWI-ST945:92:d059facxx:8:1101:1567:2217 2:N:0:TGACCA

**ocs** · 08-31-2011, 03:36 AM

Assuming you have the standard fastq file format with quality scores

Code:

@test1.1
acgt
+test1.1
1234
@test1.2
acgt
+test1.2
1234

Then this should work. Its quick and dirty and there may be more sophisticated solutions, but nevertheless:

Code:

sed -ne '1~8H;2~8H;3~8H;4~8H;${g;s/^\n//;p}' y.fastq > y_1.fastq
sed -ne '5~8H;6~8H;7~8H;8~8H;${g;s/^\n//;p}' y.fastq > y_2.fastq

When you only have lines as you have stated, its more simple:

Code:

sed -ne '1~2p' x.fastq > x_1.fastq
sed -ne '2~2p' x.fastq > x_2.fastq

Both solutions assume that the reads are consecutive.

**swbarnes2** · 08-31-2011, 08:41 AM

You could also do a grep for the line and the three lines following the lines that have the 1:N:0 pattern. But you may have to get rid of the '--' that'll be put in there (though bwa and samtools don't seem to mind them)

**dcfargo** · 08-31-2011, 08:57 AM

With one per line and every other line:

awk '0 == (NR + 1) % 2' infile > end1 &
awk '0 == (NR + 2) % 2' infile > end2 &

**BAMseek** · 08-31-2011, 05:42 PM

Yet another solution. To add to dcfargo's solution, if the file (infile) is indeed in fastq format (4 lines per record, as shown by ocs), then this should work too

awk '0 == ((NR+4) % 8)*((NR+5) % 8)*((NR+6) % 8)*((NR+7) %8)' infile > end1 &
awk '0 == (NR % 8)*((NR+1) % 8)*((NR+2) % 8)*((NR+3) %8)' infile > end2

**Balat** · 08-31-2011, 05:50 PM

Thank you all. Yes the file is fastq format with 4 lines per read. I was able to split my fastq file using both sed and awk commands.

**ocs** · 08-31-2011, 11:48 PM

Just for curiosity I tried a bit more with sed and came up with more simpler solutions (for those who are interested). My inital solution is quite complicated.

This also works:

Code:

sed -ne '1~8p;2~8p;3~8p;4~8p' x.fastq > x_1.fastq
sed -ne '5~8p;6~8p;7~8p;8~8p' x.fastq > x_2.fastq

Even more simple:

Code:

sed -ne '1~8{N;N;N;p}' x.fastq > x_1.fastq
sed -ne '4~8{N;N;N;p}' x.fastq > x_2.fastq

Also nice to see some awk solutions! Always exciting to see how things work in awk.

**robp** · 08-23-2013, 10:07 AM

That's a very concise solution! However, I think that the commands should be:

Code:

sed -ne '1~8{N;N;N;p}' x.fastq > x_1.fastq
sed -ne '5~8{N;N;N;p}' x.fastq > x_2.fastq

Where, for the second command I've replaced the 4 with a 5. This is because sed is counting from 1, so the 4th line is actually line line at offset 3, which is not the header for the second mate of the pair.

**skycreative** · 06-20-2016, 06:11 PM

It is so helpful and effective! Great thanks!

Originally posted by ocs View Post

Just for curiosity I tried a bit more with sed and came up with more simpler solutions (for those who are interested). My inital solution is quite complicated.

This also works:

Code:

sed -ne '1~8p;2~8p;3~8p;4~8p' x.fastq > x_1.fastq
sed -ne '5~8p;6~8p;7~8p;8~8p' x.fastq > x_2.fastq

Even more simple:

Code:

sed -ne '1~8{N;N;N;p}' x.fastq > x_1.fastq
sed -ne '4~8{N;N;N;p}' x.fastq > x_2.fastq

Also nice to see some awk solutions! Always exciting to see how things work in awk.

**tahia** · 09-22-2016, 07:55 AM

I think grep will be easy if you don't have consecutive read1 and read2

grep -A3 -P "1:N:" --no-group-separator in.fastq >in_1.fastq
grep -A3 -P "2:N:" --no-group-separator in.fastq >in_2.fastq

you can match your pattern as you get read name (/1,_1 or 1:N:#:#)

Best,

Topics	Statistics	Last Post
Engineered Protein Motor Takes Its First Steps Along DNA Track by SEQadmin2 Started by SEQadmin2, Yesterday, 11:05 AM	0 responses 7 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:05 AM
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 28 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 27 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 26 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM

Unconfigured Ad

split fastq file

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News