Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Balat
    Member
    • May 2010
    • 36

    split fastq file

    Hi,
    I have a single fastq file with both mate pairs of paired end reads. I would like to split this file into two files each containing one of the two pairs. I have looked into Galaxy, but it needs the read pairs of equal size.

    Any one has a script for splitting a fastq file?

    Thank you.
  • BAMseek
    Senior Member
    • Apr 2011
    • 124

    #2
    Could you post the first few lines of the file? Is there something in the description line that indicates where to split them? Or are paired reads listed one right after the other?

    Comment

    • Balat
      Member
      • May 2010
      • 36

      #3
      The paired reads are listed as first mate read followed by second mate read.

      @HWI-ST945:92:d059facxx:8:1101:1567:2217 1:N:0:TGACCA
      @HWI-ST945:92:d059facxx:8:1101:1567:2217 2:N:0:TGACCA

      Comment

      • ocs
        Member
        • May 2011
        • 27

        #4
        Assuming you have the standard fastq file format with quality scores
        Code:
        @test1.1
        acgt
        +test1.1
        1234
        @test1.2
        acgt
        +test1.2
        1234
        Then this should work. Its quick and dirty and there may be more sophisticated solutions, but nevertheless:
        Code:
        sed -ne '1~8H;2~8H;3~8H;4~8H;${g;s/^\n//;p}' y.fastq > y_1.fastq
        sed -ne '5~8H;6~8H;7~8H;8~8H;${g;s/^\n//;p}' y.fastq > y_2.fastq
        When you only have lines as you have stated, its more simple:
        Code:
        sed -ne '1~2p' x.fastq > x_1.fastq
        sed -ne '2~2p' x.fastq > x_2.fastq
        Both solutions assume that the reads are consecutive.

        Comment

        • swbarnes2
          Senior Member
          • May 2008
          • 910

          #5
          You could also do a grep for the line and the three lines following the lines that have the 1:N:0 pattern. But you may have to get rid of the '--' that'll be put in there (though bwa and samtools don't seem to mind them)

          Comment

          • dcfargo
            Member
            • Aug 2008
            • 22

            #6
            With one per line and every other line:

            awk '0 == (NR + 1) % 2' infile > end1 &
            awk '0 == (NR + 2) % 2' infile > end2 &
            Last edited by dcfargo; 08-31-2011, 09:03 AM.

            Comment

            • BAMseek
              Senior Member
              • Apr 2011
              • 124

              #7
              Yet another solution. To add to dcfargo's solution, if the file (infile) is indeed in fastq format (4 lines per record, as shown by ocs), then this should work too

              awk '0 == ((NR+4) % 8)*((NR+5) % 8)*((NR+6) % 8)*((NR+7) %8)' infile > end1 &
              awk '0 == (NR % 8)*((NR+1) % 8)*((NR+2) % 8)*((NR+3) %8)' infile > end2

              Comment

              • Balat
                Member
                • May 2010
                • 36

                #8
                Thank you all. Yes the file is fastq format with 4 lines per read. I was able to split my fastq file using both sed and awk commands.

                Comment

                • ocs
                  Member
                  • May 2011
                  • 27

                  #9
                  Just for curiosity I tried a bit more with sed and came up with more simpler solutions (for those who are interested). My inital solution is quite complicated.

                  This also works:
                  Code:
                  sed -ne '1~8p;2~8p;3~8p;4~8p' x.fastq > x_1.fastq
                  sed -ne '5~8p;6~8p;7~8p;8~8p' x.fastq > x_2.fastq
                  Even more simple:
                  Code:
                  sed -ne '1~8{N;N;N;p}' x.fastq > x_1.fastq
                  sed -ne '4~8{N;N;N;p}' x.fastq > x_2.fastq
                  Also nice to see some awk solutions! Always exciting to see how things work in awk.

                  Comment

                  • robp
                    Member
                    • Aug 2013
                    • 13

                    #10
                    That's a very concise solution! However, I think that the commands should be:

                    Code:
                    sed -ne '1~8{N;N;N;p}' x.fastq > x_1.fastq
                    sed -ne '5~8{N;N;N;p}' x.fastq > x_2.fastq
                    Where, for the second command I've replaced the 4 with a 5. This is because sed is counting from 1, so the 4th line is actually line line at offset 3, which is not the header for the second mate of the pair.

                    Comment

                    • skycreative
                      Member
                      • Jan 2010
                      • 33

                      #11
                      It is so helpful and effective! Great thanks!
                      Originally posted by ocs View Post
                      Just for curiosity I tried a bit more with sed and came up with more simpler solutions (for those who are interested). My inital solution is quite complicated.

                      This also works:
                      Code:
                      sed -ne '1~8p;2~8p;3~8p;4~8p' x.fastq > x_1.fastq
                      sed -ne '5~8p;6~8p;7~8p;8~8p' x.fastq > x_2.fastq
                      Even more simple:
                      Code:
                      sed -ne '1~8{N;N;N;p}' x.fastq > x_1.fastq
                      sed -ne '4~8{N;N;N;p}' x.fastq > x_2.fastq
                      Also nice to see some awk solutions! Always exciting to see how things work in awk.

                      Comment

                      • tahia
                        Junior Member
                        • Aug 2010
                        • 2

                        #12
                        I think grep will be easy if you don't have consecutive read1 and read2

                        grep -A3 -P "1:N:" --no-group-separator in.fastq >in_1.fastq
                        grep -A3 -P "2:N:" --no-group-separator in.fastq >in_2.fastq

                        you can match your pattern as you get read name (/1,_1 or 1:N:#:#)

                        Best,

                        Comment

                        Latest Articles

                        Collapse

                        • SEQadmin2
                          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                          by SEQadmin2


                          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                          ...
                          Yesterday, 10:05 AM
                        • SEQadmin2
                          Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                          by SEQadmin2


                          With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                          Introduction

                          Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                          05-22-2026, 06:42 AM
                        • SEQadmin2
                          Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                          by SEQadmin2

                          Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                          Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                          05-06-2026, 09:04 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by SEQadmin2, Yesterday, 12:03 PM
                        0 responses
                        19 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, Yesterday, 11:40 AM
                        0 responses
                        14 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 05-28-2026, 11:40 AM
                        0 responses
                        29 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 05-26-2026, 10:12 AM
                        0 responses
                        31 views
                        0 reactions
                        Last Post SEQadmin2  
                        Working...