Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • split fastq file

    Hi,
    I have a single fastq file with both mate pairs of paired end reads. I would like to split this file into two files each containing one of the two pairs. I have looked into Galaxy, but it needs the read pairs of equal size.

    Any one has a script for splitting a fastq file?

    Thank you.

  • #2
    Could you post the first few lines of the file? Is there something in the description line that indicates where to split them? Or are paired reads listed one right after the other?

    Comment


    • #3
      The paired reads are listed as first mate read followed by second mate read.

      @HWI-ST945:92:d059facxx:8:1101:1567:2217 1:N:0:TGACCA
      @HWI-ST945:92:d059facxx:8:1101:1567:2217 2:N:0:TGACCA

      Comment


      • #4
        Assuming you have the standard fastq file format with quality scores
        Code:
        @test1.1
        acgt
        +test1.1
        1234
        @test1.2
        acgt
        +test1.2
        1234
        Then this should work. Its quick and dirty and there may be more sophisticated solutions, but nevertheless:
        Code:
        sed -ne '1~8H;2~8H;3~8H;4~8H;${g;s/^\n//;p}' y.fastq > y_1.fastq
        sed -ne '5~8H;6~8H;7~8H;8~8H;${g;s/^\n//;p}' y.fastq > y_2.fastq
        When you only have lines as you have stated, its more simple:
        Code:
        sed -ne '1~2p' x.fastq > x_1.fastq
        sed -ne '2~2p' x.fastq > x_2.fastq
        Both solutions assume that the reads are consecutive.

        Comment


        • #5
          You could also do a grep for the line and the three lines following the lines that have the 1:N:0 pattern. But you may have to get rid of the '--' that'll be put in there (though bwa and samtools don't seem to mind them)

          Comment


          • #6
            With one per line and every other line:

            awk '0 == (NR + 1) % 2' infile > end1 &
            awk '0 == (NR + 2) % 2' infile > end2 &
            Last edited by dcfargo; 08-31-2011, 09:03 AM.

            Comment


            • #7
              Yet another solution. To add to dcfargo's solution, if the file (infile) is indeed in fastq format (4 lines per record, as shown by ocs), then this should work too

              awk '0 == ((NR+4) % 8)*((NR+5) % 8)*((NR+6) % 8)*((NR+7) %8)' infile > end1 &
              awk '0 == (NR % 8)*((NR+1) % 8)*((NR+2) % 8)*((NR+3) %8)' infile > end2

              Comment


              • #8
                Thank you all. Yes the file is fastq format with 4 lines per read. I was able to split my fastq file using both sed and awk commands.

                Comment


                • #9
                  Just for curiosity I tried a bit more with sed and came up with more simpler solutions (for those who are interested). My inital solution is quite complicated.

                  This also works:
                  Code:
                  sed -ne '1~8p;2~8p;3~8p;4~8p' x.fastq > x_1.fastq
                  sed -ne '5~8p;6~8p;7~8p;8~8p' x.fastq > x_2.fastq
                  Even more simple:
                  Code:
                  sed -ne '1~8{N;N;N;p}' x.fastq > x_1.fastq
                  sed -ne '4~8{N;N;N;p}' x.fastq > x_2.fastq
                  Also nice to see some awk solutions! Always exciting to see how things work in awk.

                  Comment


                  • #10
                    That's a very concise solution! However, I think that the commands should be:

                    Code:
                    sed -ne '1~8{N;N;N;p}' x.fastq > x_1.fastq
                    sed -ne '5~8{N;N;N;p}' x.fastq > x_2.fastq
                    Where, for the second command I've replaced the 4 with a 5. This is because sed is counting from 1, so the 4th line is actually line line at offset 3, which is not the header for the second mate of the pair.

                    Comment


                    • #11
                      It is so helpful and effective! Great thanks!
                      Originally posted by ocs View Post
                      Just for curiosity I tried a bit more with sed and came up with more simpler solutions (for those who are interested). My inital solution is quite complicated.

                      This also works:
                      Code:
                      sed -ne '1~8p;2~8p;3~8p;4~8p' x.fastq > x_1.fastq
                      sed -ne '5~8p;6~8p;7~8p;8~8p' x.fastq > x_2.fastq
                      Even more simple:
                      Code:
                      sed -ne '1~8{N;N;N;p}' x.fastq > x_1.fastq
                      sed -ne '4~8{N;N;N;p}' x.fastq > x_2.fastq
                      Also nice to see some awk solutions! Always exciting to see how things work in awk.

                      Comment


                      • #12
                        I think grep will be easy if you don't have consecutive read1 and read2

                        grep -A3 -P "1:N:" --no-group-separator in.fastq >in_1.fastq
                        grep -A3 -P "2:N:" --no-group-separator in.fastq >in_2.fastq

                        you can match your pattern as you get read name (/1,_1 or 1:N:#:#)

                        Best,

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM
                        • seqadmin
                          Techniques and Challenges in Conservation Genomics
                          by seqadmin



                          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                          Avian Conservation
                          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                          03-08-2024, 10:41 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 06:37 PM
                        0 responses
                        10 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, Yesterday, 06:07 PM
                        0 responses
                        9 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-22-2024, 10:03 AM
                        0 responses
                        49 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-21-2024, 07:32 AM
                        0 responses
                        67 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X