Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • split fastq file

    Hi,
    I have a single fastq file with both mate pairs of paired end reads. I would like to split this file into two files each containing one of the two pairs. I have looked into Galaxy, but it needs the read pairs of equal size.

    Any one has a script for splitting a fastq file?

    Thank you.

  • #2
    Could you post the first few lines of the file? Is there something in the description line that indicates where to split them? Or are paired reads listed one right after the other?

    Comment


    • #3
      The paired reads are listed as first mate read followed by second mate read.

      @HWI-ST945:92:d059facxx:8:1101:1567:2217 1:N:0:TGACCA
      @HWI-ST945:92:d059facxx:8:1101:1567:2217 2:N:0:TGACCA

      Comment


      • #4
        Assuming you have the standard fastq file format with quality scores
        Code:
        @test1.1
        acgt
        +test1.1
        1234
        @test1.2
        acgt
        +test1.2
        1234
        Then this should work. Its quick and dirty and there may be more sophisticated solutions, but nevertheless:
        Code:
        sed -ne '1~8H;2~8H;3~8H;4~8H;${g;s/^\n//;p}' y.fastq > y_1.fastq
        sed -ne '5~8H;6~8H;7~8H;8~8H;${g;s/^\n//;p}' y.fastq > y_2.fastq
        When you only have lines as you have stated, its more simple:
        Code:
        sed -ne '1~2p' x.fastq > x_1.fastq
        sed -ne '2~2p' x.fastq > x_2.fastq
        Both solutions assume that the reads are consecutive.

        Comment


        • #5
          You could also do a grep for the line and the three lines following the lines that have the 1:N:0 pattern. But you may have to get rid of the '--' that'll be put in there (though bwa and samtools don't seem to mind them)

          Comment


          • #6
            With one per line and every other line:

            awk '0 == (NR + 1) % 2' infile > end1 &
            awk '0 == (NR + 2) % 2' infile > end2 &
            Last edited by dcfargo; 08-31-2011, 09:03 AM.

            Comment


            • #7
              Yet another solution. To add to dcfargo's solution, if the file (infile) is indeed in fastq format (4 lines per record, as shown by ocs), then this should work too

              awk '0 == ((NR+4) % 8)*((NR+5) % 8)*((NR+6) % 8)*((NR+7) %8)' infile > end1 &
              awk '0 == (NR % 8)*((NR+1) % 8)*((NR+2) % 8)*((NR+3) %8)' infile > end2

              Comment


              • #8
                Thank you all. Yes the file is fastq format with 4 lines per read. I was able to split my fastq file using both sed and awk commands.

                Comment


                • #9
                  Just for curiosity I tried a bit more with sed and came up with more simpler solutions (for those who are interested). My inital solution is quite complicated.

                  This also works:
                  Code:
                  sed -ne '1~8p;2~8p;3~8p;4~8p' x.fastq > x_1.fastq
                  sed -ne '5~8p;6~8p;7~8p;8~8p' x.fastq > x_2.fastq
                  Even more simple:
                  Code:
                  sed -ne '1~8{N;N;N;p}' x.fastq > x_1.fastq
                  sed -ne '4~8{N;N;N;p}' x.fastq > x_2.fastq
                  Also nice to see some awk solutions! Always exciting to see how things work in awk.

                  Comment


                  • #10
                    That's a very concise solution! However, I think that the commands should be:

                    Code:
                    sed -ne '1~8{N;N;N;p}' x.fastq > x_1.fastq
                    sed -ne '5~8{N;N;N;p}' x.fastq > x_2.fastq
                    Where, for the second command I've replaced the 4 with a 5. This is because sed is counting from 1, so the 4th line is actually line line at offset 3, which is not the header for the second mate of the pair.

                    Comment


                    • #11
                      It is so helpful and effective! Great thanks!
                      Originally posted by ocs View Post
                      Just for curiosity I tried a bit more with sed and came up with more simpler solutions (for those who are interested). My inital solution is quite complicated.

                      This also works:
                      Code:
                      sed -ne '1~8p;2~8p;3~8p;4~8p' x.fastq > x_1.fastq
                      sed -ne '5~8p;6~8p;7~8p;8~8p' x.fastq > x_2.fastq
                      Even more simple:
                      Code:
                      sed -ne '1~8{N;N;N;p}' x.fastq > x_1.fastq
                      sed -ne '4~8{N;N;N;p}' x.fastq > x_2.fastq
                      Also nice to see some awk solutions! Always exciting to see how things work in awk.

                      Comment


                      • #12
                        I think grep will be easy if you don't have consecutive read1 and read2

                        grep -A3 -P "1:N:" --no-group-separator in.fastq >in_1.fastq
                        grep -A3 -P "2:N:" --no-group-separator in.fastq >in_2.fastq

                        you can match your pattern as you get read name (/1,_1 or 1:N:#:#)

                        Best,

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Choosing Between NGS and qPCR
                          by seqadmin



                          Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                          10-18-2024, 07:11 AM
                        • seqadmin
                          Non-Coding RNA Research and Technologies
                          by seqadmin




                          Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                          Nobel Prize for MicroRNA Discovery
                          This week,...
                          10-07-2024, 08:07 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 10-24-2024, 06:58 AM
                        0 responses
                        15 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 10-23-2024, 08:43 AM
                        0 responses
                        36 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 10-17-2024, 07:29 AM
                        0 responses
                        53 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 10-15-2024, 06:35 AM
                        0 responses
                        41 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X