Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • split fastq file

    Hi,
    I have a single fastq file with both mate pairs of paired end reads. I would like to split this file into two files each containing one of the two pairs. I have looked into Galaxy, but it needs the read pairs of equal size.

    Any one has a script for splitting a fastq file?

    Thank you.

  • #2
    Could you post the first few lines of the file? Is there something in the description line that indicates where to split them? Or are paired reads listed one right after the other?

    Comment


    • #3
      The paired reads are listed as first mate read followed by second mate read.

      @HWI-ST945:92:d059facxx:8:1101:1567:2217 1:N:0:TGACCA
      @HWI-ST945:92:d059facxx:8:1101:1567:2217 2:N:0:TGACCA

      Comment


      • #4
        Assuming you have the standard fastq file format with quality scores
        Code:
        @test1.1
        acgt
        +test1.1
        1234
        @test1.2
        acgt
        +test1.2
        1234
        Then this should work. Its quick and dirty and there may be more sophisticated solutions, but nevertheless:
        Code:
        sed -ne '1~8H;2~8H;3~8H;4~8H;${g;s/^\n//;p}' y.fastq > y_1.fastq
        sed -ne '5~8H;6~8H;7~8H;8~8H;${g;s/^\n//;p}' y.fastq > y_2.fastq
        When you only have lines as you have stated, its more simple:
        Code:
        sed -ne '1~2p' x.fastq > x_1.fastq
        sed -ne '2~2p' x.fastq > x_2.fastq
        Both solutions assume that the reads are consecutive.

        Comment


        • #5
          You could also do a grep for the line and the three lines following the lines that have the 1:N:0 pattern. But you may have to get rid of the '--' that'll be put in there (though bwa and samtools don't seem to mind them)

          Comment


          • #6
            With one per line and every other line:

            awk '0 == (NR + 1) % 2' infile > end1 &
            awk '0 == (NR + 2) % 2' infile > end2 &
            Last edited by dcfargo; 08-31-2011, 09:03 AM.

            Comment


            • #7
              Yet another solution. To add to dcfargo's solution, if the file (infile) is indeed in fastq format (4 lines per record, as shown by ocs), then this should work too

              awk '0 == ((NR+4) % 8)*((NR+5) % 8)*((NR+6) % 8)*((NR+7) %8)' infile > end1 &
              awk '0 == (NR % 8)*((NR+1) % 8)*((NR+2) % 8)*((NR+3) %8)' infile > end2

              Comment


              • #8
                Thank you all. Yes the file is fastq format with 4 lines per read. I was able to split my fastq file using both sed and awk commands.

                Comment


                • #9
                  Just for curiosity I tried a bit more with sed and came up with more simpler solutions (for those who are interested). My inital solution is quite complicated.

                  This also works:
                  Code:
                  sed -ne '1~8p;2~8p;3~8p;4~8p' x.fastq > x_1.fastq
                  sed -ne '5~8p;6~8p;7~8p;8~8p' x.fastq > x_2.fastq
                  Even more simple:
                  Code:
                  sed -ne '1~8{N;N;N;p}' x.fastq > x_1.fastq
                  sed -ne '4~8{N;N;N;p}' x.fastq > x_2.fastq
                  Also nice to see some awk solutions! Always exciting to see how things work in awk.

                  Comment


                  • #10
                    That's a very concise solution! However, I think that the commands should be:

                    Code:
                    sed -ne '1~8{N;N;N;p}' x.fastq > x_1.fastq
                    sed -ne '5~8{N;N;N;p}' x.fastq > x_2.fastq
                    Where, for the second command I've replaced the 4 with a 5. This is because sed is counting from 1, so the 4th line is actually line line at offset 3, which is not the header for the second mate of the pair.

                    Comment


                    • #11
                      It is so helpful and effective! Great thanks!
                      Originally posted by ocs View Post
                      Just for curiosity I tried a bit more with sed and came up with more simpler solutions (for those who are interested). My inital solution is quite complicated.

                      This also works:
                      Code:
                      sed -ne '1~8p;2~8p;3~8p;4~8p' x.fastq > x_1.fastq
                      sed -ne '5~8p;6~8p;7~8p;8~8p' x.fastq > x_2.fastq
                      Even more simple:
                      Code:
                      sed -ne '1~8{N;N;N;p}' x.fastq > x_1.fastq
                      sed -ne '4~8{N;N;N;p}' x.fastq > x_2.fastq
                      Also nice to see some awk solutions! Always exciting to see how things work in awk.

                      Comment


                      • #12
                        I think grep will be easy if you don't have consecutive read1 and read2

                        grep -A3 -P "1:N:" --no-group-separator in.fastq >in_1.fastq
                        grep -A3 -P "2:N:" --no-group-separator in.fastq >in_2.fastq

                        you can match your pattern as you get read name (/1,_1 or 1:N:#:#)

                        Best,

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Exploring the Dynamics of the Tumor Microenvironment
                          by seqadmin




                          The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                          07-08-2024, 03:19 PM
                        • seqadmin
                          Exploring Human Diversity Through Large-Scale Omics
                          by seqadmin


                          In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                          06-25-2024, 06:43 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 07:20 AM
                        0 responses
                        24 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 07-16-2024, 05:49 AM
                        0 responses
                        38 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 07-15-2024, 06:53 AM
                        0 responses
                        44 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 07-10-2024, 07:30 AM
                        0 responses
                        41 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X