Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • zlu
    Member
    • Nov 2008
    • 34

    converting consensus fastq to fasta

    I was trying to convert fastq consensus sequneces (generated by the pileup2fq utility of samtools) to a multi-fasta file with the MAQ's fq_all2std.pl fq2fa script but it didn't work. The result is a mix of 60bp sequence and quality lines with the > headers.

    Is there quick way to extract only the sequences from the fastq file?

    Thank you.
  • maubp
    Peter (Biopython etc)
    • Jul 2009
    • 1544

    #2
    Originally posted by zlu View Post
    I was trying to convert fastq consensus sequneces (generated by the pileup2fq utility of samtools) to a multi-fasta file with the MAQ's fq_all2std.pl fq2fa script but it didn't work. The result is a mix of 60bp sequence and quality lines with the > headers.

    Is there quick way to extract only the sequences from the fastq file?

    Thank you.
    I personally use Biopython, e.g.



    Code:
    from Bio import SeqIO
    count = SeqIO.convert("example.fastq", "fastq", "example.fasta", "fasta")
    print "Converted %i reads" % count
    You might also consider BioPerl or EMBOSS seqret, etc.

    However, the problem could be a broken FASTQ file. Could you show use the start of the file (say two reads)?

    Comment

    • zlu
      Member
      • Nov 2008
      • 34

      #3
      Perhaps I didn't explain it clearly. I wasn't trying to convert the Illumina short reads but mapped consensus sequence of a few chromosomes. Each is a few Mbp long, e.g:

      @chr1
      nnnnnnnagatagaaataCACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTT
      GGTCCGAAGCATGAGTGTTTACAATGTTTGAAtaCCTTATACAGTTCTTATACATACTTT
      ATAAATTATTTCCCaagctgttttgatacactcactaacagaTATTCTATAGAAGGAAAA
      GTTATCCACTTATGCACATTTATAGTTTTCAGAATTGTGGATAATTAGAAATTACACACA
      AAGTTATACTATTTTTAGCAACATATTCACAGGTATTTGACATATAGAGAACTGAAAAAG
      TATAATTGTGTGGATAAGTCGTCCAACTCATGATTTTATAAGGATTTATTTATTGATATT
      TACATAAAAATACTGTGCATAACTAATAAGCAGGATAAAGTTATCCACCGATTGTTATTA
      +
      !!!!!!!????????BBBEEEHKKKKKKKKKKKKNNNNNNQQNNNNNNQWWWW7WZWWWZ
      ZZZ]]]`````````>]]]]]]]ZTQQQTQNNKKKKKKHKKHHHEEEEEEEHHHHHHHHH
      HKKHHHHHHEEEEEBBBBBBBBBBBB?=B@BBBBBB??BBBBEEEEEEEEEEHKNNNNNQ
      QQTTTTWWWWWWWTTWZWWW]`>```c``]`ccc``cfcfliiloouSxxuuuuTollLo
      olliifilloolfif````]]WWWWWTTTTTTTTQQQQQQQNKKHHHHEEEEHEEEHHEE
      EEEEEEEHHHHHHHHHEEEEEEHHHHHEEHKHHHHHHHHHHHEEHHHHHHHHKNNNNKNN
      NQQ@NKNNNQQQTWWWZZZBZZZZ]<`]ZZZZZZZZLZZ``]]]ZZZWTTTQQQNNNNNK

      And using the method described above, this is what I got:

      >chr1nnnnnnnagatagaaataCACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTT
      >`]]``fGff`````]iifc`````cciiLoxuuuxZ{{xxruuxx{{{~~rrrrrrrrr
      Ooiiff`ZZZWTTTTTTQQNTQTWWWWWW]]]`]]cfifffciiiiiorrxx{{~x~{{x
      >QNNQQQTTQQQQNNNNQZ``cfffolllloollruuuurrroruxxxxx{~~xuroorr
      rrorruxxxroooux{{{xxuuuuuruurrrrollccfcc]ZZB]?]]W7QQTTWWWWWW
      >QTT7TQQTTTWZZZZ]]]]]]ZZZZ]]]]]]`````]]]]]`]`c`]```Z]``fflll
      louxxxxuS{{{{{~~~~~~{{~{{{{xuuroiilroiiollorlloorZ{~~{xxxuuu
      >KKKKKKKHHEB????BBBEEKHHHKKKKNNNNQQQQTZZZZZZ]]````````]]WWW<
      WWWW7TTTTTTTQKKKKKKHHEEEE<BBEHHH=HEEEEEHHHHEEEEHHHHHHKKKKKKK
      >QQQBQQQNNNNKKKKKKQQQQQQTTTTWWZWTTQQQQQQQQTTWZZZZZ]ZZTWWTTTQ
      QQTQQNNNTWWWZZ]]]ZZWWWWWWTTTTQQQEQQQQNNNNNNHHHHEEBEEEEEBBBBB
      >QQQATTTTTTTWWTTTQQQQQQQNNKKKQQQTTTT7TTQQ5QNNNQKNNQQQQQQQQQQ
      QQN4HKKHHHHHHHHHHHHHKHHEEBBBBBBBBBBBBBBB7??????????????!!!!!
      >LAcffffooor{{Z~~~~~~~~~~|~~~q~~~~~v~~~~~~~~~~~~~~~~~~~~~~~~
      ~~~~~~~~~~~~~~c~~xuuric```]]ZZWTNKHEEB????!!!!!!!!!!!!!!!!!!
      >QQQ5>QQ@QNNNNNNQQQQNNNNNNKHEBBBB???BBBBBBBBBEEEEEEBBBBBBBBB
      BBBBBBEHKNNKKKKKKNNNKKKKKKKKKKKW]]]cccfffc`]Z`>cfiOollllllLl
      >NQQTTTWQQQQQQTTNNNQQQT7GTTAWWWTTTZZW7TQTTQQQQQQTQQTTTQTTTTT
      WWZWZ]]]]WZ]``D]]]D```D]]]Z]]]ZZWWWT7@QNKKKKKHEBBBBHKHHHNQTT

      Comment

      • maubp
        Peter (Biopython etc)
        • Jul 2009
        • 1544

        #4
        Are the sequences and quality lines in the original FASTQ line wrapped? This is valid but not recommended as some tools (perhaps including MAQ) don't understand it.

        The tools I suggested can all cope with line wrapped FASTQ. See also: http://dx.doi.org/10.1093/nar/gkp1137

        (Or it may be the forum making it look line wrapped and adding spaces when it isn't really - trying using the [ code ] text [ /code ] tags to stop this)
        Last edited by maubp; 12-18-2009, 04:37 AM.

        Comment

        • zlu
          Member
          • Nov 2008
          • 34

          #5
          yes, the lines are indedd wrapped. Thanks for the reminder.

          However, trying seqret of EMBOSS 6.1, I got the following error:

          [SBSUser@pipeline Assembly2]$ seqret -sformat fastq-sanger
          Reads and writes (returns) sequences
          Input (gapped) sequence(s): CNS.fq
          Error: Unable to read sequence CNS.fq'

          Comment

          • maubp
            Peter (Biopython etc)
            • Jul 2009
            • 1544

            #6
            Originally posted by zlu View Post
            yes, the lines are indedd wrapped. Thanks for the reminder.
            That probably does explain MAQ's failure.
            Originally posted by zlu View Post
            However, trying seqret of EMBOSS 6.1,
            Is that plain un-patched EMBOSS 6.1.0? They are currently on patch 3, and this did include some FASTQ fixes. See:
            ftp://emboss.open-bio.org/pub/EMBOSS/fixes/README.fixes

            Originally posted by zlu View Post
            I got the following error:

            [SBSUser@pipeline Assembly2]$ seqret -sformat fastq-sanger
            Reads and writes (returns) sequences
            Input (gapped) sequence(s): CNS.fq
            Error: Unable to read sequence CNS.fq'
            That is odd. Note you can also try giving the filename as part of the command line, e.g.

            $ seqret -sformat fastq-sanger -sequence CNS.fq -osformat fasta -outseq CNS.fasta

            How big is the file? If you zipped it up and emailed it to me I could try it here for you if you liked.

            Comment

            • lh3
              Senior Member
              • Feb 2008
              • 686

              #7
              Most of fastq parsers do not work with multi-line fastq files. You'd better write your own script. After all, processing multi-line fastq is not that hard.

              Comment

              • zlu
                Member
                • Nov 2008
                • 34

                #8
                Originally posted by maubp View Post
                Is that plain un-patched EMBOSS 6.1.0? They are currently on patch 3, and this did include some FASTQ fixes.
                I'll try to patch it and see how it goes.

                Originally posted by maubp View Post
                How big is the file? If you zipped it up and emailed it to me I could try it here for you if you liked
                The compressed file is more than 1.3GB.

                Thanks for your help. I'll try to process the fastq with a script.

                Comment

                • zlu
                  Member
                  • Nov 2008
                  • 34

                  #9
                  Using the patched emboss, I can now convert the line wrapped fastq files. Thank you.

                  Comment

                  • maubp
                    Peter (Biopython etc)
                    • Jul 2009
                    • 1544

                    #10
                    Originally posted by lh3 View Post
                    Most of fastq parsers do not work with multi-line fastq files. You'd better write your own script. After all, processing multi-line fastq is not that hard.
                    Why doesn't MAQ understand multi-line fastq? Would you accept a patch to implement this?

                    Comment

                    • kmcarr
                      Senior Member
                      • May 2008
                      • 1181

                      #11
                      Originally posted by maubp View Post
                      Originally posted by lh3 View Post
                      Most of fastq parsers do not work with multi-line fastq files.
                      Why doesn't MAQ understand multi-line fastq? Would you accept a patch to implement this?
                      From the Wikipedia article on FASTQ (http://en.wikipedia.org/wiki/FASTQ_format):

                      The original Sanger FASTQ files also allowed the sequence and quality strings to be wrapped (split over multiple lines), but this is generally discouraged as it can make parsing complicated due to the unfortunate choice of "@" and "+" as markers (these characters can also occur in the quality string).

                      Comment

                      • maubp
                        Peter (Biopython etc)
                        • Jul 2009
                        • 1544

                        #12
                        Just because the consensus is that line wrapping in FASTQ is/was a bad idea, doesn't mean it can't be reliably parsed. It does make the code a little more complicated I admit - but as this thread shows, it would be useful to some if MAQ could understand line-wrapped FASTQ.

                        Comment

                        • lh3
                          Senior Member
                          • Feb 2008
                          • 686

                          #13
                          The version in MAQ svn, which is never released, parses multi-line fastq and accepts gzipped input, as it uses the same parser as bwa. Nonetheless, as maq only works with reads no longer than 127bp, not supporting multi-line fastq is not a major problem. It is more important for bwa to parse multi-line fastq as it works with long reads.

                          Comment

                          • gpcr
                            Member
                            • May 2010
                            • 18

                            #14
                            Originally posted by zlu View Post
                            Perhaps I didn't explain it clearly. I wasn't trying to convert the Illumina short reads but mapped consensus sequence of a few chromosomes. Each is a few Mbp long, e.g:

                            @chr1
                            nnnnnnnagatagaaataCACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTT
                            GGTCCGAAGCATGAGTGTTTACAATGTTTGAAtaCCTTATACAGTTCTTATACATACTTT
                            ATAAATTATTTCCCaagctgttttgatacactcactaacagaTATTCTATAGAAGGAAAA
                            GTTATCCACTTATGCACATTTATAGTTTTCAGAATTGTGGATAATTAGAAATTACACACA
                            AAGTTATACTATTTTTAGCAACATATTCACAGGTATTTGACATATAGAGAACTGAAAAAG
                            TATAATTGTGTGGATAAGTCGTCCAACTCATGATTTTATAAGGATTTATTTATTGATATT
                            TACATAAAAATACTGTGCATAACTAATAAGCAGGATAAAGTTATCCACCGATTGTTATTA
                            +
                            !!!!!!!????????BBBEEEHKKKKKKKKKKKKNNNNNNQQNNNNNNQWWWW7WZWWWZ
                            ZZZ]]]`````````>]]]]]]]ZTQQQTQNNKKKKKKHKKHHHEEEEEEEHHHHHHHHH
                            HKKHHHHHHEEEEEBBBBBBBBBBBB?=B@BBBBBB??BBBBEEEEEEEEEEHKNNNNNQ
                            QQTTTTWWWWWWWTTWZWWW]`>```c``]`ccc``cfcfliiloouSxxuuuuTollLo
                            olliifilloolfif````]]WWWWWTTTTTTTTQQQQQQQNKKHHHHEEEEHEEEHHEE
                            EEEEEEEHHHHHHHHHEEEEEEHHHHHEEHKHHHHHHHHHHHEEHHHHHHHHKNNNNKNN
                            NQQ@NKNNNQQQTWWWZZZBZZZZ]<`]ZZZZZZZZLZZ``]]]ZZZWTTTQQQNNNNNK

                            And using the method described above, this is what I got:

                            >chr1nnnnnnnagatagaaataCACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTT
                            >`]]``fGff`````]iifc`````cciiLoxuuuxZ{{xxruuxx{{{~~rrrrrrrrr
                            Ooiiff`ZZZWTTTTTTQQNTQTWWWWWW]]]`]]cfifffciiiiiorrxx{{~x~{{x
                            >QNNQQQTTQQQQNNNNQZ``cfffolllloollruuuurrroruxxxxx{~~xuroorr
                            rrorruxxxroooux{{{xxuuuuuruurrrrollccfcc]ZZB]?]]W7QQTTWWWWWW
                            >QTT7TQQTTTWZZZZ]]]]]]ZZZZ]]]]]]`````]]]]]`]`c`]```Z]``fflll
                            louxxxxuS{{{{{~~~~~~{{~{{{{xuuroiilroiiollorlloorZ{~~{xxxuuu
                            >KKKKKKKHHEB????BBBEEKHHHKKKKNNNNQQQQTZZZZZZ]]````````]]WWW<
                            WWWW7TTTTTTTQKKKKKKHHEEEE<BBEHHH=HEEEEEHHHHEEEEHHHHHHKKKKKKK
                            >QQQBQQQNNNNKKKKKKQQQQQQTTTTWWZWTTQQQQQQQQTTWZZZZZ]ZZTWWTTTQ
                            QQTQQNNNTWWWZZ]]]ZZWWWWWWTTTTQQQEQQQQNNNNNNHHHHEEBEEEEEBBBBB
                            >QQQATTTTTTTWWTTTQQQQQQQNNKKKQQQTTTT7TTQQ5QNNNQKNNQQQQQQQQQQ
                            QQN4HKKHHHHHHHHHHHHHKHHEEBBBBBBBBBBBBBBB7??????????????!!!!!
                            >LAcffffooor{{Z~~~~~~~~~~|~~~q~~~~~v~~~~~~~~~~~~~~~~~~~~~~~~
                            ~~~~~~~~~~~~~~c~~xuuric```]]ZZWTNKHEEB????!!!!!!!!!!!!!!!!!!
                            >QQQ5>QQ@QNNNNNNQQQQNNNNNNKHEBBBB???BBBBBBBBBEEEEEEBBBBBBBBB
                            BBBBBBEHKNNKKKKKKNNNKKKKKKKKKKKW]]]cccfffc`]Z`>cfiOollllllLl
                            >NQQTTTWQQQQQQTTNNNQQQT7GTTAWWWTTTZZW7TQTTQQQQQQTQQTTTQTTTTT
                            WWZWZ]]]]WZ]``D]]]D```D]]]Z]]]ZZWWWT7@QNKKKKKHEBBBBHKHHHNQTT
                            awk '/@chr1$/,/^+$/' consensus.fastq | perl -pe "s/@/>/ ; s/\+//" > chr1.fasta

                            here "chr1" is chromosome

                            Comment

                            • yifangt
                              Member
                              • Feb 2011
                              • 61

                              #15
                              My question is related, but not quite the same, which is: I need to get a sub-string of the sequence and the corresponding quality score for each of the entries, the file format untouched. My Illumina reads consists of 101bp long. I want remove the first 10bp and last 25bp then only the middle 66bp left.

                              The reason I need do this is my DNA sequence is methylated and the first 10 and last 25bp seem not having good quality in general so that I want get rid of them. My challenge is to remove both quality score and sequence correspondingly.
                              Code:
                              use Bio::SeqIO; use Bio::Seq::Quality;
                              
                              $seqio = Bio::SeqIO->new('-format'=>'fastq' , '-file'=>'some.fasq');
                              
                              my $out_fastq = Bio::SeqIO->new( -format => 'fastq', '-file'=> 'subset.fastq');
                              
                              while((my $line = $seqio->next_seq() ) { 
                              # keep the id 
                              # substring the sequence (from 11 to 66) 
                              # substring the quality score (from 11 to 66) $out_fastq->write_seq($line); }
                              Or any tools there to do my job?

                              Thanks a lot!

                              Yifang
                              Last edited by yifangt; 08-13-2011, 04:44 AM.

                              Comment

                              Latest Articles

                              Collapse

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 10:09 AM
                              0 responses
                              9 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-04-2026, 08:59 AM
                              0 responses
                              17 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              26 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              21 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...