Unconfigured Ad

**maubp** · 12-18-2009, 03:45 AM

Originally posted by zlu View Post

I was trying to convert fastq consensus sequneces (generated by the pileup2fq utility of samtools) to a multi-fasta file with the MAQ's fq_all2std.pl fq2fa script but it didn't work. The result is a mix of 60bp sequence and quality lines with the > headers.

Is there quick way to extract only the sequences from the fastq file?

Thank you.

I personally use Biopython, e.g.

Reading data from UNIX pipes. · Biopython

http://www.biopython.org/wiki/Reading_from_unix_pipes

404 Page not found

http://news.open-bio.org/news/2009/09/biopython-convert-function/

Code:

from Bio import SeqIO
count = SeqIO.convert("example.fastq", "fastq", "example.fasta", "fasta")
print "Converted %i reads" % count

You might also consider BioPerl or EMBOSS seqret, etc.

However, the problem could be a broken FASTQ file. Could you show use the start of the file (say two reads)?

**zlu** · 12-18-2009, 04:05 AM

Perhaps I didn't explain it clearly. I wasn't trying to convert the Illumina short reads but mapped consensus sequence of a few chromosomes. Each is a few Mbp long, e.g:

@chr1
nnnnnnnagatagaaataCACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTT
GGTCCGAAGCATGAGTGTTTACAATGTTTGAAtaCCTTATACAGTTCTTATACATACTTT
ATAAATTATTTCCCaagctgttttgatacactcactaacagaTATTCTATAGAAGGAAAA
GTTATCCACTTATGCACATTTATAGTTTTCAGAATTGTGGATAATTAGAAATTACACACA
AAGTTATACTATTTTTAGCAACATATTCACAGGTATTTGACATATAGAGAACTGAAAAAG
TATAATTGTGTGGATAAGTCGTCCAACTCATGATTTTATAAGGATTTATTTATTGATATT
TACATAAAAATACTGTGCATAACTAATAAGCAGGATAAAGTTATCCACCGATTGTTATTA
+
!!!!!!!????????BBBEEEHKKKKKKKKKKKKNNNNNNQQNNNNNNQWWWW7WZWWWZ
ZZZ]]]`````````>]]]]]]]ZTQQQTQNNKKKKKKHKKHHHEEEEEEEHHHHHHHHH
HKKHHHHHHEEEEEBBBBBBBBBBBB?=B@BBBBBB??BBBBEEEEEEEEEEHKNNNNNQ
QQTTTTWWWWWWWTTWZWWW]`>```c``]`ccc``cfcfliiloouSxxuuuuTollLo
olliifilloolfif````]]WWWWWTTTTTTTTQQQQQQQNKKHHHHEEEEHEEEHHEE
EEEEEEEHHHHHHHHHEEEEEEHHHHHEEHKHHHHHHHHHHHEEHHHHHHHHKNNNNKNN
NQQ@NKNNNQQQTWWWZZZBZZZZ]<`]ZZZZZZZZLZZ``]]]ZZZWTTTQQQNNNNNK

And using the method described above, this is what I got:

>chr1nnnnnnnagatagaaataCACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTT
>`]]``fGff`````]iifc`````cciiLoxuuuxZ{{xxruuxx{{{~~rrrrrrrrr
Ooiiff`ZZZWTTTTTTQQNTQTWWWWWW]]]`]]cfifffciiiiiorrxx{{~x~{{x
>QNNQQQTTQQQQNNNNQZ``cfffolllloollruuuurrroruxxxxx{~~xuroorr
rrorruxxxroooux{{{xxuuuuuruurrrrollccfcc]ZZB]?]]W7QQTTWWWWWW
>QTT7TQQTTTWZZZZ]]]]]]ZZZZ]]]]]]`````]]]]]`]`c`]```Z]``fflll
louxxxxuS{{{{{~~~~~~{{~{{{{xuuroiilroiiollorlloorZ{~~{xxxuuu
>KKKKKKKHHEB????BBBEEKHHHKKKKNNNNQQQQTZZZZZZ]]````````]]WWW<
WWWW7TTTTTTTQKKKKKKHHEEEE<BBEHHH=HEEEEEHHHHEEEEHHHHHHKKKKKKK
>QQQBQQQNNNNKKKKKKQQQQQQTTTTWWZWTTQQQQQQQQTTWZZZZZ]ZZTWWTTTQ
QQTQQNNNTWWWZZ]]]ZZWWWWWWTTTTQQQEQQQQNNNNNNHHHHEEBEEEEEBBBBB
>QQQATTTTTTTWWTTTQQQQQQQNNKKKQQQTTTT7TTQQ5QNNNQKNNQQQQQQQQQQ
QQN4HKKHHHHHHHHHHHHHKHHEEBBBBBBBBBBBBBBB7??????????????!!!!!
>LAcffffooor{{Z~~~~~~~~~~|~~~q~~~~~v~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~c~~xuuric```]]ZZWTNKHEEB????!!!!!!!!!!!!!!!!!!
>QQQ5>QQ@QNNNNNNQQQQNNNNNNKHEBBBB???BBBBBBBBBEEEEEEBBBBBBBBB
BBBBBBEHKNNKKKKKKNNNKKKKKKKKKKKW]]]cccfffc`]Z`>cfiOollllllLl
>NQQTTTWQQQQQQTTNNNQQQT7GTTAWWWTTTZZW7TQTTQQQQQQTQQTTTQTTTTT
WWZWZ]]]]WZ]``D]]]D```D]]]Z]]]ZZWWWT7@QNKKKKKHEBBBBHKHHHNQTT

**maubp** · 12-18-2009, 04:35 AM

Are the sequences and quality lines in the original FASTQ line wrapped? This is valid but not recommended as some tools (perhaps including MAQ) don't understand it.

The tools I suggested can all cope with line wrapped FASTQ. See also: http://dx.doi.org/10.1093/nar/gkp1137

(Or it may be the forum making it look line wrapped and adding spaces when it isn't really - trying using the [ code ] text [ /code ] tags to stop this)

**zlu** · 12-18-2009, 06:56 AM

yes, the lines are indedd wrapped. Thanks for the reminder.

However, trying seqret of EMBOSS 6.1, I got the following error:

[SBSUser@pipeline Assembly2]$ seqret -sformat fastq-sanger
Reads and writes (returns) sequences
Input (gapped) sequence(s): CNS.fq
Error: Unable to read sequence CNS.fq'

**maubp** · 12-18-2009, 07:11 AM

Originally posted by zlu View Post

yes, the lines are indedd wrapped. Thanks for the reminder.

That probably does explain MAQ's failure.

Originally posted by zlu View Post

However, trying seqret of EMBOSS 6.1,

Is that plain un-patched EMBOSS 6.1.0? They are currently on patch 3, and this did include some FASTQ fixes. See:
ftp://emboss.open-bio.org/pub/EMBOSS/fixes/README.fixes

Originally posted by zlu View Post

I got the following error:

[SBSUser@pipeline Assembly2]$ seqret -sformat fastq-sanger
Reads and writes (returns) sequences
Input (gapped) sequence(s): CNS.fq
Error: Unable to read sequence CNS.fq'

That is odd. Note you can also try giving the filename as part of the command line, e.g.

$ seqret -sformat fastq-sanger -sequence CNS.fq -osformat fasta -outseq CNS.fasta

How big is the file? If you zipped it up and emailed it to me I could try it here for you if you liked.

**lh3** · 12-18-2009, 07:39 AM

Most of fastq parsers do not work with multi-line fastq files. You'd better write your own script. After all, processing multi-line fastq is not that hard.

**zlu** · 12-18-2009, 08:08 AM

Originally posted by maubp View Post

Is that plain un-patched EMBOSS 6.1.0? They are currently on patch 3, and this did include some FASTQ fixes.

I'll try to patch it and see how it goes.

Originally posted by maubp View Post

How big is the file? If you zipped it up and emailed it to me I could try it here for you if you liked

The compressed file is more than 1.3GB.

Thanks for your help. I'll try to process the fastq with a script.

**zlu** · 12-22-2009, 03:42 AM

Using the patched emboss, I can now convert the line wrapped fastq files. Thank you.

**maubp** · 12-22-2009, 03:49 AM

Originally posted by lh3 View Post

Most of fastq parsers do not work with multi-line fastq files. You'd better write your own script. After all, processing multi-line fastq is not that hard.

Why doesn't MAQ understand multi-line fastq? Would you accept a patch to implement this?

**kmcarr** · 12-22-2009, 06:04 AM

Originally posted by maubp View Post

Originally posted by lh3 View Post

Most of fastq parsers do not work with multi-line fastq files.

Why doesn't MAQ understand multi-line fastq? Would you accept a patch to implement this?

From the Wikipedia article on FASTQ (http://en.wikipedia.org/wiki/FASTQ_format):

The original Sanger FASTQ files also allowed the sequence and quality strings to be wrapped (split over multiple lines), but this is generally discouraged as it can make parsing complicated due to the unfortunate choice of "@" and "+" as markers (these characters can also occur in the quality string).

**maubp** · 12-22-2009, 06:22 AM

Just because the consensus is that line wrapping in FASTQ is/was a bad idea, doesn't mean it can't be reliably parsed. It does make the code a little more complicated I admit - but as this thread shows, it would be useful to some if MAQ could understand line-wrapped FASTQ.

**lh3** · 12-22-2009, 07:34 AM

The version in MAQ svn, which is never released, parses multi-line fastq and accepts gzipped input, as it uses the same parser as bwa. Nonetheless, as maq only works with reads no longer than 127bp, not supporting multi-line fastq is not a major problem. It is more important for bwa to parse multi-line fastq as it works with long reads.

**gpcr** · 09-10-2010, 11:55 PM

Originally posted by zlu View Post

Perhaps I didn't explain it clearly. I wasn't trying to convert the Illumina short reads but mapped consensus sequence of a few chromosomes. Each is a few Mbp long, e.g:

@chr1
nnnnnnnagatagaaataCACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTT
GGTCCGAAGCATGAGTGTTTACAATGTTTGAAtaCCTTATACAGTTCTTATACATACTTT
ATAAATTATTTCCCaagctgttttgatacactcactaacagaTATTCTATAGAAGGAAAA
GTTATCCACTTATGCACATTTATAGTTTTCAGAATTGTGGATAATTAGAAATTACACACA
AAGTTATACTATTTTTAGCAACATATTCACAGGTATTTGACATATAGAGAACTGAAAAAG
TATAATTGTGTGGATAAGTCGTCCAACTCATGATTTTATAAGGATTTATTTATTGATATT
TACATAAAAATACTGTGCATAACTAATAAGCAGGATAAAGTTATCCACCGATTGTTATTA
+
!!!!!!!????????BBBEEEHKKKKKKKKKKKKNNNNNNQQNNNNNNQWWWW7WZWWWZ
ZZZ]]]`````````>]]]]]]]ZTQQQTQNNKKKKKKHKKHHHEEEEEEEHHHHHHHHH
HKKHHHHHHEEEEEBBBBBBBBBBBB?=B@BBBBBB??BBBBEEEEEEEEEEHKNNNNNQ
QQTTTTWWWWWWWTTWZWWW]`>```c``]`ccc``cfcfliiloouSxxuuuuTollLo
olliifilloolfif````]]WWWWWTTTTTTTTQQQQQQQNKKHHHHEEEEHEEEHHEE
EEEEEEEHHHHHHHHHEEEEEEHHHHHEEHKHHHHHHHHHHHEEHHHHHHHHKNNNNKNN
NQQ@NKNNNQQQTWWWZZZBZZZZ]<`]ZZZZZZZZLZZ``]]]ZZZWTTTQQQNNNNNK

And using the method described above, this is what I got:

>chr1nnnnnnnagatagaaataCACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTT
>`]]``fGff`````]iifc`````cciiLoxuuuxZ{{xxruuxx{{{~~rrrrrrrrr
Ooiiff`ZZZWTTTTTTQQNTQTWWWWWW]]]`]]cfifffciiiiiorrxx{{~x~{{x
>QNNQQQTTQQQQNNNNQZ``cfffolllloollruuuurrroruxxxxx{~~xuroorr
rrorruxxxroooux{{{xxuuuuuruurrrrollccfcc]ZZB]?]]W7QQTTWWWWWW
>QTT7TQQTTTWZZZZ]]]]]]ZZZZ]]]]]]`````]]]]]`]`c`]```Z]``fflll
louxxxxuS{{{{{~~~~~~{{~{{{{xuuroiilroiiollorlloorZ{~~{xxxuuu
>KKKKKKKHHEB????BBBEEKHHHKKKKNNNNQQQQTZZZZZZ]]````````]]WWW<
WWWW7TTTTTTTQKKKKKKHHEEEE<BBEHHH=HEEEEEHHHHEEEEHHHHHHKKKKKKK
>QQQBQQQNNNNKKKKKKQQQQQQTTTTWWZWTTQQQQQQQQTTWZZZZZ]ZZTWWTTTQ
QQTQQNNNTWWWZZ]]]ZZWWWWWWTTTTQQQEQQQQNNNNNNHHHHEEBEEEEEBBBBB
>QQQATTTTTTTWWTTTQQQQQQQNNKKKQQQTTTT7TTQQ5QNNNQKNNQQQQQQQQQQ
QQN4HKKHHHHHHHHHHHHHKHHEEBBBBBBBBBBBBBBB7??????????????!!!!!
>LAcffffooor{{Z~~~~~~~~~~|~~~q~~~~~v~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~c~~xuuric```]]ZZWTNKHEEB????!!!!!!!!!!!!!!!!!!
>QQQ5>QQ@QNNNNNNQQQQNNNNNNKHEBBBB???BBBBBBBBBEEEEEEBBBBBBBBB
BBBBBBEHKNNKKKKKKNNNKKKKKKKKKKKW]]]cccfffc`]Z`>cfiOollllllLl
>NQQTTTWQQQQQQTTNNNQQQT7GTTAWWWTTTZZW7TQTTQQQQQQTQQTTTQTTTTT
WWZWZ]]]]WZ]``D]]]D```D]]]Z]]]ZZWWWT7@QNKKKKKHEBBBBHKHHHNQTT

awk '/@chr1$/,/^+$/' consensus.fastq | perl -pe "s/@/>/ ; s/\+//" > chr1.fasta

here "chr1" is chromosome

**yifangt** · 08-13-2011, 04:41 AM

My question is related, but not quite the same, which is: I need to get a sub-string of the sequence and the corresponding quality score for each of the entries, the file format untouched. My Illumina reads consists of 101bp long. I want remove the first 10bp and last 25bp then only the middle 66bp left.

The reason I need do this is my DNA sequence is methylated and the first 10 and last 25bp seem not having good quality in general so that I want get rid of them. My challenge is to remove both quality score and sequence correspondingly.

Code:

use Bio::SeqIO; use Bio::Seq::Quality;

$seqio = Bio::SeqIO->new('-format'=>'fastq' , '-file'=>'some.fasq');

my $out_fastq = Bio::SeqIO->new( -format => 'fastq', '-file'=> 'subset.fastq');

while((my $line = $seqio->next_seq() ) { 
# keep the id 
# substring the sequence (from 11 to 66) 
# substring the quality score (from 11 to 66) $out_fastq->write_seq($line); }

Or any tools there to do my job?

Thanks a lot!

Yifang

Topics	Statistics	Last Post
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, Yesterday, 10:09 AM	0 responses 9 views 0 reactions	Last Post by SEQadmin2 Yesterday, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 17 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 26 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM
DNA Methylation Study Reveals How Epigenetic Changes Pass Between Generations by SEQadmin2 Started by SEQadmin2, 06-02-2026, 11:40 AM	0 responses 21 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 11:40 AM

Unconfigured Ad

converting consensus fastq to fasta

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News