Seqanswers Leaderboard Ad

**GenoMax** · 01-28-2015, 09:37 AM

Fasta file format is meant for plain sequence files (without quality information). There may be extensions of Fasta format but the normal usage is for plain sequence.

What you are looking for are Fastq format files, which has become the de facto standard for NGS data.

You can get fastq data files (there is a utility needed to retrieve data called srftoolkit) from NCBI Short Read Archive (SRA) (http://www.ncbi.nlm.nih.gov/sra) Fastq files can be several gigabytes in size.

**dpryan** · 01-28-2015, 09:37 AM

1. Probably JGI. They seem to have a LOT of bacteria datasets, that should be more tractable.
2. No, that would be incredibly unusual. They're usually stored in fastq, typically in separate files.
3. Each fragment sequenced produces two reads, one from each end. So if you sequence 100 million fragments, you'll have 200 million reads (100 million pairs). This is as opposed to single-end reads, where you just sequence one end of each fragment.

**mastal** · 01-28-2015, 09:42 AM

2. fastq format

**schakalakka** · 02-03-2015, 06:04 AM

Thank you for your answers.

So I have to look up fastq and write a parser. I hoped I can avoid it.

**rhinoceros** · 02-03-2015, 06:13 AM

2. Some assemblers require input in fasta format, e.g. IDBA wants pairs to be consecutive sequences in fasta format and they have bundled a small script for converting from fastq to fasta..

Also, don't reinvent the wheel. I'm sure there are many OSS fastq parsers available. A good place to start could be https://github.com/samtools/htslib

**dpryan** · 02-03-2015, 06:28 AM

HTSlib doesn't have a fastq parser. Anyway, with any modern data it's fine to assume that fastq entries are always 4 lines, so a parser is then trivial to write.

**milw** · 02-03-2015, 07:47 AM

If you're using Python, there's a decent parser already at https://scipher.wordpress.com/2010/0...-fastq-parser/

Topics	Statistics	Last Post
New Toolkit Enhances Plant Mitochondrial Genome Research by seqadmin Started by seqadmin, 05-24-2024, 07:15 AM	0 responses 13 views 0 likes	Last Post by seqadmin 05-24-2024, 07:15 AM
Catalog of Gene-Isoform Variation in Developing Human Brain by seqadmin Started by seqadmin, 05-23-2024, 10:28 AM	0 responses 17 views 0 likes	Last Post by seqadmin 05-23-2024, 10:28 AM
Ancient Viral Sequences in Human Brain Linked to Psychiatric Disorders by seqadmin Started by seqadmin, 05-23-2024, 07:35 AM	0 responses 21 views 0 likes	Last Post by seqadmin 05-23-2024, 07:35 AM
New Milestone for COSMIC with Extensive Cancer Mutation Data by seqadmin Started by seqadmin, 05-22-2024, 02:06 PM	0 responses 10 views 0 likes	Last Post by seqadmin 05-22-2024, 02:06 PM

Seqanswers Leaderboard Ad

Announcement

Sample FASTA data and questions about paired end reads

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News