Seqanswers Leaderboard Ad

**flxlex** · 09-01-2011, 04:57 AM

Your question made me finally write a blogpost about this: http://contig.wordpress.com/2011/09/...-fastq-header/. The awk command mentioned can be adjusted as needed. Also, check out http://en.wikipedia.org/wiki/Fastq for a comparison of old- and new-style headers.

**Wallysb01** · 09-01-2011, 07:15 AM

That looks great, thank you. I was all set to attempt to write my own script, but I'm very new to actually writing scripts, so I'm glad it didn't come to that.

**cement_head** · 08-26-2014, 12:16 PM

Hi,

Is it possible to ADD a term to the illumina FASTQ file? Will that interfere with programs that map reads to a reference? Or will those programs generally ignore everything after the "#"?

Thanks,
Andor

**Brian Bushnell** · 08-26-2014, 12:39 PM

You can add whatever you want to the read name, but you can't add new lines or anything to the lines with bases or qualities.

**dariober** · 08-27-2014, 01:50 AM

Originally posted by cement_head View Post

Is it possible to ADD a term to the illumina FASTQ file? Will that interfere with programs that map reads to a reference? Or will those programs generally ignore everything after the "#"?

As Brian Bushell said, the fastq header line is free text. Quoting from Cock et al 2010:

‘@’ title line which often holds just a record identifier. This is a free format field with no length limit—allowing arbitrary annotation or comments to be included...

However, you can't assume that downstream programs, e.g. aligners, expect more stringent constraints, e.g. absence of blank spaces. Also, some programs expect PE reads to have the same name (not sure if the fastq spec require this?)

**dpryan** · 08-27-2014, 04:12 AM

I've also seen a couple programs require paired-end read names to end in /1 and /2, even though that's neither a standard nor common practice (in fact, it's a stupid requirement). You'd be surprised how easy it is to break some aligners...

**SES** · 08-27-2014, 07:50 AM

Originally posted by dpryan View Post

I've also seen a couple programs require paired-end read names to end in /1 and /2, even though that's neither a standard nor common practice (in fact, it's a stupid requirement). You'd be surprised how easy it is to break some aligners...

Indeed, and what is more annoying is when programs discard anything after the first part of the name because then the pair information is lost. Examples: seqtk and the readfq library (I submitted a patch for the Perl version a long time ago, and this fix is on github).

If the pair information has been lost, or you need to adjust the format for some aligner, you can use Pairfq (specifically with the subcommand addinfo). For this one simple task though, it is probably just as easy to write out a shell command. If this is a useful tool you are using, it would probably be worth asking the developers to support Illumina Fastq files.

**Brian Bushnell** · 08-27-2014, 09:21 AM

Originally posted by SES View Post

Indeed, and what is more annoying is when programs discard anything after the first part of the name because then the pair information is lost. Examples: seqtk and the readfq library (I submitted a patch for the Perl version a long time ago, and this fix is on github).

I just want to mention that everything in the BBTools package does NOT do this, so you can subsample, normalize, trim, filter, etc. while leaving the names intact. But, some pipelines require that everything after the first whitespace be truncated, on the assumption that these are comments. For example, sam format requires read 1 and read 2 to have the exact same name, while in Illumina's output they have different names (/1 and /2, for example). So BBMap has a a couple related flags - "trimreaddescriptions", which will truncate everything after the first whitespace (for both reads and reference contigs), default false; and "keepnames", which will force read 1 and read 2 to retain their original name, even though the resulting sam file will not technically be spec-compliant (it's still useful in many situations). By default, for paired reads, read 1 and read 2 will both get the full name of read 1 so as to produce a valid sam file.

Originally posted by dpryan View Post

I've also seen a couple programs require paired-end read names to end in /1 and /2, even though that's neither a standard nor common practice (in fact, it's a stupid requirement). You'd be surprised how easy it is to break some aligners...

Also, BBTools makes use of that information for autodetecting whether a single file is paired and interleaved, but it can be overridden. And it's certainly not required

**SES** · 08-27-2014, 09:52 AM

Originally posted by Brian Bushnell View Post

I just want to mention that everything in the BBTools package does NOT do this, so you can subsample, normalize, trim, filter, etc. while leaving the names intact. But, some pipelines require that everything after the first whitespace be truncated, on the assumption that these are comments. For example, sam format requires read 1 and read 2 to have the exact same name, while in Illumina's output they have different names (/1 and /2, for example). So BBMap has a a couple related flags - "trimreaddescriptions", which will truncate everything after the first whitespace (for both reads and reference contigs), default false; and "keepnames", which will force read 1 and read 2 to retain their original name, even though the resulting sam file will not technically be spec-compliant (it's still useful in many situations). By default, for paired reads, read 1 and read 2 will both get the full name of read 1 so as to produce a valid sam file.

That is helpful information. I guess the developers of some tools assume you are only going to be mapping to a reference and working with SAM files (thus, trimming the read names to be valid). Of course, this is not the case for many of us but I can see how that assumption is valid for some (possibly most) use cases.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 18 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

How to add a suffix to fastq file

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News