Seqanswers Leaderboard Ad

**Heisman** · 11-02-2011, 08:36 AM

I'm not sure about 1, but for 2, it will probably be safer to keep them separate and it would take at most twice as long to analyze. Realistically, if you end up fiddling around a lot with the analysis for the first one then it will take much less time to analyze the second one when you get all the kinks worked out. If it's only a one time thing I wouldn't worry about screwing things up while merging.

**maubp** · 11-02-2011, 11:49 AM

Originally posted by frymor View Post

Q1: Are space character allowed in the header of the fastq file?

Yes, and they mark the end of the identifier and the start of the free format text.

You are suffering from the Illumina 1.8 switch from /1 and /2 suffixes to putting the segment number separately from the identifier. i.e. They switched from record read segment IDs, to the shared ID (similar to how SAM/BAM record parts one and two with the same ID but different FLAG values). See also:

Upcoming changes in CASAVA - SEQanswers

http://seqanswers.com/forums/showthread.php?t=8895

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

FASTQ must die! Long live SAM/BAM!

http://blastedbio.blogspot.com/2011/10/fastq-must-die-long-live-sambam.html

I think it is time to retire the FASTQ file format in favour of storing unaligned reads in SAM/BAM format . I will try to explain, as thi...

**kmcarr** · 11-02-2011, 12:33 PM

Originally posted by frymor View Post

Hi,

I was wondering if anyone has problems with space characters in the header of a fastq file

We have two technical replicates and out sequencing people told us it will save us time, if we merge the two files and analyse them together.
The problem is that the header are the same with the exception of the number of the technical replicate:

Code:

technical replicate 1
@HWI-ST0764:77:C03HCACXX:4:1101:1242:2075 1:Y:0:TTAGGC
technical replicate 2
@HWI-ST0764:77:C03HCACXX:4:1101:1242:2075 2:Y:0:TTAGGC

Q1: Are space character allowed in the header of the fastq file?

When we analysed the merged file we got the same results as the first original file of the two.
This led me to the thought, that such spaces and everything that followed are being ignored. We are working with maqgene to try and find SNPs in C. elegans.

Q2: Does it make sense at all to merge the two technical replicates to save analysis time? Do I loose information by merging them?

Thanks for the help

Assa

Those reads you have shown are not two different technical replicates, they are read 1 and read 2 from a paired end format run. The format of the headers shows them to be generated by Illumina's CASAVA v1.8+. The CASAVA pipeline normally produces two separate files, one for each read. They should be kept this way until you know what you are going to do with them. Some software requires that the two reads be supplied as separate files, other programs want the reads shuffled together in one file. I don't know what maqgene expects for paired reads.

**frymor** · 11-02-2011, 11:46 PM

Originally posted by kmcarr View Post

Those reads you have shown are not two different technical replicates, they are read 1 and read 2 from a paired end format run.

Are you sure these are two files of a paired-end reads data set?

As far as we understood our technicians here at the sequencing centre, they didn't do any PE-analysis, but produce for us technical replicates.

Where can I find out what is the meaning of the header?
What will it looks like, if it was two sets of technical replicates instead of PE reads?

Thanks a lot for the info about the PE data. It needs a totally different analysis. Up until now we analyzed them separately, because we didn't know they were connected.

We are working with the MAQGene software and I am not even sure if there's an option for explicit work with PE reads.
Does it make sense at all to merge the two files, if they are really paired-end reads?

**maubp** · 11-03-2011, 01:34 AM

Originally posted by frymor View Post

Are you sure these are two files of a paired-end reads data set?

Pretty sure based on the naming.

Originally posted by frymor View Post

As far as we understood our technicians here at the sequencing centre, they didn't do any PE-analysis, but produce for us technical replicates.

Odd. Double check that with them.

Originally posted by frymor View Post

Where can I find out what is the meaning of the header?

Read the thread I linked to earlier, and/or the official Illumina 1.8 documentation:

Upcoming changes in CASAVA - SEQanswers

http://seqanswers.com/forums/showthread.php?t=8895

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

**kmcarr** · 11-03-2011, 07:11 AM

Originally posted by frymor View Post

Are you sure these are two files of a paired-end reads data set?

Yes, absolutely.

As far as we understood our technicians here at the sequencing centre, they didn't do any PE-analysis, but produce for us technical replicates.

I think you had better speak with your sequencing facility to clarify exactly what they are telling you.

Where can I find out what is the meaning of the header?

I have attached a PDF I normally give to our clients which describes describes the file name and FASTQ header line format for CASAVA 1.8 (This file was heavily cribbed from Illumina's documentation.)

What can be seen from your examples are:
The read IDs are identical so they are two reads from the same cluster. They are reads 1 and 2 of a paired read. The barcode tag sequence for that cluster is TTAGGC. The read did NOT pass the Illumina quality filter ("Y" means it failed filtering).

What will it looks like, if it was two sets of technical replicates instead of PE reads?

I can't say since I'm not completely clear on what you mean by "two sets of technical replicates". If you mean they just ran the same library twice then the read ID information would be entirely different and there is not matching of reads from one replicate to the other.

Thanks a lot for the info about the PE data. It needs a totally different analysis. Up until now we analyzed them separately, because we didn't know they were connected.

We are working with the MAQGene software and I am not even sure if there's an option for explicit work with PE reads.
Does it make sense at all to merge the two files, if they are really paired-end reads?

I don't know if/how MAQGene manages paired reads. You'll have to consult the documentation for that.

Attached Files

IlluminaFASTQ_CASAVA_1.8.pdf (74.8 KB, 139 views)

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 49 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

space character in headers

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News