Seqanswers Leaderboard Ad

**kmcarr** · 01-20-2013, 06:48 PM

Originally posted by rnaeye View Post

It seems that index sequence is replaced by sample number...

Is there a way to make software print index sequences instead of sample number. What is the reason for this change. To make files smaller? Thank you.

Yes, they do report sample ID in place of the barcode. There is no way to alter this behavior in the MiSeq analysis software. If you want the files to match the output of the HiSeq you can run the MiSeq BCL files through CASAVA to create FASTQ files.

Why did they do this? Simply to be difficult I think.

**GenoMax** · 01-21-2013, 05:03 AM

Would it not be simpler to replace the index numbers with the actual sequences?

**mcnelson.phd** · 01-21-2013, 05:05 AM

I'm curious as to why you would need the index sequence in the sequence header as opposed to a number? If it's because you have custom scripts that separate the sequences based on the index, then it's just as easy to modify them to handle a number. Or you can do a simple awk/perl script to replace the number with the sequence if you absolutely must have that info in the header, much simpler than running CASAVA.

BTW, if you want the full index real file, there a flag that you can add into the MiSeqReporter.config xml file.

**kmcarr** · 01-21-2013, 05:56 AM

Originally posted by GenoMax View Post

Would it not be simpler to replace the index numbers with the actual sequences?

Originally posted by mcnelson.phd View Post

I'm curious as to why you would need the index sequence in the sequence header as opposed to a number? If it's because you have custom scripts that separate the sequences based on the index, then it's just as easy to modify them to handle a number. Or you can do a simple awk/perl script to replace the number with the sequence if you absolutely must have that info in the header, much simpler than running CASAVA.

But this would not be equivalent to what your get from CASAVA. Your (GenoMax & mcnelson) suggestion is to replace the index ID with the nominal index sequence. What CASAVA records in this field is the OBSERVED index sequence, thus if you are permitting mismatches in the index, the mismatched sequence is what is reported. And running CASAVA BclToFastq on a MiSeq run takes very little time.

BTW, if you want the full index real file, there a flag that you can add into the MiSeqReporter.config xml file.

Interesting mcn. I wasn't aware of this flag. It seems as though this output may be similar to what CASAVA is reporting.

**mcnelson.phd** · 01-21-2013, 06:13 AM

Originally posted by kmcarr View Post

Interesting mcn. I wasn't aware of this flag. It seems as though this output may be similar to what CASAVA is reporting.

If you choose to use that flag for Reporter, it gives you the actual index read file(s). During sample demultiplexing, Reporter does use an error correction scheme to assign samples that have identifiable and correctable errors, which means for the most part sequences are correctly assigned to their correct sample.

Now, I have looked at the quality metrics for a number of index read files, and it's quite disturbing how poor index read quality is in many cases. Not only do we constantly see low level phiX contamination, but I've also seen obvious sample cross-contamination in some genomes we did once. We're working with a group looking at speciation in very closely related archaeal strains, and I've recommended to them that we do "manual" demultiplexing using my own scripts to reduce the level cross contamination.

**GenoMax** · 01-21-2013, 06:21 AM

Originally posted by kmcarr View Post

But this would not be equivalent to what your get from CASAVA. Your (GenoMax & mcnelson) suggestion is to replace the index ID with the nominal index sequence. What CASAVA records in this field is the OBSERVED index sequence, thus if you are permitting mismatches in the index, the mismatched sequence is what is reported. And running CASAVA BclToFastq on a MiSeq run takes very little time.

Good catch. I missed that finer technical point when I thought about the simplest solution for numerical indexs.

If OP does not have easy access to MiSeq/CASAVA then simple sequence replacements would still be a practical work around.

**bbeitzel** · 01-22-2013, 08:24 AM

Originally posted by mcnelson.phd View Post

Now, I have looked at the quality metrics for a number of index read files, and it's quite disturbing how poor index read quality is in many cases. Not only do we constantly see low level phiX contamination, but I've also seen obvious sample cross-contamination in some genomes we did once. We're working with a group looking at speciation in very closely related archaeal strains, and I've recommended to them that we do "manual" demultiplexing using my own scripts to reduce the level cross contamination.

We are seeing the same thing on our MiSeq runs. We were doing some pathogen identification runs, and were seeing cross-contamination in demultiplexed reads (ie. reads from "known" samples run on the same flow cell were showing up in reads from "unknowns".) At first we thought that we were cross contaminating during library prep, but we also see a lot of PhiX showing up in unknowns. If we were somehow cross contaminating during library prep, the PhiX should still show up in the unindexed reads file. The fact that it shows up with indexed reads makes me think that it is a problem with demultiplexing. Forcing the index reads to have average quality > Q30 before demultiplexing cleans this up somewhat, but not completely.

**mcnelson.phd** · 01-22-2013, 08:53 AM

Originally posted by bbeitzel View Post

We are seeing the same thing on our MiSeq runs. We were doing some pathogen identification runs, and were seeing cross-contamination in demultiplexed reads (ie. reads from "known" samples run on the same flow cell were showing up in reads from "unknowns".) At first we thought that we were cross contaminating during library prep, but we also see a lot of PhiX showing up in unknowns. If we were somehow cross contaminating during library prep, the PhiX should still show up in the unindexed reads file. The fact that it shows up with indexed reads makes me think that it is a problem with demultiplexing. Forcing the index reads to have average quality > Q30 before demultiplexing cleans this up somewhat, but not completely.

With the phiX a lot of that is due to the lack of an index on the phiX v3 control DNA. For amplicons where we use a lot of phiX in the run, we were using the v2 phiX that comes with the Multiplexing kit for the HiSeq as that has a TruSeq index on it and thus the overall quality of all index reads was much better. But now with the MiSeq hardware upgrades we've had issues the two times we tried to use the indexed v2 phiX because of fragment size so we're stuck using the v3 phiX without the index.

What appears to happen, and this is for both the index and the reads themselves, is that RTA seems to default assign A's to clusters where it can't determine what the sequence is because there's either no signal (e.g phiX during the index read or if you sequenced fully though a small fragment). So for phiX, during the index read, most of those clusters get AAAAAA..., but in some cases the cluster is close enough to another one that has an index and that signal gets picked up for the phiX, hence faulty assignment.

The only way I see around this is to make sure that all fragments on the flow-cell have an index, and that they're all error correcting (which I believe the TruSeq and Nextera indices are) and then do a quality pass on the index before demultiplexing. Our examination showed that a Q30 was overly strict, getting rid of too many reads, while a Q20 kept >90%. There's still very low level contamination, but I guess that's just something we'll have to live with unless Illumina drastically changes their sequencing methodology.

**aboyfromnowhere** · 01-28-2013, 02:57 PM

Originally posted by bbeitzel View Post

We are seeing the same thing on our MiSeq runs. We were doing some pathogen identification runs, and were seeing cross-contamination in demultiplexed reads (ie. reads from "known" samples run on the same flow cell were showing up in reads from "unknowns".) At first we thought that we were cross contaminating during library prep, but we also see a lot of PhiX showing up in unknowns. If we were somehow cross contaminating during library prep, the PhiX should still show up in the unindexed reads file. The fact that it shows up with indexed reads makes me think that it is a problem with demultiplexing. Forcing the index reads to have average quality > Q30 before demultiplexing cleans this up somewhat, but not completely.

Been dealing with this exact same problem today, finding PhiX reads in our de novo assemblies. Probably a dumb question, but how do you force the index reads to have an average quality > Q30?

What we're more concerned about though is sequence from one strain somehow being indexed/assembled with another. Does anyone know of a way to check for/prevent this? Thanks.

**mcnelson.phd** · 01-28-2013, 04:46 PM

Unfortunately, there's no way to force Reporter to do any filtering based on the quality of the index read, so you're forced to do it with your own scripts. It's not really that hard: average quality over all index bases must be > X, no bases can have a quality < Y, index must have 0 ambiguous bases.

As far as cross-contamination, it depends on how much of a problem you have as to whether or not it will affect your assemblies. Given that any contamination should be pretty low, I just ignore it for most of our de novo assembly and resequencing/mapping runs. If I were doing variant calling though, I'd have to implement some sort of index quality cut-off based on how closely related the strains are expected to be.

**aboyfromnowhere** · 01-28-2013, 05:02 PM

Originally posted by mcnelson.phd View Post

Unfortunately, there's no way to force Reporter to do any filtering based on the quality of the index read, so you're forced to do it with your own scripts. It's not really that hard: average quality over all index bases must be > X, no bases can have a quality < Y, index must have 0 ambiguous bases.

As far as cross-contamination, it depends on how much of a problem you have as to whether or not it will affect your assemblies. Given that any contamination should be pretty low, I just ignore it for most of our de novo assembly and resequencing/mapping runs. If I were doing variant calling though, I'd have to implement some sort of index quality cut-off based on how closely related the strains are expected to be.

Hey, thanks for the reply. I'm going to have to break out some coding books then - trying to learn some perl at the moment.

For PhiX we got a single 5386 bp contig, at around between 70 X and 300 X coverage, depending on the run (so that was sequencing 5 and 2 S. pneumoniae genomes, respectively). So given that we've got a perfect size contig, at pretty high coverage, we're pretty nervous about cross-contamination from the strains themselves. Will give the index quality filtering a try though, to see if that has an effect. Thanks.

EDIT: Is this something you do with paired-end reads? If so, once you've deleted a read due to low index quality, how do you deal with its paired read in the corresponding file, given that you need to keep the order the same?

**mcnelson.phd** · 01-29-2013, 05:04 AM

Originally posted by aboyfromnowhere View Post

Hey, thanks for the reply. I'm going to have to break out some coding books then - trying to learn some perl at the moment.

For PhiX we got a single 5386 bp contig, at around between 70 X and 300 X coverage, depending on the run (so that was sequencing 5 and 2 S. pneumoniae genomes, respectively). So given that we've got a perfect size contig, at pretty high coverage, we're pretty nervous about cross-contamination from the strains themselves. Will give the index quality filtering a try though, to see if that has an effect. Thanks.

EDIT: Is this something you do with paired-end reads? If so, once you've deleted a read due to low index quality, how do you deal with its paired read in the corresponding file, given that you need to keep the order the same?

Paired-end should have no real affect on the index quality scores. I guess index 2 could have lower average quality because of the re-synthesis, but I've never looked at that.

I've been using a custom script that uses bowtie2 as its back-end to map and remove any phiX reads from 16S runs that we do. It doesn't catch every read, but by my estimate it's effective at removing >90% of all phiX reads from a sample.

If you want to go the quick route to get a script put together for index quality filtering, I'd suggest using a shell script that takes advantage of one of the many quality trimming tools already available. You would essentially do a quality trim on the index read(s), then you can filter the reads with a bad index out of your read 1/2 files, then proceed to demultiplexing. FastX toolkit can handle the quality trimming of the index and the demultiplexing, and then all you need is a simple filtering script that shouldn't be too hard to whip up in perl. Wrap it all up in a shell script to make it all work in one go and there you have it.

**aboyfromnowhere** · 01-29-2013, 06:31 AM

Originally posted by mcnelson.phd View Post

Paired-end should have no real affect on the index quality scores. I guess index 2 could have lower average quality because of the re-synthesis, but I've never looked at that.

I've been using a custom script that uses bowtie2 as its back-end to map and remove any phiX reads from 16S runs that we do. It doesn't catch every read, but by my estimate it's effective at removing >90% of all phiX reads from a sample.

If you want to go the quick route to get a script put together for index quality filtering, I'd suggest using a shell script that takes advantage of one of the many quality trimming tools already available. You would essentially do a quality trim on the index read(s), then you can filter the reads with a bad index out of your read 1/2 files, then proceed to demultiplexing. FastX toolkit can handle the quality trimming of the index and the demultiplexing, and then all you need is a simple filtering script that shouldn't be too hard to whip up in perl. Wrap it all up in a shell script to make it all work in one go and there you have it.

No, I didn't mean it would effect the index score of the paired read. My understanding was that for assembly, paired reads need to be in the same position in the forward and reverse files to be recognised as pairs. If you filter one of them (say the forward read) out due to a poor index score, do you just delete the reverse to maintain the order for the other reads in the file?

I'll give the FastX/shell script method a go. Thanks for the suggestions.

**mcnelson.phd** · 01-29-2013, 06:35 AM

Originally posted by aboyfromnowhere View Post

No, I didn't mean it would effect the index score of the paired read. My understanding was that for assembly, paired reads need to be in the same position in the forward and reverse files to be recognised as pairs. If you filter one of them (say the forward read) out due to a poor index score, do you just delete the reverse to maintain the order for the other reads in the file?

If you're filtering based on the index quality, you would have to filter out both read 1 and read 2. That's actually pretty simple because you're removing the same reads from both files as opposed to having to keep read 1 and read 2 unified while doing sequence trimming on those files.

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 24 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 159 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Miseq FASTQ sequence identifier missing index read?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News