Hi all,
There is a lot of software for performing various forms of quality control and filtering on short read data generated by NGS platforms, but it's harder to find information about what decisions one should make when doing filtering - where to put your thresholds, how much to trim etc.
I have just begun to map >100 bp Illumina reads to a small genome but I seem to have a lot of noise in the data set (both adapter contamination and low quality sequence) resulting in low mapping rates. Quality plummets towards the 3' end, and trimming all reads gives better rates but then of course, you don't want to trim away good sequence.
I understand BWA has a pretty neat approach to trimming reads individually (the -q flag), but there is the question of what value to set for this parameter.
Other approaches, like discarding whole reads exceeding some minimum thresholds, counting ambiguous bases ("N"'s), looking for little windows of poor-quality sequence etc. are also mentioned every now and then and sound good in principle but here too it is hard to know exactly how much of a given approach one should do. Another thought here is that if a read is really bad it probably won't align anyway - but it still feels better to identify and exclude these reads beforehand, right? I also thought about mapping full-length reads, then trimming unmapped reads and trying them again. It seems like there are many different approaches one could take but it's not obvious what would be best.
Since this is such a common task it would seem that some kind of standard practices would start to emerge after a while - atleast there should be a lot of people with experience of these kind of decisions. So does anyone have any opinions on this, or possibly links to other resources on the topic? Many thanks in advance.
There is a lot of software for performing various forms of quality control and filtering on short read data generated by NGS platforms, but it's harder to find information about what decisions one should make when doing filtering - where to put your thresholds, how much to trim etc.
I have just begun to map >100 bp Illumina reads to a small genome but I seem to have a lot of noise in the data set (both adapter contamination and low quality sequence) resulting in low mapping rates. Quality plummets towards the 3' end, and trimming all reads gives better rates but then of course, you don't want to trim away good sequence.
I understand BWA has a pretty neat approach to trimming reads individually (the -q flag), but there is the question of what value to set for this parameter.
Other approaches, like discarding whole reads exceeding some minimum thresholds, counting ambiguous bases ("N"'s), looking for little windows of poor-quality sequence etc. are also mentioned every now and then and sound good in principle but here too it is hard to know exactly how much of a given approach one should do. Another thought here is that if a read is really bad it probably won't align anyway - but it still feels better to identify and exclude these reads beforehand, right? I also thought about mapping full-length reads, then trimming unmapped reads and trying them again. It seems like there are many different approaches one could take but it's not obvious what would be best.
Since this is such a common task it would seem that some kind of standard practices would start to emerge after a while - atleast there should be a lot of people with experience of these kind of decisions. So does anyone have any opinions on this, or possibly links to other resources on the topic? Many thanks in advance.
Comment