We just tested Bowtie2 for reads mapping multiple times in a genome:
1 . We took one read with multiple (perfect) mapping loci in the genome (hg19):
FASTQ entry:
UCSC - BLAT:
2 . We created a fastq file containing 1,000 times this entry.
3 . We started bowtie2 with the following command:
4 . bowtie2 claims in its manual:
5 . BUT:
=> For all reads bowtie2 reports the same location (chr1:11635-11735) in the SAM output
=> Using default parameters, bowtie2 does NOT select the reported mapping randomly
------------------------------------------------
Why is that a problem?
Two examples:
Assume you analyze a small RNA-seq experiment to quantify microRNAs. Now you have an example like hsa-let-7a. The 5' arm of this microRNA (hsa-let-7a-5p) occurs three times in the human genome, while the 3' arm (hsa-let-7a-3p) is unqiue. When you now use bowtie2 for mapping (without knowing the effect shown above), you might think, that your hsa-let-7a-5p reads are randomly distributed over these three loci, normalizing the "expression" in a way. But that is not the case. Instead one of the three let-7 microRNAs gets all reads and you think it is highly expressed. But is it?
The same problem occurs when doing transcriptome sequencing. It might happen, that all the reads of a gene having pseudogene copy in the genome, map to the latter, resulting in no expression at all. Is it not expressed?
TIP: You can and should force bowtie2 to report all multiple loci (-k or -a parameter). Do not use the bowtie2 default parameter for NGS read mapping!
Also follow the disccusion here
Thanks to Christian Otto (transcriptome bioinformatics group - University Leipzig)!
1 . We took one read with multiple (perfect) mapping loci in the genome (hg19):
FASTQ entry:
Code:
@test_sequence CAAAGTGTTTACTTTTGGATTTTTGCCAGTCTAACAGGTGAAGCCCTGGAGATTCTTATTAGTGATTTGGGCTGGGGCCTGGCCATGTGTATTTTTTTAAA + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Code:
browser details YourSeq 101 1 101 101 100.0% 2 - 114359280 114359380 101 browser details YourSeq 101 1 101 101 100.0% 15 - 102519435 102519535 101 browser details YourSeq 101 1 101 101 100.0% 1 + 11635 11735 101
3 . We started bowtie2 with the following command:
Code:
bowtie2 -x hg19 -U test.fq -S test.sam
By default, bowtie2 searches for distinct, valid alignments for each read. When it finds a valid alignment, it continues looking for alignments that are nearly as good or better. The best alignment found is reported (randomly selected from among best if tied).
=> For all reads bowtie2 reports the same location (chr1:11635-11735) in the SAM output
=> Using default parameters, bowtie2 does NOT select the reported mapping randomly
------------------------------------------------
Why is that a problem?
Two examples:
Assume you analyze a small RNA-seq experiment to quantify microRNAs. Now you have an example like hsa-let-7a. The 5' arm of this microRNA (hsa-let-7a-5p) occurs three times in the human genome, while the 3' arm (hsa-let-7a-3p) is unqiue. When you now use bowtie2 for mapping (without knowing the effect shown above), you might think, that your hsa-let-7a-5p reads are randomly distributed over these three loci, normalizing the "expression" in a way. But that is not the case. Instead one of the three let-7 microRNAs gets all reads and you think it is highly expressed. But is it?
The same problem occurs when doing transcriptome sequencing. It might happen, that all the reads of a gene having pseudogene copy in the genome, map to the latter, resulting in no expression at all. Is it not expressed?
TIP: You can and should force bowtie2 to report all multiple loci (-k or -a parameter). Do not use the bowtie2 default parameter for NGS read mapping!
Also follow the disccusion here
Thanks to Christian Otto (transcriptome bioinformatics group - University Leipzig)!
Comment