Hi,
I am wondering whether trimming is necessary if using BFAST for aligning SOLiD reads? BFAST has been pretty powerful in mapping cs reads accurately while tolerating mismatches and indels - ~90% of my reads (single-end, 75bp, from genomic DNA) were uniquely mapped with best alignment score.
However, the base quality drops dramatically at 3' end of the SOLiD reads. Here are the base quality distributions of raw reads (csfastq format) and mapped reads (bam format)
- Raw reads: I suspect it detects the wrong quality format - Illumina not Sanger - though (not sure)
- BFAST mapped reads: I think this time it detects the correct quality format - Sanger
If mapping without trimming, the low-quality 3' end of the reads are mapped, but with lots of mismatches. Here is a snapshot from the bam file in GenomeView (all mismatches are at the 3' end of the mapped reads)
- green: Read mapped to the forward strand
- blue: Read mapped to the reverse strand
I am debating whether to trim the reads before mapping, in the purpose of accurate multi-sample variant calling, as my preliminary analysis (without re-align or re-cal) using varscan returned lots of false positive snps, which are located in the the noisy mapping regions....on the other hand, I would like not to throw out reads that are indeed mapped (even with mismatches).
here are the thresholds that I set for variant calling:
Any help is highly appreciated! Much thanks in advance.
Sonia
I am wondering whether trimming is necessary if using BFAST for aligning SOLiD reads? BFAST has been pretty powerful in mapping cs reads accurately while tolerating mismatches and indels - ~90% of my reads (single-end, 75bp, from genomic DNA) were uniquely mapped with best alignment score.
However, the base quality drops dramatically at 3' end of the SOLiD reads. Here are the base quality distributions of raw reads (csfastq format) and mapped reads (bam format)
- Raw reads: I suspect it detects the wrong quality format - Illumina not Sanger - though (not sure)
Code:
fastqc $reads.csfastq
- BFAST mapped reads: I think this time it detects the correct quality format - Sanger
Code:
fastqc $reads.mapped.srt.rmdup.bam
If mapping without trimming, the low-quality 3' end of the reads are mapped, but with lots of mismatches. Here is a snapshot from the bam file in GenomeView (all mismatches are at the 3' end of the mapped reads)
- green: Read mapped to the forward strand
- blue: Read mapped to the reverse strand
I am debating whether to trim the reads before mapping, in the purpose of accurate multi-sample variant calling, as my preliminary analysis (without re-align or re-cal) using varscan returned lots of false positive snps, which are located in the the noisy mapping regions....on the other hand, I would like not to throw out reads that are indeed mapped (even with mismatches).
here are the thresholds that I set for variant calling:
Code:
java -Xmx4g -jar VarScan.v2.2.11.jar mpileup2snp \ $reads.mapped.srt.rmdup.merged.mpileup \ --min-coverage 8 \ --min-reads2 2 \ --min-var-freq 0.2 \ --min-avg-qual 20 \ --p-value 0.01 \ > $reads.snp
Sonia
Comment