I am quite new to exome sequencing, so please forgive me if I am clouding the forum with basic questions.
I am working with Illumina PE data (100bp) and since we use Illumina's pipeline 1.7, I installed BWA version 0.5.9b, which can deal with Illumina quality scoring.
A few questions though:
1) Trimming reads with BWA
- The BWA manual on internet tells me the following on using BWA aln -q:
Parameter for read trimming. BWA trims a read down to argmax_x{\sum_{i=x+1}^l(INT-q_i)} if q_l<INT where l is the original read length. [0]
- Looking at the description of BWA aln -q within the program itself, it tells me:
quality threshold for read trimming down to 35bp [0]
a) since I don't know what all variables stand for in the BWA manual, I was wondering if both description mean the same thing? (I have my doubts about that)
b) Trimming might be a wise thing to do, but do I really want to trim down my 100bp reads to 35bp? Sounds to me like I might be loosing too much valuable data.
2) Examining my fastq-files of three exomes I noticed a peculiar yet consequent anomaly. If I am looking at the Q2/B-flagged read-ends, I find that ~50,000 reads in the first fastq-file are entirely flagged and ~2,000,000 in the second one. This 40x difference is seen in all exomes. Does anyone have any thoughts on what could explain these differences?
3) After aligning both fastq-files (without trimming) I noticed a lot of MAPQ scores within the sam-file are Q0, Q29 or Q60. My fastq-files are definitely Illumina-scores and looking at the scoring in the sam file, I see Sanger scoring, so it seems the -I did its job. Yet I see a MAPQ distribution that looks very random to me, but when I compare it between exomes, I see the same randomness (see attachment) which means it is not random.
I cannot explain this phenomenon and I was wondering if someone else has any thoughts on this?
Thank you for your time!
I am working with Illumina PE data (100bp) and since we use Illumina's pipeline 1.7, I installed BWA version 0.5.9b, which can deal with Illumina quality scoring.
A few questions though:
1) Trimming reads with BWA
- The BWA manual on internet tells me the following on using BWA aln -q:
Parameter for read trimming. BWA trims a read down to argmax_x{\sum_{i=x+1}^l(INT-q_i)} if q_l<INT where l is the original read length. [0]
- Looking at the description of BWA aln -q within the program itself, it tells me:
quality threshold for read trimming down to 35bp [0]
a) since I don't know what all variables stand for in the BWA manual, I was wondering if both description mean the same thing? (I have my doubts about that)
b) Trimming might be a wise thing to do, but do I really want to trim down my 100bp reads to 35bp? Sounds to me like I might be loosing too much valuable data.
2) Examining my fastq-files of three exomes I noticed a peculiar yet consequent anomaly. If I am looking at the Q2/B-flagged read-ends, I find that ~50,000 reads in the first fastq-file are entirely flagged and ~2,000,000 in the second one. This 40x difference is seen in all exomes. Does anyone have any thoughts on what could explain these differences?
3) After aligning both fastq-files (without trimming) I noticed a lot of MAPQ scores within the sam-file are Q0, Q29 or Q60. My fastq-files are definitely Illumina-scores and looking at the scoring in the sam file, I see Sanger scoring, so it seems the -I did its job. Yet I see a MAPQ distribution that looks very random to me, but when I compare it between exomes, I see the same randomness (see attachment) which means it is not random.
I cannot explain this phenomenon and I was wondering if someone else has any thoughts on this?
Thank you for your time!
Comment