No announcement yet.

BWA and Illumina data/results

  • Filter
  • Time
  • Show
Clear All
new posts

  • BWA and Illumina data/results

    I am quite new to exome sequencing, so please forgive me if I am clouding the forum with basic questions.
    I am working with Illumina PE data (100bp) and since we use Illumina's pipeline 1.7, I installed BWA version 0.5.9b, which can deal with Illumina quality scoring.

    A few questions though:

    1) Trimming reads with BWA
    - The BWA manual on internet tells me the following on using BWA aln -q:
    Parameter for read trimming. BWA trims a read down to argmax_x{\sum_{i=x+1}^l(INT-q_i)} if q_l<INT where l is the original read length. [0]
    - Looking at the description of BWA aln -q within the program itself, it tells me:
    quality threshold for read trimming down to 35bp [0]

    a) since I don't know what all variables stand for in the BWA manual, I was wondering if both description mean the same thing? (I have my doubts about that)
    b) Trimming might be a wise thing to do, but do I really want to trim down my 100bp reads to 35bp? Sounds to me like I might be loosing too much valuable data.

    2) Examining my fastq-files of three exomes I noticed a peculiar yet consequent anomaly. If I am looking at the Q2/B-flagged read-ends, I find that ~50,000 reads in the first fastq-file are entirely flagged and ~2,000,000 in the second one. This 40x difference is seen in all exomes. Does anyone have any thoughts on what could explain these differences?

    3) After aligning both fastq-files (without trimming) I noticed a lot of MAPQ scores within the sam-file are Q0, Q29 or Q60. My fastq-files are definitely Illumina-scores and looking at the scoring in the sam file, I see Sanger scoring, so it seems the -I did its job. Yet I see a MAPQ distribution that looks very random to me, but when I compare it between exomes, I see the same randomness (see attachment) which means it is not random.
    I cannot explain this phenomenon and I was wondering if someone else has any thoughts on this?

    Thank you for your time!
    Attached Files
    Last edited by Papillon; 04-11-2011, 12:34 PM.

  • #2
    MapQ is is a measure of how sure the aligner is that the read is in the right position. If read quality is low, then the read could be falsely placed.

    But if the underlying genome is non-complex ore repetative at that position, then there might also be amibiguities with other non-complex regions. A mapQ of 0 suggests a perfect duplication in the underlying genome, not that there's anything wrong with your read quality


    • #3
      Thank you for taking the time to share your knowledge and experience. A youngster like myself has to start somewhere and this forum is a very valuable place for that.

      But on topic...

      I thought that maybe the read quality score was involved in mapping a read to the reference genome (and determine the mapping quality), since I can imagine that parts that don't match to the reference, but have a low read quality score, don't count as heavily as high quality reads in a similar situation.

      But I guess the Q0 and Q60 can be explained, although I didn't thought of ambiguity as a cause for Q0. Looking at the number of Q0-reads, those could be responsible for quite some false negatives. Ouch!
      The high amount of Q29 reads puzzled and still puzzles me most of all. Any thoughts on that would be appreciated.

      Thanks a bunch!


      • #4
        I recently extracted from my BWA aligned BAM file all reads that were trimmed from 5'-end for at least 8 bp. This is (please correct me if I'm wrong) indicative of the fact that the read could be aligned to the reference genome only using SW using its mate as an anchor. Majority of these reads turned out to be Q29. So my guess is that the highest score a read can get if it was aligned to the genome this way is 29. The scale is in Sanger quality format.


        • #5
          Thank you for your answer!

          I am not quite sure if I understand what you are saying, so please forgive me (and correct me) if I make miss-assumtions or when I am asking the wrong questions.

          Do I understand correctly that reads that need help from their mate to get aligned, will get a Q29 score and are therefor overrepresented? Maybe because a read is too short after trimming to align properly on its own or maybe because one of the reads is aligned to an repetitive fragment of the exome/genome.

          In regards to my first post:

          1) Is it safe to assume BWA does hard trimming these days?
          2) Could the 40-fold difference in reads that are entirely Q2/B-flagged be explained by re-synthesis of the library? The strange part is that I do not see any significant differences between the first reads of the PE and the second reads that are only partly flagged from 3'. It's all or nothing.

          Thank you very much!
          Last edited by Papillon; 04-12-2011, 09:26 AM.


          • #6
            Yeah I think at least in my case the Q29 reads got overrepresented because in my set I only had reads that needed help from their mate to be mapped. Rather than read been too short, I think a more likely reason that a read needs its mate's help is that it contains too many mismatches to the reference genome at the position where it originated from. Therefore it cannot be aligned to the reference genome "by itself".

            1) I don't think BWA does hard trimming. Rather, you can opt for dynamic trimming where the reads are trimmed according to the formula you showed in your first post. In dynamic trimming, reads are trimmed - based on their base qualities - from 0 bases up to length - 35bp bases of length.

            So say you have a 100nt read, but the 65 bases at the 3'-end are of very bad quality. If you didn't trim the read, you couldn't align the 100nt read at all because of too many mismatches, but if you trimmed the bad quality bases, you might be able to rescue the 35nt good quality bases at the 5'-end of the read and be able to align it.

            2) I've never checked this from my datasets, but an interesting observation! As far as I remember (read it either from Wikipedia or Illumina manual) that Illumina does the Q2 flagging based on their own proprietary algorithm. But at least I don't know why you would see such a big difference.


            • #7
              Thank you so much for your answer! I have checked my data and all the Q29 pairs I looked at, are indeed, in contrast to others, (unique/mate-sw).

              When I look a bit further in the SAM-tag, I notice that in a lot of cases there are no mismatches (or just one), but at the same time there are references to positions of mismatches:

              XT:A:M NM:i:4 SM:i:29 AM:i:29 XM:i:0 XO:i:0 XG:i:0 MD:Z:72A0T0A16T3
              XT:A:U NM:i:1 SM:i:29 AM:i:29 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:100T0

              Am i missing something?

              I wrote a simple script that does the trimming/removing pairs that get too short (gotta love python), so I hope that will improve my results significantly. Fingers crossed.