Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Low alignment rates with bowtie2 and bwa for a bacterial genome

    Apologies for cross-posting.

    I am getting really low alignment rates while trying to do SNP analysis.

    I sequenced the genome of a bacterial strain. Then conducted an experiment and then sequenced the genome of the strain at the end of the experiment to see what mutations (SNPs) took place during the course of the experiment.

    Bowtie2 gives me the following output:

    #map the reads
    -bash-4.1$ ./bowtie2 -p 1 -x AER -1 S25_R1_001.fastq -2 S25_R2_001.fastq > S25_bowtie2.sam

    #output
    4240966 reads; of these:
    4240966 (100.00%) were paired; of these:
    4240777 (100.00%) aligned concordantly 0 times
    161 (0.00%) aligned concordantly exactly 1 time
    28 (0.00%) aligned concordantly >1 times
    ----
    4240777 pairs aligned concordantly 0 times; of these:
    10902 (0.26%) aligned discordantly 1 time
    ----
    4229875 pairs aligned 0 times concordantly or discordantly; of these:
    8459750 mates make up the pairs; of these:
    8168790 (96.56%) aligned 0 times
    49047 (0.58%) aligned exactly 1 time
    241913 (2.86%) aligned >1 times
    3.69% overall alignment rate

    and bwa gives me the following output (I ran samtools flagstat command to see % overall alignment rate, which is 47% if I understand the output correctly)

    #map the reads
    -bash-4.1$ bwa mem -t 4 AER.fasta S25_R1_001.fastq S25_R2_001.fastq > S25_bwa.sam
    #convert to bam
    -bash-4.1$ ./samtools view -bS S25_bwa.sam > S25_bwa.bam
    #get flagstats
    -bash-4.1$ ./samtools flagstat S25_bwa.bam

    #output
    8816220 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 secondary
    334288 + 0 supplementary
    0 + 0 duplicates
    4153225 + 0 mapped (47.11%:-nan%)
    8481932 + 0 paired in sequencing
    4240966 + 0 read1
    4240966 + 0 read2
    2605074 + 0 properly paired (30.71%:-nan%)
    3513674 + 0 with itself and mate mapped
    305263 + 0 singletons (3.60%:-nan%)
    901042 + 0 with mate mapped to a different chr
    41706 + 0 with mate mapped to a different chr (mapQ>=5)

    I have read that bwa me is generally a very aggressive aligner and that probably explains the 47% rate.

    I am looking into how to extract unmapped reads so that I can blast them to see any contamination issues.
    The fastqc reports look fine (no red crosses), especially after I trim the first ~10 and last ~3 bases. Is there any other quality control I should be doing before mapping? What are all of the reasons for low alignment rates?

  • #2
    I am going to suggest that you try BBMap (I know .. yet another aligner) since it will easily allow you to capture reads that do not map to a file. Leave settings at default values. Only provide a reference index (that you will need to build or build at run time) input files and outputs (to contain the alignment and files for unmapped reads).

    I would also suggest that you grab a sample of reads and do blast @NCBI to figure out if there is a possibility of contamination from an unexpected species.
    Last edited by GenoMax; 03-01-2016, 05:06 PM.

    Comment


    • #3
      Your check for contamination angle is the thing to do.

      You can even *randomly* subsample the unmappeds and paste 'em as fastas into into NCBI web blast just to get your bearings.

      If for instance 20% of them map to another bacteria family, then , yeah, it's contamination.

      Check out their adapter contamination tool, too.

      Comment


      • #4
        Thanks Richard and GenoMax. I will try your suggestions. Somewhere I was hoping that it was not a contamination issue. But let's see. Will post here what I find.

        Comment


        • #5
          I have never used BBMap and I am in the process of learning how to install and use it.
          But I found and did the following to get the unmapped reads from my bwa output

          ./samtools view -b -f 4 S25_bwa.bam > S25_unmapped.bam
          ./samtools bam2fq S25_unmapped.bam > S25_unmapped_bam2fq.fastq

          Fortunately or unfortunately, my unmapped reads blast with the expected genome i.e. contamination with unexpected genome does not seem to be the overwhelming issue.

          I will now see if I can rerun the analysis by checking for and removing adapter contamination but I also read that unmapped reads is not a problem for SNP analysis (my goal) and that I can proceed with just the mapped reads. But if I do that, doesn't the low alignment rate mean that I would be looking at just a very incomplete picture of how SNPs are distributed across the genome?

          Could I just skip assembly with reference and do de-novo assembly? Because I have genomes from time t=0 and time t=end of experiment, would it be possible to make SNPs inferences by comparing the SNPs at initial vs final timepoints?

          I would really appreciate any number of insights!

          Comment


          • #6
            It is good to know that the possibility of contamination is low/not there. But then why is bowtie2 having trouble aligning the reads (I am not sure if bowtie2 defaults are strict and/or your "reference" is different enough to not allow the reads to map). If the second possibility is true then you may indeed want to try and assemble your own reference. Lab specific strains can turn out to be different from the "reference" out in the wild.

            Comment


            • #7
              I tried reassembling the reference and then trying reference based assembly using bowtie2 and bwa-mem. The alignment rates improved to 25% and 70% respectively. The adapter content in fastqc is not marked to be a problematic field.

              When I extract the unmapped reads in fast format, I see a pattern though. The first and the last few unmapped reads are give below (I inserted the names "identifier 1, 2, 3 4, 5, 6 for the purpose of this post). Some reads clearly have lots (or only) string of NNNNNNs (identifier 1, 2,3) and will be unmappable. But there are reads (identifier 4, 5,6) that have no NNNNNs and still don't map. Is the quality too low for these reads, especially if some of them blast to the right genome?

              @Identifier1
              NNNNNCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGNTNTCCATCTTTTNTTTCCTTCGNTNTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
              +
              #####E#####################################################/#E#E//EEAEEE<#E<EE/A//E#E#/############################################

              @Identifier2
              AGATCTCCGGCTTGATGNNTCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGNNGNNCGNATCGCGCNACTTNTNGTCNTGCGCGGCCAGGGCATCCAGCGTCTGCTTGG
              +
              EEEEEEEEEEEEEEEEE##EEE#######################################E##6##EE#EEEEEEA#EEEE#E#EEE#EEEEEEEEEEEEEEEEEAEEEEEEEEEEEEE

              @Identifier3
              CCGGGCGCTACCACCGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNANNTNNTNNCANGNGNNGACNNNNATGNGGGNCAGCTGGCCNCTGACGGAGAGCCCCTGCCAGCCGGCAT
              +
              EEEEEEEEEEEEEEEEE############################################E##E##A##EA#E#<##/AA####EEE#EEE#<EEEEEE<E#E<AEEEEEE<EEEEEEE6AEA<AAAA/E

              @Identifier 4
              AACAGCAGGCTCCCGATCGAGTAGCCGGCGCCAAACGAGCAGAGCACGCCGCGGGCCCCCGCGGAAAGGTCGCGATTGTACAGATGGAACGCAATAATGGAACCCGCCGAGCTGGTGTTGCCGTAGCTGTC
              +
              EAEAEE<EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEAEEEEEEEEEEEAAEEEEEEEAEEEE<AEEEEEEEEEEEEEEEEEAE/EEEEAEEEEEE<EEAAEEEEEEAEAEEEEEAAAEAEEEEE<

              @Identifier 5
              CCTTGGCGTAGTCGCCAGCCCGATACCAGGCATTGCCCTGCCAGACGGGATCCGCAAAGGCGCGGGCGGCCTCCTTATAGTGCTTTTGCAAGAAGGCCTGCCAGGCCTGATTGTCATGGGTAG
              +
              EEEEEEEEEEEEEEEEEEAEEEEEAEEEEEE/EEEEEEEEE<EE/EEEAEEEEEEEE6EEEEEEEEEEEA<<EEAEEEAEEEE//<AEEEEEE/EEE6EE<A<EEEA/AA//EE<</EEEEEE

              @Identifier 6
              CAGGCCTGGCAGGCCTTCTTGCAAAAGCACTATAAGGAGGCCGCCCGCGCCTTTGCGGATCCCGTCTGGCAGGGCAATGCCTGGTATCGGGCTGGCGACTACGCCAAGGCGGTCGCCGCCTAT
              +
              EEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEAEEEEEEEEEAEEEEEEEEEEEEEEE<EEEEEEEAEEEEEEEEEAE<<<EEEEEE<E<EEEEEE/EEEEEEEA/AEEEEAEEAAA<A/A/

              Thanks very much for your help!

              Comment


              • #8
                Blasting #4 matches 97% to Aeromonas hydrophila YL17 with 4 mismatches.
                (via web blast).

                #5 matches 98% to several Aeromonas hydrophila strains

                Perhaps you can use blast to align your unmapped reads if this is your target genome.

                Is this the target? Is it a contaminant species?

                I think there's some support for sam output and likely there is support for converting blast output to sam from other utilities.

                UCSC blat is also good for checking on problem reads.

                Note also that alignment software often has parameters that you can use which will allow more mismatches and gaps.
                Last edited by Richard Finney; 03-10-2016, 05:35 PM.

                Comment


                • #9
                  As a side comment, you can make bwa mem more sensitive by lowering the minimum score for a read to be outputted (-T option, e.g. try -T 20) and/or by making the seed length shorter (-k option). It will be much slower and memory demanding but for a bacterial genome it shouldn't be a problem.

                  Obviously, if there is a problem with the reads this will not fix it but at least you can get an idea of what's going on...

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Best Practices for Single-Cell Sequencing Analysis
                    by seqadmin



                    While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                    06-06-2024, 07:15 AM
                  • seqadmin
                    Latest Developments in Precision Medicine
                    by seqadmin



                    Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                    Somatic Genomics
                    “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                    05-24-2024, 01:16 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 07:49 AM
                  0 responses
                  14 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 06-20-2024, 07:23 AM
                  0 responses
                  14 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 06-17-2024, 06:54 AM
                  0 responses
                  16 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 06-14-2024, 07:24 AM
                  0 responses
                  25 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X