Header Leaderboard Ad

Collapse

Bowtie2 transcriptome mapping issues

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bowtie2 transcriptome mapping issues

    Hi all,

    Excuse my newbie-ness as I typically work within the DNA realm but decided to do an expression analysis for a final project in my Genomics class. Anyways, I have all the reads and the reference transcriptome downloaded and for part 1 of the analysis I am trying to map read data back to the reference transcriptome. Here are the codes that I used:

    bowtie2-2.2.4/bowtie2-build 33496_Ahyacinthus_CoralContigs.fasta hg19

    bowtie2-2.2.4/bowtie2 -p 6 -x hg19 06_control_HV_100000.fastq > 6controlHV.sam

    100000 reads; of these:
    100000 (100.00%) were unpaired; of these:
    77030 (77.03%) aligned 0 times
    20885 (20.89%) aligned exactly 1 time
    2085 (2.08%) aligned >1 times
    22.97% overall alignment rate

    As you can see, the overall alignment rate is quite low. I have tried changing various parts of the code with no luck. Am I executing this in the wrong way/is there something that I am missing?

    Thanks in advance.

  • #2
    Check the quality of the bases before aligning.
    You can do so with FastQC.
    If necessary, trim the reads before aligning.

    Also, you seem to have a coral genome, "Ahyacinthus_Coral", yet you named your index hg19, which is a human reference genome.
    This is confusing, so I would change the name of the index.

    Comment


    • #3
      I mean that you should check for the presence of adapter sequences. If there are any, they will affect the alignment percentage.
      The actual quality of the bases will also affect the alignment, but to a lesser extent.

      Comment


      • #4
        Al blancha pointed out already coral sequences will not map to Human reference well (if that is indeed what Kates106 did). Plus if they have any adapter sequences then alignment percentage would drop further.

        Comment


        • #5
          Thanks for the advice.

          I mapped the reads to the coral transcriptome (not the human genome)-sorry for the confusion there. I will use FastQC to look at the quality of the reads. One more question: is there anything else within my bowtie code that could be causing this low alignment rate? (i.e. alignment options -n, -v, etc)...

          Comment


          • #6
            I didn't see anything inherently wrong with the bowtie code you used. The top three possibilities that I'd consider are:
            1) Preprocessing of the reads to remove adapter sequences and low quality reads.
            2) Poor reference transcriptome. I don't know how well characterized the coral transcriptome is, but it's unlikely to be nearly as complete as with other model organisms.
            3) Library construction. There's lots of abundant non-mRNA RNA species (rRNA etc.) that you generally don't want to sequence and aren't included as part of the reference transcriptome. Different choices made on the sample prep end can have a huge impact on how much "garbage" sequence you get (<1%-95%).

            Options 1&3 can be determined by looking at the FastQC data for adapter contamination and over-repressented sequences respectively. If it's option 2, there's not much you can do. Maybe run a program like Trinity to try to assemble your transcriptome de-novo...

            Comment


            • #7
              @Kates106: Since this is a class project perhaps you chose a bad dataset (unless it was pre-selected by the instructor for you). Any possibility that you can go back and choose a different dataset for this analysis?

              Comment


              • #8
                I checked the quality of all the transcriptome data. The majority produced warnings in the "per base sequence content," "kmer content" with various "overrepresented sequences." I think that at this point that the best course of action would be to find or ask my professor for another dataset. I chose the dataset, but he approved it...But he has been very understanding

                Thanks again for all of your suggestions/help/advice

                Comment


                • #9
                  Before you give up on this dataset take a few minutes and run it through BBDuk to see if it cleans out some of the adapters etc. You may need to find out what kind of adapaters (TruSeq or Nextera) were used. If you can't figure it out post the SRA accession # (I assume you got the data from there) and someone can help.

                  "Warnings" on FastQC do not necessarily indicate a bad dataset. Post graphs from your analysis if you need help with those.

                  Comment


                  • #10
                    http://www.pnas.org/content/suppl/20...nameddest=STXT

                    Here are the supplemental methods from the paper.

                    http://www.ncbi.nlm.nih.gov/bioproje...rm=PRJNA177515

                    Here is the link to the dataset used; however, my professor cut down the reads for control/heated treatments and I am only using a total of 12- 3 control and heated for MV coral and 3 control and heated for HV corals.

                    I think one of the errors is due to the "random hexamer primers" that were used...?

                    Currently working in BWA using their EXACT methods (my professor suggested bowtie2...)
                    Last edited by Kates106; 12-02-2014, 11:24 AM.

                    Comment


                    • #11
                      After doing a very quick read through of the methods you posted (10min and a cup of coffee, so don't hold me to it), it seems like your bowtie results are actually what you'd expect. It implied that there were lots of additional species represented in the sequence data, and they did a ton of filtering of their assembled contigs to retain only those from coral. Since you're only aligning to their assembled transcriptome, all reads coming from those other species shouldn't align (All coral reads should map since they assebled the transcriptome from the very reads you're using). From the description of figure S1, it sounds like non-coral makes up the majority of the reads since only ~20% of their assembled contigs were designated as coral, which is actually in very good agreement with your alignment rate. You could always do a quick check of this by also collecting the unaligned sequences and just do a quick BLAST search of some of them and see if they hit non-coral to satisfy curiosity.
                      "A total of 220,233 individual
                      contigs were assembled from the data, incorporating 64.71% of
                      the filtered sequences (Table S2). Of these contigs, 41,709
                      (18.9%) were putatively identified as coral in origin via nucleotide
                      similarity to known Cnidarian sequence resources (larval
                      Acropora ESTs and sequenced Cnidarian genomes) and subsequently
                      metaassembled into our final reference transcriptome
                      of 33,496 contigs (N50 = 529 totaling 14.9 Mb; Table S2)."

                      Comment


                      • #12
                        Should have spent that extra minute to read on to figure S2. Even the author's only saw a 13% alignment rate to the assembled transcriptome
                        "Alignment of 395.93
                        million sequences from 31 samples (16 control and 15 heat stressed
                        corals; n = 16 individuals; range: 1.98–22.35 million reads per
                        sample) produced 53.96 million (13.63%) unambiguously aligned
                        coral sequences"

                        Comment


                        • #13
                          Yeah I noticed that too after reading through the methods again. All good news- I will continue with the analysis...

                          I am learning a lot which is great. It is interesting to see exactly how each algorithm affects mapping. For the purpose of this project I am only looking at reads that map in one place, but I am wondering as to what you would do if the read maps in multiple places? I would assume that you would first pick the best match, but if not, how do you know which one to pick? Wouldn't that indicate isomers...? Just out of curiosity..

                          I haven't done any mapping or assembly work, but I am considering doing some assembling with this dataset once I finish this project.

                          Comment


                          • #14
                            For reads that map in multiple places with equal scores, it's common to either throw them away, or pick one location at random, or keep all mapping locations. None of these is ideal, but that's the inevitable result of using short reads.

                            Typically, you will have a higher rate of unique mapping if you map to the genome rather than transcriptome because alternative isoforms will only be represented once.

                            Comment


                            • #15
                              Interesting...thanks!

                              Comment

                              Working...
                              X