Unconfigured Ad

**blancha** · 12-01-2014, 10:46 AM

Check the quality of the bases before aligning.
You can do so with FastQC.
If necessary, trim the reads before aligning.

Also, you seem to have a coral genome, "Ahyacinthus_Coral", yet you named your index hg19, which is a human reference genome.
This is confusing, so I would change the name of the index.

**blancha** · 12-01-2014, 12:51 PM

I mean that you should check for the presence of adapter sequences. If there are any, they will affect the alignment percentage.
The actual quality of the bases will also affect the alignment, but to a lesser extent.

**GenoMax** · 12-01-2014, 03:30 PM

Al blancha pointed out already coral sequences will not map to Human reference well (if that is indeed what Kates106 did). Plus if they have any adapter sequences then alignment percentage would drop further.

**Kates106** · 12-01-2014, 03:57 PM

Thanks for the advice.

I mapped the reads to the coral transcriptome (not the human genome)-sorry for the confusion there. I will use FastQC to look at the quality of the reads. One more question: is there anything else within my bowtie code that could be causing this low alignment rate? (i.e. alignment options -n, -v, etc)...

**cmbetts** · 12-01-2014, 04:48 PM

I didn't see anything inherently wrong with the bowtie code you used. The top three possibilities that I'd consider are:
1) Preprocessing of the reads to remove adapter sequences and low quality reads.
2) Poor reference transcriptome. I don't know how well characterized the coral transcriptome is, but it's unlikely to be nearly as complete as with other model organisms.
3) Library construction. There's lots of abundant non-mRNA RNA species (rRNA etc.) that you generally don't want to sequence and aren't included as part of the reference transcriptome. Different choices made on the sample prep end can have a huge impact on how much "garbage" sequence you get (<1%-95%).

Options 1&3 can be determined by looking at the FastQC data for adapter contamination and over-repressented sequences respectively. If it's option 2, there's not much you can do. Maybe run a program like Trinity to try to assemble your transcriptome de-novo...

**GenoMax** · 12-01-2014, 04:58 PM

@Kates106: Since this is a class project perhaps you chose a bad dataset (unless it was pre-selected by the instructor for you). Any possibility that you can go back and choose a different dataset for this analysis?

**Kates106** · 12-02-2014, 10:26 AM

I checked the quality of all the transcriptome data. The majority produced warnings in the "per base sequence content," "kmer content" with various "overrepresented sequences." I think that at this point that the best course of action would be to find or ask my professor for another dataset. I chose the dataset, but he approved it...But he has been very understanding

Thanks again for all of your suggestions/help/advice

**GenoMax** · 12-02-2014, 10:32 AM

Before you give up on this dataset take a few minutes and run it through BBDuk to see if it cleans out some of the adapters etc. You may need to find out what kind of adapaters (TruSeq or Nextera) were used. If you can't figure it out post the SRA accession # (I assume you got the data from there) and someone can help.

"Warnings" on FastQC do not necessarily indicate a bad dataset. Post graphs from your analysis if you need help with those.

**Kates106** · 12-02-2014, 11:10 AM

Just a moment...

http://www.pnas.org/content/suppl/2013/01/02/1210224110.DCSupplemental/pnas.201210224SI.pdf#nameddest=STXT

Here are the supplemental methods from the paper.

Acropora hyacinthus (ID 177515) - BioProject - NCBI

http://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA177515

A BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a consortium. A BioProject record provides users a single place to find links to the diverse data types generated for that project

Here is the link to the dataset used; however, my professor cut down the reads for control/heated treatments and I am only using a total of 12- 3 control and heated for MV coral and 3 control and heated for HV corals.

I think one of the errors is due to the "random hexamer primers" that were used...?

Currently working in BWA using their EXACT methods (my professor suggested bowtie2...)

**cmbetts** · 12-02-2014, 12:09 PM

After doing a very quick read through of the methods you posted (10min and a cup of coffee, so don't hold me to it), it seems like your bowtie results are actually what you'd expect. It implied that there were lots of additional species represented in the sequence data, and they did a ton of filtering of their assembled contigs to retain only those from coral. Since you're only aligning to their assembled transcriptome, all reads coming from those other species shouldn't align (All coral reads should map since they assebled the transcriptome from the very reads you're using). From the description of figure S1, it sounds like non-coral makes up the majority of the reads since only ~20% of their assembled contigs were designated as coral, which is actually in very good agreement with your alignment rate. You could always do a quick check of this by also collecting the unaligned sequences and just do a quick BLAST search of some of them and see if they hit non-coral to satisfy curiosity.
"A total of 220,233 individual
contigs were assembled from the data, incorporating 64.71% of
the filtered sequences (Table S2). Of these contigs, 41,709
(18.9%) were putatively identified as coral in origin via nucleotide
similarity to known Cnidarian sequence resources (larval
Acropora ESTs and sequenced Cnidarian genomes) and subsequently
metaassembled into our final reference transcriptome
of 33,496 contigs (N50 = 529 totaling 14.9 Mb; Table S2)."

**cmbetts** · 12-02-2014, 12:13 PM

Should have spent that extra minute to read on to figure S2. Even the author's only saw a 13% alignment rate to the assembled transcriptome
"Alignment of 395.93
million sequences from 31 samples (16 control and 15 heat stressed
corals; n = 16 individuals; range: 1.98–22.35 million reads per
sample) produced 53.96 million (13.63%) unambiguously aligned
coral sequences"

**Kates106** · 12-02-2014, 12:35 PM

Yeah I noticed that too after reading through the methods again. All good news- I will continue with the analysis...

I am learning a lot which is great. It is interesting to see exactly how each algorithm affects mapping. For the purpose of this project I am only looking at reads that map in one place, but I am wondering as to what you would do if the read maps in multiple places? I would assume that you would first pick the best match, but if not, how do you know which one to pick? Wouldn't that indicate isomers...? Just out of curiosity..

I haven't done any mapping or assembly work, but I am considering doing some assembling with this dataset once I finish this project.

**Brian Bushnell** · 12-02-2014, 02:56 PM

For reads that map in multiple places with equal scores, it's common to either throw them away, or pick one location at random, or keep all mapping locations. None of these is ideal, but that's the inevitable result of using short reads.

Typically, you will have a higher rate of unique mapping if you map to the genome rather than transcriptome because alternative isoforms will only be represented once.

**Kates106** · 12-02-2014, 05:57 PM

Interesting...thanks!

Topics	Statistics	Last Post
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 10 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 45 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 105 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 125 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM

Unconfigured Ad

Bowtie2 transcriptome mapping issues

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News