Unconfigured Ad

**Simon Anders** · 05-29-2013, 08:52 AM

This is what sdriscoll meant, when he said that you are mapping to a transcriptome. You have provided your aligner with a FASTA file which did not contain one sequence for each chromosome but one sequence for each transcript. (Otherwise, how would samtools know where the transcripts are, as you haven't supplied a GFF file.) This is known as "mapping against the transcriptome" and it is "bad" if you don't know exactly what you are doing, for various reasons that you'll find in old threads here.

**bob-loblaw** · 05-29-2013, 09:07 AM

Originally posted by Simon Anders View Post

This is what sdriscoll meant, when he said that you are mapping to a transcriptome. You have provided your aligner with a FASTA file which did not contain one sequence for each chromosome but one sequence for each transcript. (Otherwise, how would samtools know where the transcripts are, as you haven't supplied a GFF file.) This is known as "mapping against the transcriptome" and it is "bad" if you don't know exactly what you are doing, for various reasons that you'll find in old threads here.

Ah okay. I misunderstood, I thought sdriscoll was asking if I was mapping against a reference assembled from transcriptomics data, as opposed to the DNA sequences of predicted proteins from sequenced genomes (Which is what I'm using). Sorry for being a "student" and for for making a "mistake". I'll correct that post

**sdriscoll** · 05-29-2013, 09:15 AM

Exactly as Simon said. If you were mapping to a genome reference then idxstats would return read counts per chromosome. It's absolutely more complicated to map to a transcriptome reference. A couple tools for that are eXpress and RSEM but neither of those will help you get counts at the gene level without you providing some knowledge of which references are from the same gene.

Probably the most straightforward approach is to align your reads to a genome reference (full chromosome sequences) with Tophat or STAR, if you have the RAM for it, then to count hits to genes with something like htseq-count which can find overlaps of genomic coordinates with gene features annotated in a GTF file.

**sdriscoll** · 05-29-2013, 09:16 AM

If you really want to align to this database you're using I suggest trying RSEM.

**bob-loblaw** · 05-29-2013, 09:17 AM

Originally posted by sdriscoll View Post

If you really want to align to this database you're using I suggest trying RSEM.

I'll look into it. Thank you.

**sdriscoll** · 05-29-2013, 10:24 AM

Also I don't want to send you down a confusing path. I don't mind providing you with some help to get that pipeline working. One more thing to consider - do you expect insertions/deletions to be important? If so then RSEM may not be what you want since it uses bowtie1 for alignments. eXpress is a similar solution and with eXpress you can use alignments from bowtie1, bowtie2, bwa (with some tweaking) and really any aligner that can output all possible alignments for a given read. These tools attempt to disambiguate the alignments to a set of gene/protein/transcript sequences giving you "unique" mappings for even reads that can align equally well to several references. I've done a bit of benchmarking and honestly I haven't seen great results from eXpress but RSEM does pretty well. Both work VERY well if you are able to sum counts of sequences together for sequences that share exons or share sequence (as in multi-copy genes or alternatively spliced genes). They work OK in terms of per-sequence level counts - certainly better than what the aligners can do on their own - but certainly not perfect. Just keep in mind that you're per-sequence expressions will likely contain some false positives (maybe a lot...) and will also likely be missing a few true positives. In the end you're knowledge of which sequences in your database share sequence or share exons will help you immensely in getting stable and reliable read counts.

**bob-loblaw** · 05-31-2013, 05:46 AM

Originally posted by sdriscoll View Post

Also I don't want to send you down a confusing path. I don't mind providing you with some help to get that pipeline working. One more thing to consider - do you expect insertions/deletions to be important? If so then RSEM may not be what you want since it uses bowtie1 for alignments. eXpress is a similar solution and with eXpress you can use alignments from bowtie1, bowtie2, bwa (with some tweaking) and really any aligner that can output all possible alignments for a given read. These tools attempt to disambiguate the alignments to a set of gene/protein/transcript sequences giving you "unique" mappings for even reads that can align equally well to several references. I've done a bit of benchmarking and honestly I haven't seen great results from eXpress but RSEM does pretty well. Both work VERY well if you are able to sum counts of sequences together for sequences that share exons or share sequence (as in multi-copy genes or alternatively spliced genes). They work OK in terms of per-sequence level counts - certainly better than what the aligners can do on their own - but certainly not perfect. Just keep in mind that you're per-sequence expressions will likely contain some false positives (maybe a lot...) and will also likely be missing a few true positives. In the end you're knowledge of which sequences in your database share sequence or share exons will help you immensely in getting stable and reliable read counts.

Well alternatively spliced genes won't be a problem, it's all bacteria I'm mapping to. RSEM I have been playing around with, but I have a pretty large sample size, and realigning all of them would be very time consuming, obviously I'll do it if necessary but I'd prefer not to have to. I haven't tried eXpress yet, but I will.

When I was mapping with bowtie2 I left it's reporting mode in default (i.e. report only the best alignment) but eXpress wants to be able to select the best alignment itself. Do you think this will be a big issue?

Topics	Statistics	Last Post
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, Today, 10:09 AM	0 responses 8 views 0 reactions	Last Post by SEQadmin2 Today, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, Yesterday, 08:59 AM	0 responses 14 views 0 reactions	Last Post by SEQadmin2 Yesterday, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 22 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM
DNA Methylation Study Reveals How Epigenetic Changes Pass Between Generations by SEQadmin2 Started by SEQadmin2, 06-02-2026, 11:40 AM	0 responses 19 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 11:40 AM

Unconfigured Ad

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News