Seqanswers Leaderboard Ad

**Simon Anders** · 05-29-2013, 08:52 AM

This is what sdriscoll meant, when he said that you are mapping to a transcriptome. You have provided your aligner with a FASTA file which did not contain one sequence for each chromosome but one sequence for each transcript. (Otherwise, how would samtools know where the transcripts are, as you haven't supplied a GFF file.) This is known as "mapping against the transcriptome" and it is "bad" if you don't know exactly what you are doing, for various reasons that you'll find in old threads here.

**bob-loblaw** · 05-29-2013, 09:07 AM

Originally posted by Simon Anders View Post

This is what sdriscoll meant, when he said that you are mapping to a transcriptome. You have provided your aligner with a FASTA file which did not contain one sequence for each chromosome but one sequence for each transcript. (Otherwise, how would samtools know where the transcripts are, as you haven't supplied a GFF file.) This is known as "mapping against the transcriptome" and it is "bad" if you don't know exactly what you are doing, for various reasons that you'll find in old threads here.

Ah okay. I misunderstood, I thought sdriscoll was asking if I was mapping against a reference assembled from transcriptomics data, as opposed to the DNA sequences of predicted proteins from sequenced genomes (Which is what I'm using). Sorry for being a "student" and for for making a "mistake". I'll correct that post

**sdriscoll** · 05-29-2013, 09:15 AM

Exactly as Simon said. If you were mapping to a genome reference then idxstats would return read counts per chromosome. It's absolutely more complicated to map to a transcriptome reference. A couple tools for that are eXpress and RSEM but neither of those will help you get counts at the gene level without you providing some knowledge of which references are from the same gene.

Probably the most straightforward approach is to align your reads to a genome reference (full chromosome sequences) with Tophat or STAR, if you have the RAM for it, then to count hits to genes with something like htseq-count which can find overlaps of genomic coordinates with gene features annotated in a GTF file.

**sdriscoll** · 05-29-2013, 09:16 AM

If you really want to align to this database you're using I suggest trying RSEM.

**bob-loblaw** · 05-29-2013, 09:17 AM

Originally posted by sdriscoll View Post

If you really want to align to this database you're using I suggest trying RSEM.

I'll look into it. Thank you.

**sdriscoll** · 05-29-2013, 10:24 AM

Also I don't want to send you down a confusing path. I don't mind providing you with some help to get that pipeline working. One more thing to consider - do you expect insertions/deletions to be important? If so then RSEM may not be what you want since it uses bowtie1 for alignments. eXpress is a similar solution and with eXpress you can use alignments from bowtie1, bowtie2, bwa (with some tweaking) and really any aligner that can output all possible alignments for a given read. These tools attempt to disambiguate the alignments to a set of gene/protein/transcript sequences giving you "unique" mappings for even reads that can align equally well to several references. I've done a bit of benchmarking and honestly I haven't seen great results from eXpress but RSEM does pretty well. Both work VERY well if you are able to sum counts of sequences together for sequences that share exons or share sequence (as in multi-copy genes or alternatively spliced genes). They work OK in terms of per-sequence level counts - certainly better than what the aligners can do on their own - but certainly not perfect. Just keep in mind that you're per-sequence expressions will likely contain some false positives (maybe a lot...) and will also likely be missing a few true positives. In the end you're knowledge of which sequences in your database share sequence or share exons will help you immensely in getting stable and reliable read counts.

**bob-loblaw** · 05-31-2013, 05:46 AM

Originally posted by sdriscoll View Post

Also I don't want to send you down a confusing path. I don't mind providing you with some help to get that pipeline working. One more thing to consider - do you expect insertions/deletions to be important? If so then RSEM may not be what you want since it uses bowtie1 for alignments. eXpress is a similar solution and with eXpress you can use alignments from bowtie1, bowtie2, bwa (with some tweaking) and really any aligner that can output all possible alignments for a given read. These tools attempt to disambiguate the alignments to a set of gene/protein/transcript sequences giving you "unique" mappings for even reads that can align equally well to several references. I've done a bit of benchmarking and honestly I haven't seen great results from eXpress but RSEM does pretty well. Both work VERY well if you are able to sum counts of sequences together for sequences that share exons or share sequence (as in multi-copy genes or alternatively spliced genes). They work OK in terms of per-sequence level counts - certainly better than what the aligners can do on their own - but certainly not perfect. Just keep in mind that you're per-sequence expressions will likely contain some false positives (maybe a lot...) and will also likely be missing a few true positives. In the end you're knowledge of which sequences in your database share sequence or share exons will help you immensely in getting stable and reliable read counts.

Well alternatively spliced genes won't be a problem, it's all bacteria I'm mapping to. RSEM I have been playing around with, but I have a pretty large sample size, and realigning all of them would be very time consuming, obviously I'll do it if necessary but I'd prefer not to have to. I haven't tried eXpress yet, but I will.

When I was mapping with bowtie2 I left it's reporting mode in default (i.e. report only the best alignment) but eXpress wants to be able to select the best alignment itself. Do you think this will be a big issue?

Topics	Statistics	Last Post
A Close Examination at Probiotic-Related Bacteremia by seqadmin Started by seqadmin, 05-02-2024, 08:06 AM	0 responses 16 views 0 likes	Last Post by seqadmin 05-02-2024, 08:06 AM
Expanded Genetic Insights into Blood Pressure Regulation by seqadmin Started by seqadmin, 04-30-2024, 12:17 PM	0 responses 20 views 0 likes	Last Post by seqadmin 04-30-2024, 12:17 PM
The Role of Enhancers in Defining Cell Fate by seqadmin Started by seqadmin, 04-29-2024, 10:49 AM	0 responses 25 views 0 likes	Last Post by seqadmin 04-29-2024, 10:49 AM
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News