This is what sdriscoll meant, when he said that you are mapping to a transcriptome. You have provided your aligner with a FASTA file which did not contain one sequence for each chromosome but one sequence for each transcript. (Otherwise, how would samtools know where the transcripts are, as you haven't supplied a GFF file.) This is known as "mapping against the transcriptome" and it is "bad" if you don't know exactly what you are doing, for various reasons that you'll find in old threads here.
Unconfigured Ad
Collapse
X
-
Ah okay. I misunderstood, I thought sdriscoll was asking if I was mapping against a reference assembled from transcriptomics data, as opposed to the DNA sequences of predicted proteins from sequenced genomes (Which is what I'm using). Sorry for being a "student" and for for making a "mistake". I'll correct that postOriginally posted by Simon Anders View PostThis is what sdriscoll meant, when he said that you are mapping to a transcriptome. You have provided your aligner with a FASTA file which did not contain one sequence for each chromosome but one sequence for each transcript. (Otherwise, how would samtools know where the transcripts are, as you haven't supplied a GFF file.) This is known as "mapping against the transcriptome" and it is "bad" if you don't know exactly what you are doing, for various reasons that you'll find in old threads here.
Comment
-
-
Exactly as Simon said. If you were mapping to a genome reference then idxstats would return read counts per chromosome. It's absolutely more complicated to map to a transcriptome reference. A couple tools for that are eXpress and RSEM but neither of those will help you get counts at the gene level without you providing some knowledge of which references are from the same gene.
Probably the most straightforward approach is to align your reads to a genome reference (full chromosome sequences) with Tophat or STAR, if you have the RAM for it, then to count hits to genes with something like htseq-count which can find overlaps of genomic coordinates with gene features annotated in a GTF file./* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
Comment
-
-
Also I don't want to send you down a confusing path. I don't mind providing you with some help to get that pipeline working. One more thing to consider - do you expect insertions/deletions to be important? If so then RSEM may not be what you want since it uses bowtie1 for alignments. eXpress is a similar solution and with eXpress you can use alignments from bowtie1, bowtie2, bwa (with some tweaking) and really any aligner that can output all possible alignments for a given read. These tools attempt to disambiguate the alignments to a set of gene/protein/transcript sequences giving you "unique" mappings for even reads that can align equally well to several references. I've done a bit of benchmarking and honestly I haven't seen great results from eXpress but RSEM does pretty well. Both work VERY well if you are able to sum counts of sequences together for sequences that share exons or share sequence (as in multi-copy genes or alternatively spliced genes). They work OK in terms of per-sequence level counts - certainly better than what the aligners can do on their own - but certainly not perfect. Just keep in mind that you're per-sequence expressions will likely contain some false positives (maybe a lot...) and will also likely be missing a few true positives. In the end you're knowledge of which sequences in your database share sequence or share exons will help you immensely in getting stable and reliable read counts./* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
Comment
-
-
Well alternatively spliced genes won't be a problem, it's all bacteria I'm mapping to. RSEM I have been playing around with, but I have a pretty large sample size, and realigning all of them would be very time consuming, obviously I'll do it if necessary but I'd prefer not to have to. I haven't tried eXpress yet, but I will.Originally posted by sdriscoll View PostAlso I don't want to send you down a confusing path. I don't mind providing you with some help to get that pipeline working. One more thing to consider - do you expect insertions/deletions to be important? If so then RSEM may not be what you want since it uses bowtie1 for alignments. eXpress is a similar solution and with eXpress you can use alignments from bowtie1, bowtie2, bwa (with some tweaking) and really any aligner that can output all possible alignments for a given read. These tools attempt to disambiguate the alignments to a set of gene/protein/transcript sequences giving you "unique" mappings for even reads that can align equally well to several references. I've done a bit of benchmarking and honestly I haven't seen great results from eXpress but RSEM does pretty well. Both work VERY well if you are able to sum counts of sequences together for sequences that share exons or share sequence (as in multi-copy genes or alternatively spliced genes). They work OK in terms of per-sequence level counts - certainly better than what the aligners can do on their own - but certainly not perfect. Just keep in mind that you're per-sequence expressions will likely contain some false positives (maybe a lot...) and will also likely be missing a few true positives. In the end you're knowledge of which sequences in your database share sequence or share exons will help you immensely in getting stable and reliable read counts.
When I was mapping with bowtie2 I left it's reporting mode in default (i.e. report only the best alignment) but eXpress wants to be able to select the best alignment itself. Do you think this will be a big issue?
Comment
-
Latest Articles
Collapse
-
by SEQadmin2
Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.
The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
...-
Channel: Articles
06-02-2026, 10:05 AM -
-
by SEQadmin2
With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.
Introduction
Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...-
Channel: Articles
05-22-2026, 06:42 AM -
ad_right_rmr
Collapse
News
Collapse
| Topics | Statistics | Last Post | ||
|---|---|---|---|---|
|
Started by SEQadmin2, Today, 10:09 AM
|
0 responses
8 views
0 reactions
|
Last Post
by SEQadmin2
Today, 10:09 AM
|
||
|
Started by SEQadmin2, Yesterday, 08:59 AM
|
0 responses
14 views
0 reactions
|
Last Post
by SEQadmin2
Yesterday, 08:59 AM
|
||
|
Started by SEQadmin2, 06-02-2026, 12:03 PM
|
0 responses
22 views
0 reactions
|
Last Post
by SEQadmin2
06-02-2026, 12:03 PM
|
||
|
Started by SEQadmin2, 06-02-2026, 11:40 AM
|
0 responses
19 views
0 reactions
|
Last Post
by SEQadmin2
06-02-2026, 11:40 AM
|
Comment