This is what sdriscoll meant, when he said that you are mapping to a transcriptome. You have provided your aligner with a FASTA file which did not contain one sequence for each chromosome but one sequence for each transcript. (Otherwise, how would samtools know where the transcripts are, as you haven't supplied a GFF file.) This is known as "mapping against the transcriptome" and it is "bad" if you don't know exactly what you are doing, for various reasons that you'll find in old threads here.
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Originally posted by Simon Anders View PostThis is what sdriscoll meant, when he said that you are mapping to a transcriptome. You have provided your aligner with a FASTA file which did not contain one sequence for each chromosome but one sequence for each transcript. (Otherwise, how would samtools know where the transcripts are, as you haven't supplied a GFF file.) This is known as "mapping against the transcriptome" and it is "bad" if you don't know exactly what you are doing, for various reasons that you'll find in old threads here.
Comment
-
Exactly as Simon said. If you were mapping to a genome reference then idxstats would return read counts per chromosome. It's absolutely more complicated to map to a transcriptome reference. A couple tools for that are eXpress and RSEM but neither of those will help you get counts at the gene level without you providing some knowledge of which references are from the same gene.
Probably the most straightforward approach is to align your reads to a genome reference (full chromosome sequences) with Tophat or STAR, if you have the RAM for it, then to count hits to genes with something like htseq-count which can find overlaps of genomic coordinates with gene features annotated in a GTF file./* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
Comment
-
Also I don't want to send you down a confusing path. I don't mind providing you with some help to get that pipeline working. One more thing to consider - do you expect insertions/deletions to be important? If so then RSEM may not be what you want since it uses bowtie1 for alignments. eXpress is a similar solution and with eXpress you can use alignments from bowtie1, bowtie2, bwa (with some tweaking) and really any aligner that can output all possible alignments for a given read. These tools attempt to disambiguate the alignments to a set of gene/protein/transcript sequences giving you "unique" mappings for even reads that can align equally well to several references. I've done a bit of benchmarking and honestly I haven't seen great results from eXpress but RSEM does pretty well. Both work VERY well if you are able to sum counts of sequences together for sequences that share exons or share sequence (as in multi-copy genes or alternatively spliced genes). They work OK in terms of per-sequence level counts - certainly better than what the aligners can do on their own - but certainly not perfect. Just keep in mind that you're per-sequence expressions will likely contain some false positives (maybe a lot...) and will also likely be missing a few true positives. In the end you're knowledge of which sequences in your database share sequence or share exons will help you immensely in getting stable and reliable read counts./* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
Comment
-
Originally posted by sdriscoll View PostAlso I don't want to send you down a confusing path. I don't mind providing you with some help to get that pipeline working. One more thing to consider - do you expect insertions/deletions to be important? If so then RSEM may not be what you want since it uses bowtie1 for alignments. eXpress is a similar solution and with eXpress you can use alignments from bowtie1, bowtie2, bwa (with some tweaking) and really any aligner that can output all possible alignments for a given read. These tools attempt to disambiguate the alignments to a set of gene/protein/transcript sequences giving you "unique" mappings for even reads that can align equally well to several references. I've done a bit of benchmarking and honestly I haven't seen great results from eXpress but RSEM does pretty well. Both work VERY well if you are able to sum counts of sequences together for sequences that share exons or share sequence (as in multi-copy genes or alternatively spliced genes). They work OK in terms of per-sequence level counts - certainly better than what the aligners can do on their own - but certainly not perfect. Just keep in mind that you're per-sequence expressions will likely contain some false positives (maybe a lot...) and will also likely be missing a few true positives. In the end you're knowledge of which sequences in your database share sequence or share exons will help you immensely in getting stable and reliable read counts.
When I was mapping with bowtie2 I left it's reporting mode in default (i.e. report only the best alignment) but eXpress wants to be able to select the best alignment itself. Do you think this will be a big issue?
Comment
Latest Articles
Collapse
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...-
Channel: Articles
04-22-2024, 07:01 AM -
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 05-02-2024, 08:06 AM
|
0 responses
16 views
0 likes
|
Last Post
by seqadmin
05-02-2024, 08:06 AM
|
||
Started by seqadmin, 04-30-2024, 12:17 PM
|
0 responses
20 views
0 likes
|
Last Post
by seqadmin
04-30-2024, 12:17 PM
|
||
Started by seqadmin, 04-29-2024, 10:49 AM
|
0 responses
25 views
0 likes
|
Last Post
by seqadmin
04-29-2024, 10:49 AM
|
||
Started by seqadmin, 04-25-2024, 11:49 AM
|
0 responses
28 views
0 likes
|
Last Post
by seqadmin
04-25-2024, 11:49 AM
|
Comment