I ran an RNA-seq experiment and used TopHat> cufflink It is a time series experiment when I looked at RPKM values in some of the transcripts the RPKM values goes upto 32765.04908; 2073.978485 . Is is reasonable? I also looked at Bam files there weresecrtainly very large number of reads. Any feedback/ suggestion please how to explain this high RPKM values?
Header Leaderboard Ad
Collapse
very high RPKM values from Cufflink
Collapse
Announcement
Collapse
SEQanswers June Challenge Has Begun!
The competition has begun! We're giving away a $50 Amazon gift card to the member who answers the most questions on our site during the month. We want to encourage our community members to share their knowledge and help each other out by answering questions related to sequencing technologies, genomics, and bioinformatics. The competition is open to all members of the site, and the winner will be announced at the beginning of July. Best of luck!
For a list of the official rules, visit (https://www.seqanswers.com/forum/sit...wledge-and-win)
For a list of the official rules, visit (https://www.seqanswers.com/forum/sit...wledge-and-win)
See more
See less
X
-
Originally posted by honey View PostI ran an RNA-seq experiment and used TopHat> cufflink It is a time series experiment when I looked at RPKM values in some of the transcripts the RPKM values goes upto 32765.04908; 2073.978485 . Is is reasonable? I also looked at Bam files there weresecrtainly very large number of reads. Any feedback/ suggestion please how to explain this high RPKM values?
log(10) values from cufflinks roughly equals FPKM values from cuffdiff..
-
very high RPKM values from 4.5 to sevreal thousands
Howver the problem which I have is that the RPKM are high apx. 5% are > 1000RPKM (like 3245, 4356 and so on) in the same sample. If I change to log 10 than then what will happen to values around zero. Is it a usual method to log transform RPKM value?
Any feedback is welcome
Thanks
Comment
-
Honey, I could be totally wrong here about the log(10) thing, but I don't think I am..
Can you look at the mappings for some of those transcripts where 'raw' FPMK is about 0-- do they have few reads mapped?
See:
Comment
-
High RPKM
Thanks I looked at the Bam files and can say that there are very few reads wherver it is 0 values of RPKM however where the values are very high those are the kind of hot spots there are large no of reads. Now the question is is this an artifact -High RPKM or very low RPKM how we rope up both extreme values?
Comment
-
honey, it might be a good idea to look a bit more in depth into that specific gene. You can certainly get high FPKMs mapping to genes like actin that make up a lot of the mRNA percentage of a cell. I had huge numbers of reads mapping to one region of a miRNA gene once that all turned out to be within a LINE and a SINE. For that gene at least, it was clear the repeat regions skewed the results.
Comment
-
very large RPKM
It is human genome so it is not small.
The egnes which have very high RPKM values are relavnt to biology of the tissue samples, but my problem is how to provide a scientific rational that our results are not nonspecific.
Thanks for the input
Comment
-
Originally posted by peromhc View PostI think that these values should be taken to the log(10).. this is not documented, but my suspicion.
log(10) values from cufflinks roughly equals FPKM values from cuffdiff..
Honey, how did you run Cufflinks? RABT mode or simple "quantification" mode? How long are the genes with super-high RPKM?
It seems to me that Cufflinks has a tendency to report super-high RPKM for very short transcripts (such as microRNA). I now routinely filter out the transcripts shorter than the expected fragment size (from the GTF annotation file). I think there is a good rationale to filter them out, because they can not be accurately captured by the RNA-Seq protocol....
In RABT mode, Cufflinks also reports a large number of short transcripts with crazy high values. A solution could be to re-quantify the discovered transcripts with something like BEDtools or HTSeq-count...
Comment
-
High RPKM
Originally posted by Nicolas View PostThat does not make sense to me. Unless it is an option in either Cufflinks or Cuffdiff, but I have never saw a log relationship between Cufflinks and Cuffdiff outputs.
Honey, how did you run Cufflinks? RABT mode or simple "quantification" mode? How long are the genes with super-high RPKM?
It seems to me that Cufflinks has a tendency to report super-high RPKM for very short transcripts (such as microRNA). I now routinely filter out the transcripts shorter than the expected fragment size (from the GTF annotation file). I think there is a good rationale to filter them out, because they can not be accurately captured by the RNA-Seq protocol....
In RABT mode, Cufflinks also reports a large number of short transcripts with crazy high values. A solution could be to re-quantify the discovered transcripts with something like BEDtools or HTSeq-count...
So you mean probably count method is better?
Comment
-
This issue has been discussed elsewhere on this board. As Nicholas points out, RNA-Seq really isn't reliable for very short transcripts. The reason is that all the fragments that map to these transcripts come from the "tail" of the distribution of library fragment lengths. That is, fragments that map to microRNAs are much, much shorter than most fragments in the library - by design in the RNA-Seq protocol, which size selects away very short inserts. Thus, Cufflinks infers that even though relatively few fragments actually mapped to the microRNAs, there were probably TONS of individual microRNA molecules in the transcriptome before all of the various size selection parts of the protocol kicked in. Cufflinks accordingly increases the FPKM of these short transcripts to compensate for the bias against short fragments in the library.
This compensation was designed to improve accuracy for transcripts that are in the 500bp-1kb range - for longer transcripts, the "edge effects" due to library fragment size aren't much of an issue. However, I wouldn't trust FPKM values for transcripts shorter than your average fragment length. There's really just not enough data in most standard RNA-Seq libraries to say much about small RNA abundance.
I should also point out that other methods use this same bias correction technique (RSEM for example). As far as I'm aware, the "count-based" methods don't, but that doesn't mean they shouldn't. Most of those methods are strictly for differential analysis, where any edge effects are assumed to be affecting each condition the same way. That may or may not be the case in your data.
In any case, the quick answer to this problem is to simply remove or ignore transcripts shorter than around 300bp from your GTF. In a future version, we will be flagging these transcripts as too short for reliable quantification where appropriate.
Comment
-
Hi Cole, Thanks for your post. I keep reading your comments here which are useful for many including me. I asked a similar question, with a twist, here: http://seqanswers.com/forums/showthread.php?t=17992
Can you comment please. In short, it is about how to deal with larger(>300 bp) transcripts with high FPKMs.
Originally posted by Cole Trapnell View PostThis issue has been discussed elsewhere on this board. As Nicholas points out, RNA-Seq really isn't reliable for very short transcripts. The reason is that all the fragments that map to these transcripts come from the "tail" of the distribution of library fragment lengths. That is, fragments that map to microRNAs are much, much shorter than most fragments in the library - by design in the RNA-Seq protocol, which size selects away very short inserts. Thus, Cufflinks infers that even though relatively few fragments actually mapped to the microRNAs, there were probably TONS of individual microRNA molecules in the transcriptome before all of the various size selection parts of the protocol kicked in. Cufflinks accordingly increases the FPKM of these short transcripts to compensate for the bias against short fragments in the library.
This compensation was designed to improve accuracy for transcripts that are in the 500bp-1kb range - for longer transcripts, the "edge effects" due to library fragment size aren't much of an issue. However, I wouldn't trust FPKM values for transcripts shorter than your average fragment length. There's really just not enough data in most standard RNA-Seq libraries to say much about small RNA abundance.
I should also point out that other methods use this same bias correction technique (RSEM for example). As far as I'm aware, the "count-based" methods don't, but that doesn't mean they shouldn't. Most of those methods are strictly for differential analysis, where any edge effects are assumed to be affecting each condition the same way. That may or may not be the case in your data.
In any case, the quick answer to this problem is to simply remove or ignore transcripts shorter than around 300bp from your GTF. In a future version, we will be flagging these transcripts as too short for reliable quantification where appropriate.
Comment
Latest Articles
Collapse
-
by seqadmin
Developments in sequencing technologies and methodologies have transformed the field of epigenetics, giving researchers a better way to understand the complex world of gene regulation and heritable modifications. This article explores some of the diverse sequencing methods employed in the study of epigenetics, ranging from classic techniques to cutting-edge innovations while providing a brief overview of their processes, applications, and advances.
Methylation Detect...-
Channel: Articles
05-31-2023, 10:46 AM -
-
Differential Expression and Data Visualization: Recommended Tools for Next-Level Sequencing Analysisby seqadmin
After covering QC and alignment tools in the first segment and variant analysis and genome assembly in the second segment, we’re wrapping up with a discussion about tools for differential gene expression analysis and data visualization. In this article, we include recommendations from the following experts: Dr. Mark Ziemann, Senior Lecturer in Biotechnology and Bioinformatics, Deakin University; Dr. Medhat Mahmoud Postdoctoral Research Fellow at Baylor College of Medicine;...-
Channel: Articles
05-23-2023, 12:26 PM -
-
by seqadmin
Continuing from our previous article, we share variant analysis and genome assembly tools recommended by our experts Dr. Medhat Mahmoud, Postdoctoral Research Fellow at Baylor College of Medicine, and Dr. Ming "Tommy" Tang, Director of Computational Biology at Immunitas and author of From Cell Line to Command Line.
Variant detection and analysis tools
Mahmoud classifies variant detection work into two main groups: short variants (<50...-
Channel: Articles
05-19-2023, 10:03 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 06-01-2023, 08:56 PM
|
0 responses
9 views
0 likes
|
Last Post
by seqadmin
06-01-2023, 08:56 PM
|
||
Deep Sequencing Unearths Novel Genetic Variants: Enhancing Precision Medicine for Vascular Anomalies
by seqadmin
Started by seqadmin, 06-01-2023, 07:33 AM
|
0 responses
9 views
0 likes
|
Last Post
by seqadmin
06-01-2023, 07:33 AM
|
||
Unveiling Genetic Associations Through Transcription Factor Binding Quantitative Trait Loci
by seqadmin
Started by seqadmin, 05-31-2023, 07:50 AM
|
0 responses
4 views
0 likes
|
Last Post
by seqadmin
05-31-2023, 07:50 AM
|
||
Exploring French-Canadian Ancestry: Insights into Migration, Settlement Patterns, and Genetic Structure
by seqadmin
Started by seqadmin, 05-26-2023, 09:22 AM
|
0 responses
11 views
0 likes
|
Last Post
by seqadmin
05-26-2023, 09:22 AM
|
Comment