Announcement

Collapse
No announcement yet.

HTseq to DeSeq/EdgeR to Heatmap

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #91
    Biological replicates are the most important. In an ideal world, you'd do a pilot experiment and then use a tool like scotty to determine how many replicates you would need for the power you want.

    Comment


    • #92
      Originally posted by dpryan View Post
      Biological replicates are the most important. In an ideal world, you'd do a pilot experiment and then use a tool like scotty to determine how many replicates you would need for the power you want.
      Hi how do you uniform the gene name between different methods (Texudo vs Htseq+edgeR)?
      Now I am doing RNA-Seq on the bovine, my reference genome use Ensembl database.
      As you know, the Texudo pipeline: Tophat2-Cufflinks-Cuffmerge-Cuffdiff.
      After the Cuffmerge and get the merged.gtf which could give you gene name output only but not the Ensembl ID. However, the gene name of Cuffdiff, some are like "Q865A2_BOVIN", which made me confused, cause it is not official gene name.
      Is there any methods to uniform gene names (whatsoever Ensembl ID, gene name or UCSC ID ,.etc.) across different methods?
      I think I could do Tophat2-Cuffdiff without Cufflinks+Cuffmerge to do this pipeline. However, I am not sure I am right and is there any other way to uniform them?
      Thank you!

      Comment


      • #93
        Originally Posted by dpryan
        ....FPKM/RPKM values have some downsides in that they lead you to believe that you've properly corrected for library size differences, when you actually haven't robustly.
        ...
        Some people mistakenly believe that the count-based method is "biased" toward longer genes. This is incorrect since that's not a real bias (in the common rather than technical meaning), that's a benefit. Etc.
        ...
        .
        Could you please elaborte on these two points please?

        When Im analysing expression profiles I export one table with RPKM, one with CPM and also one with normalised counts from DEseq2.

        However I don't know which one is the best? So far I have used CPM values, because the have correlated best with RT-PCR delta CT values. Also with corresponding protein values.

        How do these three methods relate? What are pros and cons?

        Comment


        • #94
          CPM is basically a non-length-normalized variant of RPKM, so it has many of the same pros and cons. In an ideal world, RPKM should be the most accurate of the three for many things, since if you have differences in major isoform length between groups then it can better represent that. Having said that, it's a bit more difficult to calculate accurately and then use well statistically (look at the evolution of cufflinks for an example). The normalized counts from DESeq2 (or edgeR) are mostly used for visualization and not much more. In most cases, there's not one "best" thing. These are all metrics that are more/less useful for a variety of things. If CPM has better predictive power in your samples then I wouldn't argue much with that. All the theoretical discussions in the world have to take a back seat to what validates experimentally.

          Comment


          • #95
            Hi Devon and Zapages,
            Have you seen my questions on post #92 ? Thank you!
            I am confused on how to uniform the gene name or ID between Tuxedo and count based pipelines. The different gene annotation file is not convenient especially I don't know how to deal with sth like "Q865A2_BOVIN"
            Last edited by super0925; 05-12-2014, 03:23 AM.

            Comment


            • #96
              Originally posted by super0925 View Post
              Hi how do you uniform the gene name between different methods (Texudo vs Htseq+edgeR)?
              Now I am doing RNA-Seq on the bovine, my reference genome use Ensembl database.
              As you know, the Texudo pipeline: Tophat2-Cufflinks-Cuffmerge-Cuffdiff.
              After the Cuffmerge and get the merged.gtf which could give you gene name output only but not the Ensembl ID. However, the gene name of Cuffdiff, some are like "Q865A2_BOVIN", which made me confused, cause it is not official gene name.
              Is there any methods to uniform gene names (whatsoever Ensembl ID, gene name or UCSC ID ,.etc.) across different methods?
              I think I could do Tophat2-Cuffdiff without Cufflinks+Cuffmerge to do this pipeline. However, I am not sure I am right and is there any other way to uniform them?
              Thank you!
              Q865A2_BOVIN looks like a uniprot ID (with a "_BOVIN" suffix, perhaps), so just convert that with Biomart or R (there's probably a bovine annotation package). BTW, this is usually not an issue since the IDs output by cufflinks and those used by htseq-count can be the same (they're both sourced from the annotation file you provide, so you can just tell htseq-count to count according to the same ID).

              Comment


              • #97
                Originally posted by dpryan View Post
                Q865A2_BOVIN looks like a uniprot ID (with a "_BOVIN" suffix, perhaps), so just convert that with Biomart or R (there's probably a bovine annotation package). BTW, this is usually not an issue since the IDs output by cufflinks and those used by htseq-count can be the same (they're both sourced from the annotation file you provide, so you can just tell htseq-count to count according to the same ID).
                Which R package could do this? I know how to convert the Ensembl gene ID list to gene names list on Biomart website http://www.ensembl.org/biomart/martview/ ,but it is manually (I mean I need to copy and pasted one list by one list), not automatically.

                Comment


                • #98
                  org.Bt.eg.db from Bioconductor. It has a uniprot to ensembl conversion map. BTW, you can also script biomart with the biomaRt package in R (this is very useful for things like ID conversions and fetching sequence information).

                  Comment


                  • #99
                    Originally posted by dpryan View Post
                    org.Bt.eg.db from Bioconductor. It has a uniprot to ensembl conversion map. BTW, you can also script biomart with the biomaRt package in R (this is very useful for things like ID conversions and fetching sequence information).
                    I have two questions now:
                    1. I have tried the Biomart(http://www.ensembl.org/biomart/martview/) on Q865A2_BOVIN but I didn't find the result...
                    2. I also try bioDBnet (http://biodbnet.abcc.ncifcrf.gov/db/db2dbRes.php) and it is not work as well.
                    Could you please try it? (may take you 1 minute) I think may be I wrongly choose the catalog (e.g. UnigeneID, Gene ID) of the input or output.

                    The result of Q865A2_BOVIN is OAS1 (please see (http://www.uniprot.org/uniprot/Q865A2), btw what is '2'-5'-oligoadenylate synthetase', could it be transfer to from Biomart or bioDBnet?

                    Thank you!
                    Last edited by super0925; 05-12-2014, 05:05 AM.

                    Comment


                    • Perhaps the uniprot IDs haven't been merged into biomart yet. Since you have an annotation file, the conversions are contained in it. So just load it into R with GenomicRanges and then get the conversions from the mcols() of the resulting GRanges object.

                      A "2'-5'-oligoadenylate synthetase" would add ATP to oligos (if you google this, you'll find that apparently this particular method ends up activating ribonucleases and leading to RNA degradation).

                      Comment


                      • Originally posted by dpryan View Post
                        Perhaps the uniprot IDs haven't been merged into biomart yet. Since you have an annotation file, the conversions are contained in it. So just load it into R with GenomicRanges and then get the conversions from the mcols() of the resulting GRanges object.

                        A "2'-5'-oligoadenylate synthetase" would add ATP to oligos (if you google this, you'll find that apparently this particular method ends up activating ribonucleases and leading to RNA degradation).

                        Thank you! But bioDBnet (http://biodbnet.abcc.ncifcrf.gov/db/db2db.php) has merged the uniprot, however it seems not working. Could you please try it? I am wondering if I selected the wrong catalog (e.g. gene symbol, gene ID...). Because if I input the gene symbol 'OAS1 ', and the Outputs select the 'UniProt Accession', I don't find the 'Q865A2' in the result list, and hence, if I input 'UniProt Accession' is 'Q865A2', outputs is null. It is strange!

                        BTW, how to define the "2'-5'-oligoadenylate synthetase" for 'OAS1' , or "prostaglandin D2 receptor" for 'PTGDR'? Are these long name called "Full gene name"? Could I use Biomart or bioDBnet or R packages you mentioned to transfer the Ensembl ID to this so called 'Full gene name'? Or the only way is to Google them one by one after I got the list?
                        Thank you!
                        Last edited by super0925; 05-12-2014, 09:51 AM.

                        Comment


                        • "2'-5'-oligoadenylate synthetase" is an example of a gene name, yes. I also can't get bioDBnet to work with that example, but I've never used the site before. As I suggested earlier, just parse the annotation file you're using, since it apparently has both.
                          Last edited by dpryan; 05-13-2014, 12:39 AM.

                          Comment


                          • Originally posted by dpryan View Post
                            "2'-5'-oligoadenylate synthetase" is an example of a gene name, yes. I also can't get bioDBnet to work with that example, but I've never used the site before. As I suggested earlier, just parse the annotation file you're using, since it apparently have both.
                            Sorry I haven't got your idea. Do you mean once I get the gene name list , like OAS1, PTGDR, etc. If I want to transfer them to their full name (or as you said the example of gene name) to "2'-5'-oligoadenylate synthetase","prostaglandin D2 receptor", I need to Google it?
                            Could the annotation file e.g. GTF file have these full name? I only remember it has the Ensembl ID and hence I need to transfer the ID to gene names (e.g. OAS1) by online converter tools. But for how to transfer them to the full name, I am still confused.
                            Thank you!
                            Last edited by super0925; 05-13-2014, 12:15 AM.

                            Comment


                            • No, just use R/biopython/bioperl/whatever to read in the GTF/GFF file that you used with cufflinks & htseq-count and then extract the ID conversions with that. Since I assume that you used the same annotation file for both tools, it must have both IDs, which means that that file can be used to define the ID mappings. Yes, this will take a modicum of programming.

                              Comment


                              • Originally posted by dpryan View Post
                                No, just use R/biopython/bioperl/whatever to read in the GTF/GFF file that you used with cufflinks & htseq-count and then extract the ID conversions with that. Since I assume that you used the same annotation file for both tools, it must have both IDs, which means that that file can be used to define the ID mappings. Yes, this will take a modicum of programming.
                                Hi Devon
                                I still have two questions:
                                Q1:
                                The full name "prostaglandin D2 receptor" is not from my GTF file... What is strange , in gene names from GTF file, I think some use UniProt Accession(e.g. ATPO_BOVIN) but some use gene names(e.g. ITSN1 ). It made me confused.
                                I have screenshot to you (My GTF file is from Ensembl Database. Bovine's genome annotation, Btau_4.0).
                                Hence I don't know how to uniform them? I think the reasonable method is use Ensembl ID which is uniformed (by (1)"Tophat-htseq-edgeR" or (2)"Tophat-Cuffdiff" but without Cufflinks and Cuffmerge) , after then, I translate them to gene name by R package or online tools (Biomart) .
                                Am I right? Or DO you have any better solution?
                                Q2:
                                The full name(e.g. "2'-5'-oligoadenylate synthetase","prostaglandin D2 receptor") is my collaborator sent to me, who is more interested with the long full name than gene name. He google the gene name or Ensembl ID and get the full name...
                                Hence I am think a automatically transfer method than manually Google it.
                                Or may I change another annotation file (GTF) e.g. from UCSC database rather than Ensembl?
                                Thank you!
                                Attached Files
                                Last edited by super0925; 05-13-2014, 03:23 AM.

                                Comment

                                Working...
                                X