Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Which ID should be used for HTSeq-count?

    1. transcripts1.gtf , transcripts2.gtf , ... , transcripts20.gtf were produced by cufflinks.

    2. I combined all of the gtf file to make my own annotation file, new.combined.gtf
    cuffcompare -o new original.gtf transcripts1.gtf transcripts2.gtf ... transcripts20.gtf

    3. I want to use HTSeq-count, by using new.combined.gtf.
    However, in new.combined.gtf, among gene_id and oID and transcript_id and tss_id, which one do I have to choose? My thought is to use oID or tssID. But some reads that have different oId share the same tssID. In other words, some reads that have been previously annotated and other reads that have not been annotated have different oID but the same tssID.


    Q: My goal is to see differentially expressed genes across the five different time points. To achieve this goal, when using HTSeq-count, which id is reasonable to choose as option?



    Thank you in advance!
    The below is the new.combined.gtf.

    Mt:3452:Mt3.5.1Chr1 E exon 154859 155160 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "Medtr1g005010.1"; tss_id "TSS1";
    Mt:3452:Mt3.5.1Chr1 E exon 155201 161330 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; oId "Medtr1g005020.1"; tss_id "TSS2";
    Mt:3452:Mt3.5.1Chr1 E exon 161501 161880 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "2"; oId "Medtr1g005020.1"; tss_id "TSS2";
    Mt:3452:Mt3.5.1Chr1 E exon 162000 162267 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "3"; oId "Medtr1g005020.1"; tss_id "TSS2";
    Mt:3452:Mt3.5.1Chr1 E exon 162314 162479 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "4"; oId "Medtr1g005020.1"; tss_id "TSS2";
    Mt:3452:Mt3.5.1Chr1 Cufflinks exon 161506 162151 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00004880"; exon_number "1"; oId "CUFF.2.1"; tss_id "TSS4";
    Mt:3452:Mt3.5.1Chr1 Cufflinks exon 162327 162630 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00004881"; exon_number "1"; oId "CUFF.2.1"; tss_id "TSS5";
    Mt:3452:Mt3.5.1Chr1 Cufflinks exon 174216 175572 . + . gene_id "XLOC_000003"; transcript_id "TCONS_00007444"; exon_number "1"; oId "CUFF.9.1"; tss_id "TSS6";
    Mt:3452:Mt3.5.1Chr1 Cufflinks exon 188340 188516 . + . gene_id "XLOC_000003"; transcript_id "TCONS_00007444"; exon_number "2"; oId "CUFF.9.1"; tss_id "TSS6";
    Mt:3452:Mt3.5.1Chr1 Cufflinks exon 175331 175619 . + . gene_id "XLOC_000003"; transcript_id "TCONS_00014784"; exon_number "1"; oId "CUFF.5.1"; tss_id "TSS8";
    Mt:3452:Mt3.5.1Chr1 Cufflinks exon 176045 176603 . + . gene_id "XLOC_000004"; transcript_id "TCONS_00004882"; exon_number "1"; oId "CUFF.8.1"; tss_id "TSS9";
    Mt:3452:Mt3.5.1Chr1 Cufflinks exon 176728 177280 . + . gene_id "XLOC_000004"; transcript_id "TCONS_00004882"; exon_number "2"; oId "CUFF.8.1"; tss_id "TSS9";
    Mt:3452:Mt3.5.1Chr1 Cufflinks exon 177480 177481 . + . gene_id "XLOC_000004"; transcript_id "TCONS_00004882"; exon_number "3"; oId "CUFF.8.1"; tss_id "TSS9";
    Mt:3452:Mt3.5.1Chr1 Cufflinks exon 176052 177755 . + . gene_id "XLOC_000004"; transcript_id "TCONS_00007445"; exon_number "1"; oId "CUFF.10.1"; tss_id "TSS9";
    Mt:3452:Mt3.5.1Chr1 E exon 176184 176603 . + . gene_id "XLOC_000004"; transcript_id "TCONS_00000003"; exon_number "1"; oId "Medtr1g005100.1"; tss_id "TSS10";
    Mt:3452:Mt3.5.1Chr1 E exon 176728 176892 . + . gene_id "XLOC_000004"; transcript_id "TCONS_00000003"; exon_number "2"; oId "Medtr1g005100.1"; tss_id "TSS10";
    Mt:3452:Mt3.5.1Chr1 E exon 177155 177280 . + . gene_id "XLOC_000004"; transcript_id "TCONS_00000003"; exon_number "3"; oId "Medtr1g005100.1"; tss_id "TSS10";
    Last edited by syintel87; 02-03-2013, 03:19 PM.

  • #2
    Originally posted by syintel87 View Post
    I want to use HTSeq-count, by using new.combined.gtf.
    However, in new.combined.gtf, among gene_id and oID and transcript_id and tss_id, which one do I have to choose? My thought is to use oID or tssID. But some reads that have different oID share the same tssID. In other words, some reads that have been previously annotated and other reads that have not been annotated have different oID but the same tssID.
    TSS -> Transcript Start Site (according to convention, and cufflinks documentation). This is related to the gene_id. There will be multiple isoforms / transcripts that have the same transcript start site, and arise from different splicing events.

    It looks like 'oID' is the name of the original transcript (or cufflink's name if it's found a novel transcript):

    http://cufflinks.cbcb.umd.edu/manual...uffcomp_output

    There also should be a .tmap file that links transcripts to genes. The first column of this file is "The gene_name attribute of the reference GTF record for this transcript, if present. Otherwise gene_id is used." If you're doing gene-level comparisons, that's probably what you want to use. For transcript-level comparisons, the second column names are probably appropriate.

    Q: My goal is to see differentially expressed genes across the five different time points. To achieve this goal, when using HTSeq-count, which id is reasonable to choose as option?
    Any particular reason why you want to use HTSeq-count? What do you propose to use as your count data?

    Cufflinks has its own recommended differential expression test method, cuffdiff, which identifies differences at both a gene level and an isoform level:

    http://cufflinks.cbcb.umd.edu/manual.html#gene_exp_diff

    Comment


    • #3
      You're best off using gene_id. tss_id will definitely work poorly (you could maybe use that with DEXSeq) and oId will also work poorly (though not nearly as bad as tss_id).

      gringer: I've done what syintel87 is trying to do, so I'll give what my reason was. In my case, I had a lot of reads mapping to UTR regions (and the occasional exon) that weren't in the annotation that I was using. cufflinks can pick a lot of these up and spit out a merged GTF file (with exons added or extended, as appropriate) that better represented the apparent structure of the genes. In my case, that really didn't end up making much of a difference in the expression analysis (it just shuffled around genes on the margin of significance), but perhaps for others it proves more useful.

      Comment


      • #4
        I have a question, and I would really appreciate it if I was to get a prompt reply!

        I am performing an RNASeq analysis and I have paired-end 101 bp reads. I want to generate count data and perform a DE analysis using DESeq where I can identify genes that are differentially expressed between 2 conditions (for which we have replicates for).

        I have run ht-seq count using an alignment file (sam file sorted by read name of course) and the latest version of Ensembl's annotation file as input to obtain this count data. I ran this using two different identifiers (-i option).

        1) gene_id (default, and 'preferred' for RNASeq)
        2) gene_name

        My preference was that I wouldn't want to go through the headache of converting these gene-id's into gene symbols which are easier for everyone to look at and identify. But what I realized was that these 2 runs (keeping the mode and everything else constant) yielded different number of rows (entries) in the resulting count dataset.

        Why is this? Isn't it a 1 to 1 (id to gene name) mapping/conversion? I'm not sure what the difference in it's working is between the two runs. Am I losing anything by using gene_name instead of gene_id?

        Again, any help would be greatly appreciated.


        Thanks

        Comment


        • #5
          Originally posted by vkartha View Post
          Isn't it a 1 to 1 (id to gene name) mapping/conversion?
          Doesn't seem so, does it?

          I quickly checked with Biomart, and the first thing I noticed is that snoRNAs that have multiple copies appear with different ENSG gene IDs but the same gene symbol. I'm sure there are many such things, probably especially (but maybe only) regarding non-coding genes.

          Sometimes it seems to me that the biggest everyday task in bioinformatics is struggling with ID conversions.

          Comment


          • #6
            Originally posted by dpryan View Post
            You're best off using gene_id. tss_id will definitely work poorly (you could maybe use that with DEXSeq) and oId will also work poorly (though not nearly as bad as tss_id).

            gringer: I've done what syintel87 is trying to do, so I'll give what my reason was. In my case, I had a lot of reads mapping to UTR regions (and the occasional exon) that weren't in the annotation that I was using. cufflinks can pick a lot of these up and spit out a merged GTF file (with exons added or extended, as appropriate) that better represented the apparent structure of the genes. In my case, that really didn't end up making much of a difference in the expression analysis (it just shuffled around genes on the margin of significance), but perhaps for others it proves more useful.
            1.
            So, you mean tss_id and oId work poorly, so that gene_id has to be used?

            2.
            My concern is the probability that even in the same gene, some transcripts could be up-regulated and other transcripts could be down-regulated. This is why I want to analyze data based on the transcript level. Does my thought make sense biologically?

            3.
            My data is RNA-seq. And my goal is to see the differentially expressed genes by using edgeR. I am concerened about some reads that are mapped to reference genome but are not annotated, so that I want to use cuffcompare to combine original.gtf file and transcripts.gtf files.
            Though considering my situation, since tss_id is working poorly, the way to go is to use gene_id, ignoring transcript levels?

            Comment


            • #7
              Originally posted by syintel87 View Post
              1.
              So, you mean tss_id and oId work poorly, so that gene_id has to be used?
              Yes
              2.
              My concern is the probability that even in the same gene, some transcripts could be up-regulated and other transcripts could be down-regulated. This is why I want to analyze data based on the transcript level. Does my thought make sense biologically?
              That's why you use something like DEXSeq in addition to something like edgeR (or DESeq, for similarity).

              3.
              My data is RNA-seq. And my goal is to see the differentially expressed genes by using edgeR. I am concerened about some reads that are mapped to reference genome but are not annotated, so that I want to use cuffcompare to combine original.gtf file and transcripts.gtf files.
              Though considering my situation, since tss_id is working poorly, the way to go is to use gene_id, ignoring transcript levels?
              Gene and transcript level analyses needn't be the same. I wouldn't try to shoehorn an analysis into a pipeline designed for something else when more appropriate pipelines exist.

              Comment


              • #8
                Originally posted by dpryan View Post
                Yes

                That's why you use something like DEXSeq in addition to something like edgeR (or DESeq, for similarity).



                Gene and transcript level analyses needn't be the same. I wouldn't try to shoehorn an analysis into a pipeline designed for something else when more appropriate pipelines exist.
                Thank you so much!!!!

                So, you mean to see differential expression at gene level, gene_id has to be used in htseq count, and edgeR analysis will be fine.

                Also, to see differential expression at transcript level, tss_id(???) has to be used in htseq count, and DEXSeq or something will be fine.

                Do I understand correctly?
                Last edited by syintel87; 02-06-2013, 02:29 AM.

                Comment


                • #9
                  Originally posted by syintel87 View Post
                  Thank you so much!!!!

                  So, you mean to see differential expression at gene level, gene_id has to be used in htseq count, and edgeR analysis will be fine.
                  Exactly

                  Also, to see differential expression at transcript level, tss_id has to be used in htseq count, and DEXSeq or something will be fine.

                  Do I understand correctly?
                  As the Germans I work with would say, jein ("yes and no"). For DEXSeq, you would use something like dexseq_count.py (read the DEXSeq vignette). DEXSeq also technically looks at differential exon usage, which is related to transcript level expression changes but not exactly the same. Browse through bioconductor (and the literature/this forum, since a lot of them are stand-alone apps) for other ways of looking at transcript level changes.

                  Comment


                  • #10
                    Originally posted by dpryan View Post
                    Exactly



                    As the Germans I work with would say, jein ("yes and no"). For DEXSeq, you would use something like dexseq_count.py (read the DEXSeq vignette). DEXSeq also technically looks at differential exon usage, which is related to transcript level expression changes but not exactly the same. Browse through bioconductor (and the literature/this forum, since a lot of them are stand-alone apps) for other ways of looking at transcript level changes.

                    Now I am thinking of two ways.

                    1.
                    1) original.gtf (original annotation ) + transcripts.gtf ( output from cufflinks ) => cuffcompare => combined.gtf
                    2) HTSeq-count by using combined.gtf and its "gene_id".

                    2.
                    1) original.gtf (original annotation ) + transcripts.gtf ( output from cufflinks ) => cuffcompare => combined.gtf
                    2) python dexseq_prepare_annotation.py combined.gtf combined_DEXSeq.gtf
                    3) python dexseq_count.py -p no -s no -a 10 combined_DEXSeq.gtf accepted_hits.sam treatment1_counts.txt
                    4) read.HTSeqCounts( treatment1_counts.txt, design, flattenedfile = "combined_DEXSeq.gtf" )

                    Would you please give me a tip about whetehr these two ways seem to make sense?

                    Thank you very much!!!

                    Comment


                    • #11
                      Originally posted by syintel87 View Post
                      2) HTSeq-count by using combined.gtf and its "gene_id".
                      HTSeq-count on its own is not going to be all that useful due to normalisation issues. It needs to be paired up with another program that can filter out the noise and make the true positive results more obvious. At least use cuffdiff if you want some results directly from cufflinks' results files.

                      Comment


                      • #12
                        Originally posted by gringer View Post
                        HTSeq-count on its own is not going to be all that useful due to normalisation issues. It needs to be paired up with another program that can filter out the noise and make the true positive results more obvious. At least use cuffdiff if you want some results directly from cufflinks' results files.
                        From earlier in the thread, there's an implicit "use edgeR" third step

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Exploring the Dynamics of the Tumor Microenvironment
                          by seqadmin




                          The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                          07-08-2024, 03:19 PM
                        • seqadmin
                          Exploring Human Diversity Through Large-Scale Omics
                          by seqadmin


                          In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                          06-25-2024, 06:43 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 07-19-2024, 07:20 AM
                        0 responses
                        37 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 07-16-2024, 05:49 AM
                        0 responses
                        49 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 07-15-2024, 06:53 AM
                        0 responses
                        60 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 07-10-2024, 07:30 AM
                        0 responses
                        43 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X