Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DEXSeq (1.too many empty 2. paried sample comparison?)

    Hi Everyone,

    1. I tried to use DEXSeq to look into differential exon usage. Frist, I tried to creat my own ExonCountSet using aligned sam files, and an annotated gtf file (generated by dexseq_prepare_annotation.py from the same gtf I used for tophat alignment).

    However, I got a lot of _empty. I showed part of the result in the following:

    ENSG00000000003:001 387
    ENSG00000000003:002 270
    ENSG00000000003:003 194
    ENSG00000000003:004 79
    ENSG00000000003:005 36
    ENSG00000000003:006 59
    ENSG00000000003:007 60
    ENSG00000000003:008 32
    ENSG00000000003:009 42
    ENSG00000000003:010 32
    ENSG00000000003:011 20
    ENSG00000000003:012 25
    ENSG00000000003:013 1
    ENSG00000000003:014 9
    ENSG00000000003:015 3
    ENSG00000000003:016 0
    ENSG00000000003:017 0
    ENSG00000000003:018 0
    ENSG00000000003:019 0
    ENSG00000000005:001 0
    ENSG00000000005:002 0
    ENSG00000000005:003 20
    ENSG00000000005:004 0
    ENSG00000000005:005 0
    ENSG00000000005:006 0

    ......
    ENSG00000258370:001 0
    ENSG00000258370:002 0
    ENSG00000258371:001 0
    ENSG00000258371:002 0
    ENSG00000258372:001 0
    ENSG00000258372:002 0
    ENSG00000258372:003 0
    ENSG00000258372:004 0
    ENSG00000258372:005 0
    ENSG00000258372:006 0
    ENSG00000258372:007 0
    ENSG00000258372:008 0
    ENSG00000258372:009 0
    ENSG00000258372:010 0
    ENSG00000258372:011 0
    ENSG00000258373:001 119
    ENSG00000258374:001 0
    ENSG00000258374:002 0
    ENSG00000258375:001 0


    _ambiguous 44367
    _empty 140632650
    _lowaqual 7773318
    _notaligned 0

    Does that mean it discard 140 million reads? How could I improve this? When I use HTseq to calculate gene-level count, the no_feature is around 70 million.


    2. If I want to compare two cancer samples along with their normals, i.e.

    Cancer 1/normal 1 (cancer 1 replica/normal 1 rep) versus Cancer 2/normal 2 (cancer 2 replica/normal 2 rep) ....

    Could DEXSeq do this? Or if other tools ?

    Any suggestions are appreciated. Thanks!

  • #2
    Hi senpeng!

    1.

    What are the exact lines are you using for the dexseq scripts? You have a lot of reads! and it sounds strange that a lot of them are empty, how many reads do you have in your initial fastq files?

    2.

    DEXSeq can do this! Have a look at the section 4: "Additional technical or experimental variables".

    Let me know if you have additional questions,

    Alejandro

    Comment


    • #3
      Are you sure the coordinates in your GTF file are based on the same genome build as the Fasta files you have aligned your read against?

      Comment


      • #4
        Originally posted by areyes View Post
        Hi senpeng!

        1.

        What are the exact lines are you using for the dexseq scripts? You have a lot of reads! and it sounds strange that a lot of them are empty, how many reads do you have in your initial fastq files?

        2.

        DEXSeq can do this! Have a look at the section 4: "Additional technical or experimental variables".

        Let me know if you have additional questions,

        Alejandro
        Dear Alejandro,
        Thanks for your reply.

        1.We have around 220 million reads in fastq files(paired end, so 110 million for one end).

        I just used python dexseq_prepare_annotation.py ensembl ensembl.63.genes.gtf <output.gtf>
        to generate the flattened_gtf_file.

        Then python dexseq_count.py <output.gtf> <sam_file> <output_file> for the DEXSeq counts.
        Last edited by senpeng; 03-27-2012, 02:04 PM.

        Comment


        • #5
          Originally posted by Simon Anders View Post
          Are you sure the coordinates in your GTF file are based on the same genome build as the Fasta files you have aligned your read against?

          Dear Simon,

          Thanks so much for your reply.

          For the alignment we used
          TopHat VN:1.3.2 -r 40 -p 8 -G annotation/ensembl.63.genes.gtf /refgenome/GRCh37/Homo_sapiens.GRCh37.62

          And I used the exactly the same ensembl gtf for dexseq_prepare_annotation.py and dexseq_count.py .

          I checked our alignment files in IGV too, there's some corrected alignment in exon area, but also a lot of alignment reads fall into intron areas. Is it the possible cause of the _empty, not perfect alignment?

          Also,two more questions:

          1.Since we've already have around 150-200 million reads, but it seems that we still could not get a good count on isoform-level exons (gene-level, it seems OK). How many reads do you recommend for study of isoform-level changes?

          2. We also aligned the data to NCBI gtf (Homo_sapiens.NCBI36.54.noMT.gtf). Could we try using NCBI gtf in DEXSeq? since you recommend ensembl in your manual and the file size are different (ensembl ~450Mb vs NCBI ~120Mb).

          Thanks again for your help and look forward to your reply.

          Comment


          • #6
            Originally posted by senpeng View Post
            I checked our alignment files in IGV too, there's some corrected alignment in exon area, but also a lot of alignment reads fall into intron areas. Is it the possible cause of the _empty, not perfect alignment?
            Yes. A read that wholly falls into an intron, without even touching the exon, will be counted as "empty".

            You may want to figure out why you have so many read aligning to introns. Are the introns evenly filled with reads or are there just small islands of reads within the introns. The former would mean intron retention, the latter the presence of exons missing in your annotation.

            In case that these are biological signals of interest, you should use some tool like cufflinks or RSEM to find boundaries for these additional features and let DEXSeq test for them, too. We have not yet tried such a toolchain, though.

            1.Since we've already have around 150-200 million reads, but it seems that we still could not get a good count on isoform-level exons (gene-level, it seems OK). How many reads do you recommend for study of isoform-level changes?
            As long as most of them align to exons, you should get quite far with 150M reads. Look at the MA plot to see whether you are Poisson-limited.

            We also aligned the data to NCBI gtf (Homo_sapiens.NCBI36.54.noMT.gtf). Could we try using NCBI gtf in DEXSeq? since you recommend ensembl in your manual and the file size are different (ensembl ~450Mb vs NCBI ~120Mb).
            The Python script needs to know for each exon to which gene it belongs. The UCSC files do not contain this information; their "gene_id" field is just a copy of the "transcript_id" field. You need to manually rectify this.

            Simon

            Comment


            • #7
              indeed. Our core facility folk who did tophat for us used a different version of gtf. In my case there is ZERO counts. Every row of the output file has zero. So ...

              I will re-run everything and will update when I know more.
              Thanks a lot
              Wenhong

              Comment


              • #8
                Hello,
                I am trying to run DEXSeq on a set of several samples. As the first step, I generated the annotation file in the required format using "dexseq_prepare_annotation.py". Then I used the "dexseq_count.py" to generate counts.

                I am worried since the "empty" cases are too many:
                _ambiguous 0
                _empty 66686221
                _lowaqual 0
                _notaligned 0

                Is this expected?

                Total mapped reads in this dataset is 351,853,340. My annotation is based on gencode v14 in gtf format (I also tried using gff3).

                python dexseq_prepare_annotation.py gencv14_levels_12.gtf htseqfrmtdgencv14L12
                python dexseq_count.py -p yes -s yes htseqfrmtdgencv14L12 SrtdBAM_FT_1.sam dexcount_FT1_Strndd

                Comment


                • #9
                  I guess your protocol has a step enriching for polyadenylated RNAs?

                  You have 20% mapping outside of exons (could also mean unspliced introns, not annotated transcripts and lincRNAs, etc). Would not think is particularly bad.

                  Comment


                  • #10
                    Originally posted by areyes View Post
                    I guess your protocol has a step enriching for polyadenylated RNAs?

                    You have 20% mapping outside of exons (could also mean unspliced introns, not annotated transcripts and lincRNAs, etc). Would not think is particularly bad.
                    Thanks for your insight, areyes. Yes, the protocol does enrich for polyA RNAs.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Exploring the Dynamics of the Tumor Microenvironment
                      by seqadmin




                      The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                      07-08-2024, 03:19 PM
                    • seqadmin
                      Exploring Human Diversity Through Large-Scale Omics
                      by seqadmin


                      In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                      06-25-2024, 06:43 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 05:49 AM
                    0 responses
                    15 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 07-15-2024, 06:53 AM
                    0 responses
                    26 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 07-10-2024, 07:30 AM
                    0 responses
                    37 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 07-03-2024, 09:45 AM
                    0 responses
                    204 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X