Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Low number of replicates DESeq

    Hi all,

    I am using DESeq for DGE analysis.

    I have STRANDED RNA-Seq data for 4 developmental stages with no replicates.
    To have a more reliable DGE I should have replicates and so I obtained (from another lab member) UNSTRANDED RNA-Seq data with 3 replicates per stage.

    Before doing a DGE, I thought to test the correlation between these samples, just to show that similar samples “cluster” together. If so, I can then use the unstranded data for my DGE analysis to have more replicates per each stage.

    I mapped the raw reads to the genome using TOPHAT, sorted the bam files by name and used htseq-count to get the raw reads counts for both the data. For the stranded data I used the option -s yes and for the unstranded data I used -s no.

    I used DESeq to include metadata and for normalization, and I removed the genes that always have a 0 value. I then calcualted the correlation which was really low.

    I then tried to use htseq-count with the option -s reverse for the stranded data and still got really low correlation.

    So I reran htseq-count on the stranded data selecting the option -s no and in this way I got a very similar number of total counts between the unstranded and stranded data (while both cases before the stranded ones were double in number). I then included metadata, estimated the new size factors, normalized and calculated the new correlation. Both Pearson and Spearman performed pretty well, confirmed by both a PCA and correlogram.

    Though, I'd still like to figure out a way to use the stranded counts. I am not sure if I lose some information running htseq-count using -s no on the stranded data.

    What I had in mind was using unstranded data to estimate the level of variation to get a threshold for DE detection but still use the stranded data as expression values. Not sure if I can do that though given one is stranded and the other is not.

    I would like to hear from you if you have any thoughts about this.

    Let me know if you need more information to better understand the issue.

    Thanks a lot
    Federico

  • #2
    Hi Federico,
    have you confirmed that the data you have is indeed strand specific?
    You can use the infer_experiment.py script from the RSeQC package to check it out - https://code.google.com/p/rseqc/

    Depending on the library preparation protocol you either use the '-s reverse' or '-s yes' option. Do you know the specifics of the library prep? In any case, you will find out if you manage to run 'infer_experiment.py'.

    /blanco

    Comment


    • #3
      Hi Blanco,

      to be sure I did a quick test using infer_experiment.py,it is indeed forward-reverse. I also did a blat to the genome and I got the same result.

      This is PairEnd Data
      Fraction of reads explained by "1++,1--,2+-,2-+": 0.9189
      Fraction of reads explained by "1+-,1-+,2++,2--": 0.0811
      Fraction of reads explained by other combinations: 0.0000

      1++,1–,2+-,2-+

      read1 mapped to ‘+’ strand indicates parental gene on ‘+’ strand
      read1 mapped to ‘-‘ strand indicates parental gene on ‘-‘ strand
      read2 mapped to ‘+’ strand indicates parental gene on ‘-‘ strand
      read2 mapped to ‘-‘ strand indicates parental gene on ‘+’ strand
      Based on this I ran TOPHAT with fr-secondstrand and htseq-count with -s yes

      Comment


      • #4
        Have you had a look at the alignments yet? Pick a two or three genes for which the counts between stranded and unstranded libraries differ strongly and two or three genes for which they agree reasonably well, then look at the eads at these loci with a genome browser (e.g., IGV). Do you see antisense transcripts which should not be there? Do you see many reads with low alignment quality? Any other peculiarity?

        Comment


        • #5
          Hi Simon,

          Good idea!
          I am sorting my bam files at the moment. I will write back once I had a look at the alignment

          In the meantime thanks for the suggestion

          Comment


          • #6
            Hi Simon,

            I have had a look at the alignments.
            When the counts between stranded and unstraded librarires agree reasonably well, it looks alright. I can see that when they differ strongly it is because the locus is particulary complex, with a lot of antisense transcription and overlapping genes, which are pickep up only by the stranded libraries.

            I've also noticed that in a lot of cases, genes that are well supported by alignment reads, have a count of 0. This happens more often for unstranded libraries, I guess particularly when there is antisense transcription. But it also happens for stranded libraries. Highly expressed genes with good alignment support have a count 0.

            I wonder if running htseq-count with
            -m union
            could have influenced this.
            For each developmental stage, around 15-20 million counts are ambigous.

            Would -m intersection_nonempty more appropriate in your opinion?

            Thanks for help
            Federico

            Comment


            • #7
              Does your GTF file contain "exon" lines for all the anti-sense transcripts? If so, this would explain the discrepancies.

              htseq-count considers a read as "ambiguous" if it maps to more than one feature (gene). So, if a read maps to a position which is covered by a regular gene on one strand and by an overlapping antisense transcript on the other strand, then this read will be counted as ambiguous if you have set "stranded" to "no", because there is no information to decide whether the read originated from the sense of from the antisense transcript. For "stranded=yes", however, the read will be counted for the feature that is on the same strand as the read.

              Comment


              • #8
                yes my gtf file contains all exon lines for antisense transcripts as well.
                this explains the discrepancy since I used -s yes for the stranded libraries.
                But why sometimes the counts are zero when the read maps to a position which is covered by a regular gene on one strand and a overlapping antisense gene on the opposite strand?

                selecting -stranded yes should account for this,shouldn't it?

                thanks

                Comment


                • #9
                  Write down the names of a few such reads, then use the '-o' option to get an output SAM file, grep in it for the read names and find out this way what htseq-count has done with these reads.

                  Comment


                  • #10
                    Ok, thanks again for help!

                    Let you know how it goes

                    Comment


                    • #11
                      Hi Simon,

                      I might have figured this out.

                      I think the issue is that I have particular loci in the genome that have overlapping genes on the same strand. Though, these genes have a different ID so that's way I get a count of 0 .

                      If a read maps to an exon shared by several genes, this will appear to htseq-count as ambigous.

                      So at the moment I am having zero counts for all these particular loci, which sometimes contain also really highly expressed genes.

                      Do you have any idea on how I could account for this?

                      Let me know if I'm not clear

                      Thanks
                      Federico

                      Comment


                      • #12
                        I am now indexing my reference with your suggestions above. Let me know if I'm not clear

                        Comment


                        • #13
                          Crossposting a question on two forums is never a good idea. At least the link to the other thread should be provided: https://stat.ethz.ch/pipermail/bioco...ry/057986.html

                          Comment


                          • #14
                            Hi Simon,

                            sorry for that. I'm still a beginner in this field and i'm still learning how all this works.
                            The thread on bioconductor list is more focused on DESeq2 analysis than htseq-count but I was indeed going to provide the link to this thread since the discussion was shifting towards htseq-count again.

                            Did you have a chance to read my previous post?

                            I think the issue I'm having (other than having the correct design for DESeq2) is that there are particular loci in the genome that have overlapping genes on the same strand. Though, these genes have a different ID so that's way I get a count of 0 .

                            As explaind in htseq manual If a read maps to an exon shared by several genes, this will appear to htseq-count as ambigous.

                            So at the moment I am having zero counts for all these particular loci, which sometimes contain also really highly expressed genes that I'd like to account for in the subsequent DGE analysis.

                            Do you have any idea on how I could account for this?

                            Thanks

                            Comment


                            • #15
                              Overlapping genes on the same strand cannot be distinguished in principle -- at least not by reads that map to the overlapping part. So, if the overlap is small, these will only be a few read, and you could ignore the issue. If, however, the overlap is big, you should consider the two genes as different isoforms of the same gene, and hence give them the same name.

                              So, I would somehow edit the GTF file such that whenever there are two genes A and B with large overlap, you replace all occurances of the gene IDs "A" and "B" with something like "A+B".

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Non-Coding RNA Research and Technologies
                                by seqadmin




                                Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                                Nobel Prize for MicroRNA Discovery
                                This week,...
                                10-07-2024, 08:07 AM
                              • seqadmin
                                Recent Developments in Metagenomics
                                by seqadmin





                                Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                                09-23-2024, 06:35 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 10-02-2024, 04:51 AM
                              0 responses
                              104 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-01-2024, 07:10 AM
                              0 responses
                              112 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 09-30-2024, 08:33 AM
                              1 response
                              115 views
                              0 likes
                              Last Post EmiTom
                              by EmiTom
                               
                              Started by seqadmin, 09-26-2024, 12:57 PM
                              0 responses
                              21 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X