Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Simon and all,

    I found another one(see the attached). I used cufflinks to assemble the transcripts and it appears that cufflinks assembled the reads in between these two genes as shown. I went further in ensemble and I realised that this (I mean reads between these two genes) were supported by EST alignments (Sequence: GH304014.1 : HW_FAT_169_A03.ab1 HW cDNA library intermuscular fat Bos taurus cDNA clone 169_A03, mRNA sequence.). This is the first time I'm analysing rna-seq data and I basically don't have a clue as to what's going on.

    Is there any inference one could deduce from this?
    Attached Files

    Comment


    • #17
      I guess you could filter the reads and only keep the reads that map uniquely (samtools view with the -q option should do that). I would worry about relying on degraded samples and the potential biases introduced, though.

      I suspect that you've got a perfect storm of incomplete annotation of gene models (which primarily affects UTRs), a highly paralogous genome, and 3' bias in your sample.

      How did the read quality look and what kind of sequencing did you do (length, paired/single)?

      Comment


      • #18
        I usually work with model organisms, so I am used to quite complete annotations. For cow, it is maybe not surprising that there are genes missing. Whether it is worth spending time on filling in these missing genes is another question. After all, if you find some of them to be differentially epxressed, you won't know what their function is, anyway. Maybe some other users working with non-model organisms have mroe advice there.

        I would still try to figure out first why most of your reads seem to map to repetitive regions. Take a few of those reads with NH:20 and maybe BLAST their sequence against the NCBI nucleotide collection. That is sometimes helpful.

        On the other hand, I just wondered whether you really have that many NH:20 reads. You have 32,980,337 SAM records with NH>1. however, if most of these have NH:20, then they all occur 20 times in your SAM file (NH is the number of _reported_ alignments), and so, these are only around 2 million reads, which would be less than 5% of your total number of reads and hence nothing to worry that much about.

        Comment


        • #19
          Originally posted by Darwin View Post
          I guess you could filter the reads and only keep the reads that map uniquely (samtools view with the -q option should do that). I would worry about relying on degraded samples and the potential biases introduced, though.

          I suspect that you've got a perfect storm of incomplete annotation of gene models (which primarily affects UTRs), a highly paralogous genome, and 3' bias in your sample.

          How did the read quality look and what kind of sequencing did you do (length, paired/single)?
          Thanks Darwin and truthfully, it is a perfect storm indeed. I used fastX tools to clip and trim my sequences. The boxplots ranged from 30 to 40. The size of the pictures is more than what it can accept as the other end is scrapped off. I can send it to you as an e-mail attachment if you don't mind.

          Moreso, these are single end reads...

          Thanks.
          Attached Files
          Last edited by Ajayi Oyeyemi; 04-05-2013, 08:48 AM.

          Comment


          • #20
            sure, if you can't resize the picture (which is probably better so that others can chime in). dichmann<at>berkeley<dot>edu. What's on the uploaded picture looks good.

            As Simon points out, it would be helpful if you analyzed the multi mapping statistics.

            It still leaves you with 3'bias, but on the other hand some protocols for DE genes rather than transcript discovery actually do targeted sequencing of the 3'end of transcripts. But again, I have no idea of potential biases appearing after RNA degradation.

            Comment


            • #21
              Originally posted by Simon Anders View Post
              I usually work with model organisms, so I am used to quite complete annotations. For cow, it is maybe not surprising that there are genes missing. Whether it is worth spending time on filling in these missing genes is another question. After all, if you find some of them to be differentially epxressed, you won't know what their function is, anyway. Maybe some other users working with non-model organisms have mroe advice there.

              I would still try to figure out first why most of your reads seem to map to repetitive regions. Take a few of those reads with NH:20 and maybe BLAST their sequence against the NCBI nucleotide collection. That is sometimes helpful.

              On the other hand, I just wondered whether you really have that many NH:20 reads. You have 32,980,337 SAM records with NH>1. however, if most of these have NH:20, then they all occur 20 times in your SAM file (NH is the number of _reported_ alignments), and so, these are only around 2 million reads, which would be less than 5% of your total number of reads and hence nothing to worry that much about.
              @Simon.,
              I'll do some of the sequences and blast them as advised. However, a thorough inspection of my sam file revealed that there are some with NH:2, NH:4 NH:16 and NH:20. I guess that's the reason why I have millions of reads with no feature.

              While I await input from scientist working with non-model organisms, Is it logical to re-run some tophat alignments using either UCSC or NCBI? (I had wanted to use DESeq that was why I used ensemble anyway) and even if it is logical which of the UCSC or NCBI is more compatible with htseq and DEXSeq scripts? I'm just worried about millions of reads with no features.
              Thanks Simon.

              Comment


              • #22
                Your quality scores are great.

                I used samtools view -q 50 to get unique maps in my data. lowering -q leaves in progressively more multi aligned reads, so that -q1 filters reads that align 10 times or more. Combine with -c option to output the number of reads. Consult the manual for more.

                Comment


                • #23
                  No need to rerun tophat, if you want to use a more complete GTF file, provided the new GTF file uses the same coordinate system (assembly) as the old one. If you use something other than Ensembl GTF file, have a very close look at the gene ID attributes. (UCSC tends to mess them up; different transcripts of the same gene must have the same gene ID; but this can be corrected with some fiddling)

                  Comment


                  • #24
                    @Darwin,

                    Thanks for the advice. I followed your lead and I use the following code:
                    samtools view -q 50 -c sample1.bam > sample_uniquehits.bam

                    count = 39647278

                    samtools view -q 1 -c sample1.bam > sample_uniquehits.bam

                    count = 44277228

                    For my second sample:
                    -q50 option gave 39847467 while -q1 option gave 46427418.

                    Any clue given my output from HTseq count? I have ~48 million reads and ~50 million reads for sample 1 and 2 respectively.

                    Comment


                    • #25
                      Originally posted by Simon Anders View Post
                      No need to rerun tophat, if you want to use a more complete GTF file, provided the new GTF file uses the same coordinate system (assembly) as the old one. If you use something other than Ensembl GTF file, have a very close look at the gene ID attributes. (UCSC tends to mess them up; different transcripts of the same gene must have the same gene ID; but this can be corrected with some fiddling)
                      Thanks Simon.

                      Given that I don't have much experience with the fiddling strategy you proposed, I'll just stick to ensemble anyway. I used samtools to check counts for unique alignments and I just posted it. If I can't wiggle my way out, I'll just use the htseq-count output in DESeq.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Essential Discoveries and Tools in Epitranscriptomics
                        by seqadmin


                        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                        Yesterday, 07:01 AM
                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      39 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      41 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      35 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      55 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X