Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • HTseq help

    Hi all,
    I'm using htseq count and fortunately, I didn't get any error. However, what concerns me most is the no feature and the alignment not unique options. They've been given me headache. They are just very high.

    I used the following code:
    htseq-count -m intersection-nonempty -s no -i gene_ id infile.sam genes.gtf > outfile.txt.

    my first sample gave:
    No feature 7411932
    ambiguos 26682
    too low aquality 0
    not aligned 0
    alignment not unique 32980337

    for my second sample:

    No feature 7024086
    ambiguos 28678
    too low aquality 0
    not aligned 0
    alignment not unique 9537858

    I got my genes.gtf from the igenome ensembl. The reads from the first sample is ~48million while for the second is ~50 million. I used tophat2 for the alignments. Please help...I'm stuck.

    My second question : at what count statistics is it optimal to be used in DESeq?
    Thanks.
    Last edited by Ajayi Oyeyemi; 03-28-2013, 08:27 AM.

  • #2
    Hey Guys please help me. My question hasn't been answered. Can someone help me out?

    Comment


    • #3
      Please don't post specific inquiries in major threads, and them bump them. This is your own thread for this query.

      Comment


      • #4
        Try to run HTSeq without the header of the SAM file.

        Comment


        • #5
          The vast majority of your reads seem to map to more than one locus. How comes?

          Maybe visualize your BAM files with a genome browser and investigate.

          Comment


          • #6
            Originally posted by eszter.ari View Post
            Try to run HTSeq without the header of the SAM file.
            Sorry I'm a newbie. Can you be more explicit? How can I run HTSeq without the header of sam file? And how can it be removed?
            I'm sorry if this question is naive.

            Comment


            • #7
              Originally posted by Simon Anders View Post
              The vast majority of your reads seem to map to more than one locus. How comes?

              Maybe visualize your BAM files with a genome browser and investigate.
              Thanks Simon; I have igv installed on my system. What are the things to watch out for. How can I know if a read is mapping to multiple regions using the igv. I'm so sorry as I'm just a newbie...

              Comment


              • #8
                htseq

                @Simon and All,

                I investigated my bam files as advised by Simon and I was able to take some snap shots. Can you help me take a look? I could see some of my reads in the intergenic region with most of them significantly enriched at the 5' and 3' ends.

                I'll appreciate your comments and Thanks in advance.

                Yemi.
                Attached Files

                Comment


                • #9
                  Your third screen shot is how things are supposed to look like. These heaps way beyond the 3' ends in the other screenshots look quite unusual. They are probably what gives rise to al the "no_feature" counts.

                  Are there more such heaps in regions even further away from the genes?

                  For anybody to make a guess what is going on, you'll need to tell us more about your experiment. (Which organism? What kind of samples? Which wet-lab protocol? What biological question? Anything non-standard in your procedure or samples?)

                  About the reads with "alignment_not_unique": We cannot see from the screenshot which ones these are. If you hover your mouse over a read, you get the full information on it from the SAM file. Look for the optional field "NH". It tells you to how many places this read was mapped to. Are, for example, the uniquely mapping reads ("NH:i:1") in the genes and the multireads (NH>1) in these intergenic heaps?
                  Last edited by Simon Anders; 04-04-2013, 07:58 AM. Reason: edited wording

                  Comment


                  • #10
                    The screenshots suggest massive RNA degradation, causing 3' bias. Three prime UTRs are often repetitive, and since since most of your reads align there that would explain why your maps are not unique. Did the total RNA look good on the bioanalyzer prior to library synthesis?
                    Last edited by Darwin; 04-04-2013, 08:20 AM. Reason: Typo.

                    Comment


                    • #11
                      Originally posted by Simon Anders View Post
                      Your third screen shot is how things are supposed to look like. These heaps way beyond the 3' ends in the other screenshots look quite unusual. They are probably what gives rise to al the "no_feature" counts.

                      Are there more such heaps in regions even further away from the genes?

                      For anybody to make a guess what is going on, you'll need to tell us more about your experiment. (Which organism? What kind of samples? Which wet-lab protocol? What biological question? Anything non-standard in your procedure or samples?)

                      About the reads with "alignment_not_unique": We cannot see from the screenshot which ones these are. If you hover your mouse over a read, you get the full information on it from the SAM file. Look for the optional field "NH". It tells you to how many places this read was mapped to. Are, for example, the uniquely mapping reads ("NH:i:1") in the genes and the multireads (NH>1) in these intergenic heaps?
                      Thanks Simon and all. I had a second look at the alignment and I observed that while some aligned in the genes, it appears that vast majority of reads aligning to regions farther away from known genes, with most extending beyond the 3 prime end.

                      As to other questions posed, I'm working with skin samples obtained from cattle and I used the illumina TruSeq kit to make the libraries. Our study sought to investigate cattle species that were raised in different environmental conditions (tropically adapted and temperate adapted).

                      I investigated my sam files. While some had NH:1, vast majority had more than 1, with some having NH:20. I checked the tophat manual and I realised that the default value was 20. Is there anyway this can be resolved given that there are many paralogous genes in this species?

                      As for the RNA quality, it ranged from 6.8 to 7.4. We decided to give it a shot since the samples were so hard to get, much more so that the RNA is being extracted from skin.

                      Please let me know your thoughts...

                      Yemi.

                      Comment


                      • #12
                        @Darwin,
                        Thanks Darwin. The Agilent readings were between 6.8 and 7.4. We decided to give this a shot because the samples were so hard to get.

                        IS there anyway one can beat around this?

                        Comment


                        • #13
                          "NH:20" means that these genes mapped to 20 or more loci with all extremely similar sequences. This must be some highly repetative feature that is all over the genome. So, have a look at some of those, and check in Ensembl or wherever what kind of repetetive elements the reads map to. I guess it will not be genes, because at least in the species I work with, there are few genes with that many paralogous copies (epecially not copies so similiar that TopHat cannot decide between them.) Of course, there are many repetitive elements with thousands of copies in mammalian genomes, but they should not be transcribed and hence not turn up in RNA-Seq data. So, have a look and see what exactly all these multireads map to.
                          Last edited by Simon Anders; 04-05-2013, 06:23 AM.

                          Comment


                          • #14
                            HTSeq

                            @Simon,

                            I took a snapshot of one of the regions where I have huge reads mapping to it(in the repeat_igv_file). There isn't any gene lying in that region. I went to check in ensemble as advised and since I used it during tophat alignments. The View bottom file is the region in ensemble.

                            Any clues?
                            Attached Files

                            Comment


                            • #15
                              Originally posted by Simon Anders View Post
                              "NH:20" means that these genes mapped to 20 or more loci with all extremely similar sequences. This must be some highly repetative feature that is all over the genome. So, have a look at some of those, and check in Ensembl or wherever what kind of repetetive elements the reads map to. I guess it will not be genes, because at least in the species I work with, there are few genes with that many paralogous copies (epecially not copies so similiar that TopHat cannot decide between them.) Of course, there are many repetitive elements with thousands of copies in mammalian genomes, but they should not be transcribed and hence not turn up in RNA-Seq data. So, have a look and see what exactly all these multireads map to.
                              I clicked on the link that connects ensemble to ucsc and ncbi. Interestingly in ncbi region 79,025,650-79,027,700 bp, version 6.1 in Mapview, there seems to be a gene lying in that region LOC100847108 which lies between LPP and TPRG1 which was missed by ensemble. Can you please help me check the view top file attached( which was the image on the top of the view bottom file I previously posted). I appreciate your efforts...
                              Attached Files
                              Last edited by Ajayi Oyeyemi; 04-05-2013, 07:37 AM. Reason: typo

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              27 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              30 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              26 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X