Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • zaki
    Member
    • Dec 2012
    • 15

    #16
    Originally posted by Simon Anders View Post
    Please take a few of these many reads in location 2, hover your mouse over them in IGV to get the read name, then grep for the read name to find the corresponding lines in the SAM file. Maybe they are all non-unique mapping reads.

    The fact that htseq-count does not count these reads is deliberate, of course. See earlier threads for an explanation.
    Many thanks Simon! I think I need to get my basic understanding of mapping solid.

    Forgive my naivety, If multi-mapping becomes an issue when we are translating mapping reads to count data (expression level)...what than would be an acceptable standard to extract count data?? (is it using htseq-count ,bedtools, cuffdiff)

    Also....during the mapping stage..what bias would be introduced, if we discard multimap reads?


    Edit - you are correct in saying that the reads at location 2 is multimapped, many thanks for the insight!

    Comment

    • Simon Anders
      Senior Member
      • Feb 2010
      • 995

      #17
      See my post #4 in this thread: http://seqanswers.com/forums/showthread.php?t=9129

      Comment

      • lpachter
        Member
        • Feb 2010
        • 40

        #18
        Dear Zaki,

        You ask an excellent question. This paper answers it in detail:

        Lior

        Comment

        • zaki
          Member
          • Dec 2012
          • 15

          #19
          Thanks Simon and Lior for the suggested readings!

          Taking a step back and reading the fundamentals behind each approach is really essential. I think its something I overlooked at first.

          Since the protocol and vignette of each approach is very well described and easy to follow, I guess I jumped the gun and started running commands without understanding much behind it (I think I still have a lot yet to understand... statistics...mapping..biology...etc...)

          Thanks again for your comments and direction! It helped a lot.



          Zaki
          Last edited by zaki; 04-01-2013, 12:47 AM.

          Comment

          • Jose Garcia
            Junior Member
            • Jul 2013
            • 4

            #20
            Originally posted by Simon Anders View Post
            Please take a few of these many reads in location 2, hover your mouse over them in IGV to get the read name, then grep for the read name to find the corresponding lines in the SAM file. Maybe they are all non-unique mapping reads.

            The fact that htseq-count does not count these reads is deliberate, of course. See earlier threads for an explanation.
            Indeed. I was having the same problem. I was getting crazy because I had already checked what Simon pointed about the possibility of having non-unique mapping reads. I think I have found out why.
            If you use htseq-count with -i gene_id, it will construct the set where to add up reads based on the ENSEMBL GENE ID, which will sum reads that could come from different ENSEMBL transcript ID. You may lose the information about transcripts at this level (maybe you can use next DEXSeq approach to look inside...) but at least, you do not lose reads mapping to different identifiers(transcript id), as was my case because I used the UCSC gtf file where gene_id=trascritpt_id.

            However, I still encountered IGV regions in my BAM files with many reads that htseq count anyhow flagged as ambigous, but were not multi aligned. I think I have found out why by looking at the locus with the ensembl genome browser and looking for the "gene_name", that is, the SYMBOL. I realized that the Ensembl annotation (present in the GTF) contained for the same locus, with the same gene_name (SYMBOL), two different ENSEMBLGENE Ids for what it called: "protein coding" and "non-sense-mediated decay", as gene biotype. Both lying in the same locus with high overlap. I guessed then that htseq-count had found reads aligning to both ENSEMBLGENE identifiers and discarded them as ambiguous.

            While discarding reads mapping to different locus or different genes which overlap (hence also different gene Symbols) may have sense, I do not see the sense in discarding reads merely because we have a maybe too exhaustive overlapping annotation. We might also have other features like this, antisense ncRNAs, pseudogenes, for instance, which may have a big overlapping with its parental, and hence, all counts in regions with the same sequence will be discarded.
            For the pseudogenes, unfortunately, if they have a huge overlap, all those reads will be discarded by htseq-count -i gene_name either, but if we are not interested in looking at pseudogenes we could choose to use a more limited annotation like RefSeq.

            So, what do you think of using as a global strategy with htseq-count when interested in having also all non coding, pseudogenes et. al a command line like this:

            htseq-count -m union -t exon -i gene_name - $GTF > $out

            It is always possible to try and deconvolve counts thereafter,
            What do you think about it?
            Thanks
            Best
            Jose
            Last edited by Jose Garcia; 07-17-2013, 07:28 AM.

            Comment

            • vyellapa
              Member
              • Oct 2011
              • 59

              #21
              In order to get gene level expression which is a better tool to use? HTSeq or Bedtools. Or has anyone tried any other tool for this purpose?

              Comment

              • dpryan
                Devon Ryan
                • Jul 2011
                • 3478

                #22
                Don't use bedtools, the results will be wrong for what you want to do. Use htseq-count or featureCounts (from subread).

                Comment

                • vyellapa
                  Member
                  • Oct 2011
                  • 59

                  #23
                  Thank you for your reply.
                  For gene expression measurement, htseq values I believe could be conservative as only unique mapping events are used along with non-overalapping reads. Feature counts is fast and through there is substantial concordance of count values with htseq count. About 20 features per sample have count differences of greater than 10000.

                  Which one does one believe?

                  Comment

                  • dpryan
                    Devon Ryan
                    • Jul 2011
                    • 3478

                    #24
                    As a general rule, htseq-count. You might use the debug options ("-o" for htseq-count, I don't recall what it is for featureCounts) to see why there's such a big difference between the two. Given the same thresholds (i.e., MAPQ filtering) and intersection model, the counts should be the same.

                    Comment

                    • shi
                      Wei Shi
                      • Feb 2010
                      • 236

                      #25
                      Originally posted by vyellapa View Post
                      Thank you for your reply.
                      For gene expression measurement, htseq values I believe could be conservative as only unique mapping events are used along with non-overalapping reads. Feature counts is fast and through there is substantial concordance of count values with htseq count. About 20 features per sample have count differences of greater than 10000.

                      Which one does one believe?
                      The following paper explains in details the difference between featureCounts and htseq-count:

                      featureCounts is available under GNU General Public License as part of the Subread (http://subread.sourceforge.net) or Rsubread (http://www.bioconductor.org) software packages.

                      Comment

                      Latest Articles

                      Collapse

                      • SEQadmin2
                        From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                        by SEQadmin2


                        Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                        The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                        ...
                        Yesterday, 10:05 AM
                      • SEQadmin2
                        Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                        by SEQadmin2


                        With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                        Introduction

                        Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                        05-22-2026, 06:42 AM
                      • SEQadmin2
                        Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                        by SEQadmin2

                        Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                        Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                        05-06-2026, 09:04 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by SEQadmin2, Yesterday, 12:03 PM
                      0 responses
                      19 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, Yesterday, 11:40 AM
                      0 responses
                      14 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 05-28-2026, 11:40 AM
                      0 responses
                      29 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 05-26-2026, 10:12 AM
                      0 responses
                      31 views
                      0 reactions
                      Last Post SEQadmin2  
                      Working...