Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • shi
    Wei Shi
    • Feb 2010
    • 236

    Originally posted by tamfernandez View Post
    Dear Wei,

    I would like to use featureCounts for counting multiple aligned reads, but instead of adding 1 to the count for each location y would like to add fractional counts (1/number of locations where the read mapped). Is this posible?

    Thank you in advance for your reply.
    Tamara
    Hi Tamara,

    Although this is something we are considering to add to featureCounts, I do not really think you will get much gain from doing this since this will add noise to your read count data, which may then compromise your downstream differential expression analysis.

    If you worry about those reads overlapping with multiple genes, you should consider using a more conservative annotation such as RefSeq gene annotation which contains much less overlapping genes compared to other annotations such as Ensembl genes.

    Wei

    Comment

    • tamfernandez
      Junior Member
      • Jan 2015
      • 3

      Originally posted by shi View Post
      Hi Tamara,

      Although this is something we are considering to add to featureCounts, I do not really think you will get much gain from doing this since this will add noise to your read count data, which may then compromise your downstream differential expression analysis.

      If you worry about those reads overlapping with multiple genes, you should consider using a more conservative annotation such as RefSeq gene annotation which contains much less overlapping genes compared to other annotations such as Ensembl genes.

      Wei
      Hi Wei,

      Thank you for your reply! I absolutely agree with you but I'm trying to count short reads that map to some multicopy genes and some conserved genes like tRNAs among others. I thought that assigning a fraction to each gene copy could be a good solution to count reads that map specifically to these genes. Do you have any suggestions for dealing with this using featureCounts?

      Again, thank you for your attention.
      Tamara

      Comment

      • shi
        Wei Shi
        • Feb 2010
        • 236

        In feautureCounts built-in annotation, multicopy genes (with the same Entrez gene id) were combined to form one gene that contains exons from all its copies. Assigning 1 count for a read overlapping with this meta-gene should be equivalent to assigning a 1/n count to each of its n copies. You may try the inbuilt annotation to see it works for you.

        Wei

        Comment

        • tamfernandez
          Junior Member
          • Jan 2015
          • 3

          Originally posted by shi View Post
          In feautureCounts built-in annotation, multicopy genes (with the same Entrez gene id) were combined to form one gene that contains exons from all its copies. Assigning 1 count for a read overlapping with this meta-gene should be equivalent to assigning a 1/n count to each of its n copies. You may try the inbuilt annotation to see it works for you.

          Wei
          Thank you Wei for your ansewer. This will solve part of the problem. Maybe for genes like tRNA-Pro isodecoders, which have almost the same sequence but still different ids, I can try to modify the id's so I can use what you suggested.

          Thank you,
          Tamara

          Comment

          • colindaven
            Senior Member
            • Oct 2008
            • 417

            Originally posted by Gordon Smyth View Post
            Counts are needed for a proper statistical analyses, as is explained in the publications behind edgeR, voom and DESeq.

            However, generating RPKM from featureCounts is easy, see:



            for an example.
            Dear Gordon,

            thanks for your response and the link, I agree entirely. Incidentally I have been using edgeR and more recently voom/limma for statistical analysis for many years. However, my users are still very keen to see RPKM values for many, many samples.

            RPKM, despite it's statistical weaknesses, is used widely. At the moment I do take the R route to generate RPM or RPKM, yet this is labour intensive for non-model organisms as I need to create multiple annotation objects. Seeing as all the information is available to featureCounts, I think there would be a demand for this.

            Like you say, the downside is that people then might want to put RPKM into statistical packages like edgeR, but this topic has been dealt with repeatedly and might be avoided by an appropriate warning.

            Thanks again,
            Colin

            Comment

            • Gordon Smyth
              Member
              • Apr 2011
              • 91

              Originally posted by colindaven View Post
              At the moment I do take the R route to generate RPM or RPKM, yet this is labour intensive for non-model organisms as I need to create multiple annotation objects. Seeing as all the information is available to featureCounts, I think there would be a demand for this.
              Hi Colin,

              We seem to be in complete agreement. Our group probably uses RPKM much the same as you do, for descriptive plots and expression summaries.

              We don't output RPKM from featureCounts just on the principle that it seems redundant information. featureCounts already outputs all the information one needs to compute RPKM. You can use the rpkm() function in the edgeR package to extract the RPKM (as in the example I linked to) or else just do it yourself in a couple of lines:

              fc <- featureCounts(stuff)
              RPKM <- fc$counts / fc$annotation$Length * 1e3
              RPKM <- t(t(RPKM) / colSums(fc$counts) * 1e6)

              In our own analyses, we generally like to filter unwanted rows (mitochondrial genes, Ig DNA segments, ribosomal genes, non-expressed genes and what not) and estimate normalization factors before computing RPKM. So there are few steps to do before we extract RPKM.

              BTW, did you know that voom and limma can work directly with RPKM? Just do this:

              v <- voom(counts, design, plot=TRUE)
              v$E <- v$E - log2(GeneLength/1e6)

              Then v$E contains log2(RPKM), and all the limma analyses will work correctly in terms of these expression values.
              Last edited by Gordon Smyth; 01-29-2015, 10:01 PM.

              Comment

              • bruce01
                Senior Member
                • Mar 2011
                • 160

                Strand issues

                Hi Wei,

                I have been using single end 50bp stranded library, and was getting low counts and losing quite a lot of data. I tried '-s =0, 1, 2' and found discrepancies with counts. I expected 1+2 to sum to output of 0, is this correct?

                Code:
                ENS_ID          0    1    2
                ENSG00000000003 48   15   33
                ENSG00000000005 44   17   27
                ENSG00000000419 77   50   91
                ENSG00000000457 235  72   193
                ENSG00000000460 677  581  360
                I have 23,038 matching, and 40,642 which do not sum to output from '-s 0'.

                I have tried resorting (aligned with STAR giving 'sort-by-coord' option) which made no difference to featureCounts output, and my command is:
                Code:
                featureCounts -a ${HOME}/Homo_sapiens.GRCh37.75.saf -F SAF -t gene -o $out -s "$x" -T 8 -d 30 $1
                Any ideas would be very welcome and I appreciate your continued support of you tools.

                Comment

                • shi
                  Wei Shi
                  • Feb 2010
                  • 236

                  Hi @bruce01,

                  The '-s 0' option is likely to give you less counts compared to the sum of counts from '-s 1' and '-s 2'. This is because reads are more likely to be found to overlap with more than one gene when '-s 0' is used (genes with overlapping coordinates but on different strands will be taken as overlapping genes at unstranded counting mode, but not taken as overlapping genes at stranded counting mode).

                  Wei
                  Last edited by shi; 03-23-2015, 09:19 PM. Reason: Further clarification

                  Comment

                  • bruce01
                    Senior Member
                    • Mar 2011
                    • 160

                    Hi Wei,

                    that makes sense for s=0, of course. I suppose my real question is that if we know what strand the RNA originally comes from do we not always need to use s=1+2, seeing as some will be from '+' and some '-'? Is this something other people do?

                    Bruce.

                    Comment

                    • dpryan
                      Devon Ryan
                      • Jul 2011
                      • 3478

                      Strand refers to sense and antisense, not +/-.

                      Comment

                      • bruce01
                        Senior Member
                        • Mar 2011
                        • 160

                        Originally posted by dpryan View Post
                        Strand refers to sense and antisense, not +/-.
                        Sorry about the nomenclature, I have +/- in my head as they are in GTF and that is how I denote it in there.

                        Can I ask your opinion on what way to run featureCounts with stranded libraries? Presumably if strand is known, then using both s=1 and s=2 gives the more accurate counts that using s=0, particularly for overlapping genes, when strand is known?

                        Comment

                        • dpryan
                          Devon Ryan
                          • Jul 2011
                          • 3478

                          If your library is stranded, then use either s=1 or s=2, as appropriate. It seems that s=2 would typically be correct, since most stranded kits use a dUTP-based method (so the sense strand is the opposite of how read #1 aligns). When in doubt, featureCounts is really fast, so you can run it with s=1 and s=2 separately and then just determine which one was the correct setting from the numbers output (afterall, sometimes you don't know the strandedness). The incorrect one will have vastly lower numbers.

                          Comment

                          • bruce01
                            Senior Member
                            • Mar 2011
                            • 160

                            Originally posted by dpryan View Post
                            The incorrect one will have vastly lower numbers.
                            This is my problem, I have samples where s=1, which is incorrect based on the library prep protocol which was dUTP based, have more counts than s=2. This is from FFPE tumor samples, and so I expect it to be bad, but not like this. Will go with s=2 anyway, thanks for your help.

                            Comment

                            • shi
                              Wei Shi
                              • Feb 2010
                              • 236

                              Hi Bruce,

                              I agree with Devon that if your library is stranded you should count reads at the stranded mode (s=1 or s=2) rather than unstranded mode (s=0).

                              Also as I said earlier, my concern with performing stranded counting for unstranded reads is that you may have reduced capability to detect reads that overlap with more than one gene. This may result in reads being assigned to wrong genes.

                              Another thing you need to consider is that the annotated strands for genes might be incorrect, particularly for those computationally predicted genes. Stranded read counting will be problematic for such genes.

                              Wei

                              Comment

                              • GenoMax
                                Senior Member
                                • Feb 2008
                                • 7142

                                Hi Wei,

                                Do you have plans to add support for SAM v.1.4 format tags to featureCounts?

                                We use BBMap for alignments and it outputs SAM v. 1.4 tags by default. We have been having trouble assigning read counts using those alignments with featureCount. featureCount does not generate an error but only assigns a small number of (< 1%) reads.

                                Note: BBMap can output SAM v.1.3 tags and then the counting works, which we are using as a workaround for now.

                                Thanks.

                                Comment

                                Latest Articles

                                Collapse

                                • SEQadmin2
                                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                  by SEQadmin2


                                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                  ...
                                  Yesterday, 10:05 AM
                                • SEQadmin2
                                  Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                  by SEQadmin2


                                  With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                  Introduction

                                  Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                  05-22-2026, 06:42 AM
                                • SEQadmin2
                                  Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                  by SEQadmin2

                                  Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                  Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                  05-06-2026, 09:04 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by SEQadmin2, Yesterday, 12:03 PM
                                0 responses
                                19 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, Yesterday, 11:40 AM
                                0 responses
                                14 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 05-28-2026, 11:40 AM
                                0 responses
                                29 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 05-26-2026, 10:12 AM
                                0 responses
                                31 views
                                0 reactions
                                Last Post SEQadmin2  
                                Working...