Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by tamfernandez View Post
    Dear Wei,

    I would like to use featureCounts for counting multiple aligned reads, but instead of adding 1 to the count for each location y would like to add fractional counts (1/number of locations where the read mapped). Is this posible?

    Thank you in advance for your reply.
    Tamara
    Hi Tamara,

    Although this is something we are considering to add to featureCounts, I do not really think you will get much gain from doing this since this will add noise to your read count data, which may then compromise your downstream differential expression analysis.

    If you worry about those reads overlapping with multiple genes, you should consider using a more conservative annotation such as RefSeq gene annotation which contains much less overlapping genes compared to other annotations such as Ensembl genes.

    Wei

    Comment


    • Originally posted by shi View Post
      Hi Tamara,

      Although this is something we are considering to add to featureCounts, I do not really think you will get much gain from doing this since this will add noise to your read count data, which may then compromise your downstream differential expression analysis.

      If you worry about those reads overlapping with multiple genes, you should consider using a more conservative annotation such as RefSeq gene annotation which contains much less overlapping genes compared to other annotations such as Ensembl genes.

      Wei
      Hi Wei,

      Thank you for your reply! I absolutely agree with you but I'm trying to count short reads that map to some multicopy genes and some conserved genes like tRNAs among others. I thought that assigning a fraction to each gene copy could be a good solution to count reads that map specifically to these genes. Do you have any suggestions for dealing with this using featureCounts?

      Again, thank you for your attention.
      Tamara

      Comment


      • In feautureCounts built-in annotation, multicopy genes (with the same Entrez gene id) were combined to form one gene that contains exons from all its copies. Assigning 1 count for a read overlapping with this meta-gene should be equivalent to assigning a 1/n count to each of its n copies. You may try the inbuilt annotation to see it works for you.

        Wei

        Comment


        • Originally posted by shi View Post
          In feautureCounts built-in annotation, multicopy genes (with the same Entrez gene id) were combined to form one gene that contains exons from all its copies. Assigning 1 count for a read overlapping with this meta-gene should be equivalent to assigning a 1/n count to each of its n copies. You may try the inbuilt annotation to see it works for you.

          Wei
          Thank you Wei for your ansewer. This will solve part of the problem. Maybe for genes like tRNA-Pro isodecoders, which have almost the same sequence but still different ids, I can try to modify the id's so I can use what you suggested.

          Thank you,
          Tamara

          Comment


          • Originally posted by Gordon Smyth View Post
            Counts are needed for a proper statistical analyses, as is explained in the publications behind edgeR, voom and DESeq.

            However, generating RPKM from featureCounts is easy, see:



            for an example.
            Dear Gordon,

            thanks for your response and the link, I agree entirely. Incidentally I have been using edgeR and more recently voom/limma for statistical analysis for many years. However, my users are still very keen to see RPKM values for many, many samples.

            RPKM, despite it's statistical weaknesses, is used widely. At the moment I do take the R route to generate RPM or RPKM, yet this is labour intensive for non-model organisms as I need to create multiple annotation objects. Seeing as all the information is available to featureCounts, I think there would be a demand for this.

            Like you say, the downside is that people then might want to put RPKM into statistical packages like edgeR, but this topic has been dealt with repeatedly and might be avoided by an appropriate warning.

            Thanks again,
            Colin

            Comment


            • Originally posted by colindaven View Post
              At the moment I do take the R route to generate RPM or RPKM, yet this is labour intensive for non-model organisms as I need to create multiple annotation objects. Seeing as all the information is available to featureCounts, I think there would be a demand for this.
              Hi Colin,

              We seem to be in complete agreement. Our group probably uses RPKM much the same as you do, for descriptive plots and expression summaries.

              We don't output RPKM from featureCounts just on the principle that it seems redundant information. featureCounts already outputs all the information one needs to compute RPKM. You can use the rpkm() function in the edgeR package to extract the RPKM (as in the example I linked to) or else just do it yourself in a couple of lines:

              fc <- featureCounts(stuff)
              RPKM <- fc$counts / fc$annotation$Length * 1e3
              RPKM <- t(t(RPKM) / colSums(fc$counts) * 1e6)

              In our own analyses, we generally like to filter unwanted rows (mitochondrial genes, Ig DNA segments, ribosomal genes, non-expressed genes and what not) and estimate normalization factors before computing RPKM. So there are few steps to do before we extract RPKM.

              BTW, did you know that voom and limma can work directly with RPKM? Just do this:

              v <- voom(counts, design, plot=TRUE)
              v$E <- v$E - log2(GeneLength/1e6)

              Then v$E contains log2(RPKM), and all the limma analyses will work correctly in terms of these expression values.
              Last edited by Gordon Smyth; 01-29-2015, 10:01 PM.

              Comment


              • Strand issues

                Hi Wei,

                I have been using single end 50bp stranded library, and was getting low counts and losing quite a lot of data. I tried '-s =0, 1, 2' and found discrepancies with counts. I expected 1+2 to sum to output of 0, is this correct?

                Code:
                ENS_ID          0    1    2
                ENSG00000000003 48   15   33
                ENSG00000000005 44   17   27
                ENSG00000000419 77   50   91
                ENSG00000000457 235  72   193
                ENSG00000000460 677  581  360
                I have 23,038 matching, and 40,642 which do not sum to output from '-s 0'.

                I have tried resorting (aligned with STAR giving 'sort-by-coord' option) which made no difference to featureCounts output, and my command is:
                Code:
                featureCounts -a ${HOME}/Homo_sapiens.GRCh37.75.saf -F SAF -t gene -o $out -s "$x" -T 8 -d 30 $1
                Any ideas would be very welcome and I appreciate your continued support of you tools.

                Comment


                • Hi @bruce01,

                  The '-s 0' option is likely to give you less counts compared to the sum of counts from '-s 1' and '-s 2'. This is because reads are more likely to be found to overlap with more than one gene when '-s 0' is used (genes with overlapping coordinates but on different strands will be taken as overlapping genes at unstranded counting mode, but not taken as overlapping genes at stranded counting mode).

                  Wei
                  Last edited by shi; 03-23-2015, 09:19 PM. Reason: Further clarification

                  Comment


                  • Hi Wei,

                    that makes sense for s=0, of course. I suppose my real question is that if we know what strand the RNA originally comes from do we not always need to use s=1+2, seeing as some will be from '+' and some '-'? Is this something other people do?

                    Bruce.

                    Comment


                    • Strand refers to sense and antisense, not +/-.

                      Comment


                      • Originally posted by dpryan View Post
                        Strand refers to sense and antisense, not +/-.
                        Sorry about the nomenclature, I have +/- in my head as they are in GTF and that is how I denote it in there.

                        Can I ask your opinion on what way to run featureCounts with stranded libraries? Presumably if strand is known, then using both s=1 and s=2 gives the more accurate counts that using s=0, particularly for overlapping genes, when strand is known?

                        Comment


                        • If your library is stranded, then use either s=1 or s=2, as appropriate. It seems that s=2 would typically be correct, since most stranded kits use a dUTP-based method (so the sense strand is the opposite of how read #1 aligns). When in doubt, featureCounts is really fast, so you can run it with s=1 and s=2 separately and then just determine which one was the correct setting from the numbers output (afterall, sometimes you don't know the strandedness). The incorrect one will have vastly lower numbers.

                          Comment


                          • Originally posted by dpryan View Post
                            The incorrect one will have vastly lower numbers.
                            This is my problem, I have samples where s=1, which is incorrect based on the library prep protocol which was dUTP based, have more counts than s=2. This is from FFPE tumor samples, and so I expect it to be bad, but not like this. Will go with s=2 anyway, thanks for your help.

                            Comment


                            • Hi Bruce,

                              I agree with Devon that if your library is stranded you should count reads at the stranded mode (s=1 or s=2) rather than unstranded mode (s=0).

                              Also as I said earlier, my concern with performing stranded counting for unstranded reads is that you may have reduced capability to detect reads that overlap with more than one gene. This may result in reads being assigned to wrong genes.

                              Another thing you need to consider is that the annotated strands for genes might be incorrect, particularly for those computationally predicted genes. Stranded read counting will be problematic for such genes.

                              Wei

                              Comment


                              • Hi Wei,

                                Do you have plans to add support for SAM v.1.4 format tags to featureCounts?

                                We use BBMap for alignments and it outputs SAM v. 1.4 tags by default. We have been having trouble assigning read counts using those alignments with featureCount. featureCount does not generate an error but only assigns a small number of (< 1%) reads.

                                Note: BBMap can output SAM v.1.3 tags and then the counting works, which we are using as a workaround for now.

                                Thanks.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Choosing Between NGS and qPCR
                                  by seqadmin



                                  Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                                  10-18-2024, 07:11 AM
                                • seqadmin
                                  Non-Coding RNA Research and Technologies
                                  by seqadmin




                                  Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                                  Nobel Prize for MicroRNA Discovery
                                  This week,...
                                  10-07-2024, 08:07 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 05:31 AM
                                0 responses
                                10 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 10-24-2024, 06:58 AM
                                0 responses
                                20 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 10-23-2024, 08:43 AM
                                0 responses
                                50 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 10-17-2024, 07:29 AM
                                0 responses
                                58 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X