Announcement

Collapse

Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

featureCounts: a universal read summarization program

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Hi @GenoMax,

    featureCounts does support SAM specification version 1.4. Have you looked at the .summary file generated by featureCounts to see why reads were not assigned?

    Best,
    Wei

    Comment


    • Originally posted by shi View Post
      Hi @GenoMax,

      featureCounts does support SAM specification version 1.4. Have you looked at the .summary file generated by featureCounts to see why reads were not assigned?

      Best,
      Wei
      Hi Wei,

      Here is a comparison of the two flags for a sample. I have used the -M flag for featureCounts.

      Code:
      With SAM v.1.3 flags
      
      Assigned	     104867565
      Unassigned_Ambiguity	577037
      Unassigned_MultiMapping	0
      Unassigned_NoFeatures	52436519
      Unassigned_Unmapped	1772756
      
      With SAM v.1.4 flags
      
      Assigned	      66620
      Unassigned_Ambiguity	280
      Unassigned_MultiMapping	0
      Unassigned_NoFeatures	157814221
      Unassigned_Unmapped	1772756
      Last edited by GenoMax; 10-22-2015, 04:56 PM.

      Comment


      • Thanks @GenoMax,

        Could you also provide mapping results for a few reads for each version?

        Comment


        • For what it's worth, the only difference between SAM 1.3 and 1.4 format output BBMap is that in 1.3 mode, it use the "M" symbol in the cigar string; in 1.4 mode, it uses "=" and "X" for match and mismatch in the cigar string. 1.4 will still occasionally contain "M" symbols when there is a "N"-called bases, as that does not strictly speaking match or mismatch. Given that a small fraction of the reads do get assigned from v1.4 sam files, I wonder if featureCounts might be using only the reads that contain at least one M operation in the cigar string?

          Comment


          • featureCounts currently does not support letters '=' and 'X' in the CIGAR string and we think that is the reason why it failed to assign most of the reads in your v1.4 data. We will add support to these letters in featureCounts.

            We are going to make a major release of our software package next week but unfortunately we are not able to include this support in the release. We will work on fixing this issue after the release. Will let you know once this is done.

            Comment


            • Hi Shi,

              Brian is on the right track.

              Here is a sample of reads

              Code:
              With v.1.3 tags
              
              XXXXXXX:XXX:XXXXXX:2:1213:18683:18437 1:N:0:        0       chr10   376     3       50M     *       0       0       GTCTGTGTTCTGTGTCAGAGGAGCAGTAGAGCTGAGGCTATCAAGCTATC    [email protected]@<[email protected]@[email protected]?GEHHHGEHHFHE       XT:A:R  NM:i:0  AM:i:3
              XXXXXXX:XXX:XXXXXX:2:1203:10888:83435 1:N:0:        0       chr10   413     3       50M     *       0       0       CTATCAAGCTATCAGGACATCCCAGATGCCCAAGACTACACTAGTCACCA    DDDBD?HFEE?GHGHEEEFIFIIHGHHHI?HIIIEHIIIIEHHHEHFHHI       XT:A:R  NM:i:0  AM:i:3
              XXXXXXX:XXX:XXXXXX:2:1114:19668:64532 1:N:0:        0       chr10   938     2       5M3I42M *       0       0       CCACTGACACCATTCTTACTGTCATCCTCTGAAGCCAGGTCCTGCTCAGT    DDDDDIIIIIGIIIIIIIIIHIIIIIIIIHIIIHIIIIIIIIIIIHIIIF       XT:A:R  NM:i:4  AM:i:2
              XXXXXXX:XXX:XXXXXX:2:1103:21004:71100 1:N:0:        0       chr10   946     3       50M     *       0       0       ATTCTTACTGTCATCCTCTGAAGCCAGGTCCTGCTCAGTGTTGGCATAGG    0<[email protected]@[email protected][email protected][email protected]?ECHHII       XT:A:R  NM:i:0  AM:i:3
              XXXXXXX:XXX:XXXXXX:2:2106:11030:47701 1:N:0:        0       chr10   946     3       50M     *       0       0       ATTCTTACTGTCATCCTCTGAAGCCAGGTCCTGCTCAGTGTTGGCATAGG    DDDDDHIFIIIHIIIIIIIIHHEHEHIIGHHIIIIHIIHHIIIHHHHHHE       XT:A:R  NM:i:0  AM:i:3
              
              With v.1.4 tags
              
              XXXXXXX:XXX:XXXXXX:2:1213:18683:18437 1:N:0:        0       chr10   376     3       50=     *       0       0       GTCTGTGTTCTGTGTCAGAGGAGCAGTAGAGCTGAGGCTATCAAGCTATC    [email protected]@<[email protected]@[email protected]?GEHHHGEHHFHE       XT:A:R  NM:i:0  AM:i:3
              XXXXXXX:XXX:XXXXXX:2:1203:10888:83435 1:N:0:        0       chr10   413     3       50=     *       0       0       CTATCAAGCTATCAGGACATCCCAGATGCCCAAGACTACACTAGTCACCA    DDDBD?HFEE?GHGHEEEFIFIIHGHHHI?HIIIEHIIIIEHHHEHFHHI       XT:A:R  NM:i:0  AM:i:3
              XXXXXXX:XXX:XXXXXX:2:1114:19668:64532 1:N:0:        0       chr10   938     2       2=1X2=3I42=     *       0       0       CCACTGACACCATTCTTACTGTCATCCTCTGAAGCCAGGTCCTGCTC
              AGT     DDDDDIIIIIGIIIIIIIIIHIIIIIIIIHIIIHIIIIIIIIIIIHIIIF      XT:A:R  NM:i:4  AM:i:2
              XXXXXXX:XXX:XXXXXX:2:1103:21004:71100 1:N:0:        0       chr10   946     3       50=     *       0       0       ATTCTTACTGTCATCCTCTGAAGCCAGGTCCTGCTCAGTGTTGGCATAGG    0<[email protected]@[email protected][email protected][email protected]?ECHHII       XT:A:R  NM:i:0  AM:i:3
              XXXXXXX:XXX:XXXXXX:2:2106:11030:47701 1:N:0:        0       chr10   946     3       50=     *       0       0       ATTCTTACTGTCATCCTCTGAAGCCAGGTCCTGCTCAGTGTTGGCATAGG    DDDDDHIFIIIHIIIIIIIIHHEHEHIIGHHIIIIHIIHHIIIHHHHHHE       XT:A:R  NM:i:0  AM:i:3
              Edit: Just saw the last post from Wei. Will look forward to a future release of subread.

              Edit2: There was an error in the copy/paste of the data in original post. I have replaced the original with correct copy.
              Last edited by GenoMax; 10-23-2015, 11:03 AM.

              Comment


              • Hi GenoMax,

                Thanks for providing the sample reads. However, we noticed that there are missing columns in your mapping results (either PNEXT or TLEN or both). This might cause problems for read counting by featureCounts and lead to incorrect counts. For example, featureCounts might not be able to find NH tags which are used to identify multi-mapping reads.

                Comment


                • That's really strange. It is, indeed, not a valid sam file. I just did some mapping and generated this:

                  Code:
                  @HD	VN:1.4	SO:unsorted
                  @SQ	SN:ecoli_K12	LN:4639675
                  @PG	ID:BBMap	PN:BBMap	VN:35.51	CL:java -ea -Xmx1g align2.BBMap in=80x_perfect.fa.gz reads=1000 int=f out=xx2.sam ow
                  400	16	ecoli_K12	778201	45	150=	*	0	0	TCTTCCGGCAACTGATGGACAGGTCAAATTCCCTGCCTGGTCGCCGTATCTGTGATAATAATTAATTGAATAGTAAAGGAATCATTGAAATGCAACTGAACAAAGTGCTGAAAGGGCTGATGATTGCTCTGCCTGTTATGGCAATTGCGG	*	NM:i:0	AM:i:45
                  401	0	ecoli_K12	940032	45	150=	*	0	0	GTAATACTACTTTCGAGTGAAAATCTACCTATCTCTTTGATTTTCAAATTATTCGATGTATACAAGCCTATATAGCGAACTGCTATAGAAATAATTACACAATACGGTTTGTTACTGGAATCAATCGTGAGCAAGCTTGAGTGAGCCATT	*	NM:i:0	AM:i:45
                  402	0	ecoli_K12	543578	45	150=	*	0	0	CATTTCATCGCAAATCGCCCGCATGTCGTTTTCTAACTGTTGGGTGAAATCGCGCAGCACGGCAGCGTCGGTATGACGACAATCAATGGTGAACGTGGTTTTACCCGGCACCACATTTACCGTATTCGGGCGCGGCTCTACTTTGCCAAA	*	NM:i:0	AM:i:45
                  All of the columns are present. I have no idea how you were able to generate output missing some columns... could it be some kind of copy-paste error? Or did you reformat the data in some way?

                  Comment


                  • Originally posted by Brian Bushnell View Post
                    That's really strange. It is, indeed, not a valid sam file.
                    All of the columns are present. I have no idea how you were able to generate output missing some columns... could it be some kind of copy-paste error? Or did you reformat the data in some way?
                    When I copied and pasted the data last night something must have gone wrong. I have recopied the data in post above. Sorry about that.

                    Comment


                    • Question on Output

                      Hi,
                      Thanks for this wonderful tool. We use it routinely.
                      I had a question regarding the output (.featureCounts file). What is the difference between a read having more than one gene assigned with commas in one line (Read1) and a read assigned to more than one gene listed over multiple lines (Read2)?

                      Read1 Assigned ENSNNNG00000079745,ENSNNNG00000059028 Total=2

                      Read2 Unassigned_NoFeatures * *
                      Read2 Unassigned_NoFeatures * *
                      Read2 Assigned ENSNNNG00000098385 *
                      Read2 Assigned ENSNNNG00000101430 *
                      Read2 Assigned ENSNNNG00000100881 *

                      The output is a result of dropping the --primary option and including the -M and -O options.

                      Thanks in advance for your help.
                      Karthik

                      Comment


                      • Read 1 was reported only once in your mapping result and this read was found to overlap with two genes (ENSNNNG00000079745 and ENSNNNG00000059028) by featureCounts.

                        Read 2 was reported 5 times in your mapping result (5 rows in your bam/sam file). The first two alignments reported for this read do not overlap with any genes and the other three alignments were found to overlap with genes.

                        Read 1 was counted two times because of multi-overlapping and Read 2 was counted three times due to multi-mapping.

                        Comment


                        • Thanks!

                          Thanks for the quick response and clarification provided.

                          Comment


                          • Hello,

                            I am trying to use featureCount with a bam file that aligned using Hisat2 with an indexed genome and spliced sites file created from a gtf file. Now I am using the same gtf file in featureCount but am getting 0% Successfully assigned reads.

                            Here is the code I am using
                            featureCounts -a gene.gtf -F GTF -b -o countsSb10.txt -t gene -R -g gene_id Sb10.aln-sort.bam

                            I am not sure what I am doing wrong. These are reads that aligned to the reference genome, my unmapped reads are removed.

                            Thank you
                            htetre

                            Comment


                            • Are you using the latest version? '-b' is an invalid argument. Could you also provide the assignment summary that can be found in the 'countsSb10.txt.summary' file?

                              Comment


                              • output tagged bamfile

                                Hi Shi,

                                Would it be possible to include an option for outputting a tagged bam file (similar to the htseq --samout option) in the next release. Basically a bamfile with a XF:Z: tag that has the name of the assignment there. We're using this currently to allow umi-tools pcr deduplication of RNAseq reads on a per gene basis, and use the FC -R files to add the tag to the bamfiles ourselves but it would a lot faster if featurecounts could just output the tagged bamfiles.

                                Thanks, Marinus

                                Comment

                                Working...
                                X