Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #91
    Originally posted by sindrle View Post
    Thank you, you have been very helpful.

    Im currently struggling with how to implement featureCounts to DEXseq, its much easier to use HTSeq, since its described in the manual.

    I really want to be able to use featureCounts, since with HTSeq I had to "hack" the python script to work with my annotation file, whereas I don't need that with featureCounts.

    Also, I want to be able, in the end, to present expression values of transcripts as counts/library size/gene length, which also is easier with featureCounts, since it outputs gene length!

    Could you elaborate a bit the difficulty you encountered with using featureCounts output with dexseq? Basically, featureCounts just outputs a read count table and it can be readily used by downstream DE analysis programs such as limma/voom and edgeR. The analysis can be performed at both gene level and exon level with these programs.

    Comment


    • #92
      Its only because me, and maybe more, do not have a background in programming at all.

      Its very simple to follow an explained guide, but with FeatureCounts I have to read and search for every step I take. Taking time which may be saved using HTSeq.

      Comment


      • #93
        Hi Wei,

        featureCounts is awesome-- thanks for writing such nice software. We use it in the bcbio-nextgen pipeline for generating counts and it is insanely fast.

        We have encountered this error on some alignment files when running featureCounts, version 1.4.3-p:

        *** glibc detected *** /n/hsphS10/hsphfs1/chb/local/bin/featureCounts: double free or corruption (!prev): 0x0d7312f0 ***
        ======= Backtrace: =========
        [0x80cfb0c]
        [0x80d2e3b]
        [0x808e405]
        ======= Memory map: ========
        08048000-08133000 r-xp 00000000 00:20 366245842982 /net/hsphfs1/srv/export/hsphfs1/share_root/chb/local/bin/featureCounts
        08133000-08134000 rw-p 000eb000 00:20 366245842982 /net/hsphfs1/srv/export/hsphfs1/share_root/chb/local/bin/featureCounts
        08134000-08138000 rw-p 00000000 00:00 0
        084e6000-0d744000 rw-p 00000000 00:00 0 [heap]
        55555000-55556000 r-xp 00000000 00:00 0 [vdso]
        55556000-55855000 rw-p 00000000 00:00 0
        55857000-5644c000 rw-p 00000000 00:00 0
        5644c000-57333000 rw-p 00000000 00:00 0
        57333000-58391000 rw-p 00000000 00:00 0
        58400000-58421000 rw-p 00000000 00:00 0
        58421000-58500000 ---p 00000000 00:00 0
        589f7000-589f8000 ---p 00000000 00:00 0
        589f8000-59f83000 rw-p 00000000 00:00 0
        fff96000-fffac000 rw-p 00000000 00:00 0 [stack]

        The command is:

        /n/hsphS10/hsphfs1/chb/local/bin/featureCounts -a /net/hsphfs1/srv/export/hsphfs1/share_root/chb/biodata/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.gtf -o /net/hsphfs1/srv/export/hsphfs1/share_root/chb/projects/ljg_cardiac_rnaseq/project/work/htseq-count/tx/tmpJu9kkp/CRUSH2673-active.counts -s 0 -p -B -C /net/hsphfs1/srv/export/hsphfs1/share_root/chb/projects/ljg_cardiac_rnaseq/project/work/align/CRUSH2673-active/5_2014-03-12_project_star/CRUSH2673-active.sorted.bam

        The header looks fine and the BAM and GTF files are also all good. Any thoughts?

        Comment


        • #94
          We have just released a new version (1.4.4). Please try that one to see if the problem persists.

          Cheers,
          Wei

          Comment


          • #95
            Issues using ensembl (version 75) GTF

            I'm attempting to evaluate featureCounts to summarize a collection of bam files from an RNA-Seq experiment aligned to ensembl sequences. FASTA and GTF files for ensemble transcripts were obtained from the ensembl FTP site, modified by appending ERCC sequences and descriptors, respectively, and aligned using bwa-mem.

            featureCounts happily processes the resulting bam files when using the ERCC GTF info only (and count totals seem to match that obtained using summarizeOverlaps in R), however using the ensembl GTF alone results in 0 "successfully assigned reads", while the combined ERCC and ensembl transcript GTF file results in only ERCC mapping reads being quantified (i.e. all ensembl transcripts appear in the output file, but all 0 counts). I think the issue may be because of some sort of formatting issue in the ensembl GTF? Can anyone shed any light on this issue?

            The head of ensembl gtf I'm using is attached.

            Thanks!
            Attached Files

            Comment


            • #96
              Originally posted by also.casey View Post
              I'm attempting to evaluate featureCounts to summarize a collection of bam files from an RNA-Seq experiment aligned to ensembl sequences. FASTA and GTF files for ensemble transcripts were obtained from the ensembl FTP site, modified by appending ERCC sequences and descriptors, respectively, and aligned using bwa-mem.

              featureCounts happily processes the resulting bam files when using the ERCC GTF info only (and count totals seem to match that obtained using summarizeOverlaps in R), however using the ensembl GTF alone results in 0 "successfully assigned reads", while the combined ERCC and ensembl transcript GTF file results in only ERCC mapping reads being quantified (i.e. all ensembl transcripts appear in the output file, but all 0 counts). I think the issue may be because of some sort of formatting issue in the ensembl GTF? Can anyone shed any light on this issue?

              The head of ensembl gtf I'm using is attached.

              Thanks!
              Is your combined FASTA file or the index you made for bwa-mem messed up? You would get this behavior if the reads were not aligned to the genome at all, only the ERCC sequences.

              Comment


              • #97
                Thanks for your reply roryk. I checked that reads were aligned using:

                samtools view foo.bam | awk '$3 != "*"' | head -1

                which returns:

                "HWI-ST913:209:C2V7JACXX:6:1101:1587:2173 83 ENST00000394020 4460 0 58M = 4417 -101 CCTTATAGTTCTCTGCACACAATAGGTGCTTCTTGGATGTTCTTGGGTTTGGAAATAA IGHGGCCFHHFCHIIHIGJIJIJIHHAEB3EFD<CGGIJIIIIIHGIGGIHHIHIGJH NM:i:0 MD:Z:58 AS:i:58 XS:i:58"

                Counting reads mapped to reference sequence names such as the one above in the example bam files using:

                samtools view foo.bam | awk '$3 != "*"' | wc -l

                indicates ~58% of reads were mapped in total, with ~3.5% mapping to ERCC reference sequences. Reads were aligned, but aren't being counted by featureCounts it seems.
                Last edited by also.casey; 04-03-2014, 10:35 AM.

                Comment


                • #98
                  The ensemble file you posted are chromosome coordinates but you aligned to a FASTA file of transcript sequences, not the whole chromosomes.

                  Comment


                  • #99
                    Can I modify the linked gtf in any way to allow for summarization of the mapped reads as aligned?

                    Comment


                    • You could just count the third column in your SAM file. Aligning to the whole chromosomes is what you really want to do though.

                      Comment


                      • If this is used for counting reads mapping on genes using Ensembl GTF file:

                        featureCounts \
                        -a organism.gtf \
                        -t exon \
                        -g gene_id \
                        -o counts_per_gene.txt \
                        accepted_hits.sorted.bam

                        then how should be counted the reads mapping on exons using the same Ensembl GTF file?

                        Is it something like this?
                        featureCounts \
                        -a organism.gtf \
                        -t exon \
                        -f exon \
                        -g exon_id \
                        -o counts_per_exons.txt \
                        accepted_hits.sorted.bam

                        ???
                        Last edited by ndaniel; 04-29-2014, 04:11 AM. Reason: ++

                        Comment


                        • Replacing "-f exon" with just "-f" will instruct featureCounts to perform feature (eg exon) level summarization. The output will include number of reads assigned to each exon.

                          You may also want to turn on "-O" option, which will instruct featureCounts to assign a read to all its overlapping exons. This is often useful for exon-level read counting.

                          Wei

                          Comment


                          • featureCounts for counting reads mapping on exons

                            Originally posted by shi View Post
                            Replacing "-f exon" with just "-f" will instruct featureCounts to perform feature (eg exon) level summarization. The output will include number of reads assigned to each exon.

                            You may also want to turn on "-O" option, which will instruct featureCounts to assign a read to all its overlapping exons. This is often useful for exon-level read counting.

                            Wei
                            Thanks!

                            Is this:


                            featureCounts \
                            -a organism.gtf \
                            -t exon \
                            -f \
                            -g exon_id \
                            -O \
                            -o counts_per_exons.txt \
                            accepted_hits.sorted.bam


                            or this (the difference is "-g gene_id" and "-g exon_id" ?


                            featureCounts \
                            -a organism.gtf \
                            -t exon \
                            -f \
                            -g gene_id \
                            -O \
                            -o counts_per_exons.txt \
                            accepted_hits.sorted.bam


                            Also, what are the command line parameters for running featureCounts for counting reads mapping on exons from command line in order to produce exactly the same output as this R version of featureCounts as shown below (taken from limma user guide which was updated last time on April 12, 2014, page 127):


                            fc_SE <- featureCounts(SE_bam_files, annot.ext="unique.gff",
                            + isGTFAnnotationFile=TRUE,
                            + GTF.featureType="exon", GTF.attrType="ID",
                            + useMetaFeatures=FALSE, allowMultiOverlap=TRUE)

                            Comment


                            • Originally posted by ndaniel View Post
                              Thanks!

                              Is this:


                              featureCounts \
                              -a organism.gtf \
                              -t exon \
                              -f \
                              -g exon_id \
                              -O \
                              -o counts_per_exons.txt \
                              accepted_hits.sorted.bam


                              or this (the difference is "-g gene_id" and "-g exon_id" ?


                              featureCounts \
                              -a organism.gtf \
                              -t exon \
                              -f \
                              -g gene_id \
                              -O \
                              -o counts_per_exons.txt \
                              accepted_hits.sorted.bam
                              Typically this should be "gene_id", but you may check your GTF annotation file to make sure it is correct (the 9th column).

                              Also, what are the command line parameters for running featureCounts for counting reads mapping on exons from command line in order to produce exactly the same output as this R version of featureCounts as shown below (taken from limma user guide which was updated last time on April 12, 2014, page 127):


                              fc_SE <- featureCounts(SE_bam_files, annot.ext="unique.gff",
                              + isGTFAnnotationFile=TRUE,
                              + GTF.featureType="exon", GTF.attrType="ID",
                              + useMetaFeatures=FALSE, allowMultiOverlap=TRUE)
                              This should give you the same results as you got from sourceforge featureCounts. You need to adapt this R command to make it work for your data though. Also, please make sure you are using Rsubread 1.14.0 and sourceforget Subread 1.4.4 (these two are matched versions), otherwise you may get slightly different results.

                              Comment


                              • how exons/features are treated?

                                In case that one counts the reads mapping on exons, how does featureCounts treats the exons/feature which overlap (and they belong to the same gene)?

                                More exactly, if there are two exons which belong to the same gene such that:
                                (i) each exon belongs to different isoforms of the same gene, and
                                (ii) both exons have the same start position on the genome, and
                                (iii) second exon is longer than the first one
                                how the reads count for these exons is reported? Is reported the count of reads which (i) map on the intersection of these two exons, or (ii) map on the longest exon in this case (because the longest exons contains also the shortest one)? Are there reported two counts of reads for each exon?

                                Here is how the featureCounts is run for these questions (i.e. counting the reads mapping on exons)
                                # Program:featureCounts v1.4.4; Command:"featureCounts" "-T" "4" "-a" "/ensembl.gtf" "-t" "exon" "-f" "-g" "gene_id" "-O" "-o" "/output$
                                Geneid Chr Start End Strand Length /accepted_hits.sorted.bam
                                ENSG00000223972 1 11869 12227 + 359 0
                                ENSG00000223972 1 12613 12721 + 109 0
                                ENSG00000223972 1 13221 14409 + 1189 4
                                ENSG00000223972 1 11872 12227 + 356 0
                                ENSG00000223972 1 12613 12721 + 109 0
                                ENSG00000223972 1 13225 14412 + 1188 4
                                ENSG00000223972 1 11874 12227 + 354 0
                                ENSG00000223972 1 12595 12721 + 127 0
                                ENSG00000223972 1 13403 13655 + 253 3
                                ENSG00000223972 1 13661 14409 + 749 1
                                ENSG00000223972 1 12010 12057 + 48 0
                                ENSG00000223972 1 12179 12227 + 49 0
                                ...

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                30 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                32 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                28 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                52 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X