Seqanswers Leaderboard Ad

**amitm** · 11-13-2014, 06:57 AM

hi heso,
`Your script to run featureC has problem

1) -t is the feature type. This is 3rd col of GTF. Its normally "exon" and it is so in the snapshot of your GTF too. So, specify -t exon and it should be fine.

2) The -g parameter specifies the attribute used to group features ('exon' here). Since its miRNA so either gene or transcript would mean the same I think.
So, you can specify -g gene_id
OR, -g transcript_id

Whatever you do in 2nd, the result file will have the primary identifier as that. Again the value for -g is in the 9th col of GTF

**heso** · 11-13-2014, 08:16 AM

Thank you for the answere.
I now tried the following command:

$ featureCounts -a Homo_sapiens_miRNA.gtf -F GTF -t exon -M -g gene_id -o Rd4_UCexo_r1_fcounts.txt input_files Rd4_UCexo_r1_seqok_n0l15abest_hg37_onlymapped.bam

of some reason I still get 0% successfully assigned reads:

//================================= Running ==================================\\
|| ||
|| Load annotation file Homo_sapiens_miRNA.gtf ... ||
|| Features : 3424 ||
|| Meta-features : 3411 ||
|| Chromosomes : 116 ||
|| ||
|| Process Unknown file input_files... ||
|| Single-end reads are included. ||
|| Failed to open file input_files ||
|| No counts were generated for this file. ||
|| ||
|| Process BAM file Rd4_UCexo_r1_seqok_n0l15abest_hg37_onlymapped.bam... ||
|| Single-end reads are included. ||
|| Assign reads to features... ||
|| Total reads : 92004855 ||
|| Successfully assigned reads : 0 (0.0%) ||
|| Running time : 3.18 minutes ||
|| ||
|| Read assignment finished. ||
|| ||
\\===================== http://subread.sourceforge.net/ ======================//

The output file looks like that:

Geneid Chr Start End Strand Length Sample_1_hg37_onlymapped.bam
ENSG00000243485 1 30366 30503 + 138
ENSG00000207730 1 1102484 1102578 + 95
ENSG00000207607 1 1103243 1103332 + 90
ENSG00000198976 1 1104385 1104467 + 83
ENSG00000207776 1 3477259 3477354 - 96
ENSG00000264428 1 3800628 3800697 + 70
ENSG00000264341 1 5624131 5624203 + 73
ENSG00000216045 1 5875931 5876019 - 89
ENSG00000264101 1 5922732 5922801 - 70
ENSG00000266687 1 5953899 5953984 - 86
ENSG00000265392 1 6489894 6489956 - 63
ENSG00000207865 1 9211727 9211836 - 110
ENSG00000252841 1 9338450 9338592 + 143

What could be the problem?

**amitm** · 11-13-2014, 08:44 AM

hi,
In the cmdline, what is the input_files term doing there?
featureCounts -a Homo_sapiens_miRNA.gtf -F GTF -t exon -M -g gene_id -o Rd4_UCexo_r1_fcounts.txt input_files Rd4_UCexo_r1_seqok_n0l15abest_hg37_onlymapped.bam

Just remove that term. See in the diagnostic log file, featureC says
|| Process Unknown file input_files... ||

Its not reading your BAM file. The correct cmd -

featureCounts \
-a Homo_sapiens_miRNA.gtf \
-F GTF \
-t exon \
-M \
-g gene_id \
-o Rd4_UCexo_r1_fcounts.txt \
Rd4_UCexo_r1_seqok_n0l15abest_hg37_onlymapped.bam

(i have just formatted for easier readability)

**heso** · 11-13-2014, 01:06 PM

hi again,

I tried You suggestion. Still no help

Code:

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file Homo_sapiens_miRNA.gtf ...                            ||
||    Features : 3424                                                         ||
||    Meta-features : 3411                                                    ||
||    Chromosomes : 116                                                       ||
||                                                                            ||
|| Process BAM file Rd4_UCexo_r1_seqok_n0l15abest_hg37_onlymapped.bam...      ||
||    Single-end reads are included.                                          ||
||    Assign reads to features...                                             ||
||    Total reads : 92004855                                                  ||
||    Successfully assigned reads : 0 (0.0%)                                  ||
||    Running time : 3.46 minutes                                             ||
||                                                                            ||
||                         Read assignment finished.                          ||
||                                                                            ||
\\===================== [url]http://subread.sourceforge.net/[/url] ======================//

And the output file:

Code:

# Program:featureCounts v1.4.6; Command:"featureCounts" "-a" "Homo_sapiens_miRNA.gtf" "-$
Geneid  Chr     Start   End     Strand  Length  Rd4_UCexo_r1_seqok_n0l15abest_hg37_onlym$
ENSG00000243485 1       30366   30503   +       138     0
ENSG00000207730 1       1102484 1102578 +       95      0
ENSG00000207607 1       1103243 1103332 +       90      0
ENSG00000198976 1       1104385 1104467 +       83      0
ENSG00000207776 1       3477259 3477354 -       96      0
ENSG00000264428 1       3800628 3800697 +       70      0
ENSG00000264341 1       5624131 5624203 +       73      0
ENSG00000216045 1       5875931 5876019 -       89      0
ENSG00000264101 1       5922732 5922801 -       70      0
ENSG00000266687 1       5953899 5953984 -       86      0
ENSG00000265392 1       6489894 6489956 -       63      0
ENSG00000207865 1       9211727 9211836 -       110     0
ENSG00000252841 1       9338450 9338592 +       143     0

But generally, does the output file have to include only the gene_id's where it has found a 'hit' (read count) or all the gene_id's of the gtf file (even if no reads are found for the specific gene_id)?

**shi** · 11-13-2014, 02:49 PM

All the gene ids are included in the output no matter they got a hit or not.

featureCounts also outputs a .summary file which summarizes the read counting results and lists the numbers of reads that were not assigned due to various reasons. Can you provide the content of that file please?

You may also have a look at the Subread users guide which includes a section for mapping and counting miRNA reads using Subread and featureCounts (section 5.5):

http://bioinf.wehi.edu.au/subread-pa...UsersGuide.pdf

The miRBase database (http://www.mirbase.org/) contains miRNA gene annotations which can be used for read summarization as well.

**heso** · 11-13-2014, 11:58 PM

The .summary file contents is the following:

Code:

Status  Rd4_UCexo_r1_seqok_n0l15abest_hg37_onlymapped.bam
Assigned        0
Unassigned_Ambiguity    0
Unassigned_MultiMapping 0
Unassigned_NoFeatures   92004855
Unassigned_Unmapped     0
Unassigned_MappingQuality	0
Unassigned_FragementLength	0
Unassigned_Chimera	0
Unassigned_Secondary    0
Unassigned_Nonjunction  0
Unassigned_Duplicate    0

I have also tried to map the reads against the mature and hairpin miRNA sequences, which I got from miRbase homepage. (I built the bowtie indexes myself after grep'ing Human sequences, changing U's to T's and adding also 25G's to the end of the mature sequences to use them for bowtie)
In this case for the same sample (which I above have used to map against the genome) I get 10% of the reads mapped to mature sequences and 6% mapped to the hairpin sequences. Therefore, I don't expect loads of miRNA's to pop up in featureCounts.
However, I should get some hits! Now I get 0!!!!

In addition, can someone explain me why I get a higher % of hits when mapping against hairpin than mature sequences??

**dpryan** · 11-14-2014, 01:03 AM

Post the first few alignments, perhaps you just have mismatching chromosome names (that's a common cause of this).

Regarding mapping against the hairpin vs. mature sequence, those results sound unsurprising. The hairpin contains the mature sequence plus extra surrounding sequence. We know that the annotated mature sequences don't fully capture the heterogeneity of real miRNA bounds, but those bounds should generally be contained within the hairpin sequence. So, the hairpin will often yield higher mapping rates.

**heso** · 11-14-2014, 01:14 PM

Yes, that's what it seemed to be.
I downloaded ebwt as well as gtf files from the UCSC site and now everything worked

Status /.............................................bam
Assigned 164848
Unassigned_Ambiguity 24960
Unassigned_MultiMapping 0
Unassigned_NoFeatures 91815047
Unassigned_Unmapped 0
Unassigned_MappingQuality 0
Unassigned_FragementLength 0
Unassigned_Chimera 0
Unassigned_Secondary 0
Unassigned_Nonjunction 0
Unassigned_Duplicate 0

Thanks again to everybody for helping out !

Topics	Statistics	Last Post
SIX2 Protein Identified as a Key Player in Prostate Cancer Treatment Resistance by seqadmin Started by seqadmin, Yesterday, 06:55 AM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 06:55 AM
Genetic Mosaicism More Prevalent Than Previously Thought by seqadmin Started by seqadmin, 05-30-2024, 03:16 PM	0 responses 24 views 0 likes	Last Post by seqadmin 05-30-2024, 03:16 PM
Comprehensive Sequencing of Great Ape Sex Chromosomes Yields Insights into Evolution and Genetic Variability by seqadmin Started by seqadmin, 05-29-2024, 01:32 PM	0 responses 29 views 0 likes	Last Post by seqadmin 05-29-2024, 01:32 PM
New Toolkit Enhances Plant Mitochondrial Genome Research by seqadmin Started by seqadmin, 05-24-2024, 07:15 AM	0 responses 215 views 0 likes	Last Post by seqadmin 05-24-2024, 07:15 AM

Seqanswers Leaderboard Ad

Announcement

featureCounts error

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News