My task is to estimate the expression of certain unannotated intergenic regions (i.e. not genes) in several tissues.
The answer I need is pretty much Yes/No i.e. whether these regions are expressed or not.
On my previous steps, I used UCSC database and therefore my coordinates are in BED format.
So, I need to:
(1) count the reads mapping to my regions.
(2) calculate expression.
For (1)
I started searching for the package that would accept BED as an input and count the reads.
I found HTSeq. Great package! I really love its simplicity yet powerfulness; also the fact that it is in Python and I can easily incorporate it into my pipeline.
So I modified the script HTSeq-count to accept the BED file as an input instead of GFF and ran it with default parameters:
The script happily read SAM, BED and gave me the standard output of counts 
I even created another script which generates the table of format
For (2)
I don't know what to use to calculate actual expression (RPKM/FPKM) from this counts table.
I calculated RPKMs manually used maximum of expressions from all tissues, but I don't know how to deal with the 'background RPKM' that must be subtracted.
What would you suggest?
ADDITIONAL QUESTIONS:
* Would you suggest some other way of using HTSeq with BED that is better than mine?
(I tried converting BED to GFF via Galaxy and using Cufflinks without success. Cufflinks is very capricious)
* What would be your preferred way to go further i.e. estimate the actual expressions from HTSeq package?
I would prefer to stay in Python and not to use R, if possible.
Thanks,
Proteos
The answer I need is pretty much Yes/No i.e. whether these regions are expressed or not.
On my previous steps, I used UCSC database and therefore my coordinates are in BED format.
So, I need to:
(1) count the reads mapping to my regions.
(2) calculate expression.
For (1)
I started searching for the package that would accept BED as an input and count the reads.
I found HTSeq. Great package! I really love its simplicity yet powerfulness; also the fact that it is in Python and I can easily incorporate it into my pipeline.
So I modified the script HTSeq-count to accept the BED file as an input instead of GFF and ran it with default parameters:
Code:
$>./tool.htseq_count_by_bed --countout <counts_file> <sam_file> <bed_file>

I even created another script which generates the table of format
Code:
ID tissue1 tissue2 tissue3 id1 1.3 1.2 0.0 id2 0.0 0.0 10 id3 0.5 0.7 0.9
For (2)
I don't know what to use to calculate actual expression (RPKM/FPKM) from this counts table.
I calculated RPKMs manually used maximum of expressions from all tissues, but I don't know how to deal with the 'background RPKM' that must be subtracted.
What would you suggest?
ADDITIONAL QUESTIONS:
* Would you suggest some other way of using HTSeq with BED that is better than mine?
(I tried converting BED to GFF via Galaxy and using Cufflinks without success. Cufflinks is very capricious)
* What would be your preferred way to go further i.e. estimate the actual expressions from HTSeq package?
I would prefer to stay in Python and not to use R, if possible.
Thanks,
Proteos
Comment