Header Leaderboard Ad


HTSeq-count and BED-formatted coordinates



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • HTSeq-count and BED-formatted coordinates

    My task is to estimate the expression of certain unannotated intergenic regions (i.e. not genes) in several tissues.
    The answer I need is pretty much Yes/No i.e. whether these regions are expressed or not.

    On my previous steps, I used UCSC database and therefore my coordinates are in BED format.
    So, I need to:
    (1) count the reads mapping to my regions.
    (2) calculate expression.

    For (1)
    I started searching for the package that would accept BED as an input and count the reads.
    I found HTSeq. Great package! I really love its simplicity yet powerfulness; also the fact that it is in Python and I can easily incorporate it into my pipeline.
    So I modified the script HTSeq-count to accept the BED file as an input instead of GFF and ran it with default parameters:
    $>./tool.htseq_count_by_bed --countout <counts_file> <sam_file> <bed_file>
    The script happily read SAM, BED and gave me the standard output of counts
    I even created another script which generates the table of format

    ID	tissue1	tissue2	tissue3
    id1	1.3	1.2	0.0
    id2	0.0	0.0	10
    id3	0.5	0.7	0.9

    For (2)
    I don't know what to use to calculate actual expression (RPKM/FPKM) from this counts table.
    I calculated RPKMs manually used maximum of expressions from all tissues, but I don't know how to deal with the 'background RPKM' that must be subtracted.

    What would you suggest?


    * Would you suggest some other way of using HTSeq with BED that is better than mine?
    (I tried converting BED to GFF via Galaxy and using Cufflinks without success. Cufflinks is very capricious)

    * What would be your preferred way to go further i.e. estimate the actual expressions from HTSeq package?
    I would prefer to stay in Python and not to use R, if possible.



  • #2
    Is there anyone who can help me?


    • #3
      RNA-Seq read counts usually show quite good proportionality to the concentration of transcripts, as several authors demonstrated with spike-ins. Hence, the counts are a good measure of expression strength, once you normalize for sequencing depth. For more information, see this thread, e.g. my post #13.

      It might also be appropriate to divide by transcript length. If you have multiple isoforms, figuring out what length to use is a highly non-trivial (but nevertheless often ignored) problem.

      There are bias effects due to CG content and transcript length etc. which you may want to look into, but this is not always that important.


      • #4
        Thank you very much Simon!
        You saved me!

        So, I've read your 'post #13' and used the function estimateSizeFactors() to estimate the factors instead of using the library sizes.
        The full logics is as follows:
        * get counts
        * get newCountDataSet from counts and conditions (in my case simply the samples)
        * calculate size factors
        * calculate base means

        I realized that by dividing count on sizeFactor you get what would be 'baseMeanA' if you did nbinomTest().
        (The baseMeanB is simply the second condition's baseMean)
        Although I don't know how to get 'baseMean' i.e. the first column in nbinomTest's result and whether it is the one I need.

        I guess not, and what I need is dividing counts by sizeFactor i.e. baseMeanA. Is not it correct?

        So, with my scarce R knowledge and with a help of Google, I created a little script that I am sharing:

        #!/usr/bin/env Rscript
        # Accepts counts file as an input ( fileIn )
        # Outputs the file with baseMeans (fileOut)
        # In the R shell, create variable:
        # fileIn = "filename.counts"
        # and run this script
        # example:
        # fileIn="counts_hs.tab"
        # source("tool.counts_to_bmeans.r")
        library( DESeq )
        # read the counts data from file
        countsTable <- read.delim( fileIn, header=TRUE, stringsAsFactors=TRUE )
        # Convert column 1 to row names
        rownames( countsTable ) <- countsTable$name
        countsTable <- countsTable[ , -1 ]
        # Get conditions from column names
        # 'conds' should determine conditions, but in our case when every sample is separate, it is the same as 'samples'
        conds <- colnames(countsTable)	
        # Calculate factors
        cds <- newCountDataSet( countsTable, conds )
        cds <- estimateSizeFactors( cds )	# Factors to normalize from count data
        # Calculate baseMeans
        nfeatures <- nrow(assayData(cds)$counts)
        nfactors  <- NROW(pData(cds)$sizeFactor)
        mfactors <- matrix(pData(cds)$sizeFactor,nfeatures,nfactors, byrow=TRUE)
        bmeans <- assayData(cds)$counts / mfactors
        # Output to tab file
        #write.table(bmeans, file="cn.tab", sep="\t")
        # Output to CSV file
        write.csv(bmeans, file=fileOut)
        I've checked the script. It is working correctly.
        (compared the results to baseMeanA of nbinomTest() results and they are the same.)

        Now I will take these basemeans data and simply divide by the summary length of exons of the corresponding gene.
        ( because my case is simple: I don't need the particular isoform expression; I just need pretty much Yes/No value.)

        Right now, I will simply import basemeans to Excel and do the length normalization there. Later on, maybe I will do it in R or better in python and will share that too

        Please let me know if I am doing something wrong.
        I hope not, because I have to present these data next Wednesday before my group and I don't have much time

        Thanks again!

        P.S. I think it would be a good idea to create Python script for all this.
        If you are interested and think it is worth, at some point when it is more mature, one might even add it to your HTSeq package; I will be more than glad to share what I have


        • #5
          The 'baseMeans' itself is just the mean of the normalized counts over all samples, ignoring the condition.

          If you want to get rid of your script's dependency on R, just reimplement the scheme I explained in the post in Python. With numpy, this should be just three or four lines.


          • #6
            And what would you say about the algorithm and logics?
            Is everything correct or I am missing some point?

            Right now, because the task is simple, it is not worth porting to Python and R is fine.
            (I have very limited time in these weeks)

            But at some point, when I am more 'hardcore HTS guy', I will probably try to do that


            • #7
              Oops, forgot to ask:
              If I have replicates, should I do something in addition to this? (variance, dispersion etc)
              At least for my simple task?


              • #8
                Hi Proteos,

                I am also interested in read coverage of intergenic regions and have only bed formatted coordinates for those. Could you possibly share or give instruction on how to modify htseq-count to accept a bed format with say 4 columns (chrom, start, end, gene_name)? Bed formatted coordinates are 1-based start and 0-based end, but it is opposite in htseq, correct? I am decent in perl, but clueless in python.

                thanks much!


                • #9
                  correction, bed format is 0-based start and 1-based end, so it's the same as htseq.


                  Latest Articles






                  Topics Statistics Last Post
                  Started by seqadmin, 05-26-2023, 09:22 AM
                  0 responses
                  Last Post seqadmin  
                  Started by seqadmin, 05-24-2023, 09:49 AM
                  0 responses
                  Last Post seqadmin  
                  Started by seqadmin, 05-23-2023, 07:14 AM
                  0 responses
                  Last Post seqadmin  
                  Started by seqadmin, 05-18-2023, 11:36 AM
                  0 responses
                  Last Post seqadmin