Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Read coverage for each gene

    Hi,

    I'm looking for a method to give the read coverage for each gene. Our current method for making an RPKM cutoff is to manually look at our bedgraphs on UCSC Genome Browser and visually assess coverage. <----- Very tedious and subject to a number of flaws.

    Ideally I would like to get a value for each gene that indicates how many bases are covered. i.e. 100% of all bases covered or 0%. Is there already something like this in BEDtools or an R package?

    I use tophat to map and Seqmonk to count reads then manually calculate RPKM.

    Thanks

    Miguel

  • #2
    Bedtools genomecov should do this.



    The default options produce the following histogram output / fields, which should be easily modifiable for your purposes:
    1. chromosome (or entire genome)
    2. depth of coverage from features in input file
    3. number of bases on chromosome (or genome) with depth equal to column 2.
    4. size of chromosome (or entire genome) in base pairs
    5. fraction of bases on chromosome (or entire genome) with depth equal to column 2.
    If you set the maximum coverage to 1 (-max 1), then you can easily get the coverage fraction for all bases covered at any depth. If you set the max coverage to 20 (or something more suitable for your needs), then you get the covered fraction for well-covered bases.

    Comment


    • #3
      I've used 'bedtools coverage' for this in the past but it's a little tricky if you're working with an alternatively spliced transcriptome. For example you can compare your alignments to a GTF with this command and for every feature you get 4 values:

      1) The number of features in A that overlapped the B interval.
      2) The number of bases in B that had non-zero coverage.
      3) The length of the entry in B.
      4) The fraction of bases in B that had non-zero coverage.
      Where A is your alignments file and B is the GTF file. The thing is you have to parse this output fetching all of the rows corresponding to 'exon' features from the GTF and then group them based on the 'transcript_id' GTF field (which is included in the output) and finally using the values they provide with each exon feature you can compute the total transcript coverage ratio. If you can write perl or python you can do this. Even if you get that far it's still a little deceptive because you don't now which isoform(s) are actually expressed so therein lies the real issue. It does, however, produce a number telling you how well covered the isoforms are.

      There's another kinda wacky idea you can try. The above method isn't much different than looking at the coverage of alignments to a transcriptome (rather than a genome) allowing all possible alignments to be reported since the above case will take a single aligned read to the genome and count it towards all features it overlaps. Alternatively you may build an aligner index for the transcriptome reference, align with bowtie using the -a option, make a small bed file that describes each transcript feature in the transcriptome reference and then use 'bedtools coverage' to access the coverage of all features that the reads may align to. This is getting a little messy but for rough estimates it should be OK especially if you can group the output into genes by joining in some gene names and sorting on that column.

      Here's kinda how it would work provided you have a genome fasta (genome.fa) file, a GTF gene annotation (genes.gtf) and some reads in a fastq file (reads.fq). You also need to have Tophat installed on your computer (for its gffread utility), bowtie1 and bedtools.

      Code:
      # make the transcriptome sequence file
      gffread -w transcriptome.fa -g genome.fa genes.gtf
      
      # index with samtools
      samtools faidx transcriptome.fa
      
      # build bowtie index for the transcriptome
      bowtie-build transcriptome_ref transcriptome.fa
      
      # align your reads with bowtie
      bowtie -a --best -S transcriptome_ref reads.fq | samtools view -bS -o aligned.bam - 
      
      # parse the fasta index to make a bed file describing the transcripts
      cut -f1,2 transcriptome.fa.fai | perl -slane 'print join("\t", $F[0], 0, $F[1]);' > transcriptome_fa.bed
      
      # use bedtools to compute coverage of features
      bedtools coverage -abam aligned.bam -b transcriptome_fa.bed > feature_coverages.bed
      as long as its understood that this process in no way whatsoever is producing correct counts at the isoform level but you can use this to get an idea of how well the isoforms are potentially covered by your reads. call me crazy but it does work.
      /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
      Salk Institute for Biological Studies, La Jolla, CA, USA */

      Comment


      • #4
        while I'm at it you could even incorporate the results of eXpress or RSEM into this transcript coverage pipeline since those seek to unambiguously assign each read. For eXpress you could use the same bowtie index built from my previous example and then run these commands:

        Code:
        # express needs alignments with all possible (and reasonable) 
        # mapping locations
        bowtie -aS -n 2 -e 999 transcriptome_ref reads.fq | samtools view -bS -o aligned.bam - 
        
        # run express making a BAM file with all of the uniquely mapped reads
        express --output-align-samp -B 1 transcriptome.fa aligned.bam
        the '--output-align-samp' option instructs eXpress to produce a BAM file named 'hits.1.samp.bam' which contains only the single alignments for each read selected by their EM algorithm as the alignment with the highest probability of being correct. in other words the alignments in this file should produce counts from the 'bedtools coverage' command that match up with the 'est_counts' column in the 'results.xprs' file produced by eXpress. You can use this BED file in the same 'bedtools coverage' command I put into my previous post. Now you'll have coverages without any redundant alignments.
        /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
        Salk Institute for Biological Studies, La Jolla, CA, USA */

        Comment


        • #5
          Thanks for the suggestions. I ended up using the following.

          coverageBed -abam CCE.bam -s -b mm9.refSeq.Wholegene.bed > CCEcoverage.txt

          The final column gave percentage coverage and it matches well with the bedgraphs I have on UCSC.

          Alternative splicing is not a primary concern at this point so this should be fine for now. If that changes I'll be sure to have a headache trying to figure out! I'll come back to this post for sure.

          Thanks!

          Comment


          • #6
            Originally posted by migs54 View Post
            Thanks for the suggestions. I ended up using the following.

            coverageBed -abam CCE.bam -s -b mm9.refSeq.Wholegene.bed > CCEcoverage.txt

            The final column gave percentage coverage and it matches well with the bedgraphs I have on UCSC.

            Alternative splicing is not a primary concern at this point so this should be fine for now. If that changes I'll be sure to have a headache trying to figure out! I'll come back to this post for sure.

            Thanks!
            Just to remind that this method will not give the coverage of the exons of one gene.

            coverageBed -abam CCE.bam -s -b mm9.refSeq.Wholegene.bed > CCEcoverage.txt

            The percentage coverage given by this command is the gene body, including introns, coverage. And further, if you donn't use "-split" parameter, the region in "-abam" will use all the span that include "Ns". The detailed explanation can be found at section 1.3.19.

            Comment


            • #7
              Hi Pengchy,

              Yes you're right. This experiment was chromatin RNA seq, so included introns when calculating coverage.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              22 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              24 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              19 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              50 views
              0 likes
              Last Post seqadmin  
              Working...
              X