Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to calculate coverage

    Hello,

    I would like to sequence coverage along the genome from aligned reads. Any suggestions or tools to do this efficiently? Maybe also being able to choose different bin sizes?

    Many thanks,

    a

  • #2
    Maybe it is early in the morning but I am not able to parse the phrase "I would like to sequence coverage along the genome from aligned reads". Perhaps you meant "... like to determine sequence coverage ...". In which case, yes, binning would work. Or just taking the number of reads times the average length of the reads all divided the genome length.

    Many of the alignment programs will give you something that you can toss into a spreadsheet and come up with a fancy graph. Basically via binning with a bin size of 1.

    Really it all depends on exactly what you want in the end. A single "X coverage" number? A graph? A mean with standard deviation? In any case the computational portion does not seem that complex to me.

    Comment


    • #3
      Terribly sorry, I did make little sense.

      I would like to know at each base along the genome how many reads saw that base. Alternatively, one can ask not at the single base level but at some interval distance, say 10bp. The output could be a wig file. This does not sound terribly complicated to do if the reads are sorted. I am more wondering whether there are tools that already do this.

      Many thanks,

      a

      Comment


      • #4
        Hi! I was working on just this issue a while back and was surprised by the relative lack of tools. Binning should work just fine. I'd recommend Aaron Quinlan's "BEDTools". the CoverageBed and genomeCoverageBed seem applicable, though I haven't used them yet.



        A friend of mine wrote a nifty script in R for me that calculates the coverage at each base-pair across the genome, using input of a text file with read genome coordinates. I'd be happy to send it to you, if that seems helpful. The data from this is quite noisy and usually needs smoothing of some sort (I did rolling means, using the R "zoo package"). I don't know what genome size you're working with, but I was using this on microbial genomes (~4 Mb) and the program runs in ~20-30 min.

        Cheers,
        Lizzy

        Comment


        • #5
          I do not remember exactly but some time ago I messed with velvet I think it calculated coverage (I must admit I do not remember was that single base coverage)
          and there was a tutorial how to get nice graphs using R.

          Originally posted by ewilbanks View Post
          Hi! I was working on just this issue a while back and was surprised by the relative lack of tools. Binning should work just fine. I'd recommend Aaron Quinlan's "BEDTools". the CoverageBed and genomeCoverageBed seem applicable, though I haven't used them yet.



          A friend of mine wrote a nifty script in R for me that calculates the coverage at each base-pair across the genome, using input of a text file with read genome coordinates. I'd be happy to send it to you, if that seems helpful. The data from this is quite noisy and usually needs smoothing of some sort (I did rolling means, using the R "zoo package"). I don't know what genome size you're working with, but I was using this on microbial genomes (~4 Mb) and the program runs in ~20-30 min.

          Cheers,
          Lizzy

          Comment


          • #6
            Hello all,

            instead of having a complete chromosome/genome as a reference, I use many gene-scale sequences as my reference. I now want to see what the coverage per base is when I map my solexa reads against these many reference sequences. Is there already a tool out there that can do the job?
            Can programs like soap, bowtie, ... provide me this type of information?
            Is it also possible to do a blast (with stringent parameters) and than parse these blast results?

            Any help/comments are more than welcome

            Comment


            • #7
              There's an easy way to do it using output from the maq-pipeline:

              Code:
              ...
              [maq-steps]
              ...
              maq pileup -p [your bfa] [your map] > pileup.out
              
              cut -f 2-4 pileup.out > croppedpileup.out
              
              #then launch R
              R
              #following are R commands
              data <-read.table(file="croppedpileup.out",sep="\t",header=F)
              colnames(data)<-c("pos","consensus","coverage")
              depth<-mean(data[,"coverage"])
              # depth now has the mean (overall)coverage
              #set the bin-size
              window<-101
              rangefrom<-0
              rangeto<-length(data[,"pos"])
              data.smoothed<-runmed(data[,"coverage"],k=window)
              png(file="cov_out.png",width=1900,height=1000)
              plot(x=data[rangefrom:rangeto,"pos"],y=data.smoothed[rangefrom:rangeto],pch=".", cex=1,xlab="bp position",ylab="depth",type="l")
              dev.off()
              Feel free to leave R afterwards,
              you should (unless some error occured) find a PNG-file containing the coverageplot in your directory;
              Of course window can be changed (needs to be odd-numbered, though)
              as well as rangefrom and rangeto values.

              Edit:
              Of course when using many sequences in maq,
              you will most likely be interessted in keeping the first column of the pileup.out.
              However, this will leed to much bigger files (longer R-load-times), and will require R-handling
              as you probably want to slice and dice and plot them by sequence-ID I take it?

              Any questions?
              Best
              -Jonathan
              Last edited by Jonathan; 06-09-2009, 02:20 AM.

              Comment


              • #8
                hi all;
                some papers mention X coverage and some says % coverage, is there any difference between both? we get % coverage by dividing total (reads*length) by genome size. if these two terms are different, how the X coverage is calculated? do we use haploid genome size instead of genome size?

                Comment


                • #9
                  Originally posted by Malabady View Post
                  hi all;
                  some papers mention X coverage and some says % coverage, is there any difference between both? we get % coverage by dividing total (reads*length) by genome size. if these two terms are different, how the X coverage is calculated? do we use haploid genome size instead of genome size?
                  From my understanding yes they are different and what you are calculating is the 'X' coverage. I.e., given the number of raw bases sequenced how many times (or X) does the sequencing potentially cover the genome.

                  % coverage is how well the genome is actually covered after all mapping and assembly is done.

                  As an example let's say we have 300M reads of 50 bases or 1.5 Gbase total. Our genome is 150M bases. After mapping (or assembly) we have a bunch of non-overlapping contigs that have 100M bases total.

                  So our 'X coverage' is 10X (1.5 Gbases / 150 Mbases)
                  Our '% coverage' is 66.6% (100 Mbases / 150 Mbases)


                  One way to think about this is that percentages generally range from 0% to 100% and so having a percentage greater that 100 can be confusing.


                  I use the haploid genome size or more specifically the C-value times 965Mbases/pg.
                  Last edited by westerman; 07-13-2009, 10:25 AM. Reason: Corrected my division order. Color me red-faced!

                  Comment


                  • #10
                    Originally posted by westerman View Post
                    From my understanding yes they are different and what you are calculating is the 'X' coverage. I.e., given the number of raw bases sequenced how many times (or X) does the sequencing potentially cover the genome.

                    % coverage is how well the genome is actually covered after all mapping and assembly is done.

                    As an example let's say we have 300M reads of 50 bases or 1.5 Gbase total. Our genome is 150M bases. After mapping (or assembly) we have a bunch of non-overlapping contigs that have 100M bases total.

                    So our 'X coverage' is 10X (150Mbases / 1.5Gbases)
                    Our '% coverage' is 66.6% (100Mbases / 150Mbases)


                    One way to think about this is that percentages generally range from 0% to 100% and so having a percentage greater that 100 can be confusing.


                    I use the haploid genome size or more specifically the C-value times 965Mbases/pg.
                    Also, what is a genome? Is it the non-repetitive part? Is it the part that is sequencable with your X base-pair reads? Most genome-sequencing papers say >98% but I highly doubt this given the large fraction of ALU, SINE, and other repeat elements that confound short reads.

                    Comment


                    • #11
                      Thanks Westerman,

                      In the example you gave:
                      So our 'X coverage' is 10X (150Mbases / 1.5Gbases)
                      Our '% coverage' is 66.6% (100Mbases / 150Mbases)

                      Isn't the X coverage should be 0.1X in this case?

                      Also, did you mean that you use the haploid genome size (NOT the diploid, triploid, etc)?

                      Comment


                      • #12
                        Originally posted by Malabady View Post
                        Thanks Westerman,

                        In the example you gave:
                        So our 'X coverage' is 10X (150Mbases / 1.5Gbases)
                        Our '% coverage' is 66.6% (100Mbases / 150Mbases)

                        Isn't the X coverage should be 0.1X in this case?
                        That is what I get for posting early Monday morning.

                        'X' coverage is raw bases divided by genome size. So the above should be 1.5 Gbases / 150 Mbases or 10X. I will correct my original post.

                        Also, did you mean that you use the haploid genome size (NOT the diploid, triploid, etc)?
                        I believe that all published C-values for for haploid genomes. Although if someone knows for certain then please chime in.

                        Comment


                        • #13
                          I agree that all published C-values are for haploid genomes. But one can roughly estimated the total genome size by multiplying the haploid size by the number of genomes. so if we are talking about diploid plant, we multiply the haploid genome size by 2. Then use this estimated genome size in the coverage calculation. Does this sounds correct to you?.....many thanks

                          Comment


                          • #14
                            computing coverage

                            Hi,
                            As ewilbanks suggested, BEDTools will do this for you.



                            If you want to compute coverage for "bins/windows" that march along the genome, you would use coverageBed. Let's say you've created a BED file called windows.bed for 10Kb windows across your genome and it looks like this (note BEDTools uses UCSC 0-based starts):
                            chr1 0 10000
                            chr1 9999 20000
                            ...

                            Now, you also have a bed file of sequence reads called reads.bed. The following command with calculate for each window in windows.bed:
                            1) The number of reads in reads.bed that overlap the window
                            2) Coverage density (i.e. the fraction of base pairs in window.bed that are covered by >= 1 read in reads.bed)

                            coverageBed -a reads.bed -b windows.bed | sortBed -i stdin > windows.bed.coverage

                            Sample output:
                            <chr> <start> <end> <#reads> <# window bases covered> <windowSize> <fraction of window covered>
                            chr1 0 10000 0 0 10000 0
                            chr1 9999 20000 33 1000 10000 0.10

                            I hope this helps. If you need a script to create a BED file with windows of a given size, just let me know.

                            Best,
                            Aaron

                            Comment


                            • #15
                              Also, "per base" coverage can be computed with genomeCoverageBed using the "-d" option. Unfortunately, it doesn't currently have an option to assume that the input BED file is sorted by chrom/start. Consequently, it loads the data into memory and sorts it internally. Thus, memory use is quite high if you have millions and millions of reads. A forthcoming release will fix this obvious limitation.

                              Note that genomeCoverageBed requires a "genome" file that tells it how long each chromosome is. One can quickly produce this by querying the UCSC databases.

                              For example, human (hg18):
                              > mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg18.chromInfo" | grep -v ^chrom | head
                              chr1 247249719
                              chr1_random 1663265
                              chr10 135374737
                              chr10_random 113275
                              chr11 134452384
                              chr11_random 215294
                              chr12 132349534
                              chr13 114142980
                              chr13_random 186858
                              chr14 106368585

                              Or, of course, just use their browser.

                              Best,
                              Aaron

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Recent Developments in Metagenomics
                                by seqadmin





                                Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                                09-23-2024, 06:35 AM
                              • seqadmin
                                Understanding Genetic Influence on Infectious Disease
                                by seqadmin




                                During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

                                Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
                                09-09-2024, 10:59 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 10-02-2024, 04:51 AM
                              0 responses
                              13 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-01-2024, 07:10 AM
                              0 responses
                              21 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 09-30-2024, 08:33 AM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 09-26-2024, 12:57 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X