Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • arendon
    replied
    Hey ohofmann,

    At the moment readAligned does not support SAM format. The developers are planning to include it for the next release. Take a look at https://stat.ethz.ch/pipermail/bioc-...ly/000474.html

    Also, it is very worthwhile to have a look at http://htseq.ucr.edu They have a nice tutorial on all things R and seq.

    Leave a comment:


  • ohofmann
    replied
    Originally posted by arendon View Post
    I would like to suggest the ShortRead and IRange packages in R/Bioconductor. If you have enough memory to load a lane in memory 8GB should be enough, they provide excellent functions to compute per base coverage and many other things.
    And a question related to the one above -- does readAligned also support (or plans to support) the SAM format?

    Leave a comment:


  • srao
    replied
    Does readAligned support Paired End export.txt files?

    Hello,
    I figured I'd ask, since this might be faster than me figuring it out - does readAligned from the ShortRead package allow data from PE export files to be plotted together?

    Thanks a lot!


    Originally posted by arendon View Post
    I would like to suggest the ShortRead and IRange packages in R/Bioconductor. If you have enough memory to load a lane in memory 8GB should be enough, they provide excellent functions to compute per base coverage and many other things.

    for example for a solexa export file:
    Code:
    require(ShortRead)
    aln<-readAligned("/path/to/file","filename_export.txt",type="SolexaExport")
    # Filtering of reads e.g.:
    aln <- aln[alignData(aln)$filtering=="Y" & !is.na(strand(aln)) ]
    #Remove duplicated reads
    aln<-aln[!srduplicated(aln)]
    #Coverage
    cvg<-coverage(aln,extend=60L) #in this case reads are extended 60 bp 3'
    One can then use the package rtracklayer to export it as a wig
    Code:
    require(rtracklayer)
    export.ucsc(as(cvg,"RangedData")),"test.wig",subformat="wig")
    you might need to change the chromosome names afterwards if your original names already contained chr.

    Leave a comment:


  • Nix
    replied
    It calculates a per base read coverage over the regions you provide. There is no scaling. It also generates a histogram of the fraction of interrogated bases with 1,2,3,4,.... or more reads. Good for assessing what % of your array capture and sequencing hit 10 or more reads. It also gives an estimate of what your coverage would be with additional sequencing.

    Leave a comment:


  • xguo
    replied
    Thanks. Could you please let me know how you scale the base coverage for transcripts with different length?

    Leave a comment:


  • Nix
    replied
    I wrote an app called "ReadCoverage" which I think does exactly what you want for defined regions (e.g. exons, repeatmasked regions). Also generates track data. It's part of http://useq.sourceforge.net . Here's the command line menu. There's also a GUI if you prefer that approach.

    **************************************************************************************
    ** Read Coverage: June 2009 **
    **************************************************************************************
    Generates read coverage stair-step xxx.bar graph files for visualization in IGB. Will
    also calculate per base coverage stats for a given file of interrogated regions.

    Options:
    -p Point Data directories, full path, comma delimited. Should contain chromosome
    specific xxx.bar.zip or xxx_-_.bar files.
    -s Save directory, full path.
    -i (Optional) Full path file name for a tab delimited bed file (chr start stop ...)
    containing interrogated regions to use in calculating a per base coverage
    statistics. Interbase coordinates assumed.
    -b Just calculate stats, skip coverage graph generation.

    Example: java -Xmx1500M -jar pathTo/USeq/Apps/ReadCoverage -p
    /Data/Ets1Rep1/,/Data/Ets1Rep2/ -s /Data/MergedHitTrckEts1 -i
    /CapSeqDesign/interrogatedExonsChrX.bed

    **************************************************************************************

    Leave a comment:


  • xguo
    replied
    Hi, there,

    I have a related question about coverage calculation for RNASeq data. The average exon coverage may be computed as the sum of # reads mapped to each exonic position divided by exon length. However, as dePhi said, "using the average coverage is not really adequate in describing the trustworthiness of your data." Drawing a distribution graph along the transcript length is helpful. It is easy to get the distribution graph for an individual transcript. But to get a summary graph for all transcripts, coverage values have to be scaled to transcript length and summed over all transcripts. I'm having trouble to do that. Could someone help me on that, probably some simple R code would do it?

    thanks a lot

    Xiang

    Leave a comment:


  • dePhi
    replied
    Originally posted by westerman View Post
    From my understanding yes they are different and what you are calculating is the 'X' coverage. I.e., given the number of raw bases sequenced how many times (or X) does the sequencing potentially cover the genome.

    % coverage is how well the genome is actually covered after all mapping and assembly is done.

    As an example let's say we have 300M reads of 50 bases or 1.5 Gbase total. Our genome is 150M bases. After mapping (or assembly) we have a bunch of non-overlapping contigs that have 100M bases total.

    So our 'X coverage' is 10X (1.5 Gbases / 150 Mbases)
    Our '% coverage' is 66.6% (100 Mbases / 150 Mbases)


    One way to think about this is that percentages generally range from 0% to 100% and so having a percentage greater that 100 can be confusing.


    I use the haploid genome size or more specifically the C-value times 965Mbases/pg.
    Maybe I misunderstood what coverage is about, but...

    Using coverage to determine how many times you COULD cover a total genome is not really usefull, is it? (Unless you want to brag during coffee breaks, of course) You would like to know how many times the sequenced part of your genome actually IS covered, right? Giving you an idea of how many reads confirm a found base at a position. And coverage gives you an estimate of just that; how many reads do you have covering the parts you actually sequenced.

    Then shouldn't the result be 15X coverage? Because actual covering of the genome was 66.6% which contains 100Mb. So coverage would be 1.5Gb/100Mb which makes it 15X?
    So on average your sequenced part of the genome has 15X coverage, while in theory you could cover the entire genome 10X.

    Adding to that, I think that using the average coverage is not really adequate in describing the trustworthiness of your data. Using coverage distribution seems to make more sense to me. What if you have 50% covered only 5 times, and 50% covered 55 times? Still gives you 30X coverage, but it's not a good representation.

    Sorry for being such a pain...
    Cheers, dePhi

    Leave a comment:


  • arendon
    replied
    I would like to suggest the ShortRead and IRange packages in R/Bioconductor. If you have enough memory to load a lane in memory 8GB should be enough, they provide excellent functions to compute per base coverage and many other things.

    for example for a solexa export file:
    Code:
    require(ShortRead)
    aln<-readAligned("/path/to/file","filename_export.txt",type="SolexaExport")
    # Filtering of reads e.g.:
    aln <- aln[alignData(aln)$filtering=="Y" & !is.na(strand(aln)) ]
    #Remove duplicated reads
    aln<-aln[!srduplicated(aln)]
    #Coverage
    cvg<-coverage(aln,extend=60L) #in this case reads are extended 60 bp 3'
    One can then use the package rtracklayer to export it as a wig
    Code:
    require(rtracklayer)
    export.ucsc(as(cvg,"RangedData")),"test.wig",subformat="wig")
    you might need to change the chromosome names afterwards if your original names already contained chr.

    Leave a comment:


  • quinlana
    replied
    Also, "per base" coverage can be computed with genomeCoverageBed using the "-d" option. Unfortunately, it doesn't currently have an option to assume that the input BED file is sorted by chrom/start. Consequently, it loads the data into memory and sorts it internally. Thus, memory use is quite high if you have millions and millions of reads. A forthcoming release will fix this obvious limitation.

    Note that genomeCoverageBed requires a "genome" file that tells it how long each chromosome is. One can quickly produce this by querying the UCSC databases.

    For example, human (hg18):
    > mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg18.chromInfo" | grep -v ^chrom | head
    chr1 247249719
    chr1_random 1663265
    chr10 135374737
    chr10_random 113275
    chr11 134452384
    chr11_random 215294
    chr12 132349534
    chr13 114142980
    chr13_random 186858
    chr14 106368585

    Or, of course, just use their browser.

    Best,
    Aaron

    Leave a comment:


  • quinlana
    replied
    computing coverage

    Hi,
    As ewilbanks suggested, BEDTools will do this for you.



    If you want to compute coverage for "bins/windows" that march along the genome, you would use coverageBed. Let's say you've created a BED file called windows.bed for 10Kb windows across your genome and it looks like this (note BEDTools uses UCSC 0-based starts):
    chr1 0 10000
    chr1 9999 20000
    ...

    Now, you also have a bed file of sequence reads called reads.bed. The following command with calculate for each window in windows.bed:
    1) The number of reads in reads.bed that overlap the window
    2) Coverage density (i.e. the fraction of base pairs in window.bed that are covered by >= 1 read in reads.bed)

    coverageBed -a reads.bed -b windows.bed | sortBed -i stdin > windows.bed.coverage

    Sample output:
    <chr> <start> <end> <#reads> <# window bases covered> <windowSize> <fraction of window covered>
    chr1 0 10000 0 0 10000 0
    chr1 9999 20000 33 1000 10000 0.10

    I hope this helps. If you need a script to create a BED file with windows of a given size, just let me know.

    Best,
    Aaron

    Leave a comment:


  • Malabady
    replied
    I agree that all published C-values are for haploid genomes. But one can roughly estimated the total genome size by multiplying the haploid size by the number of genomes. so if we are talking about diploid plant, we multiply the haploid genome size by 2. Then use this estimated genome size in the coverage calculation. Does this sounds correct to you?.....many thanks

    Leave a comment:


  • westerman
    replied
    Originally posted by Malabady View Post
    Thanks Westerman,

    In the example you gave:
    So our 'X coverage' is 10X (150Mbases / 1.5Gbases)
    Our '% coverage' is 66.6% (100Mbases / 150Mbases)

    Isn't the X coverage should be 0.1X in this case?
    That is what I get for posting early Monday morning.

    'X' coverage is raw bases divided by genome size. So the above should be 1.5 Gbases / 150 Mbases or 10X. I will correct my original post.

    Also, did you mean that you use the haploid genome size (NOT the diploid, triploid, etc)?
    I believe that all published C-values for for haploid genomes. Although if someone knows for certain then please chime in.

    Leave a comment:


  • Malabady
    replied
    Thanks Westerman,

    In the example you gave:
    So our 'X coverage' is 10X (150Mbases / 1.5Gbases)
    Our '% coverage' is 66.6% (100Mbases / 150Mbases)

    Isn't the X coverage should be 0.1X in this case?

    Also, did you mean that you use the haploid genome size (NOT the diploid, triploid, etc)?

    Leave a comment:


  • nilshomer
    replied
    Originally posted by westerman View Post
    From my understanding yes they are different and what you are calculating is the 'X' coverage. I.e., given the number of raw bases sequenced how many times (or X) does the sequencing potentially cover the genome.

    % coverage is how well the genome is actually covered after all mapping and assembly is done.

    As an example let's say we have 300M reads of 50 bases or 1.5 Gbase total. Our genome is 150M bases. After mapping (or assembly) we have a bunch of non-overlapping contigs that have 100M bases total.

    So our 'X coverage' is 10X (150Mbases / 1.5Gbases)
    Our '% coverage' is 66.6% (100Mbases / 150Mbases)


    One way to think about this is that percentages generally range from 0% to 100% and so having a percentage greater that 100 can be confusing.


    I use the haploid genome size or more specifically the C-value times 965Mbases/pg.
    Also, what is a genome? Is it the non-repetitive part? Is it the part that is sequencable with your X base-pair reads? Most genome-sequencing papers say >98% but I highly doubt this given the large fraction of ALU, SINE, and other repeat elements that confound short reads.

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Exploring the Dynamics of the Tumor Microenvironment
    by seqadmin




    The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
    07-08-2024, 03:19 PM
  • seqadmin
    Exploring Human Diversity Through Large-Scale Omics
    by seqadmin


    In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
    06-25-2024, 06:43 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 07-16-2024, 05:49 AM
0 responses
18 views
0 likes
Last Post seqadmin  
Started by seqadmin, 07-15-2024, 06:53 AM
0 responses
27 views
0 likes
Last Post seqadmin  
Started by seqadmin, 07-10-2024, 07:30 AM
0 responses
40 views
0 likes
Last Post seqadmin  
Started by seqadmin, 07-03-2024, 09:45 AM
0 responses
205 views
0 likes
Last Post seqadmin  
Working...
X