Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • criscruz
    replied
    thanks westerman

    Leave a comment:


  • westerman
    replied
    50-100x coverage per genome is good. Much more than that and you will start getting misassemblies.

    For virus I suggest using Mira. It is a good small genome assembler than can handle a lot of potential misassemblies.

    Leave a comment:


  • criscruz
    replied
    Hi everyone;

    I just have read your post but I still have my doubt in mind.

    I'm working with Ion PGM to generate whole genome sequences of some RNA viruses. Then I want to do a phylognetic tree with the consensus sequences of each virus I could identify. So the question is, how many coverage or reads per base I need in order to make a good consensus sequences to my phylogenetic analysis.
    I don't want to see or analyze the variant or quasespecies. So I need just the minimal necessary

    thanks for your time
    my best

    Cris

    Leave a comment:


  • westerman
    replied
    1. The SNP/Indels are usually not a big part of the genome; I doubt it they would throw off the calculations by a percent.

    2. Count the read only once; i.e., choose the best match or if multiple best matches then just choose one match by random.

    Really, unless you are working with a well characterized organism (e.g., human) then the numbers are going to be 'squishy' in any case. They are mainly there to give you an idea of how good your sequencing is. In other words if you calculate that you had 50x coverage (which is a nice de-novo assembly target) but only get 10% coverage to a closely related organism then that tells you something.

    Leave a comment:


  • rathankar
    replied
    calculating coverage depth

    Originally posted by westerman View Post
    From my understanding yes they are different and what you are calculating is the 'X' coverage. I.e., given the number of raw bases sequenced how many times (or X) does the sequencing potentially cover the genome.

    % coverage is how well the genome is actually covered after all mapping and assembly is done.

    As an example let's say we have 300M reads of 50 bases or 1.5 Gbase total. Our genome is 150M bases. After mapping (or assembly) we have a bunch of non-overlapping contigs that have 100M bases total.

    So our 'X coverage' is 10X (1.5 Gbases / 150 Mbases)
    Our '% coverage' is 66.6% (100 Mbases / 150 Mbases)


    One way to think about this is that percentages generally range from 0% to 100% and so having a percentage greater that 100 can be confusing.


    I use the haploid genome size or more specifically the C-value times 965Mbases/pg.
    Hi

    I went thru this post and understood how do we express coverage depth. But I need a small clarification. Does this coverage depth involve mutations in the reads [i mean non matching positions with respect to reference sequence], since it only takes the number of bases in the sample and the no. of bases in the reference sequence.

    2. if a read matches at more than one location, then will the coverage depth not increase. is there a way to reduce that error?

    Leave a comment:


  • gringer
    replied
    Is there any software that can do that without actually having to write any commands
    Er, you want a program to run that means you don't have to run a program? That's a difficult request.

    I suppose you could try using Galaxy, which hides all that pesky "running commands" stuff from you. That has a feature coverage tool, but requires input files to be in BED format, and presumably there are other tools closer to what you desire. From this email:

    To calculate coverage, please see the tool "Regional Variation ->
    Feature coverage". Query and target must both be in Interval/BED format.
    Query data in Interval/BED format is possible in most of the dataflow
    paths through the tools and from external sources. The reference genome
    file will likely need to be imported and formatted.

    Leave a comment:


  • recombinationhotspot
    replied
    I am trying to calculate the average coverage for a given region , e.g 200 bps where my reads are aligned. Is there any software that can do that without actually having to write any commands. Please note that I have no bioinformatics background and don't have access to a linux, etc operating system. The best solution I have until now is to use Savant genome browser and convert the .bam files into .bam.cov.tdf files which shows me the maximum coverage.

    Leave a comment:


  • swbarnes2
    replied
    Originally posted by mrood View Post
    It seems to me that if you are mapping to a reference genome and there are regions that have more than twice the average coverage that it is probably the result of a duplication or something in the genome of the sequenced organism.
    Natural variation in sequencing coverage could easily make 2 fold differences in coverage, or more.

    Likewise, if it has very poor coverage the organism likely does not have that region in its genome and it is likely the result of improper mapping.
    Or, the region is there, but so divergent from your reference that reads are mapping poorly, or the region could be GC rich or something, causing few reads to be generated there.

    Leave a comment:


  • gringer
    replied
    If duplications / deletions are rare enough, then median coverage should be fine. A median statistic will typically deal with the spikes and troughs that are an issue for using mean as a descriptive statistic.

    Leave a comment:


  • mrood
    replied
    I realize this is quite an old forum but I am currently trying to calculate the coverage per bp of my NGS data. I am new to NGS and bioinformatics so my apologies if this does not make sense...

    For quality control I would like to determine a logical min and max coverage to exclude from my downstream analyses; however, I cannot seem to find a "gold standard" for such purposes. It seems to me that if you are mapping to a reference genome and there are regions that have more than twice the average coverage that it is probably the result of a duplication or something in the genome of the sequenced organism. Likewise, if it has very poor coverage the organism likely does not have that region in its genome and it is likely the result of improper mapping. As such, I would like to exclude these regions before performing genome-wide population genetic analyses. Does anyone have any suggestions on what cutoffs to use and how to go about doing so?

    Thanks in advance!

    Leave a comment:


  • SEQond
    replied
    By utilising BEDtools and UCSC GB I am trying to get a picture like this (histogram of the bedgraph in wiggle form)

    so far i have used a SORTED bam on which genomeCoverageBed was run with -bg -ibam options and -g mm8.fa.fai as the index

    The resulting bedgraph was uploaded on UCSC GB and produced this "stripe" track which has numbers 1 2 3 and os on , before each stripe ( where do I find the meaning of the numbers?)

    Then added a first line of the bedgraph file like this
    track type=bedGraph name="set11bamSorted" description="BedGraph format" visibility=full color=200,100,0 altColor=0,100,200 priority=20

    and uploaded the new file.

    This time I am getting an error
    "Error File 'set11_fixed.bedgraph' - Error line 1224705 of custom track: chromEnd larger than chrom chr13 size (114176901 > 114142980)"

    Why should I get this error this time , while the contents of the bedgraph file are exactly the same as before? Is this the correct way to get the wiggle histogram like the one on the first picture ?

    Below you may find the code that I used to get the original bedgraph file

    genomeCoverageBed -bg -ibam ~/set11/tophatOut11/set11TopHataccepted_hits.Sorted.bam -g /opt/bowtie/indexes/mm8.fa.fai > ~/set11/set11.bedgraph &

    Leave a comment:


  • SEQond
    replied
    Originally posted by fhb View Post
    Hi Dr. Quinlan,

    Thanks for the bedtools. I am using the genomeCoverageBed, but I would like to use the coverageBed.
    I am not sure if you have posted this somewhere else, but I would appreciate if you could provide this code that you offered that creates a BED file with intervals of a given size.
    Thanks in advance,
    Best
    Fernando
    If possible I would like that too.

    Thanks

    Leave a comment:


  • qqtwee
    replied
    cufflinks Error

    Hello,all
    when I use cufflinks without a gtf file for bacterial RNA-seq, I get the follow result:
    tracking_id class_code nearest_ref_id gene_id gene_short_name tss_id locus length coverage FPKM FPKM_conf_lo FPKM_conf_hi FPKM_status
    ,that is all in the result file.
    I don't know the reason why there are not values.anyone can help me?
    thanks. best wishes.
    Attached Files

    Leave a comment:


  • gringer
    replied
    You're going to get a huge variation in expression for all transcripts, such that calculating a 'global deviation' doesn't really make sense. If you do want to do it globally, I'd recommend log-transforming the coverage values before working out mean coverage and deviation, because that transformation seems to better fit a normal curve.

    On a per-gene (or per-isoform) basis, I've been calculating coverage using CV (i.e. SD / mean) as an indication of variation. With about 20 genes that I did this for, the CV ranged from about 40% to 400%, but anything above about 140% appeared to be a mis-mapped transcript.

    It might be useful to only do this for expressed regions (e.g. coverage > 15) to attempt to control for these bad mappings.
    Last edited by gringer; 11-24-2011, 04:38 AM.

    Leave a comment:


  • Valerie2011
    replied
    Hi you all,

    I have a little question concerning the best way to estimate the RNASeq coverage, or more precisely global transcription rate distribution.

    To put it simple let's assume a bacterium from which an RNASeq has been performed. For this analysis we work only with the pairs (positive and negative strands) of .wig files, filtering out rRNAs, tRNAs and other known RNAs that are transcribed at very high levels.

    Briefly (ommiting a normalization step between the samples) we then calculate a global trancription rate for each strand by simply adding up all the values in the corresponding .wig file and by dividing this sum by the genome size: we get the global average read per nucleotide. Easy. A standard deviation on this mean is also easily calculated at the same time.

    My question is, is the standard deviation the best parameter to describe the distribution around this mean? As the majority of the nucleotides are not read a single time and as some transcripts are very abundant this standard deviation is obviously huge compared to the mean (e.g. mean = 8, stdev = 420). Would then the standard error (= stdev / sqrt (genome size)) be more relevant? Or should the 0 values be skipped from the analysis, counting only nucleotides called and their number of reads?

    Sorry for not being very good at stats, I forgot most of what I learnt ages ago. Basically the aim is to test whether one strand is being transcribed at (statistically) the same rate as the other... We would prefer not to use fancy software to do this analysis but to do it ourselves "manually" (with a little help of Perl, of course).

    Note: we are interested in the transcripts independently of the coding domain sequences and other annotations.

    Thanks for any suggestion!

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Best Practices for Single-Cell Sequencing Analysis
    by seqadmin



    While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
    06-06-2024, 07:15 AM
  • seqadmin
    Latest Developments in Precision Medicine
    by seqadmin



    Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

    Somatic Genomics
    “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
    05-24-2024, 01:16 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Today, 08:58 AM
0 responses
8 views
0 likes
Last Post seqadmin  
Started by seqadmin, Yesterday, 02:20 PM
0 responses
14 views
0 likes
Last Post seqadmin  
Started by seqadmin, 06-07-2024, 06:58 AM
0 responses
181 views
0 likes
Last Post seqadmin  
Started by seqadmin, 06-06-2024, 08:18 AM
0 responses
231 views
0 likes
Last Post seqadmin  
Working...
X