Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    thank you very much for your answer westerman. You're right, I have no idea about the transcriptome size. It's a tissue specific from a non-model specie. I got 62M from a 454 pyrosequencing process from 8 different organisms (same plate) and I suspect that there were too much samples for a single plate. I would like to guess the ideal ratio sample/plate so I thought that coverage could be a good way to estimate this.

    Comment


    • #32
      We've started using BED Tools as part of our pipeline and so far it's been good. We calculate 2 types of coverage, non-collapsed (total mapped reads x read length / genome size) and collapsed coverage (total mapped reads with PCR duplicates removed x read length / genome size).

      The tough part is identifying those reads aligning to multiple locations. What do people do in that case? Are people using uniquely aligned reads only and using a mappable genome size (different read lengths will change this number). We use Picard to collapse our reads. If you use BEDTools, you'll need to remove the duplicates since BEDTools doesn't seem to do a flag check for the read.

      Comment


      • #33
        Originally posted by quinlana View Post
        Hi,
        As ewilbanks suggested, BEDTools will do this for you.



        If you want to compute coverage for "bins/windows" that march along the genome, you would use coverageBed. Let's say you've created a BED file called windows.bed for 10Kb windows across your genome and it looks like this (note BEDTools uses UCSC 0-based starts):
        chr1 0 10000
        chr1 9999 20000
        ...

        Now, you also have a bed file of sequence reads called reads.bed. The following command with calculate for each window in windows.bed:
        1) The number of reads in reads.bed that overlap the window
        2) Coverage density (i.e. the fraction of base pairs in window.bed that are covered by >= 1 read in reads.bed)

        coverageBed -a reads.bed -b windows.bed | sortBed -i stdin > windows.bed.coverage

        Sample output:
        <chr> <start> <end> <#reads> <# window bases covered> <windowSize> <fraction of window covered>
        chr1 0 10000 0 0 10000 0
        chr1 9999 20000 33 1000 10000 0.10

        I hope this helps. If you need a script to create a BED file with windows of a given size, just let me know.

        Best,
        Aaron
        Hi Dr. Quinlan,

        Thanks for the bedtools. I am using the genomeCoverageBed, but I would like to use the coverageBed.
        I am not sure if you have posted this somewhere else, but I would appreciate if you could provide this code that you offered that creates a BED file with intervals of a given size.
        Thanks in advance,
        Best
        Fernando

        Comment


        • #34
          Hi Jonathan,

          About R script above, I have tried and go these error. Any suggestion is great as I'm R beginner.


          > data <-read.table(file="./illusmp454merCbed",sep="\t",header=F)
          > colnames(data)<-c("Scaffold","sca_position","coverage")
          > depth<-mean(data[,"coverage'])
          + #depth now has the mean (overall)coverage
          + #set the bin-size
          + window<-1001
          + rangefrom<-0
          + rangeto<-length(data[,"sca_position"])
          Error: unexpected symbol in:
          "rangefrom<-0
          rangeto<-length(data[,"sca_position"
          > data.1kb<-runmed(data[,"coverage"],k=window)
          Error in as.integer(k) :
          cannot coerce type 'closure' to vector of type 'integer'
          > png(file="cov_1k.png",width=400,height=1000)
          > plot(x=data[rangefrom:rangeto,"sca_position"],y=data.1kb[rangefrom:rangeto],pch=".",cex1,xlab="bp position",ylab="depth",type="p")
          Error in `[.data.frame`(data, rangefrom:rangeto, "sca_position") :
          object 'rangefrom' not found
          > dev.off()

          Comment


          • #35
            Originally posted by [email protected] View Post
            Hi Jonathan,
            > data <-read.table(file="./illusmp454merCbed",sep="\t",header=F)
            > colnames(data)<-c("Scaffold","sca_position","coverage")
            > depth<-mean(data[,"coverage'])
            + #depth now has the mean (overall)coverage
            + #set the bin-size
            + window<-1001
            + rangefrom<-0
            + rangeto<-length(data[,"sca_position"])
            You have two different types of quoting in coverage
            It should be like that:
            Code:
            depth<-mean(data[,[B][COLOR="Red"]"coverage"[/COLOR][/B]])
            and than you will also define the windows size and the ranges

            Comment


            • #36
              Hi frymor,

              Thank you for the suggestion.
              I have fix the script and now I got another message from coverPlot.Rout.

              > data <-read.table(file="~/q20snpref/illusmp454merCbed",sep="\t",header=F)
              > colnames(data)<-c("Scaffold","sca_position","coverage")
              > depth<-mean(data[,"coverage"])
              Warning message:
              In mean.default(data[, "coverage"]) :
              argument is not numeric or logical: returning NA
              > #depth now has the mean (overall)coverage
              > #set the bin-size
              > window<-1001
              > rangefrom<-0
              > rangeto<-length(data[,"sca_position"])
              > data.1kb<-runmed(data[,"coverage"],k=window)
              > png(file="cov_1k.png",width=400,height=1000)
              coverPlot.Rout

              Any suggestion would be great.

              Best regards,
              Sutada

              Comment


              • #37
                Based on the options you've given, "~/q20snpref/illusmp454merCbed" [note no extension] should be a headerless file with fields separated by tabs.

                The warning message suggests that you either have a header line in your file (or possibly a comment line), or a non-numeric line in the 'coverage' column.

                Just to make sure the input looks like what is expected, what is the output of this command [which displays the first 5 lines of a file]?
                Code:
                readLines(con = "~/q20snpref/illusmp454merCbed", n = 5)

                Comment


                • #38
                  hi gringer,

                  Thank you very much for suggestion.
                  I got this with this code.
                  > readLines(con = "~/q20snpref/illusmp454merCbed", n = 5)
                  [1] "Scaffold\tsca_position\tcoverage" "Scaffold1\t1\t0"
                  [3] "Scaffold1\t2\t0" "Scaffold1\t3\t0"
                  [5] "Scaffold1\t4\t0"

                  I have also tried
                  data <-read.table(file="~/q20snpref/illusmp454merCbed",sep="\t",header=T) and I got
                  > data <-read.table(file="~/q20snpref/illusmp454merCbed",sep="\t",header=T)
                  Killed
                  ^@^@^Killed.

                  Should I put # for the header line. Actually illusmp454merCbed is a text file contains 3 column separate by tab. I thought that to run R we need the column name to define the data.

                  Any suggestion will be great.
                  Best,

                  Comment


                  • #39
                    With a file looking like that, just what you've put should work (i.e. changing the header value to 'TRUE'):

                    data <-read.table(file="~/q20snpref/illusmp454merCbed",sep="\t",header=T)
                    Is there a particular reason why the reading of the table was killed? How big is the input file? How much memory does your computer have? Did you break out of it manually? How long did it take before the 'Killed' message appeared?

                    Comment


                    • #40
                      Hi you all,

                      I have a little question concerning the best way to estimate the RNASeq coverage, or more precisely global transcription rate distribution.

                      To put it simple let's assume a bacterium from which an RNASeq has been performed. For this analysis we work only with the pairs (positive and negative strands) of .wig files, filtering out rRNAs, tRNAs and other known RNAs that are transcribed at very high levels.

                      Briefly (ommiting a normalization step between the samples) we then calculate a global trancription rate for each strand by simply adding up all the values in the corresponding .wig file and by dividing this sum by the genome size: we get the global average read per nucleotide. Easy. A standard deviation on this mean is also easily calculated at the same time.

                      My question is, is the standard deviation the best parameter to describe the distribution around this mean? As the majority of the nucleotides are not read a single time and as some transcripts are very abundant this standard deviation is obviously huge compared to the mean (e.g. mean = 8, stdev = 420). Would then the standard error (= stdev / sqrt (genome size)) be more relevant? Or should the 0 values be skipped from the analysis, counting only nucleotides called and their number of reads?

                      Sorry for not being very good at stats, I forgot most of what I learnt ages ago. Basically the aim is to test whether one strand is being transcribed at (statistically) the same rate as the other... We would prefer not to use fancy software to do this analysis but to do it ourselves "manually" (with a little help of Perl, of course).

                      Note: we are interested in the transcripts independently of the coding domain sequences and other annotations.

                      Thanks for any suggestion!

                      Comment


                      • #41
                        You're going to get a huge variation in expression for all transcripts, such that calculating a 'global deviation' doesn't really make sense. If you do want to do it globally, I'd recommend log-transforming the coverage values before working out mean coverage and deviation, because that transformation seems to better fit a normal curve.

                        On a per-gene (or per-isoform) basis, I've been calculating coverage using CV (i.e. SD / mean) as an indication of variation. With about 20 genes that I did this for, the CV ranged from about 40% to 400%, but anything above about 140% appeared to be a mis-mapped transcript.

                        It might be useful to only do this for expressed regions (e.g. coverage > 15) to attempt to control for these bad mappings.
                        Last edited by gringer; 11-24-2011, 04:38 AM.

                        Comment


                        • #42
                          cufflinks Error

                          Hello,all
                          when I use cufflinks without a gtf file for bacterial RNA-seq, I get the follow result:
                          tracking_id class_code nearest_ref_id gene_id gene_short_name tss_id locus length coverage FPKM FPKM_conf_lo FPKM_conf_hi FPKM_status
                          ,that is all in the result file.
                          I don't know the reason why there are not values.anyone can help me?
                          thanks. best wishes.
                          Attached Files

                          Comment


                          • #43
                            Originally posted by fhb View Post
                            Hi Dr. Quinlan,

                            Thanks for the bedtools. I am using the genomeCoverageBed, but I would like to use the coverageBed.
                            I am not sure if you have posted this somewhere else, but I would appreciate if you could provide this code that you offered that creates a BED file with intervals of a given size.
                            Thanks in advance,
                            Best
                            Fernando
                            If possible I would like that too.

                            Thanks

                            Comment


                            • #44
                              By utilising BEDtools and UCSC GB I am trying to get a picture like this (histogram of the bedgraph in wiggle form)

                              so far i have used a SORTED bam on which genomeCoverageBed was run with -bg -ibam options and -g mm8.fa.fai as the index

                              The resulting bedgraph was uploaded on UCSC GB and produced this "stripe" track which has numbers 1 2 3 and os on , before each stripe ( where do I find the meaning of the numbers?)

                              Then added a first line of the bedgraph file like this
                              track type=bedGraph name="set11bamSorted" description="BedGraph format" visibility=full color=200,100,0 altColor=0,100,200 priority=20

                              and uploaded the new file.

                              This time I am getting an error
                              "Error File 'set11_fixed.bedgraph' - Error line 1224705 of custom track: chromEnd larger than chrom chr13 size (114176901 > 114142980)"

                              Why should I get this error this time , while the contents of the bedgraph file are exactly the same as before? Is this the correct way to get the wiggle histogram like the one on the first picture ?

                              Below you may find the code that I used to get the original bedgraph file

                              genomeCoverageBed -bg -ibam ~/set11/tophatOut11/set11TopHataccepted_hits.Sorted.bam -g /opt/bowtie/indexes/mm8.fa.fai > ~/set11/set11.bedgraph &

                              Comment


                              • #45
                                I realize this is quite an old forum but I am currently trying to calculate the coverage per bp of my NGS data. I am new to NGS and bioinformatics so my apologies if this does not make sense...

                                For quality control I would like to determine a logical min and max coverage to exclude from my downstream analyses; however, I cannot seem to find a "gold standard" for such purposes. It seems to me that if you are mapping to a reference genome and there are regions that have more than twice the average coverage that it is probably the result of a duplication or something in the genome of the sequenced organism. Likewise, if it has very poor coverage the organism likely does not have that region in its genome and it is likely the result of improper mapping. As such, I would like to exclude these regions before performing genome-wide population genetic analyses. Does anyone have any suggestions on what cutoffs to use and how to go about doing so?

                                Thanks in advance!

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Best Practices for Single-Cell Sequencing Analysis
                                  by seqadmin



                                  While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                                  06-06-2024, 07:15 AM
                                • seqadmin
                                  Latest Developments in Precision Medicine
                                  by seqadmin



                                  Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                                  Somatic Genomics
                                  “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                                  05-24-2024, 01:16 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Today, 08:58 AM
                                0 responses
                                8 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 02:20 PM
                                0 responses
                                14 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 06-07-2024, 06:58 AM
                                0 responses
                                181 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 06-06-2024, 08:18 AM
                                0 responses
                                231 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X