thank you very much for your answer westerman. You're right, I have no idea about the transcriptome size. It's a tissue specific from a non-model specie. I got 62M from a 454 pyrosequencing process from 8 different organisms (same plate) and I suspect that there were too much samples for a single plate. I would like to guess the ideal ratio sample/plate so I thought that coverage could be a good way to estimate this.
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
We've started using BED Tools as part of our pipeline and so far it's been good. We calculate 2 types of coverage, non-collapsed (total mapped reads x read length / genome size) and collapsed coverage (total mapped reads with PCR duplicates removed x read length / genome size).
The tough part is identifying those reads aligning to multiple locations. What do people do in that case? Are people using uniquely aligned reads only and using a mappable genome size (different read lengths will change this number). We use Picard to collapse our reads. If you use BEDTools, you'll need to remove the duplicates since BEDTools doesn't seem to do a flag check for the read.
Comment
-
Originally posted by quinlana View PostHi,
As ewilbanks suggested, BEDTools will do this for you.
If you want to compute coverage for "bins/windows" that march along the genome, you would use coverageBed. Let's say you've created a BED file called windows.bed for 10Kb windows across your genome and it looks like this (note BEDTools uses UCSC 0-based starts):
chr1 0 10000
chr1 9999 20000
...
Now, you also have a bed file of sequence reads called reads.bed. The following command with calculate for each window in windows.bed:
1) The number of reads in reads.bed that overlap the window
2) Coverage density (i.e. the fraction of base pairs in window.bed that are covered by >= 1 read in reads.bed)
coverageBed -a reads.bed -b windows.bed | sortBed -i stdin > windows.bed.coverage
Sample output:
<chr> <start> <end> <#reads> <# window bases covered> <windowSize> <fraction of window covered>
chr1 0 10000 0 0 10000 0
chr1 9999 20000 33 1000 10000 0.10
I hope this helps. If you need a script to create a BED file with windows of a given size, just let me know.
Best,
Aaron
Thanks for the bedtools. I am using the genomeCoverageBed, but I would like to use the coverageBed.
I am not sure if you have posted this somewhere else, but I would appreciate if you could provide this code that you offered that creates a BED file with intervals of a given size.
Thanks in advance,
Best
Fernando
Comment
-
Hi Jonathan,
About R script above, I have tried and go these error. Any suggestion is great as I'm R beginner.
> data <-read.table(file="./illusmp454merCbed",sep="\t",header=F)
> colnames(data)<-c("Scaffold","sca_position","coverage")
> depth<-mean(data[,"coverage'])
+ #depth now has the mean (overall)coverage
+ #set the bin-size
+ window<-1001
+ rangefrom<-0
+ rangeto<-length(data[,"sca_position"])
Error: unexpected symbol in:
"rangefrom<-0
rangeto<-length(data[,"sca_position"
> data.1kb<-runmed(data[,"coverage"],k=window)
Error in as.integer(k) :
cannot coerce type 'closure' to vector of type 'integer'
> png(file="cov_1k.png",width=400,height=1000)
> plot(x=data[rangefrom:rangeto,"sca_position"],y=data.1kb[rangefrom:rangeto],pch=".",cex1,xlab="bp position",ylab="depth",type="p")
Error in `[.data.frame`(data, rangefrom:rangeto, "sca_position") :
object 'rangefrom' not found
> dev.off()
Comment
-
Originally posted by [email protected] View PostHi Jonathan,
> data <-read.table(file="./illusmp454merCbed",sep="\t",header=F)
> colnames(data)<-c("Scaffold","sca_position","coverage")
> depth<-mean(data[,"coverage'])
+ #depth now has the mean (overall)coverage
+ #set the bin-size
+ window<-1001
+ rangefrom<-0
+ rangeto<-length(data[,"sca_position"])
It should be like that:
Code:depth<-mean(data[,[B][COLOR="Red"]"coverage"[/COLOR][/B]])
Comment
-
Hi frymor,
Thank you for the suggestion.
I have fix the script and now I got another message from coverPlot.Rout.
> data <-read.table(file="~/q20snpref/illusmp454merCbed",sep="\t",header=F)
> colnames(data)<-c("Scaffold","sca_position","coverage")
> depth<-mean(data[,"coverage"])
Warning message:
In mean.default(data[, "coverage"]) :
argument is not numeric or logical: returning NA
> #depth now has the mean (overall)coverage
> #set the bin-size
> window<-1001
> rangefrom<-0
> rangeto<-length(data[,"sca_position"])
> data.1kb<-runmed(data[,"coverage"],k=window)
> png(file="cov_1k.png",width=400,height=1000)
coverPlot.Rout
Any suggestion would be great.
Best regards,
Sutada
Comment
-
Based on the options you've given, "~/q20snpref/illusmp454merCbed" [note no extension] should be a headerless file with fields separated by tabs.
The warning message suggests that you either have a header line in your file (or possibly a comment line), or a non-numeric line in the 'coverage' column.
Just to make sure the input looks like what is expected, what is the output of this command [which displays the first 5 lines of a file]?
Code:readLines(con = "~/q20snpref/illusmp454merCbed", n = 5)
Comment
-
hi gringer,
Thank you very much for suggestion.
I got this with this code.
> readLines(con = "~/q20snpref/illusmp454merCbed", n = 5)
[1] "Scaffold\tsca_position\tcoverage" "Scaffold1\t1\t0"
[3] "Scaffold1\t2\t0" "Scaffold1\t3\t0"
[5] "Scaffold1\t4\t0"
I have also tried
data <-read.table(file="~/q20snpref/illusmp454merCbed",sep="\t",header=T) and I got
> data <-read.table(file="~/q20snpref/illusmp454merCbed",sep="\t",header=T)
Killed
^@^@^Killed.
Should I put # for the header line. Actually illusmp454merCbed is a text file contains 3 column separate by tab. I thought that to run R we need the column name to define the data.
Any suggestion will be great.
Best,
Comment
-
With a file looking like that, just what you've put should work (i.e. changing the header value to 'TRUE'):
data <-read.table(file="~/q20snpref/illusmp454merCbed",sep="\t",header=T)
Comment
-
Hi you all,
I have a little question concerning the best way to estimate the RNASeq coverage, or more precisely global transcription rate distribution.
To put it simple let's assume a bacterium from which an RNASeq has been performed. For this analysis we work only with the pairs (positive and negative strands) of .wig files, filtering out rRNAs, tRNAs and other known RNAs that are transcribed at very high levels.
Briefly (ommiting a normalization step between the samples) we then calculate a global trancription rate for each strand by simply adding up all the values in the corresponding .wig file and by dividing this sum by the genome size: we get the global average read per nucleotide. Easy. A standard deviation on this mean is also easily calculated at the same time.
My question is, is the standard deviation the best parameter to describe the distribution around this mean? As the majority of the nucleotides are not read a single time and as some transcripts are very abundant this standard deviation is obviously huge compared to the mean (e.g. mean = 8, stdev = 420). Would then the standard error (= stdev / sqrt (genome size)) be more relevant? Or should the 0 values be skipped from the analysis, counting only nucleotides called and their number of reads?
Sorry for not being very good at stats, I forgot most of what I learnt ages ago. Basically the aim is to test whether one strand is being transcribed at (statistically) the same rate as the other... We would prefer not to use fancy software to do this analysis but to do it ourselves "manually" (with a little help of Perl, of course).
Note: we are interested in the transcripts independently of the coding domain sequences and other annotations.
Thanks for any suggestion!
Comment
-
You're going to get a huge variation in expression for all transcripts, such that calculating a 'global deviation' doesn't really make sense. If you do want to do it globally, I'd recommend log-transforming the coverage values before working out mean coverage and deviation, because that transformation seems to better fit a normal curve.
On a per-gene (or per-isoform) basis, I've been calculating coverage using CV (i.e. SD / mean) as an indication of variation. With about 20 genes that I did this for, the CV ranged from about 40% to 400%, but anything above about 140% appeared to be a mis-mapped transcript.
It might be useful to only do this for expressed regions (e.g. coverage > 15) to attempt to control for these bad mappings.Last edited by gringer; 11-24-2011, 04:38 AM.
Comment
-
cufflinks Error
Hello,all
when I use cufflinks without a gtf file for bacterial RNA-seq, I get the follow result:
tracking_id class_code nearest_ref_id gene_id gene_short_name tss_id locus length coverage FPKM FPKM_conf_lo FPKM_conf_hi FPKM_status
,that is all in the result file.
I don't know the reason why there are not values.anyone can help me?
thanks. best wishes.Attached Files
Comment
-
Originally posted by fhb View PostHi Dr. Quinlan,
Thanks for the bedtools. I am using the genomeCoverageBed, but I would like to use the coverageBed.
I am not sure if you have posted this somewhere else, but I would appreciate if you could provide this code that you offered that creates a BED file with intervals of a given size.
Thanks in advance,
Best
Fernando
Thanks
Comment
-
By utilising BEDtools and UCSC GB I am trying to get a picture like this (histogram of the bedgraph in wiggle form)
so far i have used a SORTED bam on which genomeCoverageBed was run with -bg -ibam options and -g mm8.fa.fai as the index
The resulting bedgraph was uploaded on UCSC GB and produced this "stripe" track which has numbers 1 2 3 and os on , before each stripe ( where do I find the meaning of the numbers?)
Then added a first line of the bedgraph file like this
track type=bedGraph name="set11bamSorted" description="BedGraph format" visibility=full color=200,100,0 altColor=0,100,200 priority=20
and uploaded the new file.
This time I am getting an error
"Error File 'set11_fixed.bedgraph' - Error line 1224705 of custom track: chromEnd larger than chrom chr13 size (114176901 > 114142980)"
Why should I get this error this time , while the contents of the bedgraph file are exactly the same as before? Is this the correct way to get the wiggle histogram like the one on the first picture ?
Below you may find the code that I used to get the original bedgraph file
genomeCoverageBed -bg -ibam ~/set11/tophatOut11/set11TopHataccepted_hits.Sorted.bam -g /opt/bowtie/indexes/mm8.fa.fai > ~/set11/set11.bedgraph &
Comment
-
I realize this is quite an old forum but I am currently trying to calculate the coverage per bp of my NGS data. I am new to NGS and bioinformatics so my apologies if this does not make sense...
For quality control I would like to determine a logical min and max coverage to exclude from my downstream analyses; however, I cannot seem to find a "gold standard" for such purposes. It seems to me that if you are mapping to a reference genome and there are regions that have more than twice the average coverage that it is probably the result of a duplication or something in the genome of the sequenced organism. Likewise, if it has very poor coverage the organism likely does not have that region in its genome and it is likely the result of improper mapping. As such, I would like to exclude these regions before performing genome-wide population genetic analyses. Does anyone have any suggestions on what cutoffs to use and how to go about doing so?
Thanks in advance!
Comment
Latest Articles
Collapse
-
by seqadmin
Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.
Nobel Prize for MicroRNA Discovery
This week,...-
Channel: Articles
Yesterday, 08:07 AM -
-
by seqadmin
Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...-
Channel: Articles
09-23-2024, 06:35 AM -
-
by seqadmin
During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.
Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...-
Channel: Articles
09-09-2024, 10:59 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 10-02-2024, 04:51 AM
|
0 responses
95 views
0 likes
|
Last Post
by seqadmin
10-02-2024, 04:51 AM
|
||
Started by seqadmin, 10-01-2024, 07:10 AM
|
0 responses
106 views
0 likes
|
Last Post
by seqadmin
10-01-2024, 07:10 AM
|
||
Started by seqadmin, 09-30-2024, 08:33 AM
|
1 response
106 views
0 likes
|
Last Post
by EmiTom
Yesterday, 06:46 AM
|
||
Started by seqadmin, 09-26-2024, 12:57 PM
|
0 responses
20 views
0 likes
|
Last Post
by seqadmin
09-26-2024, 12:57 PM
|
Comment