Seqanswers Leaderboard Ad

**quinlana** · 08-17-2010, 08:25 AM

Hi all,
I've received a few emails expressing confusion over the utility and limitations of the new "groupBy" tool. First, it is not limited to processing output from BEDTools: it will work on any tab-delimited file or stream. To illustrate this and to fulfill requests for additional examples, the command below is used to compute the mean and standard deviation of all sequence libraries that are present in a BAM file containing multiple libraries. This example makes an assumption (in the interest of clarity) that each read tracks the library from which it came in the read group (RG) tag and that this tag is the 12th column in the SAM output.

I hope this helps and I apologize that things weren't expressed with more clarity earlier.

Aaron

Code:

##########################################################################
# Goal: Compute the mean and stdev for each sequencing library (RG tag)
# Steps:
# Line 1 (samtools) : extract all properly-paired reads
# Line 2 (awk):       print the RG/library and ISIZE (positive ISIZE only)
# Line 3 (sort):      sort the output by RG/library
# Line 4 (groupBy):   compute the mean & stdev for each library
###########################################################################
$ samtools view -f 0x2 aln.multipleLibraries.bam | \
    awk '{if ($9>0) {print $12"\t"$9}}' | \
    sort -k1,1 | \
    groupBy -i stdin -grp 1 -opCols 2,2 -ops mean,stdev

# library	mean		stdev
RG:Z:libA	319.5959	32.86841
RG:Z:libB	389.8465	32.60053
RG:Z:libC	329.1906	32.86142
RG:Z:libD	318.8107	33.33372
RG:Z:libE	359.0431	33.34611
RG:Z:libF	320.4461	32.79852
RG:Z:libG	399.0043	32.98773
RG:Z:libH	329.6738	33.15160

**Yilong Li** · 04-14-2011, 09:50 AM

Thanks for the amazing program, I've been looking for such a program for a very long time!

One question, will bedtools perform faster (esp. coverageBed or intersectBed), if the input BED or BAM files are sorted or does it matter?

**quinlana** · 04-14-2011, 10:41 AM

Originally posted by Yilong Li View Post

Thanks for the amazing program, I've been looking for such a program for a very long time!

One question, will bedtools perform faster (esp. coverageBed or intersectBed), if the input BED or BAM files are sorted or does it matter?

Currently, sorting makes no difference for intersect or coverage.

**Adriano** · 06-22-2011, 04:42 PM

Hi Aaron,

Thank you very much for your program. I am starting to use it, and for me sounds very well documented and quick to get used to the features.

I have one major observation. When you download a GFF file from NCBI Genome, for example, you get the first feature called "source" as being the whole chromosome size, one line like:

NC_008405.1 RefSeq source 1 27566993

and this causes the intersectBed to cross all the short reads with this feature. But actually, the intersections should be only with the features like "gene", "exon", "etc".

To avoid this problem, I need to edit the GFF file erasing this "source" feature out.

I hope you can have a look on this issue and improve even more you fantastic BEDTools.

Cheers,

Adriano

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 27 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 27 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

BEDTools v2.9 - new tools/features

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News