Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BEDTools v2.9 - new tools/features

    Hi all,
    I just posted Version 2.9.0. The details of the release are below. Highlights include a new unionBedGraphs tool, a "per-base" coverage option for coverageBed, a "distance" option for closestBed, and multi-column operations for groupBy.



    Best,
    Aaron

    === New tools ===
    1. unionBedGraphs. This is a powerful new tool contributed by Assaf Gordon from CSHL. It will combine/merge multiple BEDGRAPH files into a single file, thus allowing comparisons of coverage (or any text-value) across multiple samples. The example below illustrates how to compare coverage across three different BEDGRAPH files.
    Code:
     $ cat 1.bg
     chr1	1000	1500	10
     chr1	2000	2100	20
    
     $ cat 2.bg
     chr1	900	1600	60
     chr1	1700	2050	50
    
     $ cat 3.bg
     chr1	1980	2070	80
     chr1	2090	2100	20
    
     $ unionBedGraphs -header -i 1.bg 2.bg 3.bg -names WT-1 WT-2 KO-1
     chrom	start	end	WT-1	WT-2	KO-1
     chr1	900	1000	0	60	0
     chr1	1000	1500	10	60	0
     chr1	1500	1600	0	60	0
     chr1	1700	1980	0	50	0
     chr1	1980	2000	0	50	80
     chr1	2000	2050	20	50	80
     chr1	2050	2070	20	0	80
     chr1	2070	2090	20	0	0
     chr1	2090	2100	20	0	20

    === New features ===

    1. The "groupBy" tool now allows one to operate on multiple columns for each group. For example:
    Code:
    $ cat ex1.out
    chr1	10	20	A	chr1	15	25	B.1	1000
    chr1	10	20	A	chr1	25	35	B.2	10000
    
    $ groupBy -i ex1.out -grp 1,2,3,4 -opCols 8,9 -ops collapse,mean
    chr1	10	20	A	B.1,B.2,	550
    2. New "distance feature" (-d) added to closestBed by Erik Arner. In addition to finding the closest feature to each feature in A, the -d option will report the distance to the closest feature in B. Overlapping features have a distance of 0.
    3. New "per base depth feature" (-d) added to coverageBed. This reports the per base coverage (1-based) of each feature in file B based on the coverage of features found in file A. For example, this could report the per-base depth of sequencing reads (-a) across each capture target (-b).

    Best,
    Aaron
    Last edited by quinlana; 08-16-2010, 01:46 PM. Reason: superfluous

  • #2
    Hi all,
    I've received a few emails expressing confusion over the utility and limitations of the new "groupBy" tool. First, it is not limited to processing output from BEDTools: it will work on any tab-delimited file or stream. To illustrate this and to fulfill requests for additional examples, the command below is used to compute the mean and standard deviation of all sequence libraries that are present in a BAM file containing multiple libraries. This example makes an assumption (in the interest of clarity) that each read tracks the library from which it came in the read group (RG) tag and that this tag is the 12th column in the SAM output.

    I hope this helps and I apologize that things weren't expressed with more clarity earlier.

    Aaron
    Code:
    ##########################################################################
    # Goal: Compute the mean and stdev for each sequencing library (RG tag)
    # Steps:
    # Line 1 (samtools) : extract all properly-paired reads
    # Line 2 (awk):       print the RG/library and ISIZE (positive ISIZE only)
    # Line 3 (sort):      sort the output by RG/library
    # Line 4 (groupBy):   compute the mean & stdev for each library
    ###########################################################################
    $ samtools view -f 0x2 aln.multipleLibraries.bam | \
        awk '{if ($9>0) {print $12"\t"$9}}' | \
        sort -k1,1 | \
        groupBy -i stdin -grp 1 -opCols 2,2 -ops mean,stdev
    
    # library	mean		stdev
    RG:Z:libA	319.5959	32.86841
    RG:Z:libB	389.8465	32.60053
    RG:Z:libC	329.1906	32.86142
    RG:Z:libD	318.8107	33.33372
    RG:Z:libE	359.0431	33.34611
    RG:Z:libF	320.4461	32.79852
    RG:Z:libG	399.0043	32.98773
    RG:Z:libH	329.6738	33.15160

    Comment


    • #3
      Thanks for the amazing program, I've been looking for such a program for a very long time!

      One question, will bedtools perform faster (esp. coverageBed or intersectBed), if the input BED or BAM files are sorted or does it matter?

      Comment


      • #4
        Originally posted by Yilong Li View Post
        Thanks for the amazing program, I've been looking for such a program for a very long time!

        One question, will bedtools perform faster (esp. coverageBed or intersectBed), if the input BED or BAM files are sorted or does it matter?
        Currently, sorting makes no difference for intersect or coverage.

        Comment


        • #5
          Hi Aaron,

          Thank you very much for your program. I am starting to use it, and for me sounds very well documented and quick to get used to the features.

          I have one major observation. When you download a GFF file from NCBI Genome, for example, you get the first feature called "source" as being the whole chromosome size, one line like:

          NC_008405.1 RefSeq source 1 27566993

          and this causes the intersectBed to cross all the short reads with this feature. But actually, the intersections should be only with the features like "gene", "exon", "etc".

          To avoid this problem, I need to edit the GFF file erasing this "source" feature out.

          I hope you can have a look on this issue and improve even more you fantastic BEDTools.

          Cheers,

          Adriano

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Exploring the Dynamics of the Tumor Microenvironment
            by seqadmin




            The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
            07-08-2024, 03:19 PM
          • seqadmin
            Exploring Human Diversity Through Large-Scale Omics
            by seqadmin


            In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
            06-25-2024, 06:43 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 07-10-2024, 07:30 AM
          0 responses
          26 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 07-03-2024, 09:45 AM
          0 responses
          201 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 07-03-2024, 08:54 AM
          0 responses
          212 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 07-02-2024, 03:00 PM
          0 responses
          193 views
          0 likes
          Last Post seqadmin  
          Working...
          X