Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BEDTools v2.9 - new tools/features

    Hi all,
    I just posted Version 2.9.0. The details of the release are below. Highlights include a new unionBedGraphs tool, a "per-base" coverage option for coverageBed, a "distance" option for closestBed, and multi-column operations for groupBy.



    Best,
    Aaron

    === New tools ===
    1. unionBedGraphs. This is a powerful new tool contributed by Assaf Gordon from CSHL. It will combine/merge multiple BEDGRAPH files into a single file, thus allowing comparisons of coverage (or any text-value) across multiple samples. The example below illustrates how to compare coverage across three different BEDGRAPH files.
    Code:
     $ cat 1.bg
     chr1	1000	1500	10
     chr1	2000	2100	20
    
     $ cat 2.bg
     chr1	900	1600	60
     chr1	1700	2050	50
    
     $ cat 3.bg
     chr1	1980	2070	80
     chr1	2090	2100	20
    
     $ unionBedGraphs -header -i 1.bg 2.bg 3.bg -names WT-1 WT-2 KO-1
     chrom	start	end	WT-1	WT-2	KO-1
     chr1	900	1000	0	60	0
     chr1	1000	1500	10	60	0
     chr1	1500	1600	0	60	0
     chr1	1700	1980	0	50	0
     chr1	1980	2000	0	50	80
     chr1	2000	2050	20	50	80
     chr1	2050	2070	20	0	80
     chr1	2070	2090	20	0	0
     chr1	2090	2100	20	0	20

    === New features ===

    1. The "groupBy" tool now allows one to operate on multiple columns for each group. For example:
    Code:
    $ cat ex1.out
    chr1	10	20	A	chr1	15	25	B.1	1000
    chr1	10	20	A	chr1	25	35	B.2	10000
    
    $ groupBy -i ex1.out -grp 1,2,3,4 -opCols 8,9 -ops collapse,mean
    chr1	10	20	A	B.1,B.2,	550
    2. New "distance feature" (-d) added to closestBed by Erik Arner. In addition to finding the closest feature to each feature in A, the -d option will report the distance to the closest feature in B. Overlapping features have a distance of 0.
    3. New "per base depth feature" (-d) added to coverageBed. This reports the per base coverage (1-based) of each feature in file B based on the coverage of features found in file A. For example, this could report the per-base depth of sequencing reads (-a) across each capture target (-b).

    Best,
    Aaron
    Last edited by quinlana; 08-16-2010, 01:46 PM. Reason: superfluous

  • #2
    Hi all,
    I've received a few emails expressing confusion over the utility and limitations of the new "groupBy" tool. First, it is not limited to processing output from BEDTools: it will work on any tab-delimited file or stream. To illustrate this and to fulfill requests for additional examples, the command below is used to compute the mean and standard deviation of all sequence libraries that are present in a BAM file containing multiple libraries. This example makes an assumption (in the interest of clarity) that each read tracks the library from which it came in the read group (RG) tag and that this tag is the 12th column in the SAM output.

    I hope this helps and I apologize that things weren't expressed with more clarity earlier.

    Aaron
    Code:
    ##########################################################################
    # Goal: Compute the mean and stdev for each sequencing library (RG tag)
    # Steps:
    # Line 1 (samtools) : extract all properly-paired reads
    # Line 2 (awk):       print the RG/library and ISIZE (positive ISIZE only)
    # Line 3 (sort):      sort the output by RG/library
    # Line 4 (groupBy):   compute the mean & stdev for each library
    ###########################################################################
    $ samtools view -f 0x2 aln.multipleLibraries.bam | \
        awk '{if ($9>0) {print $12"\t"$9}}' | \
        sort -k1,1 | \
        groupBy -i stdin -grp 1 -opCols 2,2 -ops mean,stdev
    
    # library	mean		stdev
    RG:Z:libA	319.5959	32.86841
    RG:Z:libB	389.8465	32.60053
    RG:Z:libC	329.1906	32.86142
    RG:Z:libD	318.8107	33.33372
    RG:Z:libE	359.0431	33.34611
    RG:Z:libF	320.4461	32.79852
    RG:Z:libG	399.0043	32.98773
    RG:Z:libH	329.6738	33.15160

    Comment


    • #3
      Thanks for the amazing program, I've been looking for such a program for a very long time!

      One question, will bedtools perform faster (esp. coverageBed or intersectBed), if the input BED or BAM files are sorted or does it matter?

      Comment


      • #4
        Originally posted by Yilong Li View Post
        Thanks for the amazing program, I've been looking for such a program for a very long time!

        One question, will bedtools perform faster (esp. coverageBed or intersectBed), if the input BED or BAM files are sorted or does it matter?
        Currently, sorting makes no difference for intersect or coverage.

        Comment


        • #5
          Hi Aaron,

          Thank you very much for your program. I am starting to use it, and for me sounds very well documented and quick to get used to the features.

          I have one major observation. When you download a GFF file from NCBI Genome, for example, you get the first feature called "source" as being the whole chromosome size, one line like:

          NC_008405.1 RefSeq source 1 27566993

          and this causes the intersectBed to cross all the short reads with this feature. But actually, the intersections should be only with the features like "gene", "exon", "etc".

          To avoid this problem, I need to edit the GFF file erasing this "source" feature out.

          I hope you can have a look on this issue and improve even more you fantastic BEDTools.

          Cheers,

          Adriano

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM
          • seqadmin
            Techniques and Challenges in Conservation Genomics
            by seqadmin



            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

            Avian Conservation
            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
            03-08-2024, 10:41 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 06:37 PM
          0 responses
          8 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Yesterday, 06:07 PM
          0 responses
          8 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-22-2024, 10:03 AM
          0 responses
          49 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-21-2024, 07:32 AM
          0 responses
          67 views
          0 likes
          Last Post seqadmin  
          Working...
          X