Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BEDTools v2.8: VCF, split-alignments, new tools

    Hi all,

    I just posted the latest version of BEDTools (Version 2.8). The main highlights of this release are 1) native support for VCF (v4.0), 2) proper support for both split BAM alignments and blocked BED features, 3) a new and very useful tool called groupBy, 4) a new tool called bed12ToBed6, and several bug fixes. The gory details are below.

    Expect some additional minor releases with further updates this summer.

    Cheers,
    Aaron

    Download: http://bedtools.googlecode.com/files....v2.8.0.tar.gz


    Detailed release notes:

    1. Proper support for "split" BAM alignments and "blocked" BED (aka BED12) features. By using the "-split" option, intersectBed, coverageBed, genomeCoverageBed, and bamToBed will now correctly compute overlaps/coverage solely for the "split" portions of BAM alignments or the "blocks" of BED12 features such as genes.

    2. Added native support for the 1000 Genome Variant Calling Format (VCF) version 4.0.

    3. New bed12ToBed6 tool. This tool will convert each block of a BED12 feature into discrete BED6 features.

    4. Useful new groupBy tool. This new tool mimics the "groupBy" clause in SQL. Given a file or stream that is sorted by the appropriate "grouping columns", groupBy will compute summary statistics on another column in the file or stream. This will work with output from all BEDTools as well as any other tab-delimited file or stream. Example summary operations include: sum, mean, stdev, min, max, etc. Please see the help for the tools for examples.

    5. Improvements to genomeCoverageBed. Most notably, beyond the several efficiency and organizational changes made, it now includes a "-strand" option which allows one to specify that coverage should only be computed on either the "+" or the "-" strand.

    6. Fixed a bug in closestBed which incorrectly reported "null" overlaps for features that did not have a closest feature in the B file.

    7. Fixed a careless bug in slopBed that caused an infinite loop when the "-excl" option was used.

    8. Reduced memory consumption by ca. 15% and run time by ca. 10% for most tools.

    9. Several code-cleanliness updates such as templated functions and common tyedefs.

    10. Tweaked the genome binning approach such that 16kb bins are the most granular.

  • #2
    Hi all,

    I just updated the latest release to include a new tool called bedToIgv that will create an IGV (v1.5 and later) "batch" script from any BED/GFF/VCF file or stream. The resulting batch scripts can be run using IGV to automagically take "snapshots" of as many loci as are in your input file (time permitting). Batch files are run in IGV via File->"Run Batch Script".

    Download: http://bedtools.googlecode.com/files....v2.8.1.tar.gz

    Best,
    Aaron

    Comment


    • #3
      bedToIgv looks like an outstanding tool!

      Many thanks!

      Comment


      • #4
        Detecting insertions from dbSNP

        Hi Aaron,

        BEDTools is just the thing I need for my research. After trying to develop my own overlap script and dealing with the UCSC database and tools locally, I really appreciate your work. However, I have to bug you because insertions from dbSNP are giving me trouble: They seem to be ignored.

        Example SNPs:
        Code:
        #chrom	chromStart	chromEnd	name
        chr1	10430	10431	fake_SNP
        chr1	10433	10433	rs56289060_insertion
        chr1	10491	10492	rs55998931
        chr1	10518	10519	rs62636508
        chr1	10518	10519	same_coordinate
        Example genes:
        Code:
        # genes
        chr1	10430	10440	Gene_with_Insertion_and_fake_SNP
        chr1	10490	10520	Gene_with_3SNPs
        intersectBed -a SNPs.txt -b genes.txt -wo results in not reporting the insertion:
        Code:
        chr1	10430	10431	fake_SNP	chr1	10430	10440	Gene_with_Insertion_and_fake_SNP	1
        chr1	10491	10492	rs55998931	chr1	10490	10520	Gene_with_3SNPs	1
        chr1	10518	10519	rs62636508	chr1	10490	10520	Gene_with_3SNPs	1
        chr1	10518	10519	same_coordinate	chr1	10490	10520	Gene_with_3SNPs	1
        It seems that the problem is that for insertions, the start coordinate is the same as the end coordinate. Is there any trick you could recommend or should I just go through my SNP file and subtract 1 from all the insertion starts?

        Thank you

        Barbara

        Comment


        • #5
          split working correctly?

          Hi,
          I am wondering if coverage with split is working correctly, also if b contains the splits?
          bed12b.bed contains a split 1000-1010 & 1090-1100 and is reported as being covered by 30 basepairs and to have a length of 100.

          also: a -bbam not only -abam would be nice to have for coverage.

          best,
          ido

          Code:
          coverageBed -split -a bed12a.bed -b bed12b.bed
          chr1    1000    1100    test    0       +       1000    1100    255,0,0 2       10,10   0,90    1       30      100     0.3000000
          coverageBed -split -b bed12a.bed -a bed12b.bed
          chr1    1040    1070    test    0       +       1040    1070    0,0,255 1       30      0       0       0       30      0.0000000
          cat bed12a.bed
          chr1    1040    1070    test    0       +       1040    1070    0,0,255 1       30      0
          cat bed12b.bed
          chr1    1000    1100    test    0       +       1000    1100    255,0,0 2       10,10   0,90

          Comment


          • #6
            Originally posted by epigen View Post
            Is there any trick you could recommend or should I just go through my SNP file and subtract 1 from all the insertion starts?
            awk '/insertion/{print $1"\t"$2-1"\t"$3"\t"$4};!/insertion/' test.bed

            will subtract 1 from the start position of any SNP listed as insertion and leave other lines alone.

            awk '$2==$3{print $1"\t"$2-1"\t"$3"\t"$4};$2!=$3' test.bed

            Will subtract 1 from the start position of any line where the start and end positions match.

            Comment


            • #7
              Originally posted by idot View Post
              Hi,
              I am wondering if coverage with split is working correctly, also if b contains the splits?
              bed12b.bed contains a split 1000-1010 & 1090-1100 and is reported as being covered by 30 basepairs and to have a length of 100.

              also: a -bbam not only -abam would be nice to have for coverage.

              best,
              ido

              Code:
              coverageBed -split -a bed12a.bed -b bed12b.bed
              chr1    1000    1100    test    0       +       1000    1100    255,0,0 2       10,10   0,90    1       30      100     0.3000000
              coverageBed -split -b bed12a.bed -a bed12b.bed
              chr1    1040    1070    test    0       +       1040    1070    0,0,255 1       30      0       0       0       30      0.0000000
              cat bed12a.bed
              chr1    1040    1070    test    0       +       1040    1070    0,0,255 1       30      0
              cat bed12b.bed
              chr1    1000    1100    test    0       +       1000    1100    255,0,0 2       10,10   0,90
              Currently, -split applies to only the "A" file. Adding both is possible but I have not yet implemented it. It is on my mind, however...

              -abam and -bam will be much more difficult as BEDTools loads the "B" file into memory which could be a bad idea for many BAM files.

              Thanks for your suggestions.
              Aaron
              Thanks fo

              Comment


              • #8
                Hi all,

                I just posted version 2.8.2, which addresses the bugs that have been reported since the release of 2.8.0. I've also updated the manual to reflect the new VCF support, new tools, and the new support for "spliced / split" alignments.

                Source: http://bedtools.googlecode.com/files....v2.8.2.tar.gz
                Manual: http://bedtools.googlecode.com/files...-Manual.v3.pdf

                Best,
                Aaron

                ==================
                Bug fixes (the dirty laundry):
                ==================
                1. Fixed a clutzy bug in bedFile.h preventing GFF strands from being read properly.
                2. Fixed a bug in intersectBed that occasionally caused spurious overlaps between BAM alignments and BED features.
                3. Fixed bug in intersectBed causing -r to not report the same result when files are swapped.
                4. Added checks to groupBy to prevent the selection of improper opCols and groups.
                5. Fixed various compilation issues for newer GCC versions, esp. for groupBy, bedToBam, and bedToIgv.
                6. Updated the usage statements to reflect bed/gff/vcf support.
                7. Added new fileType functions for auto-detecting gzipped or regular files. Thanks to Assaf Gordon.

                Comment


                • #9
                  I should also mention that in addition to the usage example in the documentation and on the Google Code site, there is a nascent collection of usage examples posted by users at:



                  You may find these to be useful.
                  Aaron

                  Comment


                  • #10
                    Hi Aaron, I love bedtools and thanks for developing it. I want to see the per base coverage profile of a few housekeeping genes between my replicates as I suspect an experimental error. I was wondering if there is a way to normalize the coverage using bedtools? Right now, I cannot compare coverages between samples, because they are sequenced to different depths. Please let me know if you know of any other way I can answer this simple but critical question.

                    Thanks!

                    Comment


                    • #11
                      Originally posted by thinkRNA View Post
                      Hi Aaron, I love bedtools and thanks for developing it. I want to see the per base coverage profile of a few housekeeping genes between my replicates as I suspect an experimental error. I was wondering if there is a way to normalize the coverage using bedtools? Right now, I cannot compare coverages between samples, because they are sequenced to different depths. Please let me know if you know of any other way I can answer this simple but critical question.

                      Thanks!
                      Hi,
                      There is no tool for directly normalizing coverage. However, assuming I correctly understand your issue, one thing I could think of would be to use genomeCoverageBed to get the per base coverage (-d) option for each replicate. You could then compute the mean or median per base coverage for the replicate with the deeper coverage and normalize the depth in the shallower replicate accordingly. For example, given housekeeping gene A, you might find the median coverage in the deeper replicate is 10, while the analogous value in the shallower replicate is 1. You could then multiple the depths in the shallower sample by 10 and do your comparison.

                      This approach will be slow because you must use genomeCoverageBed and then cull out the relevant regions. I am hoping to include a "per base" depth option for coverageBed in the next release. This will allow you to quickly grab per base coverage profiles for subsets of the genome.

                      Best,
                      Aaron

                      Comment


                      • #12
                        Alternate way get normalized per base coverage

                        Originally posted by quinlana View Post
                        Hi,
                        There is no tool for directly normalizing coverage. However, assuming I correctly understand your issue, one thing I could think of would be to use genomeCoverageBed to get the per base coverage (-d) option for each replicate. You could then compute the mean or median per base coverage for the replicate with the deeper coverage and normalize the depth in the shallower replicate accordingly. For example, given housekeeping gene A, you might find the median coverage in the deeper replicate is 10, while the analogous value in the shallower replicate is 1. You could then multiple the depths in the shallower sample by 10 and do your comparison.

                        This approach will be slow because you must use genomeCoverageBed and then cull out the relevant regions. I am hoping to include a "per base" depth option for coverageBed in the next release. This will allow you to quickly grab per base coverage profiles for subsets of the genome.

                        Best,
                        Aaron
                        Hi Aaron,
                        Thanks for your prompt reply. Unfortunately genomeCoverageBed -d option on a mouse genome generates too big a file . I came up with a round about way, do you think it will work?
                        genomeCoverageBed -bg -ibam B1.bam -g mm9.genome > genomeCoverage.out

                        use intersect bed to get the housekeeping gene's coverage where GAPDH.bed contains start and end position of the gene
                        intersectBed -wb -a genomeCoverage.out -b GAPDH.bed > onlyGAPDH.txt

                        [Do I need to use -split here?]

                        loop thru each base in onlyGAPDH.txt to break it into per base coverage and multiply each base's coverage by (1000000 / (total bases covered)).

                        I'll have to write another script to get total bases covered for each sample.

                        Thanks so much,
                        Priyam

                        Comment


                        • #13
                          Hi all,

                          I know this is a old thread, but I thought my question was adequate here.

                          I'm using bedtools coverage and bedtools multicov to get read counts in intervals.
                          I'm using alignments from tophat (containing split alignments) and bwa (no split).

                          1) Does "bedtools multicov" handle -split like bedtools coverage?

                          It doesn't seem so, as read counts from tophat seem quite exaggerated. I also tried another splice-aware aligner (osa), and I had the same, when comparing to bwa alignments.

                          Then I tried using "bedtools coverage -split -counts", which should handle this properly, and still bwa counts are much more in agreement with what I see in the IGV browser, even if I look at the tophat alignments! Again, when I use edgeR or baySeq differential expression, bwa results make much more sense...

                          2) Does anyone know of issues with split alignments in bedtools?

                          Any thoughts welcome, or suggestions of other tools to do the job?
                          I tried htseq-counts, but I have to convert the bam to a sam, and my bed to a gff, and still I get errors. Finally, the R package easyrnaseq is virtually useless, as it starts exploding the memory (and I have a bit of memory available!).

                          Thanks,
                          Daniel

                          PS: Thanks for bedtools! It's really a great tool!

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Essential Discoveries and Tools in Epitranscriptomics
                            by seqadmin




                            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                            04-22-2024, 07:01 AM
                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 04-25-2024, 11:49 AM
                          0 responses
                          19 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-24-2024, 08:47 AM
                          0 responses
                          18 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          62 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          60 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X