The bedmap application takes three inputs:
- Operators
- Reference BED file (e.g., your intervals)
- Map BED file (e.g., your regions denoting "UTR" elements)
One or more operations are applied on elements of the Map file that overlap elements in the Reference file by one or more bases. These operations can calculate a numerical result, or summarize ID, range or other information about mapped elements.
I'll show you below how you can use bedmap with the --echo-map-id operator to grab UTR IDs.
As an example, let's say your intervals are the following regions, adjacent 2000 bp windows that are stored in the tab-delimited file Intervals.bed:
$ more Intervals.bed
chr1 10000 12000
chr1 12000 14000
chr1 14000 16000
...
Let's say your UTRs are stored in a file called UTRs.bed.
(If you don't have this yet, you can grab these for the entire genome from the UCSC Genome Browser's table browser, through the Genes and Gene Prediction Tracks group and the knownGene table. Change the output format to BED and click get output, selecting the desired UTR category. Save the resulting BED file to your file system.)
For hg19, here are a few of the first 5' UTRs over the genome, taken from the UCSC Genome Browser:
$ more UTRs.bed
chr1 11873 12189 uc010nxq.1_utr5_0_0_chr1_11874_f 0 +
chr1 14361 14829 uc009vis.3_utr5_0_0_chr1_14362_r 0 -
chr1 14969 15038 uc009vis.3_utr5_1_0_chr1_14970_r 0 -
chr1 15795 15942 uc009vis.3_utr5_2_0_chr1_15796_r 0 -
chr1 16606 16765 uc009vis.3_utr5_3_0_chr1_16607_r 0 -
...
We can find the IDs of the UTRs that overlap our Intervals.bed elements with the following command:
$ bedmap --echo --echo-map-id --delim '\t' Intervals.bed UTRs.bed > UTR_IDs_overlapping_Intervals.bed
Let's take a look at the results:
$ more UTR_IDs_overlapping_Intervals.bed
chr1 10000 12000 uc010nxq.1_utr5_0_0_chr1_11874_f
chr1 12000 14000
chr1 14000 16000 uc009vis.3_utr5_0_0_chr1_14362_r;uc009vis.3_utr5_1_0_chr1_14970_r;uc009vis.3_utr5_2_0_chr1_15796_r
...
In other words, the UTR uc010nxq.1_utr5_0_0_chr1_11874_f overlaps interval chr1:10000-12000, no UTRs overlap the interval chr1:12000-14000, three UTRs overlap the interval chr1:14000-16000 and so on...
Let's explain what the command did. We used three operators with this application of bedmap:
- --echo
- --echo-map-id
- --delim
The --echo operator prints each of the interval elements.
The --echo-map-id operator prints each of the UTR IDs that overlap the specified interval element. If there are multiple IDs, they are delimited with a semi-colon.
The --delim operator separates the --echo and --echo-map-id results with a tab character. This allows the output to remain a relaxed, three-column UCSC BED file, so you can process this with any downstream tools that take in BED data (like bedmap, bedops, or other BEDOPS utilities).
You don't have to just grab IDs from the UTR file. If you want the entire UTR element, use --echo-map. If you want just the UTR regions, use --echo-map-range.
There are several other --echo-map-* operators which summarize different information from mapped elements. Check out the Echo section of the bedmap documentation for more detail.
Note: BEDOPS utilities run very fast and use very little memory, compared with alternative toolkits which do not yet use our design optimizations. This is because we require that BED inputs are sorted (at most, only once), which adds structure that we can take advantage of. Data you get through the UCSC Genome Browser will be sorted. However, you might need to sort data in your intervals file, if you do not know its sort status. This is easy and relatively quick, making use of the BEDOPS sort-bed tool:
$ sort-bed Intervals.bed > Sorted_Intervals.bed
Again, this only needs to be done once. BEDOPS tools read and write sorted data, so any results from, say, bedmap do not need any further sorting before using results with downstream tools.
Leave a comment: