Unconfigured Ad

**baritone** · 07-20-2015, 09:40 PM

Why 50

Originally posted by blakeoft View Post

You can use bamToBed from bedtools to convert from bam to bed format as you have mentioned. It's quite easy to adjust the start and end site of reads in bed format. I'm not an awk master, so there's probably a better way to do this, but it should work:

Code:

awk 'BEGIN {OFS = "\t"} ; {if ($6 == "+") print $1, $2 + 50, $3 + 50, $4, $5, $6; else print $1, $2 - 50, $3 - 50, $4, $5, $6}' alignment.bed > alignment_shifted.bed

That will shift the start and end forward by fifty base pairs if the strand is +, backward if the strand is -, and print fields 1 through 6. If you need more fields, just tack them on the end.

I don't think 50 is the number to shift. In Jason's paper, they did "all reads aligning to the + strand were offset by +4 bp, and all reads aligning to the – strand were offset −5 bp". In fact, whether you shift or not, the distribution of reads will not significantly change.

**cloverliang** · 08-27-2015, 07:03 AM

Originally posted by dariober View Post

I was going to say this myself. I treated ATAC-Seq libraries pretty much like ChIP-Seq libraries.

Trim adapters -> Align (bwa mem) -> Remove reads in blacklist regions if applicable (https://sites.google.com/site/anshul...cts/blacklists) -> Remove reads with MAPQ < 10 or there about -> Remove duplicates (debatable, but I tend to see better results w/o dups) -> call peaks (macs2, with mitochondrial reads removed, see below)

We find 10s to 100s of thousands of peaks, quite sharp, often mapping to transcription start sites. A feature we find in our data and in the published one is the large number of reads mapping to the mitochondrial genome (up to 40-60%), but it can vary a lot. So I prefer to remove these reads before peak calling.

Hi Dariober,

I'm also analyzing ATAC-seq data. I did in a similar way to your procedure:

1. Map to hg19 with bowtie2 (-X 2000).
2. Remove reads with MAPQ < 10 and sort bam file using samtools.
3. Remove duplicates using picard tools.
4. Merge sorted bam of four replicates together.
5. Get cutting position for each fragment and visualize cutting frequency (per basepair) in IGV.

But the peaks I observed are quite noisy. I tried the data from Buenrostro 2013 Nature method paper. I got a good mapping rate (>90%) but a lot of duplicates (60%-70%). After removing duplicates, the per base cutting frequency becomes quite low. I was wondering if you observed a similar pattern for the public data. Would you have some suggestions on how I could improve the signal? Thank you!!

**baritone** · 08-27-2015, 09:46 AM

Do not remove duplicates

Originally posted by cloverliang View Post

Hi Dariober,

I'm also analyzing ATAC-seq data. I did in a similar way to your procedure:

1. Map to hg19 with bowtie2 (-X 2000).
2. Remove reads with MAPQ < 10 and sort bam file using samtools.
3. Remove duplicates using picard tools.
4. Merge sorted bam of four replicates together.
5. Get cutting position for each fragment and visualize cutting frequency (per basepair) in IGV.

But the peaks I observed are quite noisy. I tried the data from Buenrostro 2013 Nature method paper. I got a good mapping rate (>90%) but a lot of duplicates (60%-70%). After removing duplicates, the per base cutting frequency becomes quite low. I was wondering if you observed a similar pattern for the public data. Would you have some suggestions on how I could improve the signal? Thank you!!

According to my experience, removing duplicates and trimming the reads to 1 bp (i.e, tracking the integration site) will both decrease the resolution. Removing duplicates will force the coverage to be no higher than the read length, thus compromise good separation of peaks and non-peaks ; trimming the reads to 1bp is equivalent to giving up Kernel functions, thus the peaks will look very narrow and nasty. If you do want to trim to 1bp, you'd better separate the track into plus and minus strands so that you can easily tell which peak is real and which is not based on tag shifting explained here http://www.genomebiology.com/content/9/9/R137

**fanli** · 08-27-2015, 12:23 PM

Originally posted by resara View Post

Thanks blakeoft, checking for negative indices and removing them is a good idea.

I understand what you mean, and in most cases you can disregard the 0/1-based positioning, as it won't affect your shift (it will still shift it by 50bp).

I was thinking that since bam is 0-based, if the conversion to bed changes the position to 1-based, that's a 1nt difference in position, which could be an issue if you're planning to do footprinting with the alignments.

edit: sorry, looks like bed is 0-based, so should be fine

Just fyi, SAM/BAM is 1-based

**cloverliang** · 08-28-2015, 06:28 AM

Originally posted by baritone View Post

According to my experience, removing duplicates and trimming the reads to 1 bp (i.e, tracking the integration site) will both decrease the resolution. Removing duplicates will force the coverage to be no higher than the read length, thus compromise good separation of peaks and non-peaks ; trimming the reads to 1bp is equivalent to giving up Kernel functions, thus the peaks will look very narrow and nasty. If you do want to trim to 1bp, you'd better separate the track into plus and minus strands so that you can easily tell which peak is real and which is not based on tag shifting explained here http://www.genomebiology.com/content/9/9/R137

Thank you baritone!! I'm looking for TF binding footprints, which presumably have less transposase cuts compared to its flanking region. So I wanted to trim reads to the 1bp cutting position. The peaks I got are really shadow (as in the first track of attached figure). I was wondering if that's common pattern people observed in ATAC-seq data.

Attached Files

ATAC-seq Buenrostro 2013 data.png (47.5 KB, 406 views)

**Pef** · 09-28-2015, 07:34 AM

Dear all,

I am currently analyzing my ATAC-seq data and I am wondering what is the best way to adjust the coordinates of alignments.

I am aware of the solution proposed in this thread by blakeoft, which consists in converting bam file to bed, and to shift reads using awk.

Even if this approach is efficient and does the job, I was wondering if anyone knows about an easy way to shift coordinates on bam directly, and to conserve read sequences (and qualities, at least for the bases that belong to the original sequence before shifting), as well as information about paires.

Thanks :-)

**Azazel** · 02-14-2016, 11:43 PM

Originally posted by baritone View Post

I don't think 50 is the number to shift. In Jason's paper, they did "all reads aligning to the + strand were offset by +4 bp, and all reads aligning to the – strand were offset −5 bp". In fact, whether you shift or not, the distribution of reads will not significantly change.

This. I do not understand the reasoning behind blakeoft's suggestion to simply shift everything by 50 into one direction. What is this supposed to achieve? Also, as baritone pointed out the original paper has a very different explanation of how they fixed the coordinates.

**LacquerHead** · 05-04-2016, 12:15 PM

Since BED files are a combination of +/- strands how can I use this AWK script for the individual strands?

Originally posted by blakeoft View Post

You can use bamToBed from bedtools to convert from bam to bed format as you have mentioned. It's quite easy to adjust the start and end site of reads in bed format. I'm not an awk master, so there's probably a better way to do this, but it should work:

Code:

awk 'BEGIN {OFS = "\t"} ; {if ($6 == "+") print $1, $2 + 50, $3 + 50, $4, $5, $6; else print $1, $2 - 50, $3 - 50, $4, $5, $6}' alignment.bed > alignment_shifted.bed

That will shift the start and end forward by fifty base pairs if the strand is +, backward if the strand is -, and print fields 1 through 6. If you need more fields, just tack them on the end.

Topics	Statistics	Last Post
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, Today, 11:10 AM	0 responses 6 views 0 reactions	Last Post by SEQadmin2 Today, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 42 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 102 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 124 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM

Unconfigured Ad

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News