Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • baritone
    Junior Member
    • Jun 2015
    • 5

    #16
    Why 50

    Originally posted by blakeoft View Post
    You can use bamToBed from bedtools to convert from bam to bed format as you have mentioned. It's quite easy to adjust the start and end site of reads in bed format. I'm not an awk master, so there's probably a better way to do this, but it should work:

    Code:
    awk 'BEGIN {OFS = "\t"} ; {if ($6 == "+") print $1, $2 + 50, $3 + 50, $4, $5, $6; else print $1, $2 - 50, $3 - 50, $4, $5, $6}' alignment.bed > alignment_shifted.bed
    That will shift the start and end forward by fifty base pairs if the strand is +, backward if the strand is -, and print fields 1 through 6. If you need more fields, just tack them on the end.


    I don't think 50 is the number to shift. In Jason's paper, they did "all reads aligning to the + strand were offset by +4 bp, and all reads aligning to the – strand were offset −5 bp". In fact, whether you shift or not, the distribution of reads will not significantly change.

    Comment

    • cloverliang
      Junior Member
      • Aug 2015
      • 3

      #17
      Originally posted by dariober View Post
      I was going to say this myself. I treated ATAC-Seq libraries pretty much like ChIP-Seq libraries.

      Trim adapters -> Align (bwa mem) -> Remove reads in blacklist regions if applicable (https://sites.google.com/site/anshul...cts/blacklists) -> Remove reads with MAPQ < 10 or there about -> Remove duplicates (debatable, but I tend to see better results w/o dups) -> call peaks (macs2, with mitochondrial reads removed, see below)

      We find 10s to 100s of thousands of peaks, quite sharp, often mapping to transcription start sites. A feature we find in our data and in the published one is the large number of reads mapping to the mitochondrial genome (up to 40-60%), but it can vary a lot. So I prefer to remove these reads before peak calling.
      Hi Dariober,

      I'm also analyzing ATAC-seq data. I did in a similar way to your procedure:

      1. Map to hg19 with bowtie2 (-X 2000).
      2. Remove reads with MAPQ < 10 and sort bam file using samtools.
      3. Remove duplicates using picard tools.
      4. Merge sorted bam of four replicates together.
      5. Get cutting position for each fragment and visualize cutting frequency (per basepair) in IGV.

      But the peaks I observed are quite noisy. I tried the data from Buenrostro 2013 Nature method paper. I got a good mapping rate (>90%) but a lot of duplicates (60%-70%). After removing duplicates, the per base cutting frequency becomes quite low. I was wondering if you observed a similar pattern for the public data. Would you have some suggestions on how I could improve the signal? Thank you!!

      Comment

      • baritone
        Junior Member
        • Jun 2015
        • 5

        #18
        Do not remove duplicates

        Originally posted by cloverliang View Post
        Hi Dariober,

        I'm also analyzing ATAC-seq data. I did in a similar way to your procedure:

        1. Map to hg19 with bowtie2 (-X 2000).
        2. Remove reads with MAPQ < 10 and sort bam file using samtools.
        3. Remove duplicates using picard tools.
        4. Merge sorted bam of four replicates together.
        5. Get cutting position for each fragment and visualize cutting frequency (per basepair) in IGV.

        But the peaks I observed are quite noisy. I tried the data from Buenrostro 2013 Nature method paper. I got a good mapping rate (>90%) but a lot of duplicates (60%-70%). After removing duplicates, the per base cutting frequency becomes quite low. I was wondering if you observed a similar pattern for the public data. Would you have some suggestions on how I could improve the signal? Thank you!!

        According to my experience, removing duplicates and trimming the reads to 1 bp (i.e, tracking the integration site) will both decrease the resolution. Removing duplicates will force the coverage to be no higher than the read length, thus compromise good separation of peaks and non-peaks ; trimming the reads to 1bp is equivalent to giving up Kernel functions, thus the peaks will look very narrow and nasty. If you do want to trim to 1bp, you'd better separate the track into plus and minus strands so that you can easily tell which peak is real and which is not based on tag shifting explained here http://www.genomebiology.com/content/9/9/R137

        Comment

        • fanli
          Senior Member
          • Jul 2014
          • 197

          #19
          Originally posted by resara View Post
          Thanks blakeoft, checking for negative indices and removing them is a good idea.

          I understand what you mean, and in most cases you can disregard the 0/1-based positioning, as it won't affect your shift (it will still shift it by 50bp).

          I was thinking that since bam is 0-based, if the conversion to bed changes the position to 1-based, that's a 1nt difference in position, which could be an issue if you're planning to do footprinting with the alignments.

          edit: sorry, looks like bed is 0-based, so should be fine
          Just fyi, SAM/BAM is 1-based

          Comment

          • cloverliang
            Junior Member
            • Aug 2015
            • 3

            #20
            Originally posted by baritone View Post
            According to my experience, removing duplicates and trimming the reads to 1 bp (i.e, tracking the integration site) will both decrease the resolution. Removing duplicates will force the coverage to be no higher than the read length, thus compromise good separation of peaks and non-peaks ; trimming the reads to 1bp is equivalent to giving up Kernel functions, thus the peaks will look very narrow and nasty. If you do want to trim to 1bp, you'd better separate the track into plus and minus strands so that you can easily tell which peak is real and which is not based on tag shifting explained here http://www.genomebiology.com/content/9/9/R137
            Thank you baritone!! I'm looking for TF binding footprints, which presumably have less transposase cuts compared to its flanking region. So I wanted to trim reads to the 1bp cutting position. The peaks I got are really shadow (as in the first track of attached figure). I was wondering if that's common pattern people observed in ATAC-seq data.
            Attached Files

            Comment

            • Pef
              Junior Member
              • Nov 2013
              • 1

              #21
              Dear all,

              I am currently analyzing my ATAC-seq data and I am wondering what is the best way to adjust the coordinates of alignments.

              I am aware of the solution proposed in this thread by blakeoft, which consists in converting bam file to bed, and to shift reads using awk.

              Even if this approach is efficient and does the job, I was wondering if anyone knows about an easy way to shift coordinates on bam directly, and to conserve read sequences (and qualities, at least for the bases that belong to the original sequence before shifting), as well as information about paires.

              Thanks :-)
              Last edited by Pef; 09-28-2015, 07:39 AM.

              Comment

              • Azazel
                Member
                • Oct 2010
                • 52

                #22
                Originally posted by baritone View Post
                I don't think 50 is the number to shift. In Jason's paper, they did "all reads aligning to the + strand were offset by +4 bp, and all reads aligning to the – strand were offset −5 bp". In fact, whether you shift or not, the distribution of reads will not significantly change.
                This. I do not understand the reasoning behind blakeoft's suggestion to simply shift everything by 50 into one direction. What is this supposed to achieve? Also, as baritone pointed out the original paper has a very different explanation of how they fixed the coordinates.

                Comment

                • LacquerHead
                  Member
                  • Nov 2015
                  • 31

                  #23
                  Since BED files are a combination of +/- strands how can I use this AWK script for the individual strands?


                  Originally posted by blakeoft View Post
                  You can use bamToBed from bedtools to convert from bam to bed format as you have mentioned. It's quite easy to adjust the start and end site of reads in bed format. I'm not an awk master, so there's probably a better way to do this, but it should work:

                  Code:
                  awk 'BEGIN {OFS = "\t"} ; {if ($6 == "+") print $1, $2 + 50, $3 + 50, $4, $5, $6; else print $1, $2 - 50, $3 - 50, $4, $5, $6}' alignment.bed > alignment_shifted.bed
                  That will shift the start and end forward by fifty base pairs if the strand is +, backward if the strand is -, and print fields 1 through 6. If you need more fields, just tack them on the end.

                  Comment

                  Latest Articles

                  Collapse

                  • SEQadmin2
                    Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                    by SEQadmin2


                    I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                    Here are nine questions we think about, in roughly the order they matter, before...
                    06-18-2026, 07:11 AM
                  • SEQadmin2
                    From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                    by SEQadmin2


                    Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                    The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                    ...
                    06-02-2026, 10:05 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by SEQadmin2, Today, 11:10 AM
                  0 responses
                  6 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-17-2026, 06:09 AM
                  0 responses
                  42 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-09-2026, 11:58 AM
                  0 responses
                  102 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-05-2026, 10:09 AM
                  0 responses
                  124 views
                  0 reactions
                  Last Post SEQadmin2  
                  Working...