I'm getting confused trying to sort out how sequence reads (50bp) relate to actual chip-seq fragments (~250bp), specifically in regards to plus/minus strand.
I'm aligning a fastq file of ~20 million single end reads using Bowtie, then I want to find peaks using SISSRS. The first 5 columns of Bowtie output are described as:
1. Name of read that aligned
2. Reference strand aligned to, + for forward strand, - for reverse
3. Name of reference sequence where alignment occurs, or numeric ID if no name was provided
4. 0-based offset into the forward reference strand where leftmost character of the alignment occurs
5. Read sequence (reverse-complemented if orientation is -)
If I plot the Bowtie aligned reads I'd expect something like this, at any given peak:
Image link: http://i.imgur.com/pnrFW.png
This is how I believe SISSRS expects things to look, based on their paper
(http://nar.oxfordjournals.org/conten...1/F1.large.jpg)
But I'm getting this (+ and - aligned reads more or less overlap):
Image link: http://i.imgur.com/eXUcU.png
Here's an example:
Image link: http://i.imgur.com/vyxJ8.png
Did something go wrong with the ChIP-seq? (This isn't my data, it's from published data).
Given an alignment as in Fig.1, I would extend the reads to simulate the actual sequences like this:
Image link: http://i.imgur.com/OONAW.png
SISSRS would estimate the sequence length based on how far apart the + and - clusters are, and the peak would be found in the middle of the + and - clusters.
Since my data is as in Fig.2, I think SISSRS is underestimating the the sequences sizes, and placing the peaks slightly off from where they should be.
How should I extend the reads to more accurately visualize the sequences, and do I need to modify the data before submitting to SISSRS?
Maybe like this?
Image link: http://i.imgur.com/Hbpk1.png
Or like this?
Image link: http://i.imgur.com/CFhv8.png
This is my first time analyzing chip-seq data, thanks for the help!
I'm aligning a fastq file of ~20 million single end reads using Bowtie, then I want to find peaks using SISSRS. The first 5 columns of Bowtie output are described as:
1. Name of read that aligned
2. Reference strand aligned to, + for forward strand, - for reverse
3. Name of reference sequence where alignment occurs, or numeric ID if no name was provided
4. 0-based offset into the forward reference strand where leftmost character of the alignment occurs
5. Read sequence (reverse-complemented if orientation is -)
If I plot the Bowtie aligned reads I'd expect something like this, at any given peak:
Image link: http://i.imgur.com/pnrFW.png
This is how I believe SISSRS expects things to look, based on their paper
(http://nar.oxfordjournals.org/conten...1/F1.large.jpg)
But I'm getting this (+ and - aligned reads more or less overlap):
Image link: http://i.imgur.com/eXUcU.png
Here's an example:
Image link: http://i.imgur.com/vyxJ8.png
Did something go wrong with the ChIP-seq? (This isn't my data, it's from published data).
Given an alignment as in Fig.1, I would extend the reads to simulate the actual sequences like this:
Image link: http://i.imgur.com/OONAW.png
SISSRS would estimate the sequence length based on how far apart the + and - clusters are, and the peak would be found in the middle of the + and - clusters.
Since my data is as in Fig.2, I think SISSRS is underestimating the the sequences sizes, and placing the peaks slightly off from where they should be.
How should I extend the reads to more accurately visualize the sequences, and do I need to modify the data before submitting to SISSRS?
Maybe like this?
Image link: http://i.imgur.com/Hbpk1.png
Or like this?
Image link: http://i.imgur.com/CFhv8.png
This is my first time analyzing chip-seq data, thanks for the help!
Comment