Hi all,
I have performed RNA-seq on 2 samples, a "case" and "control", with the idea being to find differential expression between the two.
I have noticed, for one gene, that I have loads of "double" sequences mapping, i.e.: the bamfile for the sequence (which I have converted to a GenomicRanges class), has two reads mapping to the same point in the gene, but mapping repeatedly at different points along the gene.
>countOverlaps(bamlist[[1]], tx_by_gene$ENSRNOG00000024028)[which(countOverlaps(bamlist[[1]], tx_by_gene$ENSRNOG00000024028) != 0)]
[1] 2 2 2 2 1 2 1 2 1 2 2 2 1 2 1 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[38] 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2
[75] 2 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2
[112] 2 1 2
Whereas for another gene (and I think typically...at least for my top differentially expressed genes) I get:
>countOverlaps(bamlist[[1]], tx_by_gene$ENSRNOG00000006151)[which(countOverlaps(bamlist[[1]], tx_by_gene$ENSRNOG00000006151) != 0)]
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[260] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[297] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[334] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[371] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
I have traced these "double mappings" back to the fastq files, they are definitely the same sequences, e.g. for one of the sequences, I found the original reads in the sam and fastq files, here I show an example:
% grep HWUSI-EAS4752312156373911 s_1_1.qseq_ACAGTGA.fastq -A 1
@HWUSI-EAS4752312156373911
GGCTCTGGCACCTTGGGGTTGCAGGGCTCAGGAA
+HWUSI-EAS4752312156373911
geggggggggdggggfcfdfffaffggggcffff
% grep HWUSI-EAS4752317017106159551 s_1_1.qseq_ACAGTGA.fastq -A 1
@HWUSI-EAS4752317017106159551
GGCTCTGGCACCTTGGGGTTGCAGGGCTCAGGAA
+HWUSI-EAS4752317017106159551
hhhhhhhhhhhhhhghhhehhfdhhhhhhhhghh
It seems really weird, like somehow the reads for this gene are being duplicated. Is there a known reason for this?
To make it clear, it's not the same read is mapping at different positions in the same gene,The reads map to one position only, it's just that 2 reads map to the same position, consistently, so it's almost the opposite problem of a read mapping unambiguously!
so if you imagine:
Situation for most genes in sample:
Situation for this 1 particular gene:
At least, this is what I think is happening.
I tested for a load of other genes in the same sample and they all follow the pattern of the first figure, i.e. reads map in different regions, the second figure shows something that only happens for this gene.
I've written a script that looks to see if this happens for any other genes at all in the genome, but it's taking a while to run! (to be fair my R could probably be more efficient!).
The other weird thing is it happens in different samples! It happens in all 3 "case" samples, and not at all in the 3 "control" samples, but I can't think of any possible biological reason for it (at least not before going into more detail..I just wanted to see if any one here had encountered it, or if it was a common thing before I did so).
Any thougts at all would be greatly appreciated. It's very weird.
My plan otherwise is to count each "double" as one count. Though since it is not expressed in control, it's not such a problem..we can't calculated a fold change anyway.
Cheers,
I have performed RNA-seq on 2 samples, a "case" and "control", with the idea being to find differential expression between the two.
I have noticed, for one gene, that I have loads of "double" sequences mapping, i.e.: the bamfile for the sequence (which I have converted to a GenomicRanges class), has two reads mapping to the same point in the gene, but mapping repeatedly at different points along the gene.
>countOverlaps(bamlist[[1]], tx_by_gene$ENSRNOG00000024028)[which(countOverlaps(bamlist[[1]], tx_by_gene$ENSRNOG00000024028) != 0)]
[1] 2 2 2 2 1 2 1 2 1 2 2 2 1 2 1 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[38] 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2
[75] 2 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2
[112] 2 1 2
Whereas for another gene (and I think typically...at least for my top differentially expressed genes) I get:
>countOverlaps(bamlist[[1]], tx_by_gene$ENSRNOG00000006151)[which(countOverlaps(bamlist[[1]], tx_by_gene$ENSRNOG00000006151) != 0)]
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[260] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[297] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[334] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[371] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
I have traced these "double mappings" back to the fastq files, they are definitely the same sequences, e.g. for one of the sequences, I found the original reads in the sam and fastq files, here I show an example:
% grep HWUSI-EAS4752312156373911 s_1_1.qseq_ACAGTGA.fastq -A 1
@HWUSI-EAS4752312156373911
GGCTCTGGCACCTTGGGGTTGCAGGGCTCAGGAA
+HWUSI-EAS4752312156373911
geggggggggdggggfcfdfffaffggggcffff
% grep HWUSI-EAS4752317017106159551 s_1_1.qseq_ACAGTGA.fastq -A 1
@HWUSI-EAS4752317017106159551
GGCTCTGGCACCTTGGGGTTGCAGGGCTCAGGAA
+HWUSI-EAS4752317017106159551
hhhhhhhhhhhhhhghhhehhfdhhhhhhhhghh
It seems really weird, like somehow the reads for this gene are being duplicated. Is there a known reason for this?
To make it clear, it's not the same read is mapping at different positions in the same gene,The reads map to one position only, it's just that 2 reads map to the same position, consistently, so it's almost the opposite problem of a read mapping unambiguously!
so if you imagine:
Situation for most genes in sample:
Code:
[FONT="Courier New"]DIFFERENT READS: ------ ------ ------ ------ ------ GENE :------------------------------------------------->[/FONT]
Code:
[FONT="Courier New"]DIFFERENT READS: ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ GENE :------------------------------------------->[/FONT]
I tested for a load of other genes in the same sample and they all follow the pattern of the first figure, i.e. reads map in different regions, the second figure shows something that only happens for this gene.
I've written a script that looks to see if this happens for any other genes at all in the genome, but it's taking a while to run! (to be fair my R could probably be more efficient!).
The other weird thing is it happens in different samples! It happens in all 3 "case" samples, and not at all in the 3 "control" samples, but I can't think of any possible biological reason for it (at least not before going into more detail..I just wanted to see if any one here had encountered it, or if it was a common thing before I did so).
Any thougts at all would be greatly appreciated. It's very weird.
My plan otherwise is to count each "double" as one count. Though since it is not expressed in control, it's not such a problem..we can't calculated a fold change anyway.
Cheers,
Comment