Seqanswers Leaderboard Ad

**lukas1848** · 10-27-2011, 06:41 AM

My first wild guess is that they look like PCR duplicates.

**TonyBrooks** · 10-27-2011, 06:48 AM

Originally posted by lukas1848 View Post

My first wild guess is that they look like PCR duplicates.

But surely PCR duplicates should be seen more randomly across the genome rather than for just one gene?

**lukas1848** · 10-27-2011, 06:54 AM

PCR duplicates should be removed by SAMtools anyway. So these might have just slipped through the filtering. Don't ask me why though.

**GW_OK** · 10-27-2011, 06:56 AM

First guess is, of course, PCR duplicates.
Second guess is repeat/low complexity regions within the gene.
Third guess is a co-expressed homologous gene.
Fourth guess is gene duplication.

**jimineep** · 10-27-2011, 06:58 AM

Originally posted by TonyBrooks View Post

But surely PCR duplicates should be seen more randomly across the genome rather than for just one gene?

Exactly... it's definitely not random. It might happen for other genes (we'll find out when my script finishes) but it doesn't happen "a little bit" for many genes, just (almost) all for one.

It's like something happens when the mRNA gets sheared, and for a given mRNA it doubles it... or something?

**jimineep** · 10-27-2011, 07:06 AM

Originally posted by GW_OK View Post

First guess is, of course, PCR duplicates.
Second guess is repeat/low complexity regions within the gene.
Third guess is a co-expressed homologous gene.
Fourth guess is gene duplication.

Again, if it's duplicates, why might it be particularly happening for this one gene? And why an exact doubling each time? Has anyone seen this before specifically for one gene?

I'm not sure that it's repeat/low complexity regions, since wouldn't the read map unambiguously to several positions in the gene? Not exactly two reads mapping to separate positions, several times? I blatted a couple of reads and they do map unambiguously to the genome.

It seems strange that if it is a co-expressed homologous gene, that the reads would be mapping in exactly the same places? The gene length is ~500 bp, and there are ~200 reads mapping, but most of them are mapping twice in the same location, which seems odd to me.

Also, it seems weird it's so reproducable, it's happening specifically for this gene in different biological replicates

**gringer** · 10-27-2011, 07:14 AM

Is it possible that having different isoforms for the same gene is messing up the mapping?

**jimineep** · 10-27-2011, 07:47 AM

Originally posted by gringer View Post

Is it possible that having different isoforms for the same gene is messing up the mapping?

How do you mean? You mean there are 2 different isoforms being expressed in the tissue? I don't see how that would lead to the 2x sequence generation?

**gringer** · 10-27-2011, 07:52 AM

You mentioned "converted to a GenomicRanges class", and I wonder if hits to different isoforms for the same gene might be recorded multiple times. Is there any way you can look at overlaps by isoform('tx_by_isoform'), rather than overlaps by gene ('tx_by_gene')?

**jimineep** · 10-27-2011, 08:00 AM

Interesting point. I will investigate. I get the same result when mapping to exons though (i.e. tx_by_exon).

By isoforms do you mean splice variants or something deeper? One thing I can think of against that theory is that the problem can be traced back to the fastq file:

% grep HWUSI-EAS4752312156373911 s_1_1.qseq_ACAGTGA.fastq -A 1
@HWUSI-EAS4752312156373911
GGCTCTGGCACCTTGGGGTTGCAGGGCTCAGGAA
+HWUSI-EAS4752312156373911
geggggggggdggggfcfdfffaffggggcffff

% grep HWUSI-EAS4752317017106159551 s_1_1.qseq_ACAGTGA.fastq -A 1
@HWUSI-EAS4752317017106159551
GGCTCTGGCACCTTGGGGTTGCAGGGCTCAGGAA
+HWUSI-EAS4752317017106159551
hhhhhhhhhhhhhhghhhehhfdhhhhhhhhghh

So there is definitiely a pair of identical sequences that map unambiguously to the genome in the fastq file. That said I'm going to go even further back to the qseq file just to triple check!

**gringer** · 10-27-2011, 08:29 AM

Wouldn't you expect multiple hits to the same gene regions for RNA-seq experiements with a suitably large number of reads? What happens if you grep for the sequence itself (or arbitrary other sequences that have been mapped)?:

Code:

% grep '^GGCTCTGGCACCTTGGGGTTGCAGGGCTCAGGAA' s_1_1.qseq_ACAGTGA.fastq | wc -l

By isoforms do you mean splice variants or something deeper

I think they're usually just splice variants. To be more verbose, different transcripts that map to the same gene region. I'll give a rough example. Say you have a gene with two different isoforms:

Code:

Gene:  [exon 1  ][exon 2  ]-------[exon 3  ]---[exon 4  ]---[exon 5  ]
Iso_1: [exon 1  ][exon 4  ][exon 5  ]
Iso_2: [exon 1  ][exon 3  ][exon 5  ]

Let's say you get a cell that has both isoforms expressed at the same time:

Code:

Iso_1: [exon 1  ][exon 4  ][exon 5  ]
[hit]    ----     ----      ----
Iso_2: [exon 1  ][exon 3  ][exon 5  ]
[hit]      ----    ----  ---- ----

And have some way of converting those to binary counts per isoform:

Code:

Iso_1: [exon 1  ][exon 4  ][exon 5  ]
[hit]    1111     1111      1111
Iso_2: [exon 1  ][exon 3  ][exon 5  ]
[hit]      1111    1111  1111 1111

When these are summed up for a single gene, it will look something like this:

Code:

Gene:  [exon 1  ][exon 2  ]-------[exon 3  ]---[exon 4  ]---[exon 5  ]
[hit]    112211                     1111  11     1111       1212211

[note that the exon mapping will be the same as the gene mapping in terms of summed coverage]

**lukas1848** · 10-27-2011, 09:41 AM

Originally posted by gringer View Post

When these are summed up for a single gene, it will look something like this:

Code:

Gene:  [exon 1  ][exon 2  ]-------[exon 3  ]---[exon 4  ]---[exon 5  ]
[hit]    112211                     1111  11     1111       1212211

[note that the exon mapping will be the same as the gene mapping in terms of summed coverage]

This still wouldn't explain why there are always two reads mapping exactly to the same position. I'd still say that this is some artifact resulting from the library prep.

**seq_me** · 10-27-2011, 09:54 AM

I feel the same. at first look ,they look like PCR duplicates and hence overamplification of the samples. I would contact the person making the libraries and get details on amount of input DNA. If the run is a PE run, you can safely remove these duplicate rates that have the same start and the end as they are PCR duplicates.

**jimineep** · 10-27-2011, 10:32 AM

I think Gringer has it,

I think the problem comes down to the Granges object, which is as so:

> tx_by_exon$ENSRNOG00000024028
GRanges with 4 ranges and 2 elementMetadata values
seqnames ranges strand | exon_id exon_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr2 [185464227, 185464733] - | 40667 NA
[2] chr2 [185464256, 185464351] - | 40669 NA
[3] chr2 [185464373, 185464714] - | 40670 NA
[4] chr2 [185465839, 185465857] - | 40668 NA

and for the tx_by_gene is as follows:

> tx_by_gene$ENSRNOG00000024028
GRanges with 2 ranges and 2 elementMetadata values
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr2 [185464227, 185465857] - | 6823 ENSRNOT00000012174
[2] chr2 [185464256, 185464714] - | 6824 ENSRNOT00000068099

These ranges are overlapping, so clearly the reads are being counted twice for the different isoforms, which is why I was getting a 2.

I believe I have misinterpreted the results, thinking something much weirder was going on than was actually happening. Apologies.

Basically I wasn't expecting the two isoforms to coexist in this way, I didn't think that exonsBy would allow exons to overlap in this way & therefore get counted twice. I hadn't realised because all the other genes I had checked had non-overlapping exons.

This is interesting, does this therefore mean when someone performs:

library(Rsamtools)
reads=readBamGappedAlignments("aligned_reads_sorted.bam")
counts=countOverlaps(tx_by_gene,reads)

to get the overlap between the reads and genes, they could actually count the same read several times? If so they could in theory end up with more reads mapping to genes than there are reads in a bam file?

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 18 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Double reads for one gene

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News