Seqanswers Leaderboard Ad

**JackieBadger** · 02-19-2014, 05:57 PM

While I only have experience estimating CNV in targeted amplicons, and can't answer your question directly, I know there are at least a couple of reviews looking at CNV detection methods.

Here is one from a quick google http://www.plosone.org/article/info%...l.pone.0059128

**jgibbons1** · 02-20-2014, 10:28 AM

I think you're approach will work fine for rough estimates of high CN genes. However, I think you're going to be underestimating copy number, because you are normalizing high CN ORFs by the mean genome coverage. The mean genome coverage will include high CN regions, repetitive regions etc.

I just went through this myself and had to find an appropriate "baseline" measurement to normalize by. You really want your "baseline" coverage to reflect the coverage of a "single copy" region in the genome.

The way I went about doing this is by using the coverage of single copy exons as my baseline normalizing factor, but in your case, single copy ORFs would work fine. I BLASTed all exons against all exons and excluded all sequences that had hits to anything other than themselves. This worked quite well.

Good luck!

**Jeremy** · 02-20-2014, 07:34 PM

I think you would be better to map to the entire genome then extract coverage of the ORFs later using a gff of the ORF locations using something like HTSeq-count. The mapping quality will be much better, otherwise you could end up with a lot of falsely mapped reads.

**jgibbons1** · 02-20-2014, 08:05 PM

Originally posted by Jeremy View Post

I think you would be better to map to the entire genome then extract coverage of the ORFs later using a gff of the ORF locations using something like HTSeq-count. The mapping quality will be much better, otherwise you could end up with a lot of falsely mapped reads.

That's true, but that is why using the "single copy ORFs" would circumvent the problem of mapping quality and falsely mapped reads.

**AdrianP** · 02-20-2014, 08:12 PM

Originally posted by jgibbons1 View Post

That's true, but that is why using the "single copy ORFs" would circumvent the problem of mapping quality and falsely mapped reads.

Could you elaborate? I do not understand how to use single copy ORFs and what problem does that circumvent?

**Jeremy** · 02-20-2014, 08:14 PM

Wait, what sort of sequence data do you have? (normalised) RNA-Seq? genomic?

I assumed you had genomic which led me think along these lines: While I don't know what species you are working with, the % of your genome that is ORF is going to be what 1-3%? which means you will be mapping 97-99% of the reads against a reference that doesn't have the sequence that they should map to. That can result in a reasonble chance of them falsely mapping to one of the ORFs in the reference.

**AdrianP** · 02-20-2014, 08:22 PM

Originally posted by Jeremy View Post

Wait, what sort of sequence data do you have? (normalised) RNA-Seq? genomic?

I assumed you had genomic which led me think along these lines: While I don't know what species you are working with, the % of your genome that is ORF is going to be what 1-3%? which means you will be mapping 97-99% of the reads against a reference that doesn't have the sequence that they should map to. That can result in a reasonble chance of them falsely mapping to one of the ORFs in the reference.

Yes it is genomic. Makes sense.

**jgibbons1** · 02-21-2014, 09:07 AM

Originally posted by AdrianP View Post

Could you elaborate? I do not understand how to use single copy ORFs and what problem does that circumvent?

Using only single copy regions of the genome/transcriptome would give you a better idea of the true single copy coverage. When you take the average coverage of the genome/transcriptome, you are also introducing values from copy number variable genomic regions and/or highly homologous gene families. In this sense, your estimation of background coverage will be skewed higher, and thus, your copy number estimates will be lower (because background coverage is denominator). Does that make sense? It's worth putting the effort into being confident about your background coverage.

**AdrianP** · 02-21-2014, 09:14 AM

Originally posted by jgibbons1 View Post

Using only single copy regions of the genome/transcriptome would give you a better idea of the true single copy coverage. When you take the average coverage of the genome/transcriptome, you are also introducing values from copy number variable genomic regions and/or highly homologous gene families. In this sense, your estimation of background coverage will be skewed higher, and thus, your copy number estimates will be lower (because background coverage is denominator). Does that make sense? It's worth putting the effort into being confident about your background coverage.

When I estimated the average coverage of the genome, I didn't take the value of the average coverage of all contigs, but rather found a 20kb region where the coverage is pretty evenly distributed, and took the average of that. In fact, most of the contigs have that coverage in most of their regions, I checked in a couple of other locations that do not look like they are repetitive.

It is slightly lower than the average coverage of all contigs, but that makes sense.

Really, my biggest problem is that in my ORFs, there are pairs of ORFs that have 90-100% pairwise identity at the nt level. This analysis would be more successful if such ORFs had their duplicates removed. Because 90% identity, we would still consider it the same gene that likely got duplicated.

So if I do map my reads to the entire assembly, and extract coverage of ORFs, I will select the option that allows a read to map to multiple locations? Otherwise if a gene has 4 copies, and only 2 are present in the assembly, each of those 2 ORFs will look like it has 2 copies, when in fact it's one ORF with 4 copies?

**ashokrags** · 02-21-2014, 11:33 AM

I am using you could use a statistical approach for coverage. A very simplistic approach could assume reads to be poisson distributed and therefore you could look at deviations from the average coverage. So if the average coverage is L, two copies would generate 2L and 3 copies 3L and so on. Based on the average and the variance you could test whether the mean at a location is 2L or 3L etc to get an estimate for the number of copies

**Jeremy** · 02-23-2014, 06:18 PM

Originally posted by AdrianP View Post

When I estimated the average coverage of the genome, I didn't take the value of the average coverage of all contigs, but rather found a 20kb region where the coverage is pretty evenly distributed, and took the average of that. In fact, most of the contigs have that coverage in most of their regions, I checked in a couple of other locations that do not look like they are repetitive.

It is slightly lower than the average coverage of all contigs, but that makes sense.

Really, my biggest problem is that in my ORFs, there are pairs of ORFs that have 90-100% pairwise identity at the nt level. This analysis would be more successful if such ORFs had their duplicates removed. Because 90% identity, we would still consider it the same gene that likely got duplicated.

So if I do map my reads to the entire assembly, and extract coverage of ORFs, I will select the option that allows a read to map to multiple locations? Otherwise if a gene has 4 copies, and only 2 are present in the assembly, each of those 2 ORFs will look like it has 2 copies, when in fact it's one ORF with 4 copies?

Seems like that could easily be solved by blasting the ORFs against each other and identifying which are similar. You don't need any elaborate mapping approach for genes that are that different. If a gene is different enough that it gets separated out in assembly then it is pretty different. A gene with 90% identity can have a completely different function, are you sure you want to lump them together?

Your mapping approach can identify duplicate genes with just a few nt or even no difference at all. Such highly similar duplicates can be difficult to separate out during assembly resulting in assemblies where multiple identical duplicates have been collapsed into a single locus, these are the cases that you can find by looking a read depth.

**AdrianP** · 02-23-2014, 06:32 PM

Originally posted by Jeremy View Post

Seems like that could easily be solved by blasting the ORFs against each other and identifying which are similar. You don't need any elaborate mapping approach for genes that are that different. If a gene is different enough that it gets separated out in assembly then it is pretty different. A gene with 90% identity can have a completely different function, are you sure you want to lump them together?

Your mapping approach can identify duplicate genes with just a few nt or even no difference at all. Such highly similar duplicates can be difficult to separate out during assembly resulting in assemblies where multiple identical duplicates have been collapsed into a single locus, these are the cases that you can find by looking a read depth.

Here we are getting into whether the alleles of a diploid genome have enough SNPs to be placed in 2 different contigs (one of them being redundant). I have taken care of that by running diplospades. But in this organism, there are genes that are actually duplicated, and are 100% identical. Some are 90% because they are trimmed at their 5' or 3' end, and shorter. Others are indeed 90% identical due to SNPs.

I agree that 90% identity, especially due to truncation can mean different functions or rather maybe slightly different functions. I am okay with that, because I will be looking if there is any KOG enrichment for these genes with many copies.

Now back to blasting. Sure I can do this, but then I would need to add the copy numbers of those genes which are very similar but located in different loci. This will be somewhat challenging.

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 24 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 159 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Gene Copy Number

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News