Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Gene Copy Number

    Hello,

    I am attempting to determine genes that have a very high copy number in the genome. My organism of interest is a non-model, and I have Illumina Paired-End sequencing data.

    What is the best approach to determine these genes that have either 2, 3 or even 10 copies in the genome?

    My first thought is that it will most likely be coverage based. Some tools even use paired end information. I have a problem deciding which tool to use given my objective. I am not interested in regions of the genome that have a high copy number, I am particularly interested in which genes have a high copy number.

    I have a genome assembly that has been annotated. I have tried to do the analysis as follows:
    - Extract all ORFs (larger than 100AA) into a fasta file.
    - Map my reads to all ORFs simultaneously. (Here, if a read can map to two ORFs, I allow it to map to both ORFs, because I suspect that in my assembly I have ORFs with 90% to 100% identity, and the point is to see copy number of ORFs, and if there are 2 ORFs that are the same, I should see the same copy number for both)
    - After I determine the mean coverage for each individual ORFs, I divide it by the genome mean coverage, to determine the copy number.

    This method seems a bit primitive, and I am sure there are better ways to do it. A similar method would be to map the reads to the ORFs and calculate FPKMs with cufflinks, but this method would be prone to the same pitfalls as the one I employed.

    Any ideas?

  • #2
    While I only have experience estimating CNV in targeted amplicons, and can't answer your question directly, I know there are at least a couple of reviews looking at CNV detection methods.

    Here is one from a quick google http://www.plosone.org/article/info%...l.pone.0059128

    Comment


    • #3
      I think you're approach will work fine for rough estimates of high CN genes. However, I think you're going to be underestimating copy number, because you are normalizing high CN ORFs by the mean genome coverage. The mean genome coverage will include high CN regions, repetitive regions etc.

      I just went through this myself and had to find an appropriate "baseline" measurement to normalize by. You really want your "baseline" coverage to reflect the coverage of a "single copy" region in the genome.

      The way I went about doing this is by using the coverage of single copy exons as my baseline normalizing factor, but in your case, single copy ORFs would work fine. I BLASTed all exons against all exons and excluded all sequences that had hits to anything other than themselves. This worked quite well.

      Good luck!

      Comment


      • #4
        I think you would be better to map to the entire genome then extract coverage of the ORFs later using a gff of the ORF locations using something like HTSeq-count. The mapping quality will be much better, otherwise you could end up with a lot of falsely mapped reads.

        Comment


        • #5
          Originally posted by Jeremy View Post
          I think you would be better to map to the entire genome then extract coverage of the ORFs later using a gff of the ORF locations using something like HTSeq-count. The mapping quality will be much better, otherwise you could end up with a lot of falsely mapped reads.
          That's true, but that is why using the "single copy ORFs" would circumvent the problem of mapping quality and falsely mapped reads.

          Comment


          • #6
            Originally posted by jgibbons1 View Post
            That's true, but that is why using the "single copy ORFs" would circumvent the problem of mapping quality and falsely mapped reads.
            Could you elaborate? I do not understand how to use single copy ORFs and what problem does that circumvent?

            Comment


            • #7
              Wait, what sort of sequence data do you have? (normalised) RNA-Seq? genomic?

              I assumed you had genomic which led me think along these lines: While I don't know what species you are working with, the % of your genome that is ORF is going to be what 1-3%? which means you will be mapping 97-99% of the reads against a reference that doesn't have the sequence that they should map to. That can result in a reasonble chance of them falsely mapping to one of the ORFs in the reference.

              Comment


              • #8
                Originally posted by Jeremy View Post
                Wait, what sort of sequence data do you have? (normalised) RNA-Seq? genomic?

                I assumed you had genomic which led me think along these lines: While I don't know what species you are working with, the % of your genome that is ORF is going to be what 1-3%? which means you will be mapping 97-99% of the reads against a reference that doesn't have the sequence that they should map to. That can result in a reasonble chance of them falsely mapping to one of the ORFs in the reference.
                Yes it is genomic. Makes sense.

                Comment


                • #9
                  Originally posted by AdrianP View Post
                  Could you elaborate? I do not understand how to use single copy ORFs and what problem does that circumvent?
                  Using only single copy regions of the genome/transcriptome would give you a better idea of the true single copy coverage. When you take the average coverage of the genome/transcriptome, you are also introducing values from copy number variable genomic regions and/or highly homologous gene families. In this sense, your estimation of background coverage will be skewed higher, and thus, your copy number estimates will be lower (because background coverage is denominator). Does that make sense? It's worth putting the effort into being confident about your background coverage.

                  Comment


                  • #10
                    Originally posted by jgibbons1 View Post
                    Using only single copy regions of the genome/transcriptome would give you a better idea of the true single copy coverage. When you take the average coverage of the genome/transcriptome, you are also introducing values from copy number variable genomic regions and/or highly homologous gene families. In this sense, your estimation of background coverage will be skewed higher, and thus, your copy number estimates will be lower (because background coverage is denominator). Does that make sense? It's worth putting the effort into being confident about your background coverage.
                    When I estimated the average coverage of the genome, I didn't take the value of the average coverage of all contigs, but rather found a 20kb region where the coverage is pretty evenly distributed, and took the average of that. In fact, most of the contigs have that coverage in most of their regions, I checked in a couple of other locations that do not look like they are repetitive.

                    It is slightly lower than the average coverage of all contigs, but that makes sense.

                    Really, my biggest problem is that in my ORFs, there are pairs of ORFs that have 90-100% pairwise identity at the nt level. This analysis would be more successful if such ORFs had their duplicates removed. Because 90% identity, we would still consider it the same gene that likely got duplicated.

                    So if I do map my reads to the entire assembly, and extract coverage of ORFs, I will select the option that allows a read to map to multiple locations? Otherwise if a gene has 4 copies, and only 2 are present in the assembly, each of those 2 ORFs will look like it has 2 copies, when in fact it's one ORF with 4 copies?

                    Comment


                    • #11
                      I am using you could use a statistical approach for coverage. A very simplistic approach could assume reads to be poisson distributed and therefore you could look at deviations from the average coverage. So if the average coverage is L, two copies would generate 2L and 3 copies 3L and so on. Based on the average and the variance you could test whether the mean at a location is 2L or 3L etc to get an estimate for the number of copies

                      Comment


                      • #12
                        Originally posted by AdrianP View Post
                        When I estimated the average coverage of the genome, I didn't take the value of the average coverage of all contigs, but rather found a 20kb region where the coverage is pretty evenly distributed, and took the average of that. In fact, most of the contigs have that coverage in most of their regions, I checked in a couple of other locations that do not look like they are repetitive.

                        It is slightly lower than the average coverage of all contigs, but that makes sense.

                        Really, my biggest problem is that in my ORFs, there are pairs of ORFs that have 90-100% pairwise identity at the nt level. This analysis would be more successful if such ORFs had their duplicates removed. Because 90% identity, we would still consider it the same gene that likely got duplicated.

                        So if I do map my reads to the entire assembly, and extract coverage of ORFs, I will select the option that allows a read to map to multiple locations? Otherwise if a gene has 4 copies, and only 2 are present in the assembly, each of those 2 ORFs will look like it has 2 copies, when in fact it's one ORF with 4 copies?
                        Seems like that could easily be solved by blasting the ORFs against each other and identifying which are similar. You don't need any elaborate mapping approach for genes that are that different. If a gene is different enough that it gets separated out in assembly then it is pretty different. A gene with 90% identity can have a completely different function, are you sure you want to lump them together?

                        Your mapping approach can identify duplicate genes with just a few nt or even no difference at all. Such highly similar duplicates can be difficult to separate out during assembly resulting in assemblies where multiple identical duplicates have been collapsed into a single locus, these are the cases that you can find by looking a read depth.

                        Comment


                        • #13
                          Originally posted by Jeremy View Post
                          Seems like that could easily be solved by blasting the ORFs against each other and identifying which are similar. You don't need any elaborate mapping approach for genes that are that different. If a gene is different enough that it gets separated out in assembly then it is pretty different. A gene with 90% identity can have a completely different function, are you sure you want to lump them together?

                          Your mapping approach can identify duplicate genes with just a few nt or even no difference at all. Such highly similar duplicates can be difficult to separate out during assembly resulting in assemblies where multiple identical duplicates have been collapsed into a single locus, these are the cases that you can find by looking a read depth.
                          Here we are getting into whether the alleles of a diploid genome have enough SNPs to be placed in 2 different contigs (one of them being redundant). I have taken care of that by running diplospades. But in this organism, there are genes that are actually duplicated, and are 100% identical. Some are 90% because they are trimmed at their 5' or 3' end, and shorter. Others are indeed 90% identical due to SNPs.

                          I agree that 90% identity, especially due to truncation can mean different functions or rather maybe slightly different functions. I am okay with that, because I will be looking if there is any KOG enrichment for these genes with many copies.

                          Now back to blasting. Sure I can do this, but then I would need to add the copy numbers of those genes which are very similar but located in different loci. This will be somewhat challenging.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Best Practices for Single-Cell Sequencing Analysis
                            by seqadmin



                            While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                            06-06-2024, 07:15 AM
                          • seqadmin
                            Latest Developments in Precision Medicine
                            by seqadmin



                            Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                            Somatic Genomics
                            “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                            05-24-2024, 01:16 PM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 06-21-2024, 07:49 AM
                          0 responses
                          14 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 06-20-2024, 07:23 AM
                          0 responses
                          14 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 06-17-2024, 06:54 AM
                          0 responses
                          16 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 06-14-2024, 07:24 AM
                          0 responses
                          25 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X