Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Question on counting number of reads per gene

    Hello All,

    I had a question about per gene read-count data. I wrote 2 short perl scripts to find the read-count per gene, from the sam file I get in the Tophat output. However, I am not feeling too confident on the logic I used to find the read count. Therefore, I would appreciate any help from the community on this.

    My project is to find differentially expressed genes between a normal drosophila fly and a mutant. Since Drosophila has a robust (compared to some other eukaryotes) gene annotation, I decided to run a bowtie build on the Drosophila gene sequences, rather than on the individual Chromosomes. I then used Tophat on this to align the sequencing reads (The files are fastq format with solexa1.3 quality values. The length of the reads is 42bp and they are NOT paired-end).

    In the generated SAM file I would have Ambiguous reads (One read matching 2 or more different genes) and Unique reads (2 or more different reads matching the same gene). I decided to get rid of the ambiguos reads from the Sam file with the first perl script and then count the number of unique reads going to each gene with the second script. I am uncomfortable with this logic and here's why.

    Lets consider Gene A and Gene B. In the original sam file lets suppose gene A has 25 reads (all ambiguous) while gene B has the same 25 ambiguous reads plus an additional 75 unique reads. If I use my method, I drop the read count of gene A from 25 to zero, while the read-count of gene B drops from 100 to 75. In other words, gene A will have "no expression" however in reality it does. I'd like to know what you guys think, since it does not sound ok to me. But I don't know what to do.

    Is there a method to incorporate ambiguous reads? Is there a score associated with each read in the sam file or elsewhere, that outlines the strength of the gene match? Was my idea of using the gene sequences as subject for alignment rather than the chromosome sequences flawed? The total number of reads is about 1000 Mb and I get about 100 Mb of unaligned reads when I run against the gene sequences. Can this might improve if I run the sequences against the chromosomes?

    Please let me know.

    Thanks,
    Abhijit

  • #2
    Multimapping reads are definitely an issue. Tophat/bowtie will filter these for you if you set -g to 1. I think there are some papers that use some kind of likelyhood approach to assign ambiguous reads to one location. different people seem to do different things depending on their data and question.

    Comment


    • #3
      Originally posted by gen2prot View Post
      Hello All,

      Lets consider Gene A and Gene B. In the original sam file lets suppose gene A has 25 reads (all ambiguous) while gene B has the same 25 ambiguous reads plus an additional 75 unique reads. If I use my method, I drop the read count of gene A from 25 to zero, while the read-count of gene B drops from 100 to 75. In other words, gene A will have "no expression" however in reality it does. I'd like to know what you guys think, since it does not sound ok to me. But I don't know what to do.

      Is there a method to incorporate ambiguous reads? Is there a score associated with each read in the sam file or elsewhere, that outlines the strength of the gene match? Was my idea of using the gene sequences as subject for alignment rather than the chromosome sequences flawed? The total number of reads is about 1000 Mb and I get about 100 Mb of unaligned reads when I run against the gene sequences. Can this might improve if I run the sequences against the chromosomes?

      Please let me know.

      Thanks,
      Abhijit
      Tophat is designed to align reads against chromosome/genomic sequence, so I really don't know how useful it will be to use tophat against mRNA sequences.

      Finally, if gene A has 25 multireads and gene B has 75 uniq reads + 25 multireads, that is telling you can be more confident of the expression of gene B but not of gene A. However if you have sequenced deep enough (~40-60 million reads), and gene A has portions of it that are unique, you will notice uniq reads show up and you should be fine with throwing away multi reads. An important question is what proportion of genes have the entire sequence paralogous and these will have absolutely no reads mapping to them.

      Throwing away multireads introduces non-uniformity in the coverage of the data on a gene. There are papers that have dealt with multireads by using this logic. If a gene X has certain coverage and multireads align gene X as well as gene Y which has no coverage, you can choose gene X over gene Y. Read this paper:http://www.ncbi.nlm.nih.gov/pmc/arti...7/?tool=pubmed

      Comment


      • #4
        **Is there a method to incorporate ambiguous reads?**
        There are a few. I think Cufflinks (part of the bowtie/tophat pipeline) will handle them. An important question to ask here is if the ambiguous reads actually tell you anything. How do you know which gene to assign the ambiguous reads to? Do the genes serve a different biological purpose? If so, why are they so similar? Are you sure that Gene A from your example *shouldn't* have 0 reads? Maybe it really does have zero expression. Maybe it doesn't. How do you know? Does it matter? Assigning the ambiguous reads is fraught with issues. There are enough interesting results from the un-ambiguous reads that it's probably not worth worrying about the ambiguous ones except under some very specific experimental conditions.
        [The questions are rhetorical, and may not have good answers. I suspect that, because new papers are coming out on how to deal with multimapped reads, that there are no great answers yet, and/or that the answers vary by experiment.]

        ** Is there a score associated with each read in the sam file or elsewhere, that outlines the strength of the gene match? **
        There is a count of mismatched basepairs, and a length of the read, which can be used to calculate a score. There is also a Phred-style quality, and you could add up the phred score of the matched base-pairs to generate a quality score. Those are the main methods used by mapping software to determine which reads map to which genes. This may be of use, but most likely was already taken into consideration by Bowtie, and there was no clear winner: hence the ambiguity of the reads.

        **Was my idea of using the gene sequences as subject for alignment rather than the chromosome sequences flawed?**
        You have assumed that running against the whole chromosome will give no additional useful information than running against the gene annotations. That may be justified and reasonable. That may be foolish. We run against the chromosomes in part because we don't trust the gene annotations for our organisms to be 100% right. If you have 100% trust in the drosophila annotations your method is probably reasonable. I personally would want some data to justify that decision, so I would recommend mapping at least one sample against the whole chromosome and comparing the results.

        ** The total number of reads is about 1000 Mb and I get about 100 Mb of unaligned reads when I run against the gene sequences. Can this might improve if I run the sequences against the chromosomes?**
        Maybe. You should try and find out. It shouldn't take more than a day or two of computation to get those results, and comparing the outputs in IGV or another viewer should give you a good indication of how much maps to the intergenic sequences.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        29 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        31 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        28 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        52 views
        0 likes
        Last Post seqadmin  
        Working...
        X