Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • mrawlins
    replied
    **Is there a method to incorporate ambiguous reads?**
    There are a few. I think Cufflinks (part of the bowtie/tophat pipeline) will handle them. An important question to ask here is if the ambiguous reads actually tell you anything. How do you know which gene to assign the ambiguous reads to? Do the genes serve a different biological purpose? If so, why are they so similar? Are you sure that Gene A from your example *shouldn't* have 0 reads? Maybe it really does have zero expression. Maybe it doesn't. How do you know? Does it matter? Assigning the ambiguous reads is fraught with issues. There are enough interesting results from the un-ambiguous reads that it's probably not worth worrying about the ambiguous ones except under some very specific experimental conditions.
    [The questions are rhetorical, and may not have good answers. I suspect that, because new papers are coming out on how to deal with multimapped reads, that there are no great answers yet, and/or that the answers vary by experiment.]

    ** Is there a score associated with each read in the sam file or elsewhere, that outlines the strength of the gene match? **
    There is a count of mismatched basepairs, and a length of the read, which can be used to calculate a score. There is also a Phred-style quality, and you could add up the phred score of the matched base-pairs to generate a quality score. Those are the main methods used by mapping software to determine which reads map to which genes. This may be of use, but most likely was already taken into consideration by Bowtie, and there was no clear winner: hence the ambiguity of the reads.

    **Was my idea of using the gene sequences as subject for alignment rather than the chromosome sequences flawed?**
    You have assumed that running against the whole chromosome will give no additional useful information than running against the gene annotations. That may be justified and reasonable. That may be foolish. We run against the chromosomes in part because we don't trust the gene annotations for our organisms to be 100% right. If you have 100% trust in the drosophila annotations your method is probably reasonable. I personally would want some data to justify that decision, so I would recommend mapping at least one sample against the whole chromosome and comparing the results.

    ** The total number of reads is about 1000 Mb and I get about 100 Mb of unaligned reads when I run against the gene sequences. Can this might improve if I run the sequences against the chromosomes?**
    Maybe. You should try and find out. It shouldn't take more than a day or two of computation to get those results, and comparing the outputs in IGV or another viewer should give you a good indication of how much maps to the intergenic sequences.

    Leave a comment:


  • thinkRNA
    replied
    Originally posted by gen2prot View Post
    Hello All,

    Lets consider Gene A and Gene B. In the original sam file lets suppose gene A has 25 reads (all ambiguous) while gene B has the same 25 ambiguous reads plus an additional 75 unique reads. If I use my method, I drop the read count of gene A from 25 to zero, while the read-count of gene B drops from 100 to 75. In other words, gene A will have "no expression" however in reality it does. I'd like to know what you guys think, since it does not sound ok to me. But I don't know what to do.

    Is there a method to incorporate ambiguous reads? Is there a score associated with each read in the sam file or elsewhere, that outlines the strength of the gene match? Was my idea of using the gene sequences as subject for alignment rather than the chromosome sequences flawed? The total number of reads is about 1000 Mb and I get about 100 Mb of unaligned reads when I run against the gene sequences. Can this might improve if I run the sequences against the chromosomes?

    Please let me know.

    Thanks,
    Abhijit
    Tophat is designed to align reads against chromosome/genomic sequence, so I really don't know how useful it will be to use tophat against mRNA sequences.

    Finally, if gene A has 25 multireads and gene B has 75 uniq reads + 25 multireads, that is telling you can be more confident of the expression of gene B but not of gene A. However if you have sequenced deep enough (~40-60 million reads), and gene A has portions of it that are unique, you will notice uniq reads show up and you should be fine with throwing away multi reads. An important question is what proportion of genes have the entire sequence paralogous and these will have absolutely no reads mapping to them.

    Throwing away multireads introduces non-uniformity in the coverage of the data on a gene. There are papers that have dealt with multireads by using this logic. If a gene X has certain coverage and multireads align gene X as well as gene Y which has no coverage, you can choose gene X over gene Y. Read this paper:http://www.ncbi.nlm.nih.gov/pmc/arti...7/?tool=pubmed

    Leave a comment:


  • Greg
    replied
    Multimapping reads are definitely an issue. Tophat/bowtie will filter these for you if you set -g to 1. I think there are some papers that use some kind of likelyhood approach to assign ambiguous reads to one location. different people seem to do different things depending on their data and question.

    Leave a comment:


  • gen2prot
    started a topic Question on counting number of reads per gene

    Question on counting number of reads per gene

    Hello All,

    I had a question about per gene read-count data. I wrote 2 short perl scripts to find the read-count per gene, from the sam file I get in the Tophat output. However, I am not feeling too confident on the logic I used to find the read count. Therefore, I would appreciate any help from the community on this.

    My project is to find differentially expressed genes between a normal drosophila fly and a mutant. Since Drosophila has a robust (compared to some other eukaryotes) gene annotation, I decided to run a bowtie build on the Drosophila gene sequences, rather than on the individual Chromosomes. I then used Tophat on this to align the sequencing reads (The files are fastq format with solexa1.3 quality values. The length of the reads is 42bp and they are NOT paired-end).

    In the generated SAM file I would have Ambiguous reads (One read matching 2 or more different genes) and Unique reads (2 or more different reads matching the same gene). I decided to get rid of the ambiguos reads from the Sam file with the first perl script and then count the number of unique reads going to each gene with the second script. I am uncomfortable with this logic and here's why.

    Lets consider Gene A and Gene B. In the original sam file lets suppose gene A has 25 reads (all ambiguous) while gene B has the same 25 ambiguous reads plus an additional 75 unique reads. If I use my method, I drop the read count of gene A from 25 to zero, while the read-count of gene B drops from 100 to 75. In other words, gene A will have "no expression" however in reality it does. I'd like to know what you guys think, since it does not sound ok to me. But I don't know what to do.

    Is there a method to incorporate ambiguous reads? Is there a score associated with each read in the sam file or elsewhere, that outlines the strength of the gene match? Was my idea of using the gene sequences as subject for alignment rather than the chromosome sequences flawed? The total number of reads is about 1000 Mb and I get about 100 Mb of unaligned reads when I run against the gene sequences. Can this might improve if I run the sequences against the chromosomes?

    Please let me know.

    Thanks,
    Abhijit

Latest Articles

Collapse

  • seqadmin
    Recent Developments in Metagenomics
    by seqadmin





    Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
    09-23-2024, 06:35 AM
  • seqadmin
    Understanding Genetic Influence on Infectious Disease
    by seqadmin




    During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

    Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
    09-09-2024, 10:59 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 10-02-2024, 04:51 AM
0 responses
13 views
0 likes
Last Post seqadmin  
Started by seqadmin, 10-01-2024, 07:10 AM
0 responses
21 views
0 likes
Last Post seqadmin  
Started by seqadmin, 09-30-2024, 08:33 AM
0 responses
25 views
0 likes
Last Post seqadmin  
Started by seqadmin, 09-26-2024, 12:57 PM
0 responses
18 views
0 likes
Last Post seqadmin  
Working...
X