Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • mrawlins
    replied
    **Is there a method to incorporate ambiguous reads?**
    There are a few. I think Cufflinks (part of the bowtie/tophat pipeline) will handle them. An important question to ask here is if the ambiguous reads actually tell you anything. How do you know which gene to assign the ambiguous reads to? Do the genes serve a different biological purpose? If so, why are they so similar? Are you sure that Gene A from your example *shouldn't* have 0 reads? Maybe it really does have zero expression. Maybe it doesn't. How do you know? Does it matter? Assigning the ambiguous reads is fraught with issues. There are enough interesting results from the un-ambiguous reads that it's probably not worth worrying about the ambiguous ones except under some very specific experimental conditions.
    [The questions are rhetorical, and may not have good answers. I suspect that, because new papers are coming out on how to deal with multimapped reads, that there are no great answers yet, and/or that the answers vary by experiment.]

    ** Is there a score associated with each read in the sam file or elsewhere, that outlines the strength of the gene match? **
    There is a count of mismatched basepairs, and a length of the read, which can be used to calculate a score. There is also a Phred-style quality, and you could add up the phred score of the matched base-pairs to generate a quality score. Those are the main methods used by mapping software to determine which reads map to which genes. This may be of use, but most likely was already taken into consideration by Bowtie, and there was no clear winner: hence the ambiguity of the reads.

    **Was my idea of using the gene sequences as subject for alignment rather than the chromosome sequences flawed?**
    You have assumed that running against the whole chromosome will give no additional useful information than running against the gene annotations. That may be justified and reasonable. That may be foolish. We run against the chromosomes in part because we don't trust the gene annotations for our organisms to be 100% right. If you have 100% trust in the drosophila annotations your method is probably reasonable. I personally would want some data to justify that decision, so I would recommend mapping at least one sample against the whole chromosome and comparing the results.

    ** The total number of reads is about 1000 Mb and I get about 100 Mb of unaligned reads when I run against the gene sequences. Can this might improve if I run the sequences against the chromosomes?**
    Maybe. You should try and find out. It shouldn't take more than a day or two of computation to get those results, and comparing the outputs in IGV or another viewer should give you a good indication of how much maps to the intergenic sequences.

    Leave a comment:


  • thinkRNA
    replied
    Originally posted by gen2prot View Post
    Hello All,

    Lets consider Gene A and Gene B. In the original sam file lets suppose gene A has 25 reads (all ambiguous) while gene B has the same 25 ambiguous reads plus an additional 75 unique reads. If I use my method, I drop the read count of gene A from 25 to zero, while the read-count of gene B drops from 100 to 75. In other words, gene A will have "no expression" however in reality it does. I'd like to know what you guys think, since it does not sound ok to me. But I don't know what to do.

    Is there a method to incorporate ambiguous reads? Is there a score associated with each read in the sam file or elsewhere, that outlines the strength of the gene match? Was my idea of using the gene sequences as subject for alignment rather than the chromosome sequences flawed? The total number of reads is about 1000 Mb and I get about 100 Mb of unaligned reads when I run against the gene sequences. Can this might improve if I run the sequences against the chromosomes?

    Please let me know.

    Thanks,
    Abhijit
    Tophat is designed to align reads against chromosome/genomic sequence, so I really don't know how useful it will be to use tophat against mRNA sequences.

    Finally, if gene A has 25 multireads and gene B has 75 uniq reads + 25 multireads, that is telling you can be more confident of the expression of gene B but not of gene A. However if you have sequenced deep enough (~40-60 million reads), and gene A has portions of it that are unique, you will notice uniq reads show up and you should be fine with throwing away multi reads. An important question is what proportion of genes have the entire sequence paralogous and these will have absolutely no reads mapping to them.

    Throwing away multireads introduces non-uniformity in the coverage of the data on a gene. There are papers that have dealt with multireads by using this logic. If a gene X has certain coverage and multireads align gene X as well as gene Y which has no coverage, you can choose gene X over gene Y. Read this paper:http://www.ncbi.nlm.nih.gov/pmc/arti...7/?tool=pubmed

    Leave a comment:


  • Greg
    replied
    Multimapping reads are definitely an issue. Tophat/bowtie will filter these for you if you set -g to 1. I think there are some papers that use some kind of likelyhood approach to assign ambiguous reads to one location. different people seem to do different things depending on their data and question.

    Leave a comment:


  • gen2prot
    started a topic Question on counting number of reads per gene

    Question on counting number of reads per gene

    Hello All,

    I had a question about per gene read-count data. I wrote 2 short perl scripts to find the read-count per gene, from the sam file I get in the Tophat output. However, I am not feeling too confident on the logic I used to find the read count. Therefore, I would appreciate any help from the community on this.

    My project is to find differentially expressed genes between a normal drosophila fly and a mutant. Since Drosophila has a robust (compared to some other eukaryotes) gene annotation, I decided to run a bowtie build on the Drosophila gene sequences, rather than on the individual Chromosomes. I then used Tophat on this to align the sequencing reads (The files are fastq format with solexa1.3 quality values. The length of the reads is 42bp and they are NOT paired-end).

    In the generated SAM file I would have Ambiguous reads (One read matching 2 or more different genes) and Unique reads (2 or more different reads matching the same gene). I decided to get rid of the ambiguos reads from the Sam file with the first perl script and then count the number of unique reads going to each gene with the second script. I am uncomfortable with this logic and here's why.

    Lets consider Gene A and Gene B. In the original sam file lets suppose gene A has 25 reads (all ambiguous) while gene B has the same 25 ambiguous reads plus an additional 75 unique reads. If I use my method, I drop the read count of gene A from 25 to zero, while the read-count of gene B drops from 100 to 75. In other words, gene A will have "no expression" however in reality it does. I'd like to know what you guys think, since it does not sound ok to me. But I don't know what to do.

    Is there a method to incorporate ambiguous reads? Is there a score associated with each read in the sam file or elsewhere, that outlines the strength of the gene match? Was my idea of using the gene sequences as subject for alignment rather than the chromosome sequences flawed? The total number of reads is about 1000 Mb and I get about 100 Mb of unaligned reads when I run against the gene sequences. Can this might improve if I run the sequences against the chromosomes?

    Please let me know.

    Thanks,
    Abhijit

Latest Articles

Collapse

  • seqadmin
    Best Practices for Single-Cell Sequencing Analysis
    by seqadmin



    While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
    06-06-2024, 07:15 AM
  • seqadmin
    Latest Developments in Precision Medicine
    by seqadmin



    Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

    Somatic Genomics
    “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
    05-24-2024, 01:16 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Yesterday, 07:24 AM
0 responses
11 views
0 likes
Last Post seqadmin  
Started by seqadmin, 06-13-2024, 08:58 AM
0 responses
11 views
0 likes
Last Post seqadmin  
Started by seqadmin, 06-12-2024, 02:20 PM
0 responses
16 views
0 likes
Last Post seqadmin  
Started by seqadmin, 06-07-2024, 06:58 AM
0 responses
184 views
0 likes
Last Post seqadmin  
Working...
X