Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • colindaven
    replied
    The best aligners for CS data use iterative trimming, see for example NovoalignCS and Lifescope.

    Leave a comment:


  • Chirag
    replied
    Try this:

    Leave a comment:


  • Chirag
    replied
    Hey,

    Could anyone help me on how to trim the color space data.
    I am using bowtie/tophat to map the data, and i map only around 40% of the reads. Reads are 75 bp long and i had allowed 3 mismatches, where allowing 2 mismatch had given me only 35% mapping.

    Could someone help me on how to trim the end of the reads, for eg, last 10/15 bases are trimmed, so could probably map more. But haven't found yet any S/W that does trimming for SoliD ?

    regards
    Chirag

    Leave a comment:


  • bacdirector
    replied
    Ok, here it is - Percent of Reads that Map that Also Map Uniquely

    In our study that started this thread, I originally reported a doubling of reads mapped from 25 million to 50 million achieved by trimming all reads >35 to 35. At 50b, we get only 25 million reads mapped. This was based on empirical comparisons of reported valid adj errors.

    The answer to the question of what percent of reads map uniquely is--- it barely matters. The range is 85.5% to 86.5% Percent of Reads that Map that Also Map Uniquely, with an increasing trend from 85.5% to 86.5% going from 25b trimmed reads to 50b reads.

    So far, we've seen the optimal overall response to trimming at 35b. This is for Bioscope, not Nextgene (my bad!!!), but without any masking of -1's..... so in response to feedback from the thread, we'll be expanding our repetoire to include progressive mapping as well as non-call masking. And bfast. And bowtie. We have all the trimmed data sets ready to go, it's just a matter of punching buttons...

    thanks, everyone for your interest and feedback!

    James Lyons-Weiler
    Director
    Bioinformatics Analysis Core
    University of Pittsburgh

    Leave a comment:


  • drio
    replied
    For your evaluations (simulations), take a look to this.
    The author of bfast talks about simulating reads and then plotting the results using ROC curves to find the best tool for the problem.

    Leave a comment:


  • bacdirector
    replied
    thanks, everyone

    well, a lot of great feedback and perspective here; looks like a lot of people are doing different things; some use a masking tool, which should allow reads to map in spite of non-calls; that makes sense; we have heard of progressive mapping, but not done it yet, and will be studying that in comparison to what we've found. also, re: # uniquely mapped reads, will be sure to report back on that after we look at that closely (over the various trimmed lengths).

    with the various strategies for data representation, filtering on qvals; trimming back based on qvals; and given the various algorithms, i.e., bfast, and bowtie, and right now we're doing nextgene and bioscope, there seems, as usual, to be a combinatorial number of strategies. we've been careful to be to look at the number of valid adjacent errors as well to be sure we're not just mapping short reads anywhere and everywhere.

    yeah, the dibase encoding = higher accuracy is real.

    looks like we have a lot of careful comparative evaluations to do!

    any additional thoughts would be appreciated.

    Leave a comment:


  • Michael.James.Clark
    replied
    Originally posted by poisson200 View Post
    Thanks for the correction; I am interested to know where the error trends are modelled? Is that in a publication somewhere and/or do some mapping packages correct/know about these errors when mapping reads and correct for it?

    To clarify; you say that colourspace reads are more accurate than base space "theoretically", but because of chemistry issues, in reality they are not. Is that what you mean?
    Not exactly. Sorry, that may have been a little unclear. You can see any of the tech specs for SOLiD to help understand how the base correction works. Check the ABI site--many publications listed here: http://www.appliedbiosystems.com/abs...equencing.html

    Basically, colorspace reads are more accurate than base space reads because of the ability to correct colorspace errors. There are (IIRC) five repeat resets during SOLiD sequencing, so a single read is observed five separate times at different ligation starting positions. Due to the increased number of observations, one can correct away errors thanks to knowing how the colorspace-to-basespace translation would be affected by a specific mismatching colorspace read.

    The chemistry gets messy at the 3' end, so the accuracy start falling off, but it doesn't get as bad as single base sequencing after 35-bases--more like the last five bases or so. So 2-base encoded colorspace reads are more accurate than base space reads generally, but along the length of the read, they may be less accurate at the very 3' end. However, that inaccuracy at the end has zero impact on gapped alignment anchored by masking at the 5' end.

    Nils Homer has a nice article about two-base encoding and how it works to improve accuracy here: http://www.biomedcentral.com/1471-2105/10/175
    Last edited by Michael.James.Clark; 11-03-2010, 09:36 AM.

    Leave a comment:


  • westerman
    replied
    The ABI/LifeTech Bioscope software does 'progressive' mapping where it starts considering reads that map at 50 bases (and with various mismatches), then at 49 bases, etc. This approach is probably superior to one of simply chopping off the reads to 35 bases.

    As an example, a partial statistics file from one of my recent SOLiD runs shows:

    Read Length 50 0 mismatches 19,901,946 (64.86%)
    Read Length 50 1 mismatches 1,336,272 ( 4.35%)
    ...
    Read Length 35 0 mismatches 137,601 ( 0.45%)
    Read Length 35 1 mismatches 40,989 ( 0.13%)
    ...
    Down to a read length of 25.

    Leave a comment:


  • drio
    replied
    I am looking forward to see what's the number of uniquely mapped reads when trimming compared to the non trimmed version. I am with MJC, if you use bwa, NOvoalignCS or bfast (particularly the last two) you'll see an increase in mapped reads without trimming. It will be nice if you can recompute your alignments with those and show the numbers.

    Leave a comment:


  • dsidote
    replied
    Bacdirector,

    What did you use to trim your SOLiD reads?

    Dave

    Leave a comment:


  • poisson200
    replied
    Originally posted by Michael.James.Clark View Post
    This is misinformation. An error in colorspace does mess with the rest of the read momentarily, but colorspace errors are easily corrected thanks to the same phenomenon because there will be an inconsistency across the different ligations, and colorspace error trends are known and modeled. In actuality, colorspace reads are a lot more accurate in theory than base space (but in chemistry, they may get messier at the ends).
    Thanks for the correction; I am interested to know where the error trends are modelled? Is that in a publication somewhere and/or do some mapping packages correct/know about these errors when mapping reads and correct for it?

    To clarify; you say that colourspace reads are more accurate than base space "theoretically", but because of chemistry issues, in reality they are not. Is that what you mean?

    Leave a comment:


  • Chipper
    replied
    bacdirector,

    you need to allow at least 10% errors to get decent 50 bp alignments. Trimming can help to rescue reads, we have used that strategy in SOLiD publications and a similar strategy was also used in the Gibbs/Lupski/ABI genome paper in NEJM (http://www.nejm.org/doi/full/10.1056...99307083290205).

    Better yet is to use a decent mapping strategy in the first case. If you re-do your alignment with BFAST it will find an anchor in the good part of the read and still use the lower qv part of the read, which should give you more coverage. Likewise bowtie will use the god part as a seed and still use full length reads. Bowtie (at least early cs versions) sometimes gives different base calls than bfast for the same reads so it may give you more of a reference bias, or fewer false SNPs.

    Given that the read is read in 5bp steps your errors dont necessarily accumulate towards the end, if you have a bad primer E you will still not be able to map them at 35 bp or 25 bp. With bfast you can easily design an index that handles this.

    Still a thourough evaluation on mapping strategies for CS would be useful, especially to look into effects on snp-calls.
    Last edited by Chipper; 11-02-2010, 11:31 PM.

    Leave a comment:


  • Michael.James.Clark
    replied
    Have you tried using something other than NextGENe for alignment? BFAST? Novoalign? BWA? These are aligners you should definitely assess for SOLiD data. This phenomenon seems like an alignment issue, not a technology issue.

    If you have tried these algorithms and you're seeing such a phenomenon still, it may be time to call up LifeTech and ask them to come see if something is wrong with your machine. It does not sound right to me.

    Originally posted by poisson200 View Post
    I think an error in colourspace causes the rest of the read to be incorrect, whereas it might look like a SNP in Illumina/base space. I am more than happy to be corrected here if I am wrong but based on that scenario, it is possibly more beneficial for cutting off the ends of colourspace reads than Illumina's.
    This is misinformation. An error in colorspace does mess with the rest of the read momentarily, but colorspace errors are easily corrected thanks to the same phenomenon because there will be an inconsistency across the different ligations, and colorspace error trends are known and modeled. In actuality, colorspace reads are a lot more accurate in theory than base space (but in chemistry, they may get messier at the ends).

    Leave a comment:


  • poisson200
    replied
    Originally posted by bacdirector View Post
    50/52 million reads mapped....
    96% reads mapping; that sounds impressive to me. I wonder if that is best unique reads mapping. For some applications, I think it is better to ignore reads that map to multiple positions in the genome with the same number of mismatches and alignment length. I am thinking that you don't exclude those reads?

    To be honest, it is not so novel to trim reads, even colourspace reads, as we have been experimenting with that ourselves. However, although we get an improvement, we don't get 96% of total reads mapping hence my slight doubts. If you can get a publication out of it, why not? Also, let me know, I can contribute some data :-). Did Watson and Crick really first elucidate the structure of DNA on their own? Maybe/maybe not but they are famous. So go for a publication if you can.

    I also wonder if there is a bit of difference with colourspace and Illumina data. I think an error in colourspace causes the rest of the read to be incorrect, whereas it might look like a SNP in Illumina/base space. I am more than happy to be corrected here if I am wrong but based on that scenario, it is possibly more beneficial for cutting off the ends of colourspace reads than Illumina's.

    What exactly is your command line for NextGene and/or bowtie?

    Leave a comment:


  • bacdirector
    replied
    50/52 million reads mapped....

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Choosing Between NGS and qPCR
    by seqadmin



    Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
    10-18-2024, 07:11 AM
  • seqadmin
    Non-Coding RNA Research and Technologies
    by seqadmin




    Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

    Nobel Prize for MicroRNA Discovery
    This week,...
    10-07-2024, 08:07 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 11-01-2024, 06:09 AM
0 responses
11 views
0 likes
Last Post seqadmin  
Started by seqadmin, 10-30-2024, 05:31 AM
0 responses
14 views
0 likes
Last Post seqadmin  
Started by seqadmin, 10-24-2024, 06:58 AM
0 responses
24 views
0 likes
Last Post seqadmin  
Started by seqadmin, 10-23-2024, 08:43 AM
0 responses
52 views
0 likes
Last Post seqadmin  
Working...
X