Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Melissa
    replied
    Unmappable reads sounds new to me. I'm particularly interested in difficulties in de novo transcriptome assembly.

    Can you define these unmappable read more precisely? Are you referring to the raw data or high quality reads after filtering? If you are referring to filtered reads, then sequencing errors cannot contribute to this problem. I believe these reads are resulted from technical/experimental problem rather than the nature of the sequences. For example, low quality reads due to platform's temperature problem and artifacts created at the edges of the flow cell.

    Most RNA-seq data usually contain 30-40% rRNA. After filtering low quality, contaminating reads and polyA tails, 50-60% of reads sounds REALLY good to begin with. So, the answer is NO to whether it's a waste to get only 50-60% reads. High redundancy is another reason why some reads are not useful at all.

    I'm not sure the reason why some reads cannot map to the reference genome. Well, they just don't . Maybe some reads that are overlapping/spinning exon splice junctions are lost after mapping. RNA processing and other regulatory mechanisms sounds like a good explanation.

    Cheers,
    Melissa

    Leave a comment:


  • francesco.vezzi
    replied
    Originally posted by NGSfan View Post
    Hi francesco!
    In my opinion, it seems really odd, that after spending several thousand dollars on an expensive experiment to only get 50% to 60% of the data, no?
    Well this depend a lot against what reference are you aligning. We have an Illumina Genome Analyzer and we have sequenced two years ago the genome of the grapevine. When we sequence the reference plant that we have use to obtain the assembly we align more then 74% of the reads with parameters that are quite similar to yours. If we sequenced another variety of grapevine we are able to align only the 60% of all the reads. This is not strange because the two organism are different.

    You are using 25 bases reads, it means that they are really old (now illumina can produce 75 paired ends reads) and probably analysed with the old pipeline. You can find one interesting data set having a look at http://tinyurl.com/68aeq3 and to the article "de novo assembly of the pseudomonas syringae pv syringae b728a genome using illumina/solexa short sequence reads".

    About the non trivial information of not aligned reads there are a lot of possibilities like repeated regions with a lot of errors (if compared to the reference sequence) or totally new inserts.

    Originally posted by NGSfan View Post
    In my opinion, it seems really odd, that after spending several thousand dollars on an expensive experiment to only get 50% to 60% of the data, no?
    No, I disagree with you. If we consider an Illumina experiment with 7 lines the 50% of the reads means more then 2Gigabases of data and all this amount of data is obtained at a fraction of the cost of the methods available only 1 year ago.
    The problem is the opposite, there is too much data to analyse and we are more or less using instruments that where develop for totally different kind of data.

    Obviously this is only an opinion of a four months PhD student.....

    Leave a comment:


  • NGSfan
    replied
    Hi francesco!

    thanks for joining in on the discussion.

    I have for example, examined some published data with 25bp long reads, and aligned them to a mouse genome allowing for 2 mismatches at most. About 60% of the reads are alignable under those parameters. I looked at the other unmapped reads for possible contamination, but only 2% mapped to human or ecoli for example.

    Perhaps I'll take another look at the unmapped reads and allow for 3 or 4 mismatches and see how many more reads I can recover for mapping. But I'm sure there will be many still unmapped - and I wonder if this is just because there are more errors than advertised by illumina, or what else could this be?

    by non-trivial, do you mean some functional sequences? any guesses what these other sequences could be? I'm very curious.

    In my opinion, it seems really odd, that after spending several thousand dollars on an expensive experiment to only get 50% to 60% of the data, no?

    Leave a comment:


  • francesco.vezzi
    replied
    Hi,
    I think you must be more precise. What is the length of the reads you wont to place, and how many errors you allow for each read. The reads that are placed in multiple copies are considered placed or not?

    Actually I was wondering about not placed reads during last weekend (for the happiness of my girlfriend ). In particular I think that if we exclude reads that have low quality the unplaced reads hide some non trivial information...

    Leave a comment:


  • NGSfan
    started a topic why low mapping rates for RNAseq?

    why low mapping rates for RNAseq?

    Hi everyone!

    I must say, I'm very happy to find a community where we can discuss this new technology.

    I have searched the forum, but could not turn up a thread that discusses the issue of unmappable RNAseq reads.

    According to the article "The digital generation" by Nathan Blow, Dr. Liu is quoted as saying that it is not unusual that only "40-50%" of the data generated are mappable. There is some mention that perhaps this unmappable sequence is from antisense transcripts or artefacts of the RNA processing.

    Interesting J.Shendure mentions being about to achieve 95% mapping with genomic DNA.

    Losing 60-50% of the RNA-seq data seems quite high. Has anyone looked into this more carefully? Are the majority of these unmappable reads just full of sequencing errors? could there be contamination? and what is meant by artefacts in making a sequencing library from RNA? what would these artefacts look like to make them unmappable?

    Thanks for any thoughts.

Latest Articles

Collapse

  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM
  • seqadmin
    Strategies for Sequencing Challenging Samples
    by seqadmin


    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
    03-22-2024, 06:39 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
27 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
31 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 09:21 AM
0 responses
27 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-04-2024, 09:00 AM
0 responses
52 views
0 likes
Last Post seqadmin  
Working...
X