Header Leaderboard Ad

Collapse

75+35 Pair-end SOLiD RNA-seq data analysis

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • carmeyeii
    replied
    Thanks hildebs!

    I'm analyzing a "second-hand" dataset generated using SOLiD 4. It is a transcriptome mate pair library that is 52 x 37 nt, and I cannot for the sake of me find the protocol that was used to generate those specific read lengths. I have F3 and R3 reads, so I am assuming it is a circularization protocol, but I do not know what the size selection parameters were, or how the circles were cut to produce the final fragments. This info would be very valuable for a more accurate mapping.

    Any knowledge would be greatly appreciated!

    Thanks a lot,

    Carmen

    Leave a comment:


  • hildebs
    replied
    Originally posted by carmeyeii View Post
    Hello,

    I am analyzing 9 RNA-seq libraries which were sequenced on SOLiD.
    It seems ilke the best aligners for SOLiD data are Shrimp, Novalign and Lifescope. And from what I've read, Lifescope seems to be the only Colorspace aligner with splicing capabilities.

    I've worked with Illumina data before and mapped using TopHat, but I don't really use the novel junction discovery option - I supply a reference transcriptome against which it maps during the first round, and the remaining reads are then mapped agains the genome [wihtout nover junction discovery].

    Is there something similar that can be done using any of the former three colorspace aligners?

    What have you found best when working with RNA SOLiD libraries ?

    Thanks a lot for your help,

    Carmen

    Hello Carmen,

    If you have access to LifeScope, that is what I have used in the past. You can specify a .gtf file describing the transcriptome and use this for mapping, similar to what you have used tophat for in the past.

    LifeScope first aligns to a "filter" fasta, if specified, to filter out the reads that map to "junk" sequences (adapters, rRNA sequences). The reads that map to this filter are excluded from further analysis. Then it maps to exon junctions (pulled from the gtf file, F3 read only) and to exons (F5 read only), and then finally to the genome, for those reads not mapped to the other references. It then merges all of the mapped reads into one file, pairs them and creates a .bam.

    I personally do not have much experience with the other mappers. LifeScope came with our 5500 install and I decided it was the best way to go.

    I hope this helps!

    Leave a comment:


  • carmeyeii
    replied
    Hello,

    I am analyzing 9 RNA-seq libraries which were sequenced on SOLiD.
    It seems ilke the best aligners for SOLiD data are Shrimp, Novalign and Lifescope. And from what I've read, Lifescope seems to be the only Colorspace aligner with splicing capabilities.

    I've worked with Illumina data before and mapped using TopHat, but I don't really use the novel junction discovery option - I supply a reference transcriptome against which it maps during the first round, and the remaining reads are then mapped agains the genome [wihtout nover junction discovery].

    Is there something similar that can be done using any of the former three colorspace aligners?

    What have you found best when working with RNA SOLiD libraries ?

    Thanks a lot for your help,

    Carmen

    Leave a comment:


  • snetmcom
    replied
    Originally posted by endether View Post
    Hi hildebs,

    Thank you so much for your information. Our library prep group indeed used the ribo-minus kits for depletion. We actually try different rRNA filter files, because there is no official rRNA annotation in the genome we are working on. For some filter files, we sometimes get >70% reads being filtered, but for some other filter, we only get around 10%. It might be the problem of the filter files though. After using the filter fasta, we should left with <1M reads can be mapped to the exon region per lib, which makes our downstream analysis really hard.

    We did further analysis by blasting the gene, where a large quantity of 0 quality reads were mapped to, to a repeats database. We found that those genes are somehow associated with 45S rRNAs. I think it's now clear that it should be a rRNA contamination. We are now considering to "rescue" the data and materials besides re-doing the library preparation. and I will definitely let our group know the ribo-zero option.

    Thank you so much!

    Best,
    Zheng
    was this ribominus or ribominus v2? i just now heard about the v2 kits.

    Leave a comment:


  • endether
    replied
    Originally posted by hildebs View Post
    Hey endether,

    I have observed similar issues with PE SOLiD data. You may want to ask your library prep group which kit they used for RNA-depletion. The sequencing core I work with has used both the ribo-minus and ribo-zero kits for depletion. The ribo-minus kit is very hit-or-miss, and you may need to do it twice. After I map it may still contain up to 50% contamination.
    The ribo-zero kit, however, gets consistently low (<5%) ribosomal levels.

    If you use a filter fasta for LifeScope mapping, you should be able to quantify rRNA levels as well as mapping levels in the same step. If you still have a high number of low-quality reads after mapping, you may need to remove those (with samtools or some such) before transcript assembly (if you have any reads left).
    Hi hildebs,

    Thank you so much for your information. Our library prep group indeed used the ribo-minus kits for depletion. We actually try different rRNA filter files, because there is no official rRNA annotation in the genome we are working on. For some filter files, we sometimes get >70% reads being filtered, but for some other filter, we only get around 10%. It might be the problem of the filter files though. After using the filter fasta, we should left with <1M reads can be mapped to the exon region per lib, which makes our downstream analysis really hard.

    We did further analysis by blasting the gene, where a large quantity of 0 quality reads were mapped to, to a repeats database. We found that those genes are somehow associated with 45S rRNAs. I think it's now clear that it should be a rRNA contamination. We are now considering to "rescue" the data and materials besides re-doing the library preparation. and I will definitely let our group know the ribo-zero option.

    Thank you so much!

    Best,
    Zheng

    Leave a comment:


  • hildebs
    replied
    Hey endether,

    I have observed similar issues with PE SOLiD data. You may want to ask your library prep group which kit they used for RNA-depletion. The sequencing core I work with has used both the ribo-minus and ribo-zero kits for depletion. The ribo-minus kit is very hit-or-miss, and you may need to do it twice. After I map it may still contain up to 50% contamination.
    The ribo-zero kit, however, gets consistently low (<5%) ribosomal levels.

    If you use a filter fasta for LifeScope mapping, you should be able to quantify rRNA levels as well as mapping levels in the same step. If you still have a high number of low-quality reads after mapping, you may need to remove those (with samtools or some such) before transcript assembly (if you have any reads left).

    Leave a comment:


  • endether
    replied
    Originally posted by morellr View Post
    I was hoping to see an update of this thread -- Can you give us some details on how your 75-35 PE reads turned out? I'm interested in knowing what percentage of the 35 (F5) reads mapped to the same chromosome as the 75 (F3) reads.
    The results actually aren't so good. We ended up using Lifescope to do the alignment because it resulted in much better mapping rate (>70%). However, the reported mapping quality is really low, where more than 80% of alignment had mapping quality of 0. We later found that it was because those reads are potentially mapped to multiple loci. We are still on our way to find the actual reason of it. It seems to be a ribosomal RNA contamination right now. However, our protocol actually contains a step to remove rRNAs.

    Leave a comment:


  • morellr
    replied
    Outcome?

    I was hoping to see an update of this thread -- Can you give us some details on how your 75-35 PE reads turned out? I'm interested in knowing what percentage of the 35 (F5) reads mapped to the same chromosome as the 75 (F3) reads.

    Leave a comment:


  • snetmcom
    replied
    this pattern is normal for solid chemistry. If lifescope is an option, you should always start there.

    Leave a comment:


  • endether
    replied
    Originally posted by colindaven View Post
    You could try a number of things.

    I don't know of any splice aware aligners that will work well with PE SOLiD data.

    Firstly, your PE reads are probably bad quality - try trimming to perhaps 20bp. You can check their quality using FastQC or similar.

    In terms of aligners, you could try LifeScope/Bioscope and CLC trial version. Bioscope has some RNA specific alignment tools which I haven't tried, and CLC seems to do a rather good job with SOLiD on bacterial genomes at least (more alignments than NovoalignCS, and more good SNPs called apparently, but I can't quantify this globally yet).

    NovoalignCS is a capable aligner for SOLiD.

    Also, have a look on this section of the Seqanswers site for further comments on alignment.

    Lastly, can you get a spliced reference dataset of some sort from Tophat to align against ?

    Thank you so much for the suggestions.

    I ran the fastqc and indeed I found some problem with the qualities of our reads.

    Here is what it looks like in the F3 end of one library.

    Basically, the quality seems to drop every 5 bases. I am wondering if this is an indication that there's something wrong with the sequencing machine. I randomly picked several libraries and lanes and they all have the same pattern. other than discarding this "noisy" data, do you have any suggestions on that?


    For aligners, actually I am wondering why in general, solid reads seem to have lower mapping rate from Bowtie? From what I read from the forum, the best case of Bowtie is only around 50%. Is this a result of general low quality of reads or aligners's 'incompatibility' with color-space coding? It seems that mis-matching tolerances is one issue, as Bowtie can only take 3 mismatches in maximum, but in colorspace reads 1 base mis-match translate into 2 in color-space.

    I will definitely try LifeScope next. However the splicing junctions, I did not notice that tophat can generate the junction database separately. I think there might be some overheads when translating the alignment from junctions to the genome, but if there is an existing tool can deal with this issue automatically, I will try to not reinvent the wheel.

    Again, thanks a lot for your help!

    Best regards,
    Zheng

    Leave a comment:


  • colindaven
    replied
    You could try a number of things.

    I don't know of any splice aware aligners that will work well with PE SOLiD data.

    Firstly, your PE reads are probably bad quality - try trimming to perhaps 20bp. You can check their quality using FastQC or similar.

    In terms of aligners, you could try LifeScope/Bioscope and CLC trial version. Bioscope has some RNA specific alignment tools which I haven't tried, and CLC seems to do a rather good job with SOLiD on bacterial genomes at least (more alignments than NovoalignCS, and more good SNPs called apparently, but I can't quantify this globally yet).

    NovoalignCS is a capable aligner for SOLiD.

    Also, have a look on this section of the Seqanswers site for further comments on alignment.

    Lastly, can you get a spliced reference dataset of some sort from Tophat to align against ?

    Leave a comment:


  • endether
    started a topic 75+35 Pair-end SOLiD RNA-seq data analysis

    75+35 Pair-end SOLiD RNA-seq data analysis

    Hello Everyone,

    Recently, I am working with SOLiD RNA-seq short reads data. The reads are pair-ended with F3: 75bps and F5: 35bp. Now I am struggling with the question that which alignment tools should I choose.

    Basically our goal is to perform alignment and expression summarization followed by differential expression analysis. In particular, we wish to be able to identify some novel transcripts or novel splicing events.

    I have worked with Illumina short reads data on the same species using Tophat+Cufflinks before and they gave me reasonable results. Tophat successfully handled the splicing junctions and around 80% reads can be mapped to the genome.

    With the SOLiD data on Tophat, I got only around 20% reads mapped. For example, when I aligned one of my libraries to the transcriptome (the first step in Tophat), only 4.64% reads were mapped. Then among these unmapped reads, only 19.92% can be mapped to the genome. The chopped segments had a mapping rate around 19% on junctions.

    I assumed there must be something wrong with my parameters settings. I searched the forum and tuned a few parameters as suggested (basically allow more mis-matches). However, I got similar results with low mapping rate.

    I am wondering if there are any alternative tools that can perform splice junction-aware alignment. I have read BFAST and SHRiMP from other posts, but it seems that none of them support the splice junctions or novel transcripts discovery.


    Thanks a lot,
    Zheng
Working...
X