Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Hobbe
    Member
    • Apr 2010
    • 29

    Splice site prediction with solid rna-seq data

    Hi all

    We are having problems predicting splice sites from our Solid rna-seq data. We have a draft genome (125Mb, a eukaryote) assembled from 454-data and are now trying to map our Solid reads to this genome to predict splice sites. The idea is to use these predicted splice sites to make intron hints for the gene finder Augustus to create correct gene models.

    We are currently trying Bowtie/Tophat, but get weird results. For example, when working with a subset of our reads we find some splice sites, but these are not found when we add more data. Also, we have earlier tried Corona Light together with Splitseek, and Bowtie/Tophat does not find sites that were found with Corona Light/Splitseek. On the other hand, Corona Light/Splitseek is timeconsuming/awkward to run and often reports splice sites that are a few bp off, so that is not an ideal choice either.

    This cannot be an uncommon situation, so what are the rest of you doing in these situations? No closely related genomes have been sequenced.
  • colindaven
    Senior Member
    • Oct 2008
    • 417

    #2
    Another reasonable choice might be hmmSplicer, at least for comparison. I've had what look to be reasonable results from it in the past. I take it you're working in sequence space, not colour space ?

    Comment

    • Hobbe
      Member
      • Apr 2010
      • 29

      #3
      Originally posted by colindaven View Post
      Another reasonable choice might be hmmSplicer, at least for comparison. I've had what look to be reasonable results from it in the past. I take it you're working in sequence space, not colour space ?

      Thanks for the reply. No, we are working in color space. Sequences converted to sequence space would too easily become wrong if there are any errors in the original colorspace reads. However, if you or anyone else have had good success with converting to sequence space I would love to hear about it. The general recommendation seems to be to map in colorspace.

      Comment

      • darked89
        Member
        • Jun 2009
        • 38

        #4
        Originally posted by Hobbe View Post
        Hi all

        We are having problems predicting splice sites from our Solid rna-seq data. We have a draft genome (125Mb, a eukaryote) assembled from 454-data and are now trying to map our Solid reads to this genome to predict splice sites. The idea is to use these predicted splice sites to make intron hints for the gene finder Augustus to create correct gene models.
        Augustus can cope with "hints" created by mapping Illumina reads (converted to fasta) with splice-agnostic blat. So as long as you have some gene models for training, unspliced mappings should work, I hope.

        Originally posted by Hobbe View Post
        We are currently trying Bowtie/Tophat, but get weird results. For example, when working with a subset of our reads we find some splice sites, but these are not found when we add more data. Also, we have earlier tried Corona Light together with Splitseek, and Bowtie/Tophat does not find sites that were found with Corona Light/Splitseek. On the other hand, Corona Light/Splitseek is timeconsuming/awkward to run and often reports splice sites that are a few bp off, so that is not an ideal choice either.

        This cannot be an uncommon situation, so what are the rest of you doing in these situations? No closely related genomes have been sequenced.
        I got strange results from tophat vs bowtie mapping SOLID reads without GFF gene models guide (draft+ mamalian genome): bowtie in colorspace mapped _more_ reads than tophat. I used the latest versions (TopHat 1.3.1 and Bowtie 0.12.7).

        Comment

        • Hobbe
          Member
          • Apr 2010
          • 29

          #5
          Originally posted by darked89 View Post
          Augustus can cope with "hints" created by mapping Illumina reads (converted to fasta) with splice-agnostic blat. So as long as you have some gene models for training, unspliced mappings should work, I hope.

          Blat is the preferred program to use for spliced mapping (see the Augustus Rnaseq instructions). You really need those intron hints to get correct gene models. Blat doesn't work on Solid data though.

          Of biggest importance in our case was to have Augustus trained on the actual organism. We did this using our 454 cDNA data, and using this training the number of correctly found genes in our small set (14) of known test genes increased from 6 to 9 (compared to using the training files for distantly related organisms that came with Augustus). Adding intron hints we are now up to 11 out of 14 genes, but this is only with a small part of our Solid rnaseq data, and we are now working on adding more hints. The only solution we have just now is using the old Corona Light pipeline together with Splitseek by Adam Ameur. Slow, but seems to work.

          IMO, there is still a great need for a good spliced mapper for Solid data.

          Comment

          • darked89
            Member
            • Jun 2009
            • 38

            #6
            Originally posted by Hobbe View Post
            Blat is the preferred program to use for spliced mapping (see the Augustus Rnaseq instructions). You really need those intron hints to get correct gene models. Blat doesn't work on Solid data though.
            Same for FASTQ format. Maybe there is something to be gained from color 2 fasta conversion and mapping by blat.

            Originally posted by Hobbe View Post
            Of biggest importance in our case was to have Augustus trained on the actual organism. We did this using our 454 cDNA data, and using this training the number of correctly found genes in our small set (14) of known test genes increased from 6 to 9 (compared to using the training files for distantly related organisms that came with Augustus). Adding intron hints we are now up to 11 out of 14 genes, but this is only with a small part of our Solid rnaseq data, and we are now working on adding more hints.
            Also you may try to use CEGMA (http://korflab.ucdavis.edu/Datasets/cegma/) either to produce yet another training or testing set. Also at times there is no way out except starting semi-manual annotation, again be it for the training or testing sets. Blastp your Augustus predictions: genes whith high conservation/100% coverage in other species are likely to be real.

            Originally posted by Hobbe View Post
            The only solution we have just now is using the old Corona Light pipeline together with Splitseek by Adam Ameur. Slow, but seems to work.
            Is it the currently recommended setup by Splitseek author? In the Splitseek 1.3.4 manual the recommended one is Whole Transcriptome Pipeline.

            Originally posted by Hobbe View Post
            IMO, there is still a great need for a good spliced mapper for Solid data.
            Indeed. I have found some other software (X-MATE), but it requires junction libraries and uses yet another pipeline (http://solidsoftwaretools.com/gf/project/mapreads/).
            See:

            Comment

            • adameur
              Member
              • Nov 2009
              • 23

              #7
              Hi,

              Just a few words about SplitSeek from the author. It only works with the split read mapper from the AB Whole Transcriptome Pipeline, always had. I'm aware it is akward but unfortunately there are currently no good alternatives.

              The good news is that AB WTP actually works fine once you get it to run. I even managed to run some 75bp reads from the SOLiD5500 through WTP and SplitSeek (using 25bp anchors in the mapping) so it might be an option also in the future.

              /Adam

              Comment

              Latest Articles

              Collapse

              • SEQadmin2
                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                by SEQadmin2


                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                Here are nine questions we think about, in roughly the order they matter, before...
                06-18-2026, 07:11 AM
              • SEQadmin2
                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                by SEQadmin2


                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                ...
                06-02-2026, 10:05 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by SEQadmin2, 06-17-2026, 06:09 AM
              0 responses
              32 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-09-2026, 11:58 AM
              0 responses
              97 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-05-2026, 10:09 AM
              0 responses
              117 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-04-2026, 08:59 AM
              0 responses
              109 views
              0 reactions
              Last Post SEQadmin2  
              Working...