Announcement

Collapse
No announcement yet.

A multi-split mapping algorithm for circular RNA, splicing, trans-splicing, and ...

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • A multi-split mapping algorithm for circular RNA, splicing, trans-splicing, and ...

    ABSTRACT:

    Numerous high-throughput sequencing studies focus on detecting conventionally spliced mRNAs in RNA-seq data. However, non-standard RNAs arising through gene fusion, circularization, or trans-splicing are often neglected. We introduce a novel, unbiased algorithm to detect splice junctions from single-end cDNA sequences. In contrast to other methods, our approach accommodates multi-junction structures. Our method compares favorably with competing tools on conventionally spliced mRNAs and, with a gain of up to 40\% of recall, systematically outperforms them on reads with multiple splits, trans-splicing and circular products. The algorithm is integrated into our mapping tool segemehl (www.bioinf.uni-leipzig.de/Software/segemehl/).

    Steve Hoffmann, Christian Otto, Gero Doose, Andrea Tanzer, David Langenberger, Sabina Christ, Manfred Kunz, Lesca Holdt, Daniel Teupser, Jöerg Hackermüeller and Peter F Stadler: 'A multi-split mapping algorithm for circular RNA, splicing, trans-splicing, and fusion detection', Genome Biology, 15:R34, doi:10.1186/gb-2014-15-2-r34 (2014)
    ecSeq Bioinformatics is Europe’s leading provider of hands-on bioinformatics workshops and professional data analysis in the field of Next-Generation Sequencing (NGS).

  • #2
    circular RNA

    dear segemehl programmer,

    which conditions are the best for finding back-spliced (circular) transcripts from 50 PE illumina reads.

    i would run with following parameters:

    Code:
    ./segemehl.x -t 20 -T -Y -S -i $index -d $fa -q $fq/15607.1.fq.gz -p $fq/15607.2.fq.gz | gzip > $out/CoCa_15607.sam.gz
    should i change MEDAH?

    how can i use Haarz to extract especially back-spliced reads?

    dietmar

    PS: the documentation is for version 0.1.3 - is there a newer one?

    Comment


    • #3
      Hi Dietmar,

      build the source

      >make
      >make testrealign.x

      to do the mapping ()
      >segemehl.x -q file.fq -d hg19.fa -i hg19.idx -S -s -o file.out

      option -S turns on the splice feature. This includes all non-standard splicing events. The option -s shuts up the progress bar.

      to call the junctions:
      >testrealign.x -d hg19.fa -q file.out -n

      option -n is necessary to stop the program from realigning reads - takes much longer.

      Hope that helps!

      Comment


      • #4
        @luitpold

        dear luitpold,

        thank you!

        but i always get this error:
        Code:
        testrealign.x: libs/memory.c:18: bl_realloc: Assertion `ptr != ((void *)0)' failed.
        ./testrealign_CoCa_CoNo.sh: line 11:  5078 Aborted                 (core dumped) ./testrealign.x -d $fa -q $out/CoCa_15607.sam -n -U $out/15607_splitfile.bed -T $out/15607_transsplit.bed
        any hint what could be wrong? too large SAM-file: 42 GByte? i have 96 Gbyte RAM.

        dietmar

        Comment


        • #5
          Hi Dietmar,

          seems to be an "out of memory" issue. You might want test it on a smaller SAM file … otherwise contact the developers directly …

          Comment


          • #6
            Dietmar,

            one more thought … is your SAM file sorted?

            Comment


            • #7
              thank you,

              sorting solved the problem.

              dietmar

              Comment


              • #8
                Dear segemehl development team,

                Using segemehl on Memczak 2013 Nature data sets, I managed to get tens of thousands circular RNA splice junctions. However when I compare them to the published data of Memczak, I found that 61 out of the 250~ circular RNAs in hek 293 cell line were not in the result I got from segemehl, which is different from what is declared in your manuscript. Do you think adding the trimming options (-Y -T) would make it different?

                Also, I found it difficult to use the testrealign.x looking for junction sites on large sam files. Trying the -B option to split the result into different chromosomes, but still not working, the result bed files are empty.

                Thank you

                Comment


                • #9
                  If you are interested in how to use segemehl to detect fusion transcripts and/or circularized RNAs, I can recommend you the following hands-on course:
                  Discovering standard and non-standard RNA transcripts - How to detect canonical splicing, circular RNAs, trans-splicing, and fusion transcripts

                  Developers of the algorithm will explain you step-by-step how you can use segemehl to detect standard and non-standard transcripts.
                  ecSeq Bioinformatics is Europe’s leading provider of hands-on bioinformatics workshops and professional data analysis in the field of Next-Generation Sequencing (NGS).

                  Comment


                  • #10
                    Is out there any article, paper, study where segemhl has been used for finding fusion genes (e.g. show a fusion gene found by segemhl)? Has segemhl been compared with other gene fusion finders? On average how many fusion genes are reported per sample? What is the wet-lab validation rate of the fusions found by segemhl?

                    For my case reporting hundreds/thousands of candidate fusion genes per sample is totally useless because according to the medical/biological literature the fusion genes are very rare events (i.e. in 98% of the all patient samples are zero fusions per sample) and in case that the indeed are found then there are not more than very few in one sample, maybe a maximum of 25 per sample is the absolute maximum and an average would be around 1 or 3 per sample. Please notice, that fusion genes are not SNPs/indels/alternative-splicing-events. Here the scientific "null" hypothesis is that there are on average between 0-5 fusion genes per sample! This hypothesis can be rejected using only wet-lab data and NOT in silico data! If a tool reports over 100 candidate fusion genes per sample it means that that tool already has a ~95% false positive rate!

                    I would like to use it for finding pathogenetic/somatic fusion genes and I looked/searched very hard and I was not able to find anything which suggest that segemhl has ever been used for finding pathogenetic/somatic fusion genes.
                    Last edited by ntn12; 05-18-2014, 12:46 AM.

                    Comment


                    • #11
                      Aren't most of these questions answered when reading the segemehl publication? They compared their tool with 7 other state-of-the-art tools and validated their results based on available RNA seq datasets.

                      As far as I can judge the situation, the group that developed segemehl is a pure bioinformatics group and thus they did not perform any wet-lab validation, but implemented a tool that does what it should (compared to other algorithms). And since it was published only some month ago, I think we have to wait until we find any article where segemehl was used to find fusion genes.

                      I'm curious about these future publications, since the examples shown in the paper are quite impressive. But the future will show if segemehl is really that good.

                      Comment


                      • #12
                        Originally posted by Paul Newport View Post
                        Aren't most of these questions answered when reading the segemehl publication? They compared their tool with 7 other state-of-the-art tools and validated their results based on available RNA seq datasets.
                        ...
                        .
                        Could you point to the publication where SEGEMEHL is used for finding fusion genes?

                        If you mean this:
                        http://bioinformatics.oxfordjournals...s.btu146.short

                        then there SEGEMEHL is compared to STAR, BOWTIE2, BWA-MEM, BLAT, etc. and not even one of these is a gene fusion finder! The word fusion is not mentioned even once in the entire article (except in the references). Fusion gene finders are for example: SOAPfuse, deFuse, FusionHunter, etc. How does SEGEMEHL compare to these? Here is a nice comparisons for fusion genes finders: http://code.google.com/p/fusioncatcher/wiki/comparison

                        Did I miss something here?

                        I mean by fusion genes this:
                        http://erc.endocrinology-journals.or.../R143.full.pdf

                        P.S. Read splitter is not the same as finding fusion genes!
                        Last edited by ntn12; 05-18-2014, 10:15 AM.

                        Comment


                        • #13
                          Dear ntn12,

                          thanks for your comments and questions.

                          segemehl itself is not a fusion-gene-finder. It is a mapping tool that can detect split-reads and its resulting set of these split-reads can be used to call fusion genes. But it has to be done in a separate downstream analysis and is not included in the segemehl algorithm. I hope that makes things clearer.
                          Last edited by ecSeq Bioinformatics; 05-19-2014, 12:01 AM.
                          ecSeq Bioinformatics is Europe’s leading provider of hands-on bioinformatics workshops and professional data analysis in the field of Next-Generation Sequencing (NGS).

                          Comment


                          • #14
                            Originally posted by ntn12 View Post
                            Here is a nice comparisons for fusion genes finders: http://code.google.com/p/fusioncatcher/wiki/comparison
                            Sorry, but I don't understand the list shown on the linked page.

                            My questions would be:
                            1. Where do these 40 fusion genes come from?
                            2. Why does only FusionCatcher find all of these?
                            3. Why is this list on the FusionCatcher website?


                            That looks a bit suspicious to me!

                            Comment


                            • #15
                              Originally posted by Paul Newport View Post
                              Where do these 40 fusion genes come from?
                              I just did some research and found on the FusionCatcher website:

                              FusionCatcher has been used originally for finding novel and known fusion genes in breast tumor cell lines BT474, SKBR3, MCF7, KPL4 as shown in the following articles:
                              • S. Kangaspeska, S. Hultsch, H. Edgren, D. Nicorici, A. Murumägi, O.P. Kallioniemi, Reanalysis of RNA-sequencing data reveals several additional fusion genes with multiple isoforms, PLOS One 2012. http://dx.plos.org/10.1371/journal.pone.0048745
                              • H. Edgren, A. Murumagi, S. Kangaspeska, D. Nicorici, V. Hongisto, K. Kleivi, I.H. Rye, S. Nyberg, M. Wolf, A.L. Borresen-Dale, O.P. Kallioniemi, Identification of fusion genes in breast cancer by paired-end RNA-sequencing, Genome Biology 2011, Vol. 12. http://genomebiology.com/2011/12/1/R6


                              These are the same two publications shown on the "comparison" page. So the 40 genes were predicted using FusionCatcher? Honestly?

                              Comment

                              Working...
                              X