Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • junction mapping in color space

    Hi @ all,

    i need to deal with some kind of junction mapping. We need to discover insertions sites of a specific retrovirus in the human genome. To detect these sites, we fragmented human DNA and enriched for fragments that are homologue to the viruses known (LTR-)ends. Finally we did a 50BP SOLiD-run to sequence the DNA-junctions (viral DNA / human DNA).

    Now i am looking for an aligner that can tell about partial-read alignments against a reference genome. I thought BFAST can do something like this, but i don't see how. On the other hand i'm pretty familiar with Bowtie and considered to use the trim-read-prior-to-alignment options, but again i don't know if thats applicable to color-space reads.

    Has anyone experiences in detecting non-exonic junctions (as they arise from transposons, too)? Any ideas?

    Thanks in advance!
    Uwe

  • #2
    Hi Uwe,

    You could try SplitSeek, a method we originally developed for junction mapping in RNA-seq data (http://genomebiology.com/2010/11/3/R34). I think it should work also in your case, at least in theory...

    Adam

    Comment


    • #3
      Hi Amadeur,

      thanks for your reply. According to figure 1 (and result paragraph 1), SplitSeek splits every read into two related subreads and also leaves a gap between these two. I'm afraid, this will not work for color-space data, because of the recursive nature of color-space annotations. At least the gap in between the subreads would break the recursion, wouldn't it?
      Do you think that SplitSeek can (beside the color-space limitation) detect juntions that arose from recombinations, insertions or whatsoever? I perceive SplitSeek as a highly specialized Exon-boundary finder - do you see a way to set up SplitSeek to seek for insertions of certain "contigs/exons" in a huge genome (human, mouse)?

      Uwe

      Comment


      • #4
        Hi Uwe,

        In fact its quite the opposite. The split read mapping is done using the AB WT pipeline, which was developed specifically for SOLiD. So it works for color space while normal base space is more problematic.

        We have seen that SplitSeek can find small insertions (one or a few bp) and deletions (of varying lengths). I can't see why it shouldn't be possible to also find other types of rearrangements like inversions, translocations and so on... But you'll need sufficient coverage. Also, repeat regions where the split reads can't be mapped uniquely might be a problem.

        You'll find the code at the SOLiD tools webpage if you want to give it a try (http://solidsoftwaretools.com/gf/project/splitseek/). We have tested it on human and mouse so I don't think the genome size will be a problem.

        If you'd like to run this on genomic reads (and not RNA-seq data), I suggest that you first remove the reads that were aligned full-length to the genome with some other mapper (like corona lite). In that way you'll reduce the number of reads in the input file and speed up the program.

        Adam

        Comment


        • #5
          @adameur,

          I am new to SOLID data and I am thinking of using your SplitSeek program since I think it is the one that fits my necessities the best. It would be very nice if you can help me with these questions:

          1) Is it mandatory to use the "split_read_mapper" from the SOLiD WT Analysis Pipeline before using your program? Could I use other mappers (e.g. BFAST)?

          2) I got the data in a SRF file... How could I obtain the csfasta from it? I'm thinking of using the staden program srf2fastq and later obtain the csfasta from the fastq... Is it the proper way to do it?

          Thanks in advance for any help

          Comment


          • #6
            Hi fennan,

            Some quick answers:

            1) Currently the "split_read_mapper" is the only aligner that is directly supported. You could try using some other mapping tool but then you'll have to make some processing of the output files. But it's important to note that the aligner should perform an independent mapping of sub parts of reads, as is the case for the "split_read_mapper".

            2) You could try the SRF_Reader in the solid2srf package (http://solidsoftwaretools.com/gf/project/srf/).

            Hope this helps!

            Adam

            Comment


            • #7
              Hi Adam,

              It does help. Thank you for your quick response! I'll be using your program and I will report my experience here.

              By the way, In case it is useful for someone, I found some problems compiling the source code from SRF_Reader (as is sadly not very uncommon). It seems that it is designed for 32 bits machines (mine is 64 bits). I had to manually include some header files (mainly cstdlib.h and string.h) and it worked. After that I found the package for 64 bits (http://yum.biopackages.net/biopackag...os4.x86_64.rpm). It worked fine for me too...

              Comment


              • #8
                Hi Adam,

                I've been trying to use SplitSeek but I am having a lot of problems with the "split_read_mapper" program. I have reported my problems to SOLiD support but they are not being very helpful so far. I saw that I am dealing with the same RNA-seq that you used in you paper (GSE14605), so I thought you could give some hints so I can apply your SplitSeek program.

                The thing is that five days ago I launched "split_read_mapper" for one csfasta file (~600MB) and the mapper.log file says that the program is still "Waiting for mapping jobs to finish...". Three days ago I also launched "split_read_mapper" for a small csfasta file (5000 reads) with the mm9 whole genome as the reference but it is also at the same point ("Waiting for mapping jobs to finish..."). My questions are:

                1) Is this normal? How long did it take for you?

                2) What queue system did you use? I am using SGE (Sun Grid Engine) but I am not sure if it might not be properly supported by this program... Any idea about where the problem could be?

                Thanks

                Comment


                • #9
                  Hi fennan,

                  The mapping jobs should only take a few hours so I think something went wrong. My guess is that it might be a memory issue.. Can you try again with increased memory? I'm using the PBS system.

                  Adam

                  Comment


                  • #10
                    Hi Amadeur,

                    I have the similar question as Uwe. As I know, the color space reads are dependent of its first nucleotide (perhaps the primer), and the rest of nucleotides are resolved recursively.
                    i.e. T1021301230123123012301

                    As a result, there would be a serious problem when there is an error occur in the middle. I know the authors of BFAST have developed a specific alignment algorithm to deal with this problem.

                    I wonder how do you split the reads while avoiding this problem. Are you first translating the numbers to nucleotide first and do the split? Or did you use some smart idea to handle this?

                    Bests.
                    -Cuncong

                    Originally posted by winfried View Post
                    Hi Amadeur,

                    thanks for your reply. According to figure 1 (and result paragraph 1), SplitSeek splits every read into two related subreads and also leaves a gap between these two. I'm afraid, this will not work for color-space data, because of the recursive nature of color-space annotations. At least the gap in between the subreads would break the recursion, wouldn't it?
                    Do you think that SplitSeek can (beside the color-space limitation) detect juntions that arose from recombinations, insertions or whatsoever? I perceive SplitSeek as a highly specialized Exon-boundary finder - do you see a way to set up SplitSeek to seek for insertions of certain "contigs/exons" in a huge genome (human, mouse)?

                    Uwe

                    Comment


                    • #11
                      Originally posted by cczhong View Post
                      Hi Amadeur,

                      I have the similar question as Uwe. As I know, the color space reads are dependent of its first nucleotide (perhaps the primer), and the rest of nucleotides are resolved recursively.
                      i.e. T1021301230123123012301

                      As a result, there would be a serious problem when there is an error occur in the middle. I know the authors of BFAST have developed a specific alignment algorithm to deal with this problem.

                      I wonder how do you split the reads while avoiding this problem. Are you first translating the numbers to nucleotide first and do the split? Or did you use some smart idea to handle this?

                      Bests.
                      -Cuncong
                      Hi cczhong,

                      the smart idea practically any color-space aligner is built on, is not to translate the color-space reads into base-space in order to do the mapping/alignment, but to translate the reference genome into color-space, instead. This way there are no recursions to resolve, because color-space aligners always know, how to retranslate aligned portions of the reference genome back into base-space. Using for example Bowtie (http://bowtie-bio.sourceforge.net/index.shtml) you can build color-space indices of reference genomes and then do the mapping even with truncating reads both 5' and 3' (see the options '--trim5' and '--trim3' at the manual page) prior to mapping them. However, if you need to search for recombinations, translocations or any other kind of similar events, SplitSeek can be your best friend.

                      For the sake of completeness, it should be mentioned that it is of course possible to translate color-space reads into base-space first and then do the whole alignment process in base-space. It is, however, a pretty lossy process that is, you will end up with many reads that simply don't align anywhere. This is mainly because of the 'blind' interpetation (blind in terms of not comparing to any reference that provides hints as to where sequencing errors or SNPs are located at) of the recursion. I once used Bowtie and Blat (the latter being a pure base-space aligner) to quantify the loss of alignable reads. Of all reads that could be successfully mapped by Bowtie (in color-base) about one third coundn't be aligned using Blat (in base-space) after translation into base-space. Certainly, this will vary from sample to sample after all. But it at least led to the assumption that not sticking with color-space aligners is usally the last choise.

                      Best Uwe

                      Comment


                      • #12
                        Has anyone succeeded in using SplitSeek without the SOLiD WT Pipeline?

                        We're trying to analyze some SOLiD transcriptome data, but we want to use an aligner that knows SAM/BAM format since we're more familiar with that.

                        Comment


                        • #13
                          Try NovoalignCS (www.novocraft.com) as it supports direct SAM output. It may also help in aligning the full length colorspace reads. Subtracting these aligned reads yields unmapped reads that could be passed to the splitseq mapper.
                          It takes about 5 minutes to build a full colorspace index for human, mouse, etc using novoindex and you need a minimum of 8-9Gb or RAM per server. Multithreading, polyclonal filter, CSFAST/CSQUAL and MPI are supported.


                          What percentage of reads in the run are expected to contain junctions?




                          Originally posted by mmartin View Post
                          Has anyone succeeded in using SplitSeek without the SOLiD WT Pipeline?

                          We're trying to analyze some SOLiD transcriptome data, but we want to use an aligner that knows SAM/BAM format since we're more familiar with that.

                          Comment


                          • #14
                            Hi, thanks for the reply, NovoalignCS looks quite good. It's not an option for me, however, as I prefer Open Source tools such as BWA, Bowtie and BFAST, which also support color space input and SAM output.

                            This isn't exactly what what I meant, though. I want to convert SAM/BAM output of any aligner to the input required by SplitSeek (which seems to be a file in BEDPE format). I guess it isn't difficult to write a script for that, but there may be some pitfalls I don't know about, yet.

                            Comment


                            • #15
                              Hi mmartin,

                              The AB WT pipeline performs a split read mapping where the two ends of the read are independently aligned to the reference. This type of split read alignment is essential if you want to run SplitSeek, otherwise you risk missing a lot of junctions. As far as I know there are currently no good alternatives to the WT pipeline for running SplitSeek.

                              About converting the SAM/BAM to BEDPE. I suppose this could be done quite easily. For me, the main concern is whether or not most split read alignments are included in the SAM/BAM alignment results. And that will depend on which mapping algorithm was used.

                              By the way, does anyone know how a split read alignment is represented in SAM?

                              /Adam

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-25-2024, 11:49 AM
                              0 responses
                              19 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-24-2024, 08:47 AM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              62 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              60 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X