Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Aligner capable of correctly aligning a 30-50 bp duplication

    Hi,

    I have a sample (human, germline) amplified with multiplex PCR and sequenced with Ion Torrent and sufficient coverage around 500x there is a known duplication in one of the exons of around 40 base pairs.

    I have tried Bowtie2, Stampy and BWA, none of them correctly detect the duplication, either they throw away the reads, leading to a coverage hole in the duplication region or they try to fit it in create a large number of neighbouring snp's. Just for fun I tried clustalW, clustal Omega and MAFFT with a single perfect read (simulated) and the exon reference sequence. None of them are smart enough to recognize the duplication, they all align the end of the read and generate either SNPs in the beginning or SNPs and deletions.

    Is there any aligner out there, that can correctly align reads across a duplication of 30 to 50 bp?

    thanks for any advice,

    cheers

    David

  • #2
    Aligners really can't do that. They are looking at a single read in the context of a reference genome. How would any software figure out a large duplication like that?

    You could try de novo assembly. That might find it.

    You don't say how long your reads are. If they are only 50 bases long, it will be very very hard for any automated software to see a 40 bp duplication with that. At least the SNPs tell you that there is a discrepancy there, that's probably the best you can hope for.

    Comment


    • #3
      Thanks for the quick reply, the reads are from an Ion Torrent with a read length of 100 to 150bp after some trimming.

      Comment


      • #4
        Hi @david2,

        The in-house 1.4 version of our Subread aligner can detect more than 300bp long insertions or deletions. This version is not publicly available yet, but I'm happy to send it to you if you want to have a try.

        Cheers
        Wei

        Comment


        • #5
          Hi Wei,

          would be great if I could try your Subread aligner and see how it works.

          Cheers
          David

          Comment


          • #6
            Hi @david2,

            Thanks for your interest. I have put the package here - http://bioinf.wehi.edu.au/subread/su...-alpha1.tar.gz

            After downloading it, unzip the tar ball, change to 'src' directory and then build the package by typing the following command if you install it onto a linux/unix machine (you will get some warnings, but dont worry about them):

            make -f Makefile.Linux
            (type 'make -f Makefile.MacOS' if you install it onto a Mac OS X).

            A 'bin' directory will be generated in the home directory of the package and all executables generated from the compilation will be saved to that folder.

            Then you need to build an index for your reference genome first, using a command similar to the command below:

            subread-buildindex -o my_index genome_chr1.fa genome_chr2.fa ...

            Now issue the following command to align reads and discover indels up to 200bp long:

            core -i my_index -r READ_1.fq -R READ_2.fq -I 200 -o my_alignment.sam

            Here I assume you have paired-end reads. Remove '-R READ_2.fq' if your reads are single ended. Here I also assume that the quality score offset in your data is 33. If this is incorrect, please add '-P 6' to the above command.

            You can check the CIGAR strings of the mapped reads to find the discovered indels, or you can use the following command to get the list of indels reported by Subread :

            subindel -g my_index -i my_alignment.sam -o my_indels -I 200

            Hope it will work for you. Let me know if you run into any problems.

            Cheers,
            Wei

            Comment


            • #7
              Thanks Wei, I will download and try it out and let you know!
              cheers
              David

              Comment


              • #8
                Hi Wei,

                I managed to download, compile and run your aligner as well as subindel to list the indels, but I wasn't lucky with the detection of the duplication, the best I could do was detect deletions in the area of the duplication, which at least should hint the user that there is something going on in that region.

                Thanks for letting me try it,

                cheers

                david

                Comment


                • #9
                  Hi David,

                  I might have a incorrect understanding of the problem you got. Is the 'duplication' you mentioned an insertion in your sequencing data that you want to discover (this was my original thought)? Or this is not an insertion, but a repeated region in the reference exon that you want the aligners to map the reads correctly to?

                  Is it possible that you can post the exon reference sequence so I can have a close look at it (and your simulated perfect read)? This will also let more people help you.

                  Also, is it a RNA-seq dataset or a gDNA-seq dataset? The way to analyze it will be totally different if it is an RNA-seq dataset.

                  Cheers,
                  Wei

                  Comment


                  • #10
                    Hi Wei,

                    you understood correctly, I have genomic DNA, germline, human, DNA-Seq by Ion Torrent:

                    This is what the reference looks like (actually part of exon10 of BRCA1)
                    >Reference
                    AAACATTCAAGCAGAACTAGGTAGAAACAGAGGGCCAAAATTGAA
                    TGCTATGCTTAGATTAGGGGTTTTGCAACCTGAGGTCTATAAACAA
                    AGTCTTCCTGGAAGTAATTGTAAGCATCC

                    This is what a read could look like without any read errors
                    >Sample Read without errors/theoretical
                    AAACATTCAAGCAGAACTAGGTAGAAACAGAGGGCCAAAATTGAA
                    TGCTATGCTTAGATTAGGGGTTTTGCAACCTGAGGTCTATAAACAA
                    TGCTATGCTTAGATTAGGGGTTTTGCAACCTGAGGTCTATAAACAA
                    AGTCTTCCTGGAAGTAATTGTAAGCATCC
                    Or
                    >Sample Read without errors/theoretical
                    TGCTATGCTTAGATTAGGGGTTTTGCAACCTGAGGTCTATAAACAA
                    TGCTATGCTTAGATTAGGGGTTTTGCAACCTGAGGTCTATAAACAA
                    AGTCTTCCTGGAAGTAATTGTAAGCATCC
                    TGAAATAAAAAAGCAAGAATATGAAGAAGTAGTTCAGACTGTTAAT


                    The sample has been Sanger sequenced, so I know it has this heterozygous duplication. I also aligned the sample to a modified BRCA1 reference where I added the duplication and aligned sample reads to it, so there are sample reads that cover the duplication in the data set.

                    Cheers,

                    David

                    Comment


                    • #11
                      Hi David,

                      Thanks for the further information.

                      It looks like you do have reads that covered the duplication. But I think the reason why the programs you have tried could not find it is probably because of the low sequencing depth you got from the Ion Torrent sequencing. When detecting insertions as long as this, programs like Subread will perform read assembly to try to join overlapping reads to form contigs and then use these contigs to detect long indels. When the sequencing depth is low, you are more likely to get gaps between reads, which makes it harder to get long configs and to detect long indels.

                      So it would be better to have a higher sequencing depth and to use paired-end reads for this sort of analysis. However, although it seems very hard to find this long duplication using this dataset, this does not mean you can not find other indels, especially those relatively short indels.

                      I am sorry I could not be more helpful for this.

                      Best wishes,
                      Wei

                      Comment


                      • #12
                        Hi Wei,

                        thanks for your quick feedback, it is important for me to understand the limitations of current aligners to detect longer insertions or deletions, and that we might need a second alignment step to detect them.

                        Cheers

                        David

                        Comment


                        • #13
                          @david2: alignment is determined by the scoring matrix you use. Your first sample read can be mapped to the reference as long as you use a very small extension penalty. For example, under matching score 5, mismatching penalty 17, gap open penalty 20 and gap extension penalty 1, the first sample read can be mapped to hg19:

                          bwa mem -A5 -B17 -O20 -E1 hg19.fa q.fa

                          Result:

                          Sample 16 17 41244231 60 29M46I91M * 0 0 GGATGCTTACAATTACTTCCAGGAAGACTTTGTTTATAGACCTCAGGTTGCAAAACCCCTAATCTAAGCATAGCATTGTTTATAGACCTCAGGTTGCAAAACCCCTAATCTAAGCATAGCATTCAATTTTGGCCCTCTGTTTCTACCTAGTTCTGCTTGAATGTTT * NM:i:46 AS:i:1078 XS:i:190

                          EDIT: novoalign gives the correct alignment under the default setting. I am unable to configure bowtie2 to give the intended alignment.
                          Last edited by lh3; 05-16-2013, 04:36 PM.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Essential Discoveries and Tools in Epitranscriptomics
                            by seqadmin




                            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                            04-22-2024, 07:01 AM
                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, Yesterday, 11:49 AM
                          0 responses
                          15 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-24-2024, 08:47 AM
                          0 responses
                          16 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          62 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          60 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X