Seqanswers Leaderboard Ad

**swbarnes2** · 05-08-2013, 12:43 PM

Aligners really can't do that. They are looking at a single read in the context of a reference genome. How would any software figure out a large duplication like that?

You could try de novo assembly. That might find it.

You don't say how long your reads are. If they are only 50 bases long, it will be very very hard for any automated software to see a 40 bp duplication with that. At least the SNPs tell you that there is a discrepancy there, that's probably the best you can hope for.

**david2** · 05-08-2013, 01:26 PM

Thanks for the quick reply, the reads are from an Ion Torrent with a read length of 100 to 150bp after some trimming.

**shi** · 05-08-2013, 04:13 PM

Hi @david2,

The in-house 1.4 version of our Subread aligner can detect more than 300bp long insertions or deletions. This version is not publicly available yet, but I'm happy to send it to you if you want to have a try.

Cheers
Wei

**david2** · 05-09-2013, 01:09 AM

Hi Wei,

would be great if I could try your Subread aligner and see how it works.

Cheers
David

**shi** · 05-09-2013, 03:04 AM

Hi @david2,

Thanks for your interest. I have put the package here - http://bioinf.wehi.edu.au/subread/su...-alpha1.tar.gz

After downloading it, unzip the tar ball, change to 'src' directory and then build the package by typing the following command if you install it onto a linux/unix machine (you will get some warnings, but dont worry about them):

make -f Makefile.Linux
(type 'make -f Makefile.MacOS' if you install it onto a Mac OS X).

A 'bin' directory will be generated in the home directory of the package and all executables generated from the compilation will be saved to that folder.

Then you need to build an index for your reference genome first, using a command similar to the command below:

subread-buildindex -o my_index genome_chr1.fa genome_chr2.fa ...

Now issue the following command to align reads and discover indels up to 200bp long:

core -i my_index -r READ_1.fq -R READ_2.fq -I 200 -o my_alignment.sam

Here I assume you have paired-end reads. Remove '-R READ_2.fq' if your reads are single ended. Here I also assume that the quality score offset in your data is 33. If this is incorrect, please add '-P 6' to the above command.

You can check the CIGAR strings of the mapped reads to find the discovered indels, or you can use the following command to get the list of indels reported by Subread :

subindel -g my_index -i my_alignment.sam -o my_indels -I 200

Hope it will work for you. Let me know if you run into any problems.

Cheers,
Wei

**david2** · 05-09-2013, 03:25 AM

Thanks Wei, I will download and try it out and let you know!
cheers
David

**david2** · 05-10-2013, 10:03 AM

Hi Wei,

I managed to download, compile and run your aligner as well as subindel to list the indels, but I wasn't lucky with the detection of the duplication, the best I could do was detect deletions in the area of the duplication, which at least should hint the user that there is something going on in that region.

Thanks for letting me try it,

cheers

david

**shi** · 05-10-2013, 02:55 PM

Hi David,

I might have a incorrect understanding of the problem you got. Is the 'duplication' you mentioned an insertion in your sequencing data that you want to discover (this was my original thought)? Or this is not an insertion, but a repeated region in the reference exon that you want the aligners to map the reads correctly to?

Is it possible that you can post the exon reference sequence so I can have a close look at it (and your simulated perfect read)? This will also let more people help you.

Also, is it a RNA-seq dataset or a gDNA-seq dataset? The way to analyze it will be totally different if it is an RNA-seq dataset.

Cheers,
Wei

**david2** · 05-11-2013, 02:08 AM

Hi Wei,

you understood correctly, I have genomic DNA, germline, human, DNA-Seq by Ion Torrent:

This is what the reference looks like (actually part of exon10 of BRCA1)
>Reference
AAACATTCAAGCAGAACTAGGTAGAAACAGAGGGCCAAAATTGAA
TGCTATGCTTAGATTAGGGGTTTTGCAACCTGAGGTCTATAAACAA
AGTCTTCCTGGAAGTAATTGTAAGCATCC

This is what a read could look like without any read errors
>Sample Read without errors/theoretical
AAACATTCAAGCAGAACTAGGTAGAAACAGAGGGCCAAAATTGAA
TGCTATGCTTAGATTAGGGGTTTTGCAACCTGAGGTCTATAAACAA
TGCTATGCTTAGATTAGGGGTTTTGCAACCTGAGGTCTATAAACAA
AGTCTTCCTGGAAGTAATTGTAAGCATCC
Or
>Sample Read without errors/theoretical
TGCTATGCTTAGATTAGGGGTTTTGCAACCTGAGGTCTATAAACAA
TGCTATGCTTAGATTAGGGGTTTTGCAACCTGAGGTCTATAAACAA
AGTCTTCCTGGAAGTAATTGTAAGCATCC
TGAAATAAAAAAGCAAGAATATGAAGAAGTAGTTCAGACTGTTAAT

The sample has been Sanger sequenced, so I know it has this heterozygous duplication. I also aligned the sample to a modified BRCA1 reference where I added the duplication and aligned sample reads to it, so there are sample reads that cover the duplication in the data set.

Cheers,

David

**shi** · 05-11-2013, 03:08 AM

Hi David,

Thanks for the further information.

It looks like you do have reads that covered the duplication. But I think the reason why the programs you have tried could not find it is probably because of the low sequencing depth you got from the Ion Torrent sequencing. When detecting insertions as long as this, programs like Subread will perform read assembly to try to join overlapping reads to form contigs and then use these contigs to detect long indels. When the sequencing depth is low, you are more likely to get gaps between reads, which makes it harder to get long configs and to detect long indels.

So it would be better to have a higher sequencing depth and to use paired-end reads for this sort of analysis. However, although it seems very hard to find this long duplication using this dataset, this does not mean you can not find other indels, especially those relatively short indels.

I am sorry I could not be more helpful for this.

Best wishes,
Wei

**david2** · 05-11-2013, 03:55 AM

Hi Wei,

thanks for your quick feedback, it is important for me to understand the limitations of current aligners to detect longer insertions or deletions, and that we might need a second alignment step to detect them.

Cheers

David

**lh3** · 05-16-2013, 03:23 PM

@david2: alignment is determined by the scoring matrix you use. Your first sample read can be mapped to the reference as long as you use a very small extension penalty. For example, under matching score 5, mismatching penalty 17, gap open penalty 20 and gap extension penalty 1, the first sample read can be mapped to hg19:

bwa mem -A5 -B17 -O20 -E1 hg19.fa q.fa

Result:

Sample 16 17 41244231 60 29M46I91M * 0 0 GGATGCTTACAATTACTTCCAGGAAGACTTTGTTTATAGACCTCAGGTTGCAAAACCCCTAATCTAAGCATAGCATTGTTTATAGACCTCAGGTTGCAAAACCCCTAATCTAAGCATAGCATTCAATTTTGGCCCTCTGTTTCTACCTAGTTCTGCTTGAATGTTT * NM:i:46 AS:i:1078 XS:i:190

EDIT: novoalign gives the correct alignment under the default setting. I am unable to configure bowtie2 to give the intended alignment.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Aligner capable of correctly aligning a 30-50 bp duplication

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News