Seqanswers Leaderboard Ad

**GenoMax** · 03-01-2016, 02:47 PM

I am going to suggest that you try BBMap (I know .. yet another aligner) since it will easily allow you to capture reads that do not map to a file. Leave settings at default values. Only provide a reference index (that you will need to build or build at run time) input files and outputs (to contain the alignment and files for unmapped reads).

I would also suggest that you grab a sample of reads and do blast @NCBI to figure out if there is a possibility of contamination from an unexpected species.

**Richard Finney** · 03-01-2016, 03:29 PM

Your check for contamination angle is the thing to do.

You can even *randomly* subsample the unmappeds and paste 'em as fastas into into NCBI web blast just to get your bearings.

If for instance 20% of them map to another bacteria family, then , yeah, it's contamination.

Check out their adapter contamination tool, too.

**seatales** · 03-02-2016, 08:00 AM

Thanks Richard and GenoMax. I will try your suggestions. Somewhere I was hoping that it was not a contamination issue. But let's see. Will post here what I find.

**seatales** · 03-03-2016, 12:15 PM

I have never used BBMap and I am in the process of learning how to install and use it.
But I found and did the following to get the unmapped reads from my bwa output

./samtools view -b -f 4 S25_bwa.bam > S25_unmapped.bam
./samtools bam2fq S25_unmapped.bam > S25_unmapped_bam2fq.fastq

Fortunately or unfortunately, my unmapped reads blast with the expected genome i.e. contamination with unexpected genome does not seem to be the overwhelming issue.

I will now see if I can rerun the analysis by checking for and removing adapter contamination but I also read that unmapped reads is not a problem for SNP analysis (my goal) and that I can proceed with just the mapped reads. But if I do that, doesn't the low alignment rate mean that I would be looking at just a very incomplete picture of how SNPs are distributed across the genome?

Could I just skip assembly with reference and do de-novo assembly? Because I have genomes from time t=0 and time t=end of experiment, would it be possible to make SNPs inferences by comparing the SNPs at initial vs final timepoints?

I would really appreciate any number of insights!

**GenoMax** · 03-03-2016, 12:39 PM

It is good to know that the possibility of contamination is low/not there. But then why is bowtie2 having trouble aligning the reads (I am not sure if bowtie2 defaults are strict and/or your "reference" is different enough to not allow the reads to map). If the second possibility is true then you may indeed want to try and assemble your own reference. Lab specific strains can turn out to be different from the "reference" out in the wild.

**seatales** · 03-10-2016, 12:43 PM

I tried reassembling the reference and then trying reference based assembly using bowtie2 and bwa-mem. The alignment rates improved to 25% and 70% respectively. The adapter content in fastqc is not marked to be a problematic field.

When I extract the unmapped reads in fast format, I see a pattern though. The first and the last few unmapped reads are give below (I inserted the names "identifier 1, 2, 3 4, 5, 6 for the purpose of this post). Some reads clearly have lots (or only) string of NNNNNNs (identifier 1, 2,3) and will be unmappable. But there are reads (identifier 4, 5,6) that have no NNNNNs and still don't map. Is the quality too low for these reads, especially if some of them blast to the right genome?

@Identifier1
NNNNNCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGNTNTCCATCTTTTNTTTCCTTCGNTNTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
#####E#####################################################/#E#E//EEAEEE<#E<EE/A//E#E#/############################################

@Identifier2
AGATCTCCGGCTTGATGNNTCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGNNGNNCGNATCGCGCNACTTNTNGTCNTGCGCGGCCAGGGCATCCAGCGTCTGCTTGG
+
EEEEEEEEEEEEEEEEE##EEE#######################################E##6##EE#EEEEEEA#EEEE#E#EEE#EEEEEEEEEEEEEEEEEAEEEEEEEEEEEEE

@Identifier3
CCGGGCGCTACCACCGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNANNTNNTNNCANGNGNNGACNNNNATGNGGGNCAGCTGGCCNCTGACGGAGAGCCCCTGCCAGCCGGCAT
+
EEEEEEEEEEEEEEEEE############################################E##E##A##EA#E#<##/AA####EEE#EEE#<EEEEEE<E#E<AEEEEEE<EEEEEEE6AEA<AAAA/E

@Identifier 4
AACAGCAGGCTCCCGATCGAGTAGCCGGCGCCAAACGAGCAGAGCACGCCGCGGGCCCCCGCGGAAAGGTCGCGATTGTACAGATGGAACGCAATAATGGAACCCGCCGAGCTGGTGTTGCCGTAGCTGTC
+
EAEAEE<EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEAEEEEEEEEEEEAAEEEEEEEAEEEE<AEEEEEEEEEEEEEEEEEAE/EEEEAEEEEEE<EEAAEEEEEEAEAEEEEEAAAEAEEEEE<

@Identifier 5
CCTTGGCGTAGTCGCCAGCCCGATACCAGGCATTGCCCTGCCAGACGGGATCCGCAAAGGCGCGGGCGGCCTCCTTATAGTGCTTTTGCAAGAAGGCCTGCCAGGCCTGATTGTCATGGGTAG
+
EEEEEEEEEEEEEEEEEEAEEEEEAEEEEEE/EEEEEEEEE<EE/EEEAEEEEEEEE6EEEEEEEEEEEA<<EEAEEEAEEEE//<AEEEEEE/EEE6EE<A<EEEA/AA//EE<</EEEEEE

@Identifier 6
CAGGCCTGGCAGGCCTTCTTGCAAAAGCACTATAAGGAGGCCGCCCGCGCCTTTGCGGATCCCGTCTGGCAGGGCAATGCCTGGTATCGGGCTGGCGACTACGCCAAGGCGGTCGCCGCCTAT
+
EEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEAEEEEEEEEEAEEEEEEEEEEEEEEE<EEEEEEEAEEEEEEEEEAE<<<EEEEEE<E<EEEEEE/EEEEEEEA/AEEEEAEEAAA<A/A/

Thanks very much for your help!

**Richard Finney** · 03-10-2016, 01:09 PM

Blasting #4 matches 97% to Aeromonas hydrophila YL17 with 4 mismatches.
(via web blast).

#5 matches 98% to several Aeromonas hydrophila strains

Perhaps you can use blast to align your unmapped reads if this is your target genome.

Is this the target? Is it a contaminant species?

I think there's some support for sam output and likely there is support for converting blast output to sam from other utilities.

UCSC blat is also good for checking on problem reads.

Note also that alignment software often has parameters that you can use which will allow more mismatches and gaps.

**dariober** · 03-11-2016, 05:36 AM

As a side comment, you can make bwa mem more sensitive by lowering the minimum score for a read to be outputted (-T option, e.g. try -T 20) and/or by making the seed length shorter (-k option). It will be much slower and memory demanding but for a bacterial genome it shouldn't be a problem.

Obviously, if there is a problem with the reads this will not fix it but at least you can get an idea of what's going on...

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, 07-25-2024, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin 07-25-2024, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Low alignment rates with bowtie2 and bwa for a bacterial genome

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News