Bowtie, an ultrafast, memory-efficient, open source short read aligner

ShaunMahony replied

05-19-2009, 07:28 AM
This has probably been answered already, so apologies in advance.

Does anyone know if Bowtie by default filters the input on the basis of quality? I'm getting a strange result. When I perfectly sample random 32mers from the mouse genome, and then align them back to the same genome, most aligners align ~83% uniquely. However, Bowtie is only aligning ~77%.

Where are the missing reads going? It can't be mismatch qualities, since there are no mismatches in the sampled 'reads'. These are the options I'm using:

./bowtie -q --solexa-quals -m 2 --best -p 2
Leave a comment:
dara replied

05-07-2009, 07:38 AM
yes that makes sense. Thank you
Leave a comment:
Ben Langmead replied

05-07-2009, 07:35 AM
Hi dara,

It complained that the total sequence length of all the reference strings was too big to fit in a single index, right? I didn't mean to imply that you can't feed multiple fasta files to bowtie-build; you certainly can. But if the total total length of all the sequence you're supplying is too big, you'll have to break the input up into chunks somehow and build separate indexes for each chunk. You might try feeding the fasta files in smaller bundles, or you might redistribute sequences throughout the fasta files, or both. If you've got chromosomes, you probably just want to try bundling together as many chromosome fasta files as you can get away with in a single invocation of bowtie-build.

Does that make sense?

Thanks,
Ben
Leave a comment:
dara replied

05-07-2009, 07:27 AM
Hello Ben,

Thank you for your quick response. However, I'm a little puzzled because I was looking at the script that comes along with genome index on the Bowtie website (make_h_sapiens_asm.sh) and it seems to build just one index by providing all the chunks to the bowtie-build executable at once. Here's the line I'm talking about:

INPUTS=hs_ref_chr1.fa,hs_ref_chr2.fa,hs_ref_chr3.fa,hs_ref_chr4.fa,hs_ref_chr5.fa,hs_ref_chr6.fa,hs_ref_chr7.fa,hs_ref_chr8.fa,hs_ref_chr9.fa,hs_ref_chr10.fa,hs_ref_chr11.fa,hs_ref_chr12.fa,hs_ref_chr13.fa,hs_ref_chr14.fa,hs_ref_chr15.fa,hs_ref_chr16.fa,hs_ref_chr17.fa,hs_ref_chr18.fa,hs_ref_chr19.fa,hs_ref_chr20.fa,hs_ref_chr21.fa,hs_ref_chr22.fa,hs_ref_chrMT.fa,hs_ref_chrX.fa,hs_ref_chrY.fa

${BOWTIE_BUILD_EXE} ${INPUTS} h_sapiens_asm

I was trying the same thing- providing individual chromosome splits to the indexer and it complained.

Thanks again
Leave a comment:
Ben Langmead replied

05-07-2009, 07:06 AM
Now that paired-end is substantially done, we'll be embarking on gapped alignment soon. I'll probably start on that in June. Hopefully by the end of the summer you'll see at least initial gapped-alignment support. That's a guess though .

Thanks,
Ben

Originally posted by dara View Post

Also another question for you:

Any updates on plans for bowtie supporting gapped alignment?

thanks
Leave a comment:
Ben Langmead replied

05-07-2009, 07:04 AM
Hi dara,

Yes, you have to build separate index files and query them separately. You'll have to synthesize the per-index results into an overall set of results, e.g., with some scripts. Bowtie doesn't currently know how to query multiple indexes as part of a single alignment run.

Thanks,
Ben
Leave a comment:
dara replied

05-07-2009, 06:24 AM
Also another question for you:

Any updates on plans for bowtie supporting gapped alignment?

thanks
Leave a comment:
dara replied

05-07-2009, 06:05 AM
Hi Ben,

Once the reference file has been split into chunks, do they have to be made into seperate indexes? So, for example if I've split the reference into chrom1, chrom2 and chrom3, would I need to do:

./bowtie-build -f chrom1 indexchrom1
./bowtie-build -f chrom2 indexchrom2
./bowtie-build -f chrom3 indexchrom3

If I build separate indexes, how would I call all of them when mapping with my reads file?

Thanks for your help

Last edited by dara; 05-07-2009, 06:25 AM. Reason: name
Leave a comment:
dara replied

05-01-2009, 07:11 AM
Hi Ben,

Thank you for your response. The file is a human genome download from blast- Its about 8.3 gb in size and I was using the default 32-bit version of bowtie-build. Alright I will try what you suggested- will split the genome (by chromosome maybe) and then feed those splits to the bowtie-build.

I will let you know if that causes any issues.

Thanks
Leave a comment:
Ben Langmead replied

05-01-2009, 06:41 AM
Hi dara,

How large is the human_genomic.fa file? Are you using 32-bit or 64-bit bowtie-build? I've not seen this before. Most versions of Linux and glibc can handle very large files with no problem.

I suspect that once you fix this problem, you'll run into the problem that Bowtie can only index reference sequences in chunks of about 3.6 Gbases or so. When you try to feed bowtie-build an input with too much sequence, it will say "Error: Reference sequence has more than 2^32-1 characters! Please divide the reference into batches or chunks of about 3.6 billion characters or less each and index each independently." This is because Bowtie uses 32-bit ints internally to refer to offsets in the index. We may fix this some day, but until then you'll have to work around this by indexing your reference in chunks.

Ben
Leave a comment:
dara replied

04-30-2009, 07:44 AM
BOWTIE_BUILD: Problems when using with large reference genomes?

Hi all,

I've been trying to run bowtie using the human_genomic.fa file from blast db as reference. When I attempted to use Bowtie-build to break up this large file into indexes, I keep getting a 'Error: could not open human_genomic.fa' message.
I tried creating a file with just the first 10000 lines of the human genome and that works fine. I thought bowtie can easily handle such big reference files. Has anyone else faced this issue- any suggestions of how to overcome it?

Here's what I did: ./bowtie-build -f human_genomic.fa human_genom

thanks
Leave a comment:
Ben Langmead replied

04-09-2009, 02:16 PM
Hi Ieuan,

Originally posted by ieuanclay View Post

0.9.9.2 does not have the same problem, and has roughly the rsaem footprint for both -a and -a --nostrata. Any idea what the change was? Either way I am happy!

--best mode got an overhaul in 0.9.9.2 such that --best now conducts a best-first search, rather than a depth-first search with buffering and flushing of results, as before. My suspicion is that the old approach was, for some reads, buffering a huge number of results and exhausting memory. I'll take a harder look, though.

Thanks,
Ben
Leave a comment:
thondeboer replied

04-08-2009, 02:30 PM
Hi Ben,

You can read more on our read structure on our website and on this forum as well:

Question and Confuse about Complete Genomics - SEQanswers

http://seqanswers.com/forums/showthread.php?t=1307

Sequencing technologies without a commercially released platform (Oxford Nanopore, Halcyon Molecular, etc.)

Page not found - Complete Genomics

http://www.completegenomics.com/pages/materials/CompleteGenomicsTechnologyPaper.pdf

But basically we have a gapped read structure of 5 + 10 + 10 + 10 (times two) bases.
The first gap is "negative" that is, has overlap between the 5 and 10 base reads.
The other gaps are positive, that is, gaps in the more classical sense.

You won't know the negative gap value (it can vary from 1 to 3 overlaps) unless you map the data (or unless there is only one way to overlap) onto the reference genome.

Good to hear you are in support of SAM/BAM. We are considering this as our export format as well...

Thon
Complete Genomics
Leave a comment:
Ben Langmead replied

04-08-2009, 01:26 PM
Hey Thon,

We haven't tried implementing gapped alignment yet, though tools like BWA and SOAP2 show it's doable in this framework. Can you describe the "unusual read structure"?

Yes, we would certainly like to support SAM/BAM output eventually. It's on the TODO list!

Thanks,
Ben
Leave a comment:
thondeboer replied

04-08-2009, 09:30 AM
Hi Ben,

Complete Genomics here....
Have you tried to use our gapped read structure yet with Bowtie? As you may know, we have quite an unusual read structure so most mapping software is not able to use this effectively and we have build our own, but our customers would probably want to use other mapping software as well if only to compare our mapping to theirs...

The data is available in the SRA under number SRA008092

ftp://ftp.ncbi.nlm.nih.gov/sra/Submi...008/SRA008092/

You can also get a sample data set which is part of the API we have released.

Page not found - Complete Genomics

http://www.completegenomics.com/developer/default.aspx

We are considering changing to the SAM/BAM format as the export of our mapping data...Are you considering supporting SAM/BAM as an output format as well?

Thanks!

Thon
Leave a comment:

Previous 1 18 25 26 27 28 29 30 31 34 template Next

Recent Advances in Sequencing Analysis Tools

by seqadmin

The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
- Channel: Articles
05-06-2024, 07:48 AM
Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM

Topics	Statistics	Last Post
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, Yesterday, 06:35 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, 05-09-2024, 02:46 PM	0 responses 21 views 0 likes	Last Post by seqadmin 05-09-2024, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 18 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM
Enhanced Neoantigen Detection: Introducing NeoHunter by seqadmin Started by seqadmin, 05-06-2024, 07:17 AM	0 responses 19 views 0 likes	Last Post by seqadmin 05-06-2024, 07:17 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News