Bowtie, an ultrafast, memory-efficient, open source short read aligner

kerhard replied

04-20-2011, 12:19 PM
limit to bowtie-build fasta input files?

Hi all,

I've been trying to make a bowtie index using a long list of annotated transposons as the input fasta files rather than reference chromosome files and bowtie-build does not seem to like it very much.

If I try to use ALL of the fasta files (which is a lot, probably around ~1000), I get the error message:

Error: could not open <fileX.fa>

But if I use only a subset of the fasta files (including fileX.fa), it works just fine.

I'm assuming that it's a memory issue, but the total contents of all of these fasta files is much less than the fasta files containing the full reference genome sequences, and I can make an index with them just fine.

Has anyone had any experience doing something similar? Is there some limit to the number of input files bowtie-build can take? I imagine that I can just split these files up into smaller groups and make several index files, but it would be nice to be able to have all of them in one index.

Thanks for any help/advice!
Leave a comment:
gntc replied

04-04-2011, 09:34 AM
mismatches

Originally posted by droog_22 View Post

Dear All,
I am using bowtie to align reads to the dm3 genome. I just read that the SAM specifications allow for tags such as H0, H1, etc. which counts the number of 0-differences, 1-difference hits, and so on. I know how to do ass these tags using awk, I was just wondering if it would be straightforward to modify bowtie so that it outputs these values.

Bowtie does this automatically. The tag is XA:i:0 (for a read with 0 mismatches).
Leave a comment:
droog_22 replied

04-04-2011, 07:26 AM
Counting Hits in a BAM file

Dear All,

I am using bowtie to align reads to the dm3 genome. I just read that the SAM specifications allow for tags such as H0, H1, etc. which counts the number of 0-differences, 1-difference hits, and so on. I know how to do ass these tags using awk, I was just wondering if it would be straightforward to modify bowtie so that it outputs these values.

Cheers D.
Leave a comment:
biznatch replied

03-31-2011, 11:22 AM
Originally posted by gntc View Post

The files in chromFa.tar.gz each start out with a large number of 'N's. Is this due to uncertainty near the ends of chromosomes in sequencing?

I think yes, it is because of uncertainty near the ends of the chromosomes. If you look at hg19 in the UCSC Genome Browser and turn on the Gap track you can see where there are gaps in sequencing on each chromosome. Anywhere there is a gap will be N's in the sequence. There are gaps at the ends of each chromosome because telomeres and subtelomeres are repetitive and difficult to sequence and assemble. There are also large gaps at the centromeres for the same reason.
Leave a comment:
ewels replied

03-31-2011, 08:52 AM
Originally posted by gntc View Post

The files in chromFa.tar.gz each start out with a large number of 'N's. Is this due to uncertainty near the ends of chromosomes in sequencing?

What good timing - I was just searching out the answer to this exact question for that exact file... Does anyone know the answer?
Leave a comment:
gntc replied

03-30-2011, 10:09 AM
repeats

Originally posted by biznatch View Post

Depends whether the index was made from the masked version of hg19 or not. I'm pretty sure the pre-made index from the Bowtie website is made from the non-masked genome. Both masked and non-masked are available here:

Index of /goldenPath/hg19/bigZips

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/

"chromFa.tar.gz - The assembly sequence in one file per chromosome.
Repeats from RepeatMasker and Tandem Repeats Finder (with period
of 12 or less) are shown in lower case; non-repeating sequence is
shown in upper case.

chromFaMasked.tar.gz - The assembly sequence in one file per chromosome.
Repeats are masked by capital Ns; non-repeating sequence is shown in
upper case."

The files in chromFa.tar.gz each start out with a large number of 'N's. Is this due to uncertainty near the ends of chromosomes in sequencing?
Leave a comment:
biznatch replied

03-26-2011, 10:50 PM
Originally posted by gntc View Post

Does the hg19 index mask repeats in the genome?

I have illumina data that has a large number of repeats. The sequences have been mapped using ELAND and found that ~30% had >10 matches. When using bowtie about 10% have >10 matches. What accounts for this difference? Does the hg19 index mask repeats?

Depends whether the index was made from the masked version of hg19 or not. I'm pretty sure the pre-made index from the Bowtie website is made from the non-masked genome. Both masked and non-masked are available here:

Index of /goldenPath/hg19/bigZips

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/

"chromFa.tar.gz - The assembly sequence in one file per chromosome.
Repeats from RepeatMasker and Tandem Repeats Finder (with period
of 12 or less) are shown in lower case; non-repeating sequence is
shown in upper case.

chromFaMasked.tar.gz - The assembly sequence in one file per chromosome.
Repeats are masked by capital Ns; non-repeating sequence is shown in
upper case."
Leave a comment:
gntc replied

03-25-2011, 03:28 PM
Repeats

Does the hg19 index mask repeats in the genome?

I have illumina data that has a large number of repeats. The sequences have been mapped using ELAND and found that ~30% had >10 matches. When using bowtie about 10% have >10 matches. What accounts for this difference? Does the hg19 index mask repeats?
Leave a comment:
nashp replied

03-20-2011, 09:48 PM
question about bowtie's handling of long reads

Hey guys,
I have a question about bowtie's performance when we increase the length of the reads. Initially i used to run bowtie for reads length = 35. Now I am running the exps with reads length 51.

When I read bowtie's manual, I noticed they say bowtie's performance decreases as the read length increases. On the contrary, i am seeing its performance become better when I shifted from 35 to 51. Could you guys please tell me why? is it normal for bowtie to behave this way??

how short is short reads and how long is long reads (in terms of base pairs) ?
Leave a comment:
Xi Wang replied

03-14-2011, 10:13 PM
Nowadays, spliced read mappers first split reads into segments, then apply mappers such as Bowtie to map those segments onto the reference genome. The segments belong to a read can be mapped with a long distance between, so that the splice junctions can be detected.
What Nishomer mentioned is an old version of Tophat, when the read length is short.
Leave a comment:
nilshomer replied

03-14-2011, 09:12 PM
Originally posted by tonge View Post

Hi XiWang,
Could you please tell me why is bowtie not suitable for RNA seq? Especially since Bowtie is utilised by tophat software.
Thanks, Pete

It is not able to handle spliced reads, as biznatch mentioned.

Originally posted by biznatch View Post

I think it's because Bowtie doesn't recognize splice junctions. Ie. when you sequence your RNA is often aligns across introns so there is a large gap in the alignment. Tophat uses the Bowtie alignment algorithm but can align across splice junctions. ...or something like that.

Just to be clear, Tophat doesn't align anything by itself. Tophat creates a junction reference, where possible splice junctions are represented as contiguous sequences in the FASTA reference, allowing bowtie to map properly to these putative junctions. Take a look at the Tophat paper for more information.
Leave a comment:
biznatch replied

03-14-2011, 07:43 PM
Originally posted by tonge View Post

Hi XiWang,
Could you please tell me why is bowtie not suitable for RNA seq? Especially since Bowtie is utilised by tophat software.
Thanks, Pete

I think it's because Bowtie doesn't recognize splice junctions. Ie. when you sequence your RNA is often aligns across introns so there is a large gap in the alignment. Tophat uses the Bowtie alignment algorithm but can align across splice junctions. ...or something like that.
Leave a comment:
tonge replied

03-14-2011, 06:20 PM
Originally posted by Xi Wang View Post

Generally, there is not a tool simply better than the others. It depends on what your scientific questions are, what kind of data you have, what the purpose is to analyze the data. For example, Bowtie is not suitable to deal with RNA-seq data.

Hi XiWang,
Could you please tell me why is bowtie not suitable for RNA seq? Especially since Bowtie is utilised by tophat software.
Thanks, Pete
Leave a comment:
gntc replied

02-04-2011, 10:28 AM
hg19 and allocation issues

I am new to bowtie and I am having a couple problems. First, I downloaded the hg19 ebwt files and attempted to transfer them to the server where I will be running bowtie but received errors for 5 of the 6 files. Despite the errors the file names still appeared on the server and to check if they were functional I tried a trial run:

./bowtie -c -t hg19 CTGAGCTTGACGCTTTGCTAATATNGTAAGAAGAGAAACTATTAATTATGGCTTTCTAAAATTGAATATCCTTGTACACA

this was the response:

Out of memory allocating plen[] in Ebwt::read() at ebwt.h:3153
Overall time: 00:00:00

What can I do?

Thanks, Greg
Leave a comment:
david2 replied

02-04-2011, 09:37 AM
Hi jyoshna,
What kind of application do you have in mind? All machines and software packages have advantages and disadvantages depending on what you want to do (re-sequencing? De novo? SNP detection? Indels?Whole genome or targeted?)
Leave a comment:

Previous 1 4 5 6 7 8 9 10 17 34 template Next

Latest Developments in Precision Medicine

by seqadmin

Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

Somatic Genomics
“We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
- Channel: Articles
05-24-2024, 01:16 PM
Recent Advances in Sequencing Analysis Tools

by seqadmin

The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
- Channel: Articles
05-06-2024, 07:48 AM

Topics	Statistics	Last Post
SIX2 Protein Identified as a Key Player in Prostate Cancer Treatment Resistance by seqadmin Started by seqadmin, 06-03-2024, 06:55 AM	0 responses 12 views 0 likes	Last Post by seqadmin 06-03-2024, 06:55 AM
Genetic Mosaicism More Prevalent Than Previously Thought by seqadmin Started by seqadmin, 05-30-2024, 03:16 PM	0 responses 25 views 0 likes	Last Post by seqadmin 05-30-2024, 03:16 PM
Comprehensive Sequencing of Great Ape Sex Chromosomes Yields Insights into Evolution and Genetic Variability by seqadmin Started by seqadmin, 05-29-2024, 01:32 PM	0 responses 29 views 0 likes	Last Post by seqadmin 05-29-2024, 01:32 PM
New Toolkit Enhances Plant Mitochondrial Genome Research by seqadmin Started by seqadmin, 05-24-2024, 07:15 AM	0 responses 215 views 0 likes	Last Post by seqadmin 05-24-2024, 07:15 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News