Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by Ben Langmead View Post
    That's right; --best does not limit the number of alignments Bowtie reports. If you ask for 1 alignment (default), --best guarantees it's the best. If you ask for -k 4, --best guarantees they're the 4 best, reported in best-to-worst order. If you ask for -a, --best guarantees that you'll get all of them in best-to-worst order.

    Thanks
    Ben
    Thank for quick reply.
    More questions for this topic:
    1. Is it possible some reads only have 2 hits even -k 4 and --best are specified?
    2. How does Bowtie deal with several "equal" best alignments but -k 1 --best were specified? This is the question for how Bowtie deal with repetitive reads.

    Thanks again.
    Last edited by liu3zhen; 11-04-2009, 02:20 PM.

    Comment


    • Originally posted by liu3zhen View Post
      1. Is it possible some reads only have 2 hits even -k 4 and --best are specified?
      Yes.

      Originally posted by liu3zhen View Post
      2. How does Bowtie deal with several "equal" best alignments but -k 1 --best were specified? This is the question for how Bowtie deal with repetitive reads.
      A subset (a subset of 1 in the case of -k 1) is chosen at random.

      Thanks,
      Ben

      Comment


      • Originally posted by Ben Langmead View Post
        A subset (a subset of 1 in the case of -k 1) is chosen at random.
        Thanks, Ben. Bother you again for more questions:
        I think Bowtie is more like BLAST in term of output. The specified maximum alignmens were reported in the order of best-to-worst. Does Bowtie have a way to report only the "equal" best hits. If one best hit is found, only report one alignment. If multiple best hits are found, report up to the specified maximum number of alignments. This is important because I know which reads really hit multiple places in the genome by doing so.

        Comment


        • Originally posted by liu3zhen View Post
          Thanks, Ben. Bother you again for more questions:
          I think Bowtie is more like BLAST in term of output. The specified maximum alignmens were reported in the order of best-to-worst. Does Bowtie have a way to report only the "equal" best hits. If one best hit is found, only report one alignment. If multiple best hits are found, report up to the specified maximum number of alignments. This is important because I know which reads really hit multiple places in the genome by doing so.
          I don't follow your question; please post an example where Bowtie's output is not what you expect/want.

          Ben

          Comment


          • Originally posted by Ben Langmead View Post
            I don't follow your question; please post an example where Bowtie's output is not what you expect/want.

            Ben
            Sorry for the confusing question.
            If I specify -k 4 --best, multiple alignments will be reported even one best hit is found. I'm wondering is it possible only the best hit (only 1) is report for this case, rather than 4 alignments reported in best-to-worst order.

            Comment


            • Originally posted by liu3zhen View Post
              Sorry for the confusing question.
              If I specify -k 4 --best, multiple alignments will be reported even one best hit is found. I'm wondering is it possible only the best hit (only 1) is report for this case, rather than 4 alignments reported in best-to-worst order.
              Please take a look at the documentation for the --strata option in the manual. If that doesn't do what you'd like, please post a small example where Bowtie won't report what you'd like.

              Ben

              Comment


              • Originally posted by liu3zhen View Post
                Sorry for the confusing question.
                If I specify -k 4 --best, multiple alignments will be reported even one best hit is found. I'm wondering is it possible only the best hit (only 1) is report for this case, rather than 4 alignments reported in best-to-worst order.
                Maybe you can specify the option "-m 1". Then all the multiple reads will not be reported.
                Xi Wang

                Comment


                • Originally posted by Ben Langmead View Post
                  Hi Ramzi,

                  The options you're looking for are almost certainly -I/-X and --ff/--fr/--rf. You need to have a reasonably good idea of the expected insert size and specify an appropriate range with -I/-X. You should also confirm that your paired-end protocol produces pairs in the fw/rev orientation. This is the typical configuration for Illumina. If your paired-end data has a different orientation, change it with --ff or --rf.

                  Hope that helps,
                  Ben
                  Hi Ben,
                  thanks for your reply, Indeed setting the -X parameter to 5000 and the orientation to (--rf [although it's the standard illumina paired end protocol]) and clipping 24 pair on the reverse sequence lead to 52.35% :
                  #bowtie -t -p 16 -X 5000 --rf ./ref/h_sapiens_37_asm -1 ./fastq/s_8_1_sequence.fq -2 ./fastq/s_8_2_sequence_50b.fq ./align/pronest_5000_rf_2_50b.bowtie.align --un ./unalign/pronest_5000_rf_2_50b.unalign.fq
                  # reads with at least one reported alignment: 3486518 (52.35%)
                  # reads that failed to align: 3173993 (47.65%)
                  Do you think alignment could be improved further ?
                  thanks again for help and congratulation for the good work.
                  Regards,
                  Ramzi
                  Research Scientist - Bioinformatics
                  Sidra Medical and Research Center

                  Comment


                  • Gzipped files?

                    Hi Ben, Thanks for the update. Do you have an estimate about when Bowtie would be able to handle gzipped files for input and output?

                    Thanks,
                    Andreia

                    Originally posted by Ben Langmead View Post
                    Hi amaer,

                    Perhaps by end-of-year. It's very hard to say because most of my time goes to collaborators, and they don't have predictable schedules . But by end-of-year is a reasonable guess.

                    Thanks,
                    Ben

                    Comment


                    • Hi Andreia,

                      Originally posted by amaer View Post
                      Hi Ben, Thanks for the update. Do you have an estimate about when Bowtie would be able to handle gzipped files for input and output?

                      Thanks,
                      Andreia
                      This is the first time I've heard that requested. Of course, it's easy to pipe or 'tee' bowtie's output directly to gzip; e.g.:

                      $ ./bowtie -u 10 e_coli reads/e_coli_1000.fq | gzip -c > alns.gz
                      # reads processed: 10
                      # reads with at least one reported alignment: 7 (70.00%)
                      # reads that failed to align: 3 (30.00%)
                      Reported 7 alignments to 1 output stream(s)
                      $ gzip -dc alns.gz
                      r0 - gi|110640213|ref|NC_008253.1| 3658049 ATGCTGGAATGGCGATAGTTGGGTGGGTATCGTTC 45567778999:9;;<===>?@@@@AAAABCCCDE 0 32:T>G,34:G>A
                      r1 - gi|110640213|ref|NC_008253.1| 1902085 CGGATGATTTTTATCCCATGAGACATCCAGTTCGG 45567778999:9;;<===>?@@@@AAAABCCCDE 0
                      r2 - gi|110640213|ref|NC_008253.1| 3989609 CATAAAGCAACAGTGTTATACTATAACAATTTTGA 45567778999:9;;<===>?@@@@AAAABCCCDE 0
                      r5 + gi|110640213|ref|NC_008253.1| 4249841 CAGCATAAGTGGATATTCAAAGTTTTGCTGTTTTA EDCCCBAAAA@@@@?>===<;;9:99987776554 0
                      r7 + gi|110640213|ref|NC_008253.1| 4086913 GCATATTGCCAATTTTCGCTTCGGGGATCAGGCTA EDCCCBAAAA@@@@?>===<;;9:99987776554 0
                      r8 + gi|110640213|ref|NC_008253.1| 2679194 GGTTCAGTTCAGTATACGCCTTATCCGGCCTACGG EDCCCBAAAA@@@@?>===<;;9:99987776554 0 14:A>T,33:C>G
                      r9 - gi|110640213|ref|NC_008253.1| 2430559 GCCTGTTCGGCGTTGAGGGTAATGAAATCATCGCC 45567778999:9;;<===>?@@@@AAAABCCCDE 0
                      For unpaired input, it's also possible to pipe 'gzip -dc's output directly to bowtie and specify '-' (meaning "standard in") as the input file. E.g.:

                      $ head -40 reads/e_coli_1000.fq | gzip -c > tmpreads.gz
                      $ gzip -dc tmpreads.gz | ./bowtie e_coli -
                      r0 - gi|110640213|ref|NC_008253.1| 3658049 ATGCTGGAATGGCGATAGTTGGGTGGGTATCGTTC 45567778999:9;;<===>?@@@@AAAABCCCDE 0 32:T>G,34:G>A
                      r1 - gi|110640213|ref|NC_008253.1| 1902085 CGGATGATTTTTATCCCATGAGACATCCAGTTCGG 45567778999:9;;<===>?@@@@AAAABCCCDE 0
                      r2 - gi|110640213|ref|NC_008253.1| 3989609 CATAAAGCAACAGTGTTATACTATAACAATTTTGA 45567778999:9;;<===>?@@@@AAAABCCCDE 0
                      r5 + gi|110640213|ref|NC_008253.1| 4249841 CAGCATAAGTGGATATTCAAAGTTTTGCTGTTTTA EDCCCBAAAA@@@@?>===<;;9:99987776554 0
                      r7 + gi|110640213|ref|NC_008253.1| 4086913 GCATATTGCCAATTTTCGCTTCGGGGATCAGGCTA EDCCCBAAAA@@@@?>===<;;9:99987776554 0
                      r8 + gi|110640213|ref|NC_008253.1| 2679194 GGTTCAGTTCAGTATACGCCTTATCCGGCCTACGG EDCCCBAAAA@@@@?>===<;;9:99987776554 0 14:A>T,33:C>G
                      r9 - gi|110640213|ref|NC_008253.1| 2430559 GCCTGTTCGGCGTTGAGGGTAATGAAATCATCGCC 45567778999:9;;<===>?@@@@AAAABCCCDE 0
                      # reads processed: 10
                      # reads with at least one reported alignment: 7 (70.00%)
                      # reads that failed to align: 3 (30.00%)
                      Reported 7 alignments to 1 output stream(s)
                      But paired-end reads, unfortunately, generally come as two separate files. Bowtie supports a simple alternative format for paired-end reads (1 pair per line, tab-delimited fields are read name, mate 1 sequence, mate 1 qualities, mate 2 sequence, mate 2 qualities) which is enabled with the --12 option. If the reads are in that format and you use the --12 option, you can again pipe the output of 'gzip -dc' and use '-' as the input file. I have a script (that I'll include in the next Bowtie release) that converts a pair of (perhaps gzipped) FASTQ files into that format. That script's output can be conveniently piped to Bowtie e.g.:

                      $ gzip -c reads/e_coli_1000_1.fq > tmp1.fq.gz
                      $ gzip -c reads/e_coli_1000_2.fq > tmp2.fq.gz
                      $ perl scripts/fastq_to_tabbed.pl -1 tmp1.fq.gz -2 tmp2.fq.gz | ./bowtie -u 10 --12 - e_coli
                      r0/2 + gi|110640213|ref|NC_008253.1| 2963701 GAATACTGGCGGATTACCGGGGAAGCTGGAGC EDCCCBAAAA@@@@?>===<;;9:99987776 0
                      r0/1 - gi|110640213|ref|NC_008253.1| 2963907 CTGACCGGCAGGAGTATGAAGGATGCGGAAGAATA 45567778999:9;;<===>?@@@@AAAABCCCDE 0
                      r1/2 + gi|110640213|ref|NC_008253.1| 4118712 AATGTGAAAACGCCATCGATGGAACAGGCAAT EDCCCBAAAA@@@@?>===<;;9:99987776 0
                      r1/1 - gi|110640213|ref|NC_008253.1| 4118838 GGCGTAGATGTCGGCGCGAAAAAAGAGATCTATCA 45567778999:9;;<===>?@@@@AAAABCCCDE 0
                      r2/2 + gi|110640213|ref|NC_008253.1| 4158419 AACGCGCGTTATCGTGCCGGTCCATTACGCGG EDCCCBAAAA@@@@?>===<;;9:99987776 0
                      r2/1 - gi|110640213|ref|NC_008253.1| 4158521 CGCTCAGGGCGTGATGTCCACTTACAAAGGGCGTG 45567778999:9;;<===>?@@@@AAAABCCCDE 0
                      r3/1 + gi|110640213|ref|NC_008253.1| 2116984 CGATGCAGATGCGTACCACCTGGACCAGGCCTTTC EDCCCBAAAA@@@@?>===<;;9:99987776554 2
                      r3/2 - gi|110640213|ref|NC_008253.1| 2117146 CGTTTATCTGGCTGTTTATCCGACGCCCGAAA 67778999:9;;<===>?@@@@AAAABCCCDE 2
                      r4/2 + gi|110640213|ref|NC_008253.1| 1796722 GCACAACATCGGGAGGGTAAGATTTGTGACGA EDCCCBAAAA@@@@?>===<;;9:99987776 0
                      r4/1 - gi|110640213|ref|NC_008253.1| 1796883 CACTTCAGCCGCCACTTCTGGCACCAGCAAAGTCA 45567778999:9;;<===>?@@@@AAAABCCCDE 0
                      r5/2 + gi|110640213|ref|NC_008253.1| 136755 ATGGCCGCACTTGTAGAGCTGCTGAAAAACCC EDCCCBAAAA@@@@?>===<;;9:99987776 0
                      r5/1 - gi|110640213|ref|NC_008253.1| 136879 TGGCTGCTGTCGCCAAAGGCGAAGCTAAATCCCCG 45567778999:9;;<===>?@@@@AAAABCCCDE 0
                      r6/2 + gi|110640213|ref|NC_008253.1| 2717264 CCCATTTCCGCGTCTCAAATAATTTGCTGGAG EDCCCBAAAA@@@@?>===<;;9:99987776 0
                      r6/1 - gi|110640213|ref|NC_008253.1| 2717443 AACAGATAAAAAGCCTGGGTCAGCGCCGTATACGC 45567778999:9;;<===>?@@@@AAAABCCCDE 0
                      r7/1 + gi|110640213|ref|NC_008253.1| 1073149 GTATCGCTGTTTTCCAGTTGTTCAAGATAAGAAAA EDCCCBAAAA@@@@?>===<;;9:99987776554 0
                      r7/2 - gi|110640213|ref|NC_008253.1| 1073350 TGATGTTGCAACTTTGTGCAACCGTGTTAAAT 67778999:9;;<===>?@@@@AAAABCCCDE 0
                      r8/1 + gi|110640213|ref|NC_008253.1| 177630 CCACGGTTGATGCTGGCATCGCTGATTGGTGCGTT EDCCCBAAAA@@@@?>===<;;9:99987776554 0
                      r8/2 - gi|110640213|ref|NC_008253.1| 177789 AGCATTAGCGCGCCGGATATGAAGGTGAACGA 67778999:9;;<===>?@@@@AAAABCCCDE 0
                      r9/1 + gi|110640213|ref|NC_008253.1| 1642197 GCGCCTTAATGCGCTACTCCACCAGCAATTACGTA EDCCCBAAAA@@@@?>===<;;9:99987776554 0
                      r9/2 - gi|110640213|ref|NC_008253.1| 1642268 TCAACTGGCCCTCACCGCCAGATGCCACGCGA 67778999:9;;<===>?@@@@AAAABCCCDE 0
                      # reads processed: 10
                      # reads with at least one reported alignment: 10 (100.00%)
                      # reads that failed to align: 0 (0.00%)
                      Reported 10 paired-end alignments to 1 output stream(s)
                      To me, delegating zipping and unzipping to external programs is vastly preferable to building it into Bowtie, since it enables a multitude of compression formats without unnecessarily complicating Bowtie or detracting from its portability. So I'm unlikely to build that into Bowtie unless I hear many more people ask for it.

                      Let me know if the suggestions don't work or if you'd like be to send you the paired-end script.

                      Thanks,
                      Ben

                      Comment


                      • eliminating redundancy from the output

                        Originally posted by Ben Langmead View Post
                        Hi Andreia,

                        This is the first time I've heard that requested. Of course, it's easy to pipe or 'tee' bowtie's output directly to gzip; (...) To me, delegating zipping and unzipping to external programs is vastly preferable to building it into Bowtie, since it enables a multitude of compression formats without unnecessarily complicating Bowtie or detracting from its portability. So I'm unlikely to build that into Bowtie unless I hear many more people ask for it.

                        Let me know if the suggestions don't work or if you'd like be to send you the paired-end script.

                        Thanks,
                        Ben
                        Thanks, Ben, I would have liked to have the option of reading a zipped/gzipped file since that's how we receive the data from Solexa sequencing provider. Besides, the FASTQ files are quite big, and I'd prefer to leave them uncompressed, rather than decompress them for analysis and then recompress them. But I can certainly understand your point.

                        In the mean time, I have a few more questions/suggestions? about the output. I had 20 million Solexa 36mer reads run against a division of Genbank (lots of redundancy obviously) and got a huge output file - over 100 GB. It took over 3 hrs on 8 CPUs (64 bit) (I ran it with the options -a --best --strata -n 2).

                        1. Could we have the option that in the outfile we don't print the read sequence and/or quality score? or other ways to reduce the size of the output file, while not being quite "concise" style?
                        2. In the 'concise' mode, could we print the NAME (e.g. gi/accession up to the 1st whitespace) of the reference sequence, instead of the index?

                        Thanks,
                        Andreia

                        Comment


                        • @RG Header

                          Hi Ben,
                          Is there a way to have @RG header when generating the sam output file with bowtie ?
                          Thanks in advance.
                          Research Scientist - Bioinformatics
                          Sidra Medical and Research Center

                          Comment


                          • Originally posted by amaer View Post
                            I would have liked to have the option of reading a zipped/gzipped file since that's how we receive the data from Solexa sequencing provider. Besides, the FASTQ files are quite big, and I'd prefer to leave them uncompressed, rather than decompress them for analysis and then recompress them. But I can certainly understand your point.
                            To be clear, I am suggesting a solution to your problem; just not one that builds gzip into Bowtie, which is my preference.

                            Originally posted by amaer View Post
                            In the mean time, I have a few more questions/suggestions? about the output. I had 20 million Solexa 36mer reads run against a division of Genbank (lots of redundancy obviously) and got a huge output file - over 100 GB. It took over 3 hrs on 8 CPUs (64 bit) (I ran it with the options -a --best --strata -n 2).

                            1. Could we have the option that in the outfile we don't print the read sequence and/or quality score? or other ways to reduce the size of the output file, while not being quite "concise" style?
                            2. In the 'concise' mode, could we print the NAME (e.g. gi/accession up to the 1st whitespace) of the reference sequence, instead of the index?
                            If the size of the output is a problem and not all of the information output by Bowtie is needed, I would strongly suggest a strategy of postprocessing bowtie's default output, perhaps by piping its output to a tool like awk or by wrapping the bowtie invocation it in a perl script or something similar. That's far more flexible than relying solely on Bowtie's output mode. Here's a simple example where read sequences and qualities are suppressed but the output is otherwise identical:

                            $ ./bowtie -u 10 e_coli reads/e_coli_1000.fq | awk -v OFS="\t" '{print $1,$2,$3,$4,$7,$8}'
                            # reads processed: 10
                            # reads with at least one reported alignment: 7 (70.00%)
                            # reads that failed to align: 3 (30.00%)
                            Reported 7 alignments to 1 output stream(s)
                            r0 - gi|110640213|ref|NC_008253.1| 3658049 0 32:T>G,34:G>A
                            r1 - gi|110640213|ref|NC_008253.1| 1902085 0
                            r2 - gi|110640213|ref|NC_008253.1| 3989609 0
                            r5 + gi|110640213|ref|NC_008253.1| 4249841 0
                            r7 + gi|110640213|ref|NC_008253.1| 4086913 0
                            r8 + gi|110640213|ref|NC_008253.1| 2679194 0 14:A>T,33:C>G
                            r9 - gi|110640213|ref|NC_008253.1| 2430559 0
                            It's likely that I'll deprecate --concise mode in a future release, so writing a script or a wrapper to postprocess Bowtie's output is highly recommended.

                            Hope that helps,
                            Ben

                            Comment


                            • Originally posted by ramouz87 View Post
                              Is there a way to have @RG header when generating the sam output file with bowtie ?
                              Hi ramouz,

                              I guess I'm not too familiar with how other tools set these fields. Are they typically set by the user? I.e., should I add a command-line option where the user can specify values for ID, SM, etc.?

                              Thanks,
                              Ben

                              Comment


                              • concise mode and speed

                                Originally posted by Ben Langmead View Post
                                (...) piping its output to a tool like awk or by wrapping the bowtie invocation it in a perl script or something similar. That's far more flexible than relying solely on Bowtie's output mode. Here's a simple example where read sequences and qualities are suppressed but the output is otherwise identical:
                                (...)
                                It's likely that I'll deprecate --concise mode in a future release, so writing a script or a wrapper to postprocess Bowtie's output is highly recommended.

                                Hope that helps,
                                Ben
                                Thanks a lot, Ben, that's very helpful. I have noticed that in my case running in the --concise mode is MUCH faster than the full mode (more than twice as fast). I don't mind using indexes instead of real IDs that much, but you do 'lose' the mismatch information. a concise mode with the mismatch information added would be nice.

                                I also noticed that if I have for example a target of 1 million Genbank sequences of 20Kb each, if I concatenate them in a single Fasta sequence before building the index it speeds up the run by about 15-20%.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Choosing Between NGS and qPCR
                                  by seqadmin



                                  Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                                  10-18-2024, 07:11 AM
                                • seqadmin
                                  Non-Coding RNA Research and Technologies
                                  by seqadmin




                                  Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                                  Nobel Prize for MicroRNA Discovery
                                  This week,...
                                  10-07-2024, 08:07 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 05:31 AM
                                0 responses
                                10 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 10-24-2024, 06:58 AM
                                0 responses
                                20 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 10-23-2024, 08:43 AM
                                0 responses
                                50 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 10-17-2024, 07:29 AM
                                0 responses
                                58 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X