Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Build BLAST db with illumina reads

    I am running BLAST locally and would like to populate a database with a set of paired end illumina reads. There are 2,132,034,004 reads that I would like to dump in the db. It is taking a very long time to get the db going. I have used the makebalstdb command. Anyone try this before? How long did it take, any general advice or other suggestions? Thank you.

  • #2
    I have done this before and I think it took a day or so, and that was with more like 100M reads, not 2B! What ever you are trying to do, it will probably be easier a different way. At the very least, you should probably be using mpiblast, as any querying of that database will also take a very long time.

    What is it you are trying to do anyway? The community here might have some better suggestions for you.

    Comment


    • #3
      One aspect of my project is recover the original read names. I have data from DNA STAR but it changed the original read names. I want to identify the unmatched reads as well and verify sequences the original analyst found interesting.

      Comment


      • #4
        Ok, I'm still somewhat confused by what data you have and what you're doing with it.

        Are you trying to, for example, find the reads that support a SNP from an alignment? You could more easily reduce a data set like that with samtools mpileup to just reads that overlap the SNP. Then blasting, or even just inspection, could work a lot faster.

        Comment


        • #5
          Sorry for not being clear. Thank you for your help and patience. Here is the main purpose of my task:

          I have some illumina sequences that were identified to be chimeric by another analyst. He came to this conclusion using another software that trimmed and renamed the reads. I want to blast these 'chimeric' reads against the custom blast database populated with all the reads from the entire dataset. My hope is that the chimeric reads from the other analyst will align with the original reads and that I will be able to view the entire read (not trimmed) with the header information so I have the original read name. Then I can use the original data for further analysis.
          Last edited by wdemos; 06-27-2012, 09:38 AM. Reason: clarity

          Comment


          • #6
            Provided you have access to a machine with a good amount of RAM, doing a blat search may also be a possibility. Are your sequences already in a multi-fasta format?

            Comment


            • #7
              Yes, the files are concatenated into one large fasta file.
              Last edited by wdemos; 06-27-2012, 09:58 AM. Reason: clarification

              Comment


              • #8
                Hmm, I see. That is a little more problematic then isn't it.

                It does sound like blat or blast are your only options. You could try splitting up that huge file of reads into some number of chunks and running makeblastdb individually, then just concatenating the results from each chunk later. Blat does have the advantage that you don't need to format your database. If you have Kent's src, you could use a few of the tools to speed this job up (ie faSplit), then again running multiple queries on individual chunks.

                If you have access to a cluster, you could take advantage of hundreds of cores with blat. Just be clever about the faSplit output names, then take advantage of the qsub -t and a $PBS_ARRAYID variable in your command. Blat also has the advantage of being able to easily merge and filter the resulting .psl files.

                It doesn't sound like fun to me, but it might just be done within a day if you have the right resources and a well thought out plan.

                Comment


                • #9
                  Thank you. I was just approved for access to a cluster. I will let you know how it turns out.

                  Comment


                  • #10
                    How about making a bwa or bowtie index of the full reads, and aligning to them, instead of blasting? After all that's the virtue of those algorithms; aligning short reads.

                    Comment


                    • #11
                      I left the makeblastdb command running while I tried working with the cluster and it actually finished populating last night. I currently am conducting my BLAST search. I appreciate all the help and if my blast fails I may consider the bwa/bowtie suggestion. Thank you.

                      Comment


                      • #12
                        Originally posted by wdemos View Post
                        I left the makeblastdb command running while I tried working with the cluster and it actually finished populating last night. I currently am conducting my BLAST search. I appreciate all the help and if my blast fails I may consider the bwa/bowtie suggestion. Thank you.
                        Sounds good, I'm happy to help.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Essential Discoveries and Tools in Epitranscriptomics
                          by seqadmin




                          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                          04-22-2024, 07:01 AM
                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 12:17 PM
                        0 responses
                        13 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-29-2024, 10:49 AM
                        0 responses
                        19 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-25-2024, 11:49 AM
                        0 responses
                        24 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-24-2024, 08:47 AM
                        0 responses
                        23 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X