Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • lanner
    Member
    • Apr 2014
    • 29

    Reduce .fa reference file

    I am using the "BWA for SOLiD" tool on Galaxy. It calls for two inputs:

    1) "Reference Genome": I am using mrna.fa.gz - Human mRNA from GenBank (from the website http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/). It is ~532MB.

    2) "FASTQ file (Nucleotide-space recoded from color-space)": I am using a .fastq file of human transcriptome data. It is ~ 1.8GB.

    I was told to go ahead and try running "BWA for SOLiD" with these inputs, but that it would most likely exceed resources with a memory error.

    I am wondering how I can prevent this (without having to reference cloud resources, etc), and just use the normal Galaxy platform. I have already reduced my .fastq file from its original size by 10-fold (I randomly kept only 1 out of every 10 sequences).

    What is the most effective way for me to reduce the process? And how can I do so without introducing more biases? Should I further reduce my .fastq files by another 2 or 5 fold etc.? Or should I reduce my .fa file, and if so, what is the ideal way to accomplish this?

    I am not concerned about quality. This is for a quick course project - not for any publication! )

    I am feeling concerned because already, it has been ~2 hours since I submitted the "BWA for SOLiD" job to Galaxy, and it is still "waiting to run", whereas I have since run many other smaller jobs, and have never had to wait for my job to begin on Galaxy, except for a few minutes. Approximately how long would such a job take on Galaxy, given the size of the inputs? I just don't know what to expect, and am feeling concerned about time issues....

    Sorry for a long message. If you have any advice on any of the topics, I would be glad to hear them!
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    The amount of time you need to wait is based on other users. It's difficult to predict when your job will run; generally, the fewer resources you need, the sooner it will run, though that's not always true.

    The amount of memory mapping needs is not based on the number of reads (that only affects the time), but the size of the reference. I encourage you to avoid Solid (colorspace) data and focus on base-space data, which is more accurate and has far more relevant tools.

    Comment

    • GenoMax
      Senior Member
      • Feb 2008
      • 7142

      #3
      Originally posted by lanner View Post
      I am feeling concerned because already, it has been ~2 hours since I submitted the "BWA for SOLiD" job to Galaxy, and it is still "waiting to run", whereas I have since run many other smaller jobs, and have never had to wait for my job to begin on Galaxy, except for a few minutes. Approximately how long would such a job take on Galaxy, given the size of the inputs? I just don't know what to expect, and am feeling concerned about time issues....

      Sorry for a long message. If you have any advice on any of the topics, I would be glad to hear them!
      I assume you are using public galaxy instance at PSU. Alignment jobs on public galaxy go into a separate queue and since there are many from around the world they generally take longer (may be up to 24 h). Do not try to delete/resubmit because your job will go to the end of the queue.

      I suggest calling it a day and checking back tomorrow morning. Should be done by then.

      Comment

      • lanner
        Member
        • Apr 2014
        • 29

        #4
        Thank you both for your input!

        Brian: "I encourage you to avoid Solid (colorspace) data and focus on base-space data, which is more accurate and has far more relevant tools."

        Are you suggesting this for after the BWA alignment? I found this advice on another site:

        "You should not convert colorspace to base space prior to aligning reads. The reason for this is that if there is an error in one of the color calls, it will effect all the downstream color calls. Instead, you should use an aligner that will do the assembly in color-space instead."

        GenoMax: Yes, I am just using my (free) account on https://usegalaxy.org/. Glad to know you think it might be done by tomorrow morning, or even done at all, given its memory restraints! It is still waiting.

        I am actually running a similar pipeline on four files total. Should I submit them all to BWA alignment, so they are early in the queue? Or, will that cause the memory to crash or be pushed later to the queue as me, as a user, am using more memory etc? I am just wondering how to get them through the BWA process on Galaxy the fastest - serially or in parallel?

        Thanks again!

        Comment

        • Brian Bushnell
          Super Moderator
          • Jan 2014
          • 2709

          #5
          Originally posted by lanner View Post
          Thank you both for your input!

          Brian: "I encourage you to avoid Solid (colorspace) data and focus on base-space data, which is more accurate and has far more relevant tools."

          Are you suggesting this for after the BWA alignment? I found this advice on another site:

          "You should not convert colorspace to base space prior to aligning reads. The reason for this is that if there is an error in one of the color calls, it will effect all the downstream color calls. Instead, you should use an aligner that will do the assembly in color-space instead."
          Well, that's accurate... but no, I am suggesting that you avoid colorspace from the start, and use reads from a different platform than Solid. Converting colorspace to base-space is highly subjective (I have written a colorspace to base-space converter) and introduces many biases. And yes, with errors, it is impossible to usefully convert reads from colorspace to base-space prior to the reads being mapped, which introduces a ref-bias.

          In my opinion, Solid was a poorly-implemented technology, and I believe the world would be better off if everyone pretended it never existed. It is obsolete, but even when it was still being marketed, rival technologies (Illumina, 454, Sanger) were vastly superior in terms of read lengths, error rates, and compatibility with software. As long as people keep publishing things based on Solid data, the signal-to-noise ratio of scientific literature will be adversely affected.

          Comment

          • lanner
            Member
            • Apr 2014
            • 29

            #6
            Brian: Thank you, that is really helpful for me to know. I will mention this when I present my project, especially to explain questionable results. And, I will avoid using SOLiD if I ever perform these sort of analyses in a more serious capacity (for publication purposes)!!

            Comment

            • Brian Bushnell
              Super Moderator
              • Jan 2014
              • 2709

              #7
              You're welcome!

              But please note that Solid can give useful results, and "I used Solid data" can't be used to explain results contradictory to what you were expecting. It's just very unreliable; so while publications have been made in the past using Solid data, I would not personally pass one today.

              Comment

              • lanner
                Member
                • Apr 2014
                • 29

                #8
                Brian: Okay, thanks for the clarification...

                Comment

                Latest Articles

                Collapse

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by SEQadmin2, 06-05-2026, 10:09 AM
                0 responses
                14 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-04-2026, 08:59 AM
                0 responses
                24 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-02-2026, 12:03 PM
                0 responses
                31 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-02-2026, 11:40 AM
                0 responses
                23 views
                0 reactions
                Last Post SEQadmin2  
                Working...