Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • lanner
    Member
    • Apr 2014
    • 29

    Reduce .fa reference file

    I am using the "BWA for SOLiD" tool on Galaxy. It calls for two inputs:

    1) "Reference Genome": I am using mrna.fa.gz - Human mRNA from GenBank (from the website http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/). It is ~532MB.

    2) "FASTQ file (Nucleotide-space recoded from color-space)": I am using a .fastq file of human transcriptome data. It is ~ 1.8GB.

    I was told to go ahead and try running "BWA for SOLiD" with these inputs, but that it would most likely exceed resources with a memory error.

    I am wondering how I can prevent this (without having to reference cloud resources, etc), and just use the normal Galaxy platform. I have already reduced my .fastq file from its original size by 10-fold (I randomly kept only 1 out of every 10 sequences).

    What is the most effective way for me to reduce the process? And how can I do so without introducing more biases? Should I further reduce my .fastq files by another 2 or 5 fold etc.? Or should I reduce my .fa file, and if so, what is the ideal way to accomplish this?

    I am not concerned about quality. This is for a quick course project - not for any publication! )

    I am feeling concerned because already, it has been ~2 hours since I submitted the "BWA for SOLiD" job to Galaxy, and it is still "waiting to run", whereas I have since run many other smaller jobs, and have never had to wait for my job to begin on Galaxy, except for a few minutes. Approximately how long would such a job take on Galaxy, given the size of the inputs? I just don't know what to expect, and am feeling concerned about time issues....

    Sorry for a long message. If you have any advice on any of the topics, I would be glad to hear them!
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    The amount of time you need to wait is based on other users. It's difficult to predict when your job will run; generally, the fewer resources you need, the sooner it will run, though that's not always true.

    The amount of memory mapping needs is not based on the number of reads (that only affects the time), but the size of the reference. I encourage you to avoid Solid (colorspace) data and focus on base-space data, which is more accurate and has far more relevant tools.

    Comment

    • GenoMax
      Senior Member
      • Feb 2008
      • 7142

      #3
      Originally posted by lanner View Post
      I am feeling concerned because already, it has been ~2 hours since I submitted the "BWA for SOLiD" job to Galaxy, and it is still "waiting to run", whereas I have since run many other smaller jobs, and have never had to wait for my job to begin on Galaxy, except for a few minutes. Approximately how long would such a job take on Galaxy, given the size of the inputs? I just don't know what to expect, and am feeling concerned about time issues....

      Sorry for a long message. If you have any advice on any of the topics, I would be glad to hear them!
      I assume you are using public galaxy instance at PSU. Alignment jobs on public galaxy go into a separate queue and since there are many from around the world they generally take longer (may be up to 24 h). Do not try to delete/resubmit because your job will go to the end of the queue.

      I suggest calling it a day and checking back tomorrow morning. Should be done by then.

      Comment

      • lanner
        Member
        • Apr 2014
        • 29

        #4
        Thank you both for your input!

        Brian: "I encourage you to avoid Solid (colorspace) data and focus on base-space data, which is more accurate and has far more relevant tools."

        Are you suggesting this for after the BWA alignment? I found this advice on another site:

        "You should not convert colorspace to base space prior to aligning reads. The reason for this is that if there is an error in one of the color calls, it will effect all the downstream color calls. Instead, you should use an aligner that will do the assembly in color-space instead."

        GenoMax: Yes, I am just using my (free) account on https://usegalaxy.org/. Glad to know you think it might be done by tomorrow morning, or even done at all, given its memory restraints! It is still waiting.

        I am actually running a similar pipeline on four files total. Should I submit them all to BWA alignment, so they are early in the queue? Or, will that cause the memory to crash or be pushed later to the queue as me, as a user, am using more memory etc? I am just wondering how to get them through the BWA process on Galaxy the fastest - serially or in parallel?

        Thanks again!

        Comment

        • Brian Bushnell
          Super Moderator
          • Jan 2014
          • 2709

          #5
          Originally posted by lanner View Post
          Thank you both for your input!

          Brian: "I encourage you to avoid Solid (colorspace) data and focus on base-space data, which is more accurate and has far more relevant tools."

          Are you suggesting this for after the BWA alignment? I found this advice on another site:

          "You should not convert colorspace to base space prior to aligning reads. The reason for this is that if there is an error in one of the color calls, it will effect all the downstream color calls. Instead, you should use an aligner that will do the assembly in color-space instead."
          Well, that's accurate... but no, I am suggesting that you avoid colorspace from the start, and use reads from a different platform than Solid. Converting colorspace to base-space is highly subjective (I have written a colorspace to base-space converter) and introduces many biases. And yes, with errors, it is impossible to usefully convert reads from colorspace to base-space prior to the reads being mapped, which introduces a ref-bias.

          In my opinion, Solid was a poorly-implemented technology, and I believe the world would be better off if everyone pretended it never existed. It is obsolete, but even when it was still being marketed, rival technologies (Illumina, 454, Sanger) were vastly superior in terms of read lengths, error rates, and compatibility with software. As long as people keep publishing things based on Solid data, the signal-to-noise ratio of scientific literature will be adversely affected.

          Comment

          • lanner
            Member
            • Apr 2014
            • 29

            #6
            Brian: Thank you, that is really helpful for me to know. I will mention this when I present my project, especially to explain questionable results. And, I will avoid using SOLiD if I ever perform these sort of analyses in a more serious capacity (for publication purposes)!!

            Comment

            • Brian Bushnell
              Super Moderator
              • Jan 2014
              • 2709

              #7
              You're welcome!

              But please note that Solid can give useful results, and "I used Solid data" can't be used to explain results contradictory to what you were expecting. It's just very unreliable; so while publications have been made in the past using Solid data, I would not personally pass one today.

              Comment

              • lanner
                Member
                • Apr 2014
                • 29

                #8
                Brian: Okay, thanks for the clarification...

                Comment

                Latest Articles

                Collapse

                • SEQadmin2
                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                  by SEQadmin2


                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                  ...
                  Yesterday, 10:05 AM
                • SEQadmin2
                  Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                  by SEQadmin2


                  With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                  Introduction

                  Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                  05-22-2026, 06:42 AM
                • SEQadmin2
                  Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                  by SEQadmin2

                  Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                  Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                  05-06-2026, 09:04 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by SEQadmin2, Yesterday, 12:03 PM
                0 responses
                19 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, Yesterday, 11:40 AM
                0 responses
                14 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 05-28-2026, 11:40 AM
                0 responses
                29 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 05-26-2026, 10:12 AM
                0 responses
                31 views
                0 reactions
                Last Post SEQadmin2  
                Working...