Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reduce .fa reference file

    I am using the "BWA for SOLiD" tool on Galaxy. It calls for two inputs:

    1) "Reference Genome": I am using mrna.fa.gz - Human mRNA from GenBank (from the website http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/). It is ~532MB.

    2) "FASTQ file (Nucleotide-space recoded from color-space)": I am using a .fastq file of human transcriptome data. It is ~ 1.8GB.

    I was told to go ahead and try running "BWA for SOLiD" with these inputs, but that it would most likely exceed resources with a memory error.

    I am wondering how I can prevent this (without having to reference cloud resources, etc), and just use the normal Galaxy platform. I have already reduced my .fastq file from its original size by 10-fold (I randomly kept only 1 out of every 10 sequences).

    What is the most effective way for me to reduce the process? And how can I do so without introducing more biases? Should I further reduce my .fastq files by another 2 or 5 fold etc.? Or should I reduce my .fa file, and if so, what is the ideal way to accomplish this?

    I am not concerned about quality. This is for a quick course project - not for any publication! )

    I am feeling concerned because already, it has been ~2 hours since I submitted the "BWA for SOLiD" job to Galaxy, and it is still "waiting to run", whereas I have since run many other smaller jobs, and have never had to wait for my job to begin on Galaxy, except for a few minutes. Approximately how long would such a job take on Galaxy, given the size of the inputs? I just don't know what to expect, and am feeling concerned about time issues....

    Sorry for a long message. If you have any advice on any of the topics, I would be glad to hear them!

  • #2
    The amount of time you need to wait is based on other users. It's difficult to predict when your job will run; generally, the fewer resources you need, the sooner it will run, though that's not always true.

    The amount of memory mapping needs is not based on the number of reads (that only affects the time), but the size of the reference. I encourage you to avoid Solid (colorspace) data and focus on base-space data, which is more accurate and has far more relevant tools.

    Comment


    • #3
      Originally posted by lanner View Post
      I am feeling concerned because already, it has been ~2 hours since I submitted the "BWA for SOLiD" job to Galaxy, and it is still "waiting to run", whereas I have since run many other smaller jobs, and have never had to wait for my job to begin on Galaxy, except for a few minutes. Approximately how long would such a job take on Galaxy, given the size of the inputs? I just don't know what to expect, and am feeling concerned about time issues....

      Sorry for a long message. If you have any advice on any of the topics, I would be glad to hear them!
      I assume you are using public galaxy instance at PSU. Alignment jobs on public galaxy go into a separate queue and since there are many from around the world they generally take longer (may be up to 24 h). Do not try to delete/resubmit because your job will go to the end of the queue.

      I suggest calling it a day and checking back tomorrow morning. Should be done by then.

      Comment


      • #4
        Thank you both for your input!

        Brian: "I encourage you to avoid Solid (colorspace) data and focus on base-space data, which is more accurate and has far more relevant tools."

        Are you suggesting this for after the BWA alignment? I found this advice on another site:

        "You should not convert colorspace to base space prior to aligning reads. The reason for this is that if there is an error in one of the color calls, it will effect all the downstream color calls. Instead, you should use an aligner that will do the assembly in color-space instead."

        GenoMax: Yes, I am just using my (free) account on https://usegalaxy.org/. Glad to know you think it might be done by tomorrow morning, or even done at all, given its memory restraints! It is still waiting.

        I am actually running a similar pipeline on four files total. Should I submit them all to BWA alignment, so they are early in the queue? Or, will that cause the memory to crash or be pushed later to the queue as me, as a user, am using more memory etc? I am just wondering how to get them through the BWA process on Galaxy the fastest - serially or in parallel?

        Thanks again!

        Comment


        • #5
          Originally posted by lanner View Post
          Thank you both for your input!

          Brian: "I encourage you to avoid Solid (colorspace) data and focus on base-space data, which is more accurate and has far more relevant tools."

          Are you suggesting this for after the BWA alignment? I found this advice on another site:

          "You should not convert colorspace to base space prior to aligning reads. The reason for this is that if there is an error in one of the color calls, it will effect all the downstream color calls. Instead, you should use an aligner that will do the assembly in color-space instead."
          Well, that's accurate... but no, I am suggesting that you avoid colorspace from the start, and use reads from a different platform than Solid. Converting colorspace to base-space is highly subjective (I have written a colorspace to base-space converter) and introduces many biases. And yes, with errors, it is impossible to usefully convert reads from colorspace to base-space prior to the reads being mapped, which introduces a ref-bias.

          In my opinion, Solid was a poorly-implemented technology, and I believe the world would be better off if everyone pretended it never existed. It is obsolete, but even when it was still being marketed, rival technologies (Illumina, 454, Sanger) were vastly superior in terms of read lengths, error rates, and compatibility with software. As long as people keep publishing things based on Solid data, the signal-to-noise ratio of scientific literature will be adversely affected.

          Comment


          • #6
            Brian: Thank you, that is really helpful for me to know. I will mention this when I present my project, especially to explain questionable results. And, I will avoid using SOLiD if I ever perform these sort of analyses in a more serious capacity (for publication purposes)!!

            Comment


            • #7
              You're welcome!

              But please note that Solid can give useful results, and "I used Solid data" can't be used to explain results contradictory to what you were expecting. It's just very unreliable; so while publications have been made in the past using Solid data, I would not personally pass one today.

              Comment


              • #8
                Brian: Okay, thanks for the clarification...

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                32 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                37 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                31 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                53 views
                0 likes
                Last Post seqadmin  
                Working...
                X