Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Novoalign Alignment Time

    Hi,

    I am a new member & am new to working with Novoalign. I am trying to align paired end data & was wondering if anyone knows approximately how long this typically takes?

    My data is 25 million reads in each file, so total 50 million. It's a mosquito reference genome.

    I've been running one file for about one week now, but it is still running, so I was wondering if there might be a problem.

    The output Sam file is about 6 GB as of now. & the alignment efficiency is not high for our data. If anyone knows why this might be happening as well, I would appreciate your answers.

    The computer I'm running novoalign on is a MAC server with 16 GB RAM & the processor is: 2 X 2.26 GHz Quad-Core Intel Xeon.

    Again, I'd appreciate any help or links to websites that might help me figure out more.

    Thanks, Kristen

  • #2
    Hi,

    Are you using the free version of Novoalign?

    I found that the free version of Novoalign was comparable to Maq for speed. The first alignment I tried ran for about 6 days, and hadn't quite finished after all that time, so your results seem about right.

    The licensed version supports multithreading, but I haven't tried it.

    Maria

    Comment


    • #3
      Hi Kristen,

      I would definitely recommend that you use the unlocked version for better performance as more features are available , including multithreading, quality calibration.

      Novoalign is designed to be sensitive and works best with good quality data. Now I'm not sure what the quality of your data is but it would help to know what your average per-base quality along the read is. If your data is bad quality it would be good to remove erroneous read pairs with some preprocessing tools like FASTQc

      If you would like to get more assistance and a license key please visit our website at www.novocraft.com and fill in a request form. We will be happy to assist you to obtain better performance.


      Originally posted by Kristen View Post
      Hi,

      I am a new member & am new to working with Novoalign. I am trying to align paired end data & was wondering if anyone knows approximately how long this typically takes?

      My data is 25 million reads in each file, so total 50 million. It's a mosquito reference genome.

      I've been running one file for about one week now, but it is still running, so I was wondering if there might be a problem.

      The output Sam file is about 6 GB as of now. & the alignment efficiency is not high for our data. If anyone knows why this might be happening as well, I would appreciate your answers.

      The computer I'm running novoalign on is a MAC server with 16 GB RAM & the processor is: 2 X 2.26 GHz Quad-Core Intel Xeon.

      Again, I'd appreciate any help or links to websites that might help me figure out more.

      Thanks, Kristen

      Comment


      • #4
        Hi Kristen,

        It should be a lot faster than that as Mosquito genome isn't that large. Several things can affect performance.
        First is to get a trial license so that you have multi-threading, you can request that via the web site www.novocraft.com but even without this 5 days is way too long.
        Could you provide the commands that you used to build the index and to run novoalign, things like setting the threshold too high may slow down the alignments. (if you're running SAM report format just the output to stderr log will do or if Native format a head of the report)
        The alignment process may also slow down if the base calls have low qualities. You can filter out low quality reads using the -l option (set to about 60% of read length) or in latest versions by using the polyclonal filter (-p option).
        You might also use a Unix command like top to see what CPU utilisation Nvoalign is getting. If it's running single threaded it should be 100% if it's lower than this then maybe some other processes are competing for resources. If you're using the multi-threaded version CPU utilisation should be up around 800%.

        Colin

        Comment


        • #5
          Yes, we're currently using the free version, version 2.06. We used FastQC to check the quality of our data. Our data is far above the 20 level. Doesn't this mean our quality should be high enough to work with novoalign?

          Here are the commands used to build the index: novoindex -k 14 -s 1 m_index 2L.fa 2R.fa 3L.fa 3R.fa UNKN.fa X.fa Y_unplaced.fa

          These are the command line arguments to run novoalign: novoalign -o SAM -f 1.fastq 2.fastq -d m_index > control.sam

          Also, is there a way to set novoalign to allow higher mismatches for the reads? We're currently using the default 2; is there a way to set it to 5 or 10?

          Thanks for all of your help & let me know if there's any other information you need. We are going to try it again after downloading the trial.
          Last edited by Kristen; 09-02-2010, 09:31 AM. Reason: Update

          Comment


          • #6
            Hi Kristen,

            OK, for novoindex you should try without setting k&s options, the default values will give much better performance. With a genome the size of mosquito most of the 14-mers in the index will not exist in the genome and this reduces efficiency of the algorithm. Default k&s are choosen so that each index entry has 5-20 references to the genome and this gives good efficiency.
            The 14/1 index will also be quite large and you may have a problem with other processes competing for memory on your server especially as you are running single threaded.

            There was a problem in Novoindex on Mac OSX version in choosing default k&s, this was fixed in V2.06.00 so you should be OK and default should be around -k12 -s1.
            With a default index I expect it should take 10hrs on single thread and around 90mins with multi-threading.

            Colin

            Comment


            • #7
              Thanks for all of your help. We are trying your recommendations.

              Do you also happen to know if there is a way to allow higher mismatches for the reads? Are there any parameters that we can use to set this or is there a default already built in to the code? I am asking because the alignment efficiency is low; only about 10% of reads can be pair end aligned. For others, both of the ends cannot be aligned, or one will align but the other can't.

              Thanks again for all of your help.

              Comment


              • #8
                Kristen,

                That is interesting to know. I'm not sure of the source of your reads but I am guessing that you are trying to align slightly divergent genomes. Novoalign was designed primarily for resequencing and if there is very little similarity between your reads and the genome then it may not be the best tool to use. Could you provide more information about how these reads are related to the reference?

                Another scenario why the reads do not map so well is perhaps the presence of an adaptor sequence that the aligner does not know about. Novoalign can do three-prime and five-prime adaptor trimming of the read before it matches it to the genome. Have your checked for perhaps some sort of contamination of the library?

                The threshold parameter "-t" controls specificity and by default it is dynamic with an upper bound of 250. Setting it to 250 will alllow more mismatches on the read but could also lead to more repeat or low-scoring alignments. Try the -r All options to get a sense of all the possible locations a short read will map to.




                Originally posted by Kristen View Post
                Thanks for all of your help. We are trying your recommendations.

                Do you also happen to know if there is a way to allow higher mismatches for the reads? Are there any parameters that we can use to set this or is there a default already built in to the code? I am asking because the alignment efficiency is low; only about 10% of reads can be pair end aligned. For others, both of the ends cannot be aligned, or one will align but the other can't.

                Thanks again for all of your help.

                Comment


                • #9
                  Hi Kristen,

                  As Zee mentioned, Novoalign doesn't default to 2 mismatches, it defaults an alignment score that ranges from about 90 for a 32 bp read up to 250 for 75 bp reads.
                  An alignment score will allow up to 8 mismatches or even more if there at low quality bases so usually that isn't the cause of the 10% alignment rate however it is probably the reason the alignments are slow, Novoalign uses an iterative approach where it tries to align with no mismatches and then gradually increases the mismatch allowance until an alignment is found or we reach an upper limit that's around 8 mismatches depending on read length and base qualities. If 90% aren't aligning it means that at least 90% are getting to the final iteration which is the slowest and hence the long run time.
                  A couple of things you could try are:
                  1. turning on quality calibration, just add option -k, and see if it increases the yield of alignments. It often helps if something went wrong with the sequencing run.
                  2. Turn on adapter trimming -a, if by chance your fragments were shorter than the read length you may have adapter on the reads. Trimming it off will improve alignment yield.

                  I also wonder if something is wrong with your reads such as contamination or just a really bad run of the sequencer. Could you send me 10k reads taken from around the middle of the read file. Email to colin at novocraft .... com

                  Thanks, Colin

                  Comment


                  • #10
                    Bowtie

                    hi

                    can anyone tell me how to input the sequence in bowtie? i dont know what to print in the command line to input the sequence?

                    Akash

                    Comment


                    • #11
                      Novoalign Alignment Time

                      Hi Akash,

                      Have a look at the bowtie website, the Manual and the Getting Started sections are very helpful.



                      Best wishes,
                      Maria

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM
                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      31 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      33 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      28 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      53 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X