Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by RockChalkJayhawk View Post
    Why not just align each of them independantly, then merge BAMS?
    I am also currently trying to figure out the best way to go from multiple fastq files (produced by CASAVA 1.8) to alignment with Novoalign. I am not too familiar with the internal workings of novoalign - would aligning the fastq files separately then merging the BAMs give a different result than merging the FASTQs first then aligning? For example, does Novoalign choose the best hit of a multi-hit read based on other aligned reads?

    thanks!
    Justin

    Comment


    • #17
      Originally posted by BAMseek View Post
      I am also currently trying to figure out the best way to go from multiple fastq files (produced by CASAVA 1.8) to alignment with Novoalign. I am not too familiar with the internal workings of novoalign - would aligning the fastq files separately then merging the BAMs give a different result than merging the FASTQs first then aligning? For example, does Novoalign choose the best hit of a multi-hit read based on other aligned reads?

      thanks!
      Justin
      Hi Justin,

      You can do multiple paired end files using named pipes

      mkfifo read1_pipe read2_pipe
      cat lane1_read1.fastq lane2_read1.fastq lane3_read1.fastq > read1.pipe
      cat lane1_read2.fastq lane2_read2.fastq lane3_read2.fastq > read2.pipe
      novoalign -f read1.pipe read2.pipe -F STDFQ ....
      rm read1_pipe read2_pipe

      You could also just cat all the fastq files into one one real file rather than a named pipe but that uses extra disk space.

      Usually we would run a novoalign for each file of reads and then sort each report and merge the files. This gives a slight advantage in that multiple sorts can be run at the same time. When running samtools sort we usually write uncompressed bam to save CPU for compression.

      The other question is will the results be different? Each read is aligned independently even multi-hit reads, if you use -r Random then alignment is chosen randomly from alignments for that read only.

      However, it is possible that they will be very slightly different.

      The first issue for paired end is that Novoalign adds a fragment length penalty to paired alignments and this can affect which alignment is reported for some pairs. It's not often that this penalty changes the alignment location but it can for multi-mapped reads. The initial fragment length penalties in Novoalign are calculated from a normal distribution using the mean and standard deviation entered on the -i option. However as pairs are aligned Novoalign builds a histogram of actual proper pair lengths and then starts to recalculate penalties based on the actual distribution. So if the actual fragment length distribution is not normal or the initial -i settings are not accurate then the fragment length penalties can change. Basically the first few thousand alignments for each run will have slightly different fragment length penalties than the later pairs. Setting -i accurately will minimise any differences. In any case, if you have millions of reads the start up effect is probably not noticeable. If you do merge read files they should all be from the same sample prep and all have the same fragment length distribution.

      The second issue would be if you used quality calibration, the -k option. This also starts off on the assumption that qualities are accurate and then slowly adjusts as more reads are aligned so the first few thousand reads in each run will not have the benefit of quality calibration. (You can avoid this using by doing a trial run to collect calibration data) But calibration is also a good reason to run separate Novoaligns for each lane of reads as each lane should be calibrated separately, I've seen cases where individual lanes had very different calibration profiles.

      So it's possible to process multiple files in this way but I don't see a real benefit.

      Kind Regards, Colin

      Comment


      • #18
        Hi Colin,

        Thank you for the very helpful information. I was able to get the named pipes up and running, and I think that solution will work for us or at least be the easiest thing to place into the pipeline with minimum changes. I'll also see if aligning them individually provides some performance gains since we are running the alignments on a cluster. Writing the pre-merge files to an uncompressed BAM is also a nice idea. Thanks for the detailed explanations of the differences we may see since we would need to be able to explain the differences if we switch to aligning one big file or multiple small files.

        Thanks for all the work!
        Justin

        Comment


        • #19
          Originally posted by sparks View Post
          So it's possible to process multiple files in this way but I don't see a real benefit.
          If you have a clustered environment, aligning separate files separately allows you to spawn your process across the cluster. It is true that novoalign supports this with MPI, nevertheless I still prefer different processes in a Map/Reduce-like fashion.

          d

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM
          • seqadmin
            Techniques and Challenges in Conservation Genomics
            by seqadmin



            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

            Avian Conservation
            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
            03-08-2024, 10:41 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 06:37 PM
          0 responses
          8 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Yesterday, 06:07 PM
          0 responses
          8 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-22-2024, 10:03 AM
          0 responses
          49 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-21-2024, 07:32 AM
          0 responses
          66 views
          0 likes
          Last Post seqadmin  
          Working...
          X