Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by RockChalkJayhawk View Post
    Why not just align each of them independantly, then merge BAMS?
    I am also currently trying to figure out the best way to go from multiple fastq files (produced by CASAVA 1.8) to alignment with Novoalign. I am not too familiar with the internal workings of novoalign - would aligning the fastq files separately then merging the BAMs give a different result than merging the FASTQs first then aligning? For example, does Novoalign choose the best hit of a multi-hit read based on other aligned reads?

    thanks!
    Justin

    Comment


    • #17
      Originally posted by BAMseek View Post
      I am also currently trying to figure out the best way to go from multiple fastq files (produced by CASAVA 1.8) to alignment with Novoalign. I am not too familiar with the internal workings of novoalign - would aligning the fastq files separately then merging the BAMs give a different result than merging the FASTQs first then aligning? For example, does Novoalign choose the best hit of a multi-hit read based on other aligned reads?

      thanks!
      Justin
      Hi Justin,

      You can do multiple paired end files using named pipes

      mkfifo read1_pipe read2_pipe
      cat lane1_read1.fastq lane2_read1.fastq lane3_read1.fastq > read1.pipe
      cat lane1_read2.fastq lane2_read2.fastq lane3_read2.fastq > read2.pipe
      novoalign -f read1.pipe read2.pipe -F STDFQ ....
      rm read1_pipe read2_pipe

      You could also just cat all the fastq files into one one real file rather than a named pipe but that uses extra disk space.

      Usually we would run a novoalign for each file of reads and then sort each report and merge the files. This gives a slight advantage in that multiple sorts can be run at the same time. When running samtools sort we usually write uncompressed bam to save CPU for compression.

      The other question is will the results be different? Each read is aligned independently even multi-hit reads, if you use -r Random then alignment is chosen randomly from alignments for that read only.

      However, it is possible that they will be very slightly different.

      The first issue for paired end is that Novoalign adds a fragment length penalty to paired alignments and this can affect which alignment is reported for some pairs. It's not often that this penalty changes the alignment location but it can for multi-mapped reads. The initial fragment length penalties in Novoalign are calculated from a normal distribution using the mean and standard deviation entered on the -i option. However as pairs are aligned Novoalign builds a histogram of actual proper pair lengths and then starts to recalculate penalties based on the actual distribution. So if the actual fragment length distribution is not normal or the initial -i settings are not accurate then the fragment length penalties can change. Basically the first few thousand alignments for each run will have slightly different fragment length penalties than the later pairs. Setting -i accurately will minimise any differences. In any case, if you have millions of reads the start up effect is probably not noticeable. If you do merge read files they should all be from the same sample prep and all have the same fragment length distribution.

      The second issue would be if you used quality calibration, the -k option. This also starts off on the assumption that qualities are accurate and then slowly adjusts as more reads are aligned so the first few thousand reads in each run will not have the benefit of quality calibration. (You can avoid this using by doing a trial run to collect calibration data) But calibration is also a good reason to run separate Novoaligns for each lane of reads as each lane should be calibrated separately, I've seen cases where individual lanes had very different calibration profiles.

      So it's possible to process multiple files in this way but I don't see a real benefit.

      Kind Regards, Colin

      Comment


      • #18
        Hi Colin,

        Thank you for the very helpful information. I was able to get the named pipes up and running, and I think that solution will work for us or at least be the easiest thing to place into the pipeline with minimum changes. I'll also see if aligning them individually provides some performance gains since we are running the alignments on a cluster. Writing the pre-merge files to an uncompressed BAM is also a nice idea. Thanks for the detailed explanations of the differences we may see since we would need to be able to explain the differences if we switch to aligning one big file or multiple small files.

        Thanks for all the work!
        Justin

        Comment


        • #19
          Originally posted by sparks View Post
          So it's possible to process multiple files in this way but I don't see a real benefit.
          If you have a clustered environment, aligning separate files separately allows you to spawn your process across the cluster. It is true that novoalign supports this with MPI, nevertheless I still prefer different processes in a Map/Reduce-like fashion.

          d

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Best Practices for Single-Cell Sequencing Analysis
            by seqadmin



            While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
            06-06-2024, 07:15 AM
          • seqadmin
            Latest Developments in Precision Medicine
            by seqadmin



            Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

            Somatic Genomics
            “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
            05-24-2024, 01:16 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 07:49 AM
          0 responses
          14 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 06-20-2024, 07:23 AM
          0 responses
          14 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 06-17-2024, 06:54 AM
          0 responses
          16 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 06-14-2024, 07:24 AM
          0 responses
          25 views
          0 likes
          Last Post seqadmin  
          Working...
          X