Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • epi
    Member
    • Jan 2012
    • 38

    aligning multiple fastq for the same sample

    Hi everyone, I am trying to align fastq against reference with bowtie. The input data contains same sample run in multiple lanes, 2, 3 or 4, as is commonly the case when expected depth could not be reached by single lane.

    What could be the best strategy to align these, align one by one, or merge fastq first. The context is ChIP-Seq.

    Thanks for response.
  • sdriscoll
    I like code
    • Sep 2009
    • 436

    #2
    since alignments are alignments you could align them separately and output as SAM files then use Samtools to merge and sort the alignments.
    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
    Salk Institute for Biological Studies, La Jolla, CA, USA */

    Comment

    • epi
      Member
      • Jan 2012
      • 38

      #3
      Thanks for response. But will the alignment be not different when aligned together or in isolation. eg unique matches

      Comment

      • dpryan
        Devon Ryan
        • Jul 2011
        • 3478

        #4
        The individual alignments will be the same regardless. Depending on how you made your library, it might make sense to align the lanes separately (for accurate PCR duplicate calling, which is presumably what you meant by "unique match"). Aside from that, there's no difference aside from the number of keystrokes required.

        Comment

        • epi
          Member
          • Jan 2012
          • 38

          #5
          You point is well taken. But there is another situation in addition, which is when you want only reads matching to genome at one place, not various. If you align in batches, you can not have this done accurately.

          Comment

          • Alex Renwick
            Member
            • Jul 2011
            • 44

            #6
            Originally posted by epi View Post
            You point is well taken. But there is another situation in addition, which is when you want only reads matching to genome at one place, not various. If you align in batches, you can not have this done accurately.
            Could you explain more what you mean by this? Typically, each read is aligned independently of others, then the results are merged for subsequent analysis.

            Comment

            • dpryan
              Devon Ryan
              • Jul 2011
              • 3478

              #7
              Originally posted by epi View Post
              You point is well taken. But there is another situation in addition, which is when you want only reads matching to genome at one place, not various. If you align in batches, you can not have this done accurately.
              This is a bit ambiguous in English. "reads matching to genome at one place" can either mean "uniquely mapped reads" (most likely you mean this) or "reads mapping only to a specific region of the genome" (presumably you don't mean that). In neither case will the results differ depending on whether you invoke bowtie once or multiple times. I recall there being auxiliary flags that indicate multiple alignments of which only one was returned and/or a flag to just not return those (something like -m in bowtie1, haven't used it in a while though).

              As you quoted Simon Andrews as saying in another thread, "For straight forward alignments (Bowtie, BWA etc) then the two operations would be the same".

              Comment

              • sdriscoll
                I like code
                • Sep 2009
                • 436

                #8
                Originally posted by epi View Post
                You point is well taken. But there is another situation in addition, which is when you want only reads matching to genome at one place, not various. If you align in batches, you can not have this done accurately.
                You might be thinking of this backwards. Each read, of which you have millions, is unique but could in fact all align to the same genomic region. What is meant by unique alignments in RNA-Seq is for each read to only be able to align in one spot. What you WANT is for reads to align on top of one another....that's how we are able to measure gene expression and do anything, really.

                Just align with Bowie using the -m 1 -k 1 options. That will produce unique alignments per read.
                /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                Salk Institute for Biological Studies, La Jolla, CA, USA */

                Comment

                • epi
                  Member
                  • Jan 2012
                  • 38

                  #9
                  Nice to see the discussion. I guess it depends on individual experiment how much of an issue PCR duplicates might be. Won't this be a good practice to always merge fastq before align to remove any possible bias.

                  Comment

                  • Heisman
                    Senior Member
                    • Dec 2010
                    • 534

                    #10
                    Originally posted by epi View Post
                    Nice to see the discussion. I guess it depends on individual experiment how much of an issue PCR duplicates might be. Won't this be a good practice to always merge fastq before align to remove any possible bias.
                    If you want to remove PCR duplicates, then you should merge all data before removing PCR duplicates if all of the data comes from the same prepped library. If the data comes from different prepped libraries, you should merge after removing the duplicates.

                    Comment

                    • epi
                      Member
                      • Jan 2012
                      • 38

                      #11
                      Originally posted by Heisman View Post
                      If you want to remove PCR duplicates, then you should merge all data before removing PCR duplicates if all of the data comes from the same prepped library. If the data comes from different prepped libraries, you should merge after removing the duplicates.
                      In other words, if library corresponds to sample, which i believe is the case with the data I have, same sample run in multiple lanes should be merged and then aligned.
                      This clarifies a lot. I have heard some opinions from bioinformaticians that this is immaterial. In fact, even further breaking down the fastq into smaller fragments (for whatever reasons) should not matter for alignment.

                      Comment

                      • Alex Renwick
                        Member
                        • Jul 2011
                        • 44

                        #12
                        Originally posted by epi View Post
                        In other words, if library corresponds to sample, which i believe is the case with the data I have, same sample run in multiple lanes should be merged and then aligned.
                        This clarifies a lot. I have heard some opinions from bioinformaticians that this is immaterial. In fact, even further breaking down the fastq into smaller fragments (for whatever reasons) should not matter for alignment.
                        Heisman points out that if you have different samples you should align first, remove duplicates, then merge. You conclude that since you have just one sample, you need to merge first and then align. That conclusion does not logically follow. The fallacy is common enough to have it's own name: Denial of the Antecedent.

                        It really sounds like you had your mind made up before coming here with your question. Everyone who responded has told you that it doesn't matter whether you align then merge or vice versa. You don't have to believe them, but if someone takes the time to offer guidance you should at least do them the curtesy of plainly stating the basis of your disagreement.

                        Comment

                        • rnaseek
                          Member
                          • Nov 2011
                          • 22

                          #13
                          I think it is better to do the alignment individually. This will help check for lane specific biases, if there is any. In addition, aligning individually will help do the alignment in parallel.

                          Comment

                          • analyst
                            Member
                            • Jan 2011
                            • 18

                            #14
                            When using splice aligners for RNA-Seq, must merge and then align for obvious reasons. For regular aligners (bowtie etc.) I still do merge first and remove PCR duplicates and then align. As far as speed, it does not bother me as it takes only a few minutes to align anyways. Also using a parellelized tool as bowtie, I would rather dedicate all available nodes to merged lane than splitting them among 2 individual lanes running simultaneously. After all you have to merge them at some stage anyways for the actual analysis, file management can be cleaner to do it right from the beginning. I see from comments people do it alternatively as well, I guess its just my preference for the analysis. I also do not understand Alex's comments, epi's interpretation of Heisman's response seems fine.
                            Logically, it should not matter if you can take care of PCR duplicates at some stage in your pipeline. But practically, i have some strange experiences using combination of publicly available tools and their behavior. I will have to do a complete analysis by myself to believe if splitting would cause any real issue or not. if anyone has gone on to do the same, please share here. With ll due respect, I am sticking to my approach till then.
                            Last edited by analyst; 05-08-2012, 08:13 AM.

                            Comment

                            • epi
                              Member
                              • Jan 2012
                              • 38

                              #15
                              Thanks for commenting analyst, I just don't care about responses like Alex's. Unfortunately he is not the only person in public forums and in scientific world who like to get personal in scientific discussion. Basically, it seems they try to push their own agenda and preferences onto the other without even understanding what is being discussed, like this case. May be he is a big advocate of one particular strategy and feels insecure if some one even mentions any other. Or may be he just is looking for places to use the phrase of the day he learnt, this tendency is even more common and has it's own name: talking through the hat. Unlike his example, this even fits.
                              But overall this is an excellent forum with good collection of people and experts. Actually, I am not familiar with the steps upstream of the NGS data generation, like sample and library prep, so I feel I am more educated after these discussions. Some people state their opinion and some even the reasons behind it, both are useful.

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Today, 08:59 AM
                              0 responses
                              8 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              21 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              15 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...