Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • aligning multiple fastq for the same sample

    Hi everyone, I am trying to align fastq against reference with bowtie. The input data contains same sample run in multiple lanes, 2, 3 or 4, as is commonly the case when expected depth could not be reached by single lane.

    What could be the best strategy to align these, align one by one, or merge fastq first. The context is ChIP-Seq.

    Thanks for response.

  • #2
    since alignments are alignments you could align them separately and output as SAM files then use Samtools to merge and sort the alignments.
    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
    Salk Institute for Biological Studies, La Jolla, CA, USA */

    Comment


    • #3
      Thanks for response. But will the alignment be not different when aligned together or in isolation. eg unique matches

      Comment


      • #4
        The individual alignments will be the same regardless. Depending on how you made your library, it might make sense to align the lanes separately (for accurate PCR duplicate calling, which is presumably what you meant by "unique match"). Aside from that, there's no difference aside from the number of keystrokes required.

        Comment


        • #5
          You point is well taken. But there is another situation in addition, which is when you want only reads matching to genome at one place, not various. If you align in batches, you can not have this done accurately.

          Comment


          • #6
            Originally posted by epi View Post
            You point is well taken. But there is another situation in addition, which is when you want only reads matching to genome at one place, not various. If you align in batches, you can not have this done accurately.
            Could you explain more what you mean by this? Typically, each read is aligned independently of others, then the results are merged for subsequent analysis.

            Comment


            • #7
              Originally posted by epi View Post
              You point is well taken. But there is another situation in addition, which is when you want only reads matching to genome at one place, not various. If you align in batches, you can not have this done accurately.
              This is a bit ambiguous in English. "reads matching to genome at one place" can either mean "uniquely mapped reads" (most likely you mean this) or "reads mapping only to a specific region of the genome" (presumably you don't mean that). In neither case will the results differ depending on whether you invoke bowtie once or multiple times. I recall there being auxiliary flags that indicate multiple alignments of which only one was returned and/or a flag to just not return those (something like -m in bowtie1, haven't used it in a while though).

              As you quoted Simon Andrews as saying in another thread, "For straight forward alignments (Bowtie, BWA etc) then the two operations would be the same".

              Comment


              • #8
                Originally posted by epi View Post
                You point is well taken. But there is another situation in addition, which is when you want only reads matching to genome at one place, not various. If you align in batches, you can not have this done accurately.
                You might be thinking of this backwards. Each read, of which you have millions, is unique but could in fact all align to the same genomic region. What is meant by unique alignments in RNA-Seq is for each read to only be able to align in one spot. What you WANT is for reads to align on top of one another....that's how we are able to measure gene expression and do anything, really.

                Just align with Bowie using the -m 1 -k 1 options. That will produce unique alignments per read.
                /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                Salk Institute for Biological Studies, La Jolla, CA, USA */

                Comment


                • #9
                  Nice to see the discussion. I guess it depends on individual experiment how much of an issue PCR duplicates might be. Won't this be a good practice to always merge fastq before align to remove any possible bias.

                  Comment


                  • #10
                    Originally posted by epi View Post
                    Nice to see the discussion. I guess it depends on individual experiment how much of an issue PCR duplicates might be. Won't this be a good practice to always merge fastq before align to remove any possible bias.
                    If you want to remove PCR duplicates, then you should merge all data before removing PCR duplicates if all of the data comes from the same prepped library. If the data comes from different prepped libraries, you should merge after removing the duplicates.

                    Comment


                    • #11
                      Originally posted by Heisman View Post
                      If you want to remove PCR duplicates, then you should merge all data before removing PCR duplicates if all of the data comes from the same prepped library. If the data comes from different prepped libraries, you should merge after removing the duplicates.
                      In other words, if library corresponds to sample, which i believe is the case with the data I have, same sample run in multiple lanes should be merged and then aligned.
                      This clarifies a lot. I have heard some opinions from bioinformaticians that this is immaterial. In fact, even further breaking down the fastq into smaller fragments (for whatever reasons) should not matter for alignment.

                      Comment


                      • #12
                        Originally posted by epi View Post
                        In other words, if library corresponds to sample, which i believe is the case with the data I have, same sample run in multiple lanes should be merged and then aligned.
                        This clarifies a lot. I have heard some opinions from bioinformaticians that this is immaterial. In fact, even further breaking down the fastq into smaller fragments (for whatever reasons) should not matter for alignment.
                        Heisman points out that if you have different samples you should align first, remove duplicates, then merge. You conclude that since you have just one sample, you need to merge first and then align. That conclusion does not logically follow. The fallacy is common enough to have it's own name: Denial of the Antecedent.

                        It really sounds like you had your mind made up before coming here with your question. Everyone who responded has told you that it doesn't matter whether you align then merge or vice versa. You don't have to believe them, but if someone takes the time to offer guidance you should at least do them the curtesy of plainly stating the basis of your disagreement.

                        Comment


                        • #13
                          I think it is better to do the alignment individually. This will help check for lane specific biases, if there is any. In addition, aligning individually will help do the alignment in parallel.

                          Comment


                          • #14
                            When using splice aligners for RNA-Seq, must merge and then align for obvious reasons. For regular aligners (bowtie etc.) I still do merge first and remove PCR duplicates and then align. As far as speed, it does not bother me as it takes only a few minutes to align anyways. Also using a parellelized tool as bowtie, I would rather dedicate all available nodes to merged lane than splitting them among 2 individual lanes running simultaneously. After all you have to merge them at some stage anyways for the actual analysis, file management can be cleaner to do it right from the beginning. I see from comments people do it alternatively as well, I guess its just my preference for the analysis. I also do not understand Alex's comments, epi's interpretation of Heisman's response seems fine.
                            Logically, it should not matter if you can take care of PCR duplicates at some stage in your pipeline. But practically, i have some strange experiences using combination of publicly available tools and their behavior. I will have to do a complete analysis by myself to believe if splitting would cause any real issue or not. if anyone has gone on to do the same, please share here. With ll due respect, I am sticking to my approach till then.
                            Last edited by analyst; 05-08-2012, 08:13 AM.

                            Comment


                            • #15
                              Thanks for commenting analyst, I just don't care about responses like Alex's. Unfortunately he is not the only person in public forums and in scientific world who like to get personal in scientific discussion. Basically, it seems they try to push their own agenda and preferences onto the other without even understanding what is being discussed, like this case. May be he is a big advocate of one particular strategy and feels insecure if some one even mentions any other. Or may be he just is looking for places to use the phrase of the day he learnt, this tendency is even more common and has it's own name: talking through the hat. Unlike his example, this even fits.
                              But overall this is an excellent forum with good collection of people and experts. Actually, I am not familiar with the steps upstream of the NGS data generation, like sample and library prep, so I feel I am more educated after these discussions. Some people state their opinion and some even the reasons behind it, both are useful.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Non-Coding RNA Research and Technologies
                                by seqadmin




                                Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                                Nobel Prize for MicroRNA Discovery
                                This week,...
                                10-07-2024, 08:07 AM
                              • seqadmin
                                Recent Developments in Metagenomics
                                by seqadmin





                                Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                                09-23-2024, 06:35 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 10-02-2024, 04:51 AM
                              0 responses
                              104 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-01-2024, 07:10 AM
                              0 responses
                              112 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 09-30-2024, 08:33 AM
                              1 response
                              116 views
                              0 likes
                              Last Post EmiTom
                              by EmiTom
                               
                              Started by seqadmin, 09-26-2024, 12:57 PM
                              0 responses
                              23 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X