Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • PratikC
    Member
    • Sep 2011
    • 16

    mapping on to custom reference sequence

    Hi,
    I have sequence read data of 3 particular genes of Human.
    I know their location on reference genome and so I retrieved their sequences and made another fasta file with regular header (>seq_name) and following concatenated three sequence without anything else.

    My raw read file is of 18GB (paired end,Illumina). When I map them to custom reference (using BWA-SAMtools) I get Bam file of 8GB (Forward-Reverse merged). These all steps takes few minutes to few hours on my workstation. My BAM file index is only 36KB. When I input this BAM file in 'samtools pileup -vcf', it keeps running more than 3 days and the output file size is around 1.1GB. Its still running and I dont understand why it takes so much time? Another matter is when I checked output file after first day, it was 900MB, then on second day it was 1 GB and on third day its 1.1GB. I already killed this process before 7-8 days (because of same behavior) and restarted it. Please help me to understand why it is happening so? (there is no any error message on any of the steps)

    Thanking you in advance...
  • aaronh
    Member
    • Sep 2008
    • 46

    #2
    First off, you should most likely be mapping to the whole genome. Even if your data is from something like exon capture, you need to know how likely the mapping is to each position in the whole genome. Second, don't use pileup as this is unsupported now. Use mpileup.

    Comment

    • colindaven
      Senior Member
      • Oct 2008
      • 417

      #3
      You have that much sequence of only a few genes ? That seems like a bizarre form of overkill!

      I would also align to the whole genome and use mpileup as the previous poster said - there are good instructions on the samtools sourceforge website.

      Comment

      • PratikC
        Member
        • Sep 2011
        • 16

        #4
        Sorry friends for late reply. Thnx aaronh and colindaven. Yes my seq read is of 18GB and only 3 genes are there, but keep in mind its pooled sequencing so we got around 25 samples. Now coming to the main question, what I doubt is can pileup -vdf take so much time when rest of the steps are getting done withing few minutes to few hours. Does pileup take most of the ngs data processing time?

        Now as you suggested, I got newer samtool and set up in my workstation. Will keep you posted when done with that one. Maybe mpileup will be more efficient to do the thing...

        Comment

        • Heisman
          Senior Member
          • Dec 2010
          • 534

          #5
          Pileup should not take that much time. Since you are going to try mpileup, as it is running you can check to see how far it has progressed. Assuming you sort your bam file prior to running it you'll be able to guess how long it will take.

          If it's still acting weirdly, write back. I've set up custom reference sequences for this purpose (although I use Novoalign, not BWA, to align) and made it work fine.

          Comment

          • PratikC
            Member
            • Sep 2011
            • 16

            #6
            Originally posted by Heisman View Post
            Pileup should not take that much time. Since you are going to try mpileup, as it is running you can check to see how far it has progressed. Assuming you sort your bam file prior to running it you'll be able to guess how long it will take.

            If it's still acting weirdly, write back. I've set up custom reference sequences for this purpose (although I use Novoalign, not BWA, to align) and made it work fine.
            Hi Heisman, I already started mpileup and hopes it will be completed soon. Meanwhile, if possible, can you plz give me some more details of how you set up custom reference sequences and then how you did SNP annotation? You can also mail me if you want to.
            Thanks.

            Comment

            • Heisman
              Senior Member
              • Dec 2010
              • 534

              #7
              I actually haven't done this for awhile so hopefully I'm not forgetting anything.

              1. Take all of your targeted regions, extend each region by X base pairs, where X is determined on your method of generating data. Put all of this in fasta format.

              2. When you have your fasta file, you will want to generate a couple other files. One is the novoindex file (http://novocraft.com/wiki/tiki-index...ference+Genome). The others are for use with GATK (http://www.broadinstitute.org/gsa/wi...ference_genome).

              3. Then, align your data using novoalign, remove duplicates with Picard, realign around indels and do base quality recalibration if desired.

              4. Then, call SNPs with mpileup.

              I can give more details if desired. If you want to check how mpileup is progressing, assuming you are using a command such as this:

              samtools-0.1.18/samtools mpileup -AB -ugf [reference_seq.fa] [aligned_file.bam] | samtools-0.1.18/bcftools/bcftools view -bvcg -> [intermediate_file] &

              Go ahead and use the following command before the first one finishes. If your reference sequence and aligned files are sorted, then looking at the end of the following file that is generated will let you know how far the variant calling has progressed.

              samtools-0.1.18/bcftools/bcftools view [intermediate_file] | samtools-0.1.18/bcftools/vcfutils.pl varFilter -D99999 >[list_of_variants] &

              Comment

              • PratikC
                Member
                • Sep 2011
                • 16

                #8
                Originally posted by Heisman View Post
                I actually haven't done this for awhile so hopefully I'm not forgetting anything.

                1. Take all of your targeted regions, extend each region by X base pairs, where X is determined on your method of generating data. Put all of this in fasta format.

                2. When you have your fasta file, you will want to generate a couple other files. One is the novoindex file (http://novocraft.com/wiki/tiki-index...ference+Genome). The others are for use with GATK (http://www.broadinstitute.org/gsa/wi...ference_genome).

                3. Then, align your data using novoalign, remove duplicates with Picard, realign around indels and do base quality recalibration if desired.

                4. Then, call SNPs with mpileup.

                I can give more details if desired. If you want to check how mpileup is progressing, assuming you are using a command such as this:

                samtools-0.1.18/samtools mpileup -AB -ugf [reference_seq.fa] [aligned_file.bam] | samtools-0.1.18/bcftools/bcftools view -bvcg -> [intermediate_file] &

                Go ahead and use the following command before the first one finishes. If your reference sequence and aligned files are sorted, then looking at the end of the following file that is generated will let you know how far the variant calling has progressed.

                samtools-0.1.18/bcftools/bcftools view [intermediate_file] | samtools-0.1.18/bcftools/vcfutils.pl varFilter -D99999 >[list_of_variants] &
                Thanks Heisman, I got your point. I am doing that part and going fine with it. My question was like when we do dbSNP annotation, if our seq is mapped to ref genome, program will annotate dbSNPs on the basis of location on genome. But when we have seq reads alligned to custom reference, than how do we use dbSNP annotations?
                By the way, really thans for the reply. Now I got to know how much mpileup process is completed. Can you suggest your or some other articles with similar strategy.

                Thanks.

                Comment

                • Heisman
                  Senior Member
                  • Dec 2010
                  • 534

                  #9
                  Oh, that's a different question, haha.

                  When you create your reference sequence and make your fasta headers, I suggest making the header the genomic coordinates of the specific interval. Then, if one portion of your reference sequence is named "chr1:45673-58472" and you call a SNP at:

                  chr1:45673-58472 564

                  You can know the mutation is at chr1: (45673 + 564 -1) (be careful not to be off by one). So, even if your programming skills aren't good, you can put this in excel and convert everything to the genomic coordinates.
                  Last edited by Heisman; 10-11-2011, 11:58 AM.

                  Comment

                  • husamia
                    Member
                    • Apr 2010
                    • 66

                    #10
                    when you said you have pooled samples, I was suprised. I thought you were supposed to demultiplex your pooled samples before aligning to any reference. Otherwise you don't know which sample is which.

                    on another note, Is it possible you got a never ending loop in the command where you got somehow the input as your output from another process and it keeps going? just a thought like
                    Last edited by husamia; 10-11-2011, 01:11 PM.

                    Comment

                    • PratikC
                      Member
                      • Sep 2011
                      • 16

                      #11
                      Cheers!!!

                      Originally posted by Heisman View Post
                      Oh, that's a different question, haha.

                      When you create your reference sequence and make your fasta headers, I suggest making the header the genomic coordinates of the specific interval. Then, if one portion of your reference sequence is named "chr1:45673-58472" and you call a SNP at:

                      chr1:45673-58472 564

                      You can know the mutation is at chr1: (45673 + 564 -1) (be careful not to be off by one). So, even if your programming skills aren't good, you can put this in excel and convert everything to the genomic coordinates.
                      Thanks Heisman, I got your point and will do as you said.
                      Thank you very much.
                      Again, can you suggest any article of yours or others showing similar kind of custom reference in NGS analysis...

                      Comment

                      • PratikC
                        Member
                        • Sep 2011
                        • 16

                        #12
                        Originally posted by husamia View Post
                        when you said you have pooled samples, I was suprised. I thought you were supposed to demultiplex your pooled samples before aligning to any reference. Otherwise you don't know which sample is which.

                        on another note, Is it possible you got a never ending loop in the command where you got somehow the input as your output from another process and it keeps going? just a thought like
                        Hi Husamia, yes its pooled sample and we are not demultiplexing before alignment. We just want to find out which mutation is there in population, we don't care which mutated gene coming from which patient. Well that is for just this dataset which is a pilot study. In the following main sequencing work, we will do specific patient's sequencing individually with more samples and more number of genes.

                        Now coming to the never ending loop, I checked that and sure there is no loop, I checked the samtools output as described by Heisman and its working properly...

                        Comment

                        • Heisman
                          Senior Member
                          • Dec 2010
                          • 534

                          #13
                          Here's a paper: http://genome.cshlp.org/content/20/12/1711.full

                          They use a positive and negative control for their pooled sequencing analysis. When we do sequencing, Phi-X is spiked into every lane, and that can be used as a negative control. A positive control is not actually necessary to run SPLINTER although it is useful for downstream analysis (similary, a negative control isn't technically necessary as you could artificially create negative control data... I doubt anybody has done that but if you are willing to miss out on the more rare variants I think you could make it work).

                          Comment

                          • aniruddha.otago
                            Member
                            • Jan 2010
                            • 21

                            #14
                            Hi Guys,

                            Recently we have published a paper in Nucleic Acids research on DNA methylation data analysis- "Comparison of alignment software for genome-wide bisulphite sequence data". you might find it useful to have a look at it.. http://www.ncbi.nlm.nih.gov/pubmed/22344695

                            Comment

                            • greekkey
                              A good person
                              • Sep 2012
                              • 11

                              #15
                              I am just looking for the usage of command and saw your data. I'm wondering what's is your final result. I think you want to know the allele frequency of a given SNP in a population. If this is a common SNP (eg. 50%), probably OK, but if the freq. is not so high, in the pooled population, the signal might be masked by sequencing background noise.


                              Originally posted by PratikC View Post
                              Hi Husamia, yes its pooled sample and we are not demultiplexing before alignment. We just want to find out which mutation is there in population, we don't care which mutated gene coming from which patient. Well that is for just this dataset which is a pilot study. In the following main sequencing work, we will do specific patient's sequencing individually with more samples and more number of genes.

                              Now coming to the never ending loop, I checked that and sure there is no loop, I checked the samtools output as described by Heisman and its working properly...

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                Yesterday, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 12:03 PM
                              0 responses
                              19 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, Yesterday, 11:40 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-26-2026, 10:12 AM
                              0 responses
                              31 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...