Header Leaderboard Ad

Collapse

Compile multi-individual SNP tables from bowtie/samtools mappings

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Compile multi-individual SNP tables from bowtie/samtools mappings

    I am sure someone must have solved this before, and I wonder if there is not an obviously easy way to do this:

    I have GAIIx-RAD data of bowtie-reference-mapped individuals. The resulting individual SNP data (produced with samtools in the form of pileups) is easily merged in a single table, but one major problem remains:

    The samtools pileup only reports on SNPs which are different wrt the reference sequence, but not on those SNPs which happen to be identical to the reference (i.e. those which show differences in other individuals).

    Is there any easy way of discriminating the absence of data from DNA-sequence identity of any given SNP with the reference sequence?

    Any pointers most welcome!

  • #2
    We had the same problem in a plant transcriptome, so we created a tool to do just that. It takes the individual bam files, it merges them and then it looks for the SNP with respect to the reference or to between the reads. You could do something like that.
    Alternatively, you could give a spin to our tool, it might be useful to you, it is named ngs_backbone.

    Comment


    • #3
      Dear Jose,
      this is brilliant. I will gove it a shot - no need to reinvent the wheel. Big thanks for your quick reply!

      Comment


      • #4
        consolidating pileup across multiple samples

        Here are steps for using samtools to creating pileup output for three samples holding the sample allele at each site
        Code:
        # assuming for each of three read groups (samples) named s1,s2,s3 that you have one file of reads aligned to genome in $RefseqFasta
        
        # use samtools to produce your sorted bam files {s1,s2,s3}.s.bam
        ...
        
        # build the pileups:
        samtools pileup -vcf $RefseqFasta s1.s.bam > s1.pileup.tab
        samtools pileup -vcf $RefseqFasta s2.s.bam > s2.pileup.tab
        samtools pileup -vcf $RefseqFasta s3.s.bam > s3.pileup.tab
        
        # filter them using samtools.pl (or varfilter.py) (or awk)
        samtools.pl varFilter $varFilterOptions s1.pileup.tab > s1.pileup.filtered.tab
        samtools.pl varFilter $varFilterOptions s2.pileup.tab > s2.pileup.filtered.tab
        samtools.pl varFilter $varFilterOptions s3.pileup.tab > s3.pileup.filtered.tab
        
        # produce consolidated "list of sites" (c.f. [url]http://samtools.sourceforge.net/samtools.shtml[/url]) holding at least one filtered pileup
        # note: the s{1,2,3} relies upon your shell being bash.
        cut -f 1-2  s{1,2,3}.pileup.filtered.tab | sort  --key=1,1 --key=2,2n | uniq > s1-3.sites.tab
        
        # extract the pileup for each sample at each site
        samtools pileup -l s1-3.sites.tab -c -f $RefseqFasta s1.s.bam > s1.pileup.atsites.tab
        samtools pileup -l s1-3.sites.tab -c -f $RefseqFasta s2.s.bam > s2.pileup.atsites.tab
        samtools pileup -l s1-3.sites.tab -c -f $RefseqFasta s3.s.bam > s3.pileup.atsites.tab
        
        # Use whatever tools you have to compare s{1,2,3}.pileup.atsites.tab
        # They all are in the same order and have the same values in the 
        # first three columns (chromosome, position, reference sequence).
        # Put them in a database and write join commands?  
        # Use the unix `paste` command to see them side-by-side?
        # Load them into "R" and compare them there?
        # Use ensembl'e snp_effect_predictor to predict consequences?
        # ...
        I typically craft the above logic using Makefiles....

        Cheers,

        Malcolm

        Comment


        • #5
          Use mpileup

          Comment


          • #6
            Originally posted by lh3 View Post
            Use mpileup
            Using mpileup does a fantastic job of simplifying the workflow. However, the per-sample information (strand bias, sequence depth, etc.) are largely lost. Is there a way to maintain that sample-level information (DP, SB, and potentially others) so that more interesting filtering can be done on individual samples? In the end, we are typically interested in having very high quality variant calls per sample and want to guard against errors at that level and not so much at the level of the population.

            Comment


            • #7
              For now, call candidates first and then run mpileup on the candidates to get the pileup format (i.e., without the "-g" option). From pileup, you will know DP. As to strand bias, you may perform an exact test based on the pileup string, like what bcftools is doing.

              Or when you get the candidate lists, collect information individually by running mpileup on each individual.

              EDIT: in future, pileup lines may be folded into VCF, but this will not be soon.
              Last edited by lh3; 10-29-2010, 04:35 AM.

              Comment


              • #8
                I have just committed a new revision by which you can compute per-sample DP and SP (phred-scaled strand bias P-value), via the -D and -S options of mpileup. They are not the default behavior. In addition, -S may have performance overhead as Fisher's exact test really takes time.

                EDIT: for now, I think DP and SP are the most informative ones.
                Last edited by lh3; 10-29-2010, 08:21 AM.

                Comment


                • #9
                  We keep, sample and library information as well as a mean quality per allele. That's why we use our own snp caller and not pileup. If pileup kept that information we would have use it.

                  Comment


                  • #10
                    mpileup keeps per sample information. this is what it is designed for. the quality of each base is there, so you can also compute a mean. on the other hand, I am not sure how much library information may help. I guess we mostly use one library per sample now.

                    nonetheless, for deep resequencing of a couple of individuals, the model for snp calling is not important. any reasonable snp callers will give sensible results. no need to be bound to samtools. just you should apply the baq thing before snp calling.

                    Comment


                    • #11
                      Hi Ih3

                      Could you write the command for it? Where I put -D and -S?

                      samtools mpileup -EuDSf *.fa s1.sorted.bam s2.sorted.bam
                      bcftools view -bvcg
                      bcftools/vcfutils.pl varFilter -D100

                      Thanks


                      I have just committed a new revision by which you can compute per-sample DP and SP (phred-scaled strand bias P-value), via the -D and -S options of mpileup. They are not the default behavior. In addition, -S may have performance overhead as Fisher's exact test really takes time.

                      EDIT: for now, I think DP and SP are the most informative ones.

                      Comment

                      Working...
                      X