Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • david.tamborero
    Member
    • Feb 2011
    • 60

    How Varscan process the pileup data?

    Hi,

    if any of you has used Varscan to find somatic mutations given normal and tumor pileup files, maybe you can give me some light about how it processes the data.

    For instance, given the following position in the tumor pileup file:

    Code:
    chr1	173824	T	21	GG...*..*.G.......GG.	
    				##9!!!!;!<#;<<=<<<!#<
    Varscan says that 8 tumor reads supports the 'T' and 2 tumor reads supports the 'G'. I do not see how it is fitted!

    Another random example:

    Code:
    chr1	30028	c	9	,tt,,,T,,	
    				5::778?.4
    Varscan says that in the previous position, there are 3 reads for nucleotide 'T' and 5 for 'C' (why not 6??). There are these kind of discordances in almost all the positions I've checked, so I'm pretty confused.

    thanks!
    david
  • dkoboldt
    Member
    • Mar 2009
    • 62

    #2
    VarScan includes a minimum base quality threshold requirement in order to count reads. By default, this is 15 or 20 depending on the command you use.

    In the first example, some bases have a base quality represented by '#', indicating a Phred score of 3. In the second example, one reference base has a quality represented by ".", which is 14. Both of these fall below the minimum base quality threshold in VarScan.

    Comment

    • david.tamborero
      Member
      • Feb 2011
      • 60

      #3
      Thank you for your response, dkoboldt.

      I indeed thought in the possibility that some 'default quality filtering' was being applied, but (for instance) in the first example, all G's should be not taken into account since their qualities are either '#' or '!'. However, Varscan stats say that for that position there are two G's. This does not seem to fit.

      There is plenty of things like that, and I am not able to find a 'pattern' of quality filtering by checking the pileup positions as compared to the corresponding VArscan entry...

      thanks a lot,
      cheers
      David
      Last edited by david.tamborero; 01-04-2012, 06:24 PM.

      Comment

      • david.tamborero
        Member
        • Feb 2011
        • 60

        #4
        Just for those that can be interested in the varscan filtering according to the base quality: apparently Varscan has a bug that arises when certain characters appear in the field containing the bases (for instance, the '*'). This provokes a shift in the 'base-base quality' correspondance and then the counting of bases supporting each allele becomes wrong.

        Therefore, the base quality filtering performed by Varscan fails for those reads containing these characters. I've written to the authors, hope it helps.

        cheers,
        david

        Comment

        • Jane M
          Senior Member
          • Aug 2011
          • 239

          #5
          Hello David,
          I intended to try Varscan with normal and tumoral files, so thanks for mentionning the problem. Please, keep us informed if you got news from the authors. I would like to try the tool when the problem is solved!
          Do you know if the problem is also present when using Varsan with only one file?

          Comment

          • david.tamborero
            Member
            • Feb 2011
            • 60

            #6
            No idea, I only have used the somatic command. You can try the other Varscan commands (as the pileup2snp) with a dummy pileup file and play with the --min-avg-qual parameter to check it out.

            By the way, I need that Varscan performs the quality filtering because the samtools mpileup command also does it wrong (the -Q argument does not work). Seems that the base quality filtering is not that important!

            cheers,
            David

            Comment

            • Jane M
              Senior Member
              • Aug 2011
              • 239

              #7
              Ok, thanks!

              Comment

              • Jane M
                Senior Member
                • Aug 2011
                • 239

                #8
                I tried VarScan in the somatic mode and I also experienced this problem.
                For example, VarScan gives as output the following number of reads:

                - for the normal sample: 183 (reference) and 4 (variant)
                - for the tumor sample: 8 (reference) and 14 (variant)

                This results in a somatic mutation. I checked my bam files in IGV and I found:
                - for the normal sample: 185 (reference) and 165 (variant)
                - for the tumor sample: 8 (reference) and 359 (variant)

                Of course, some reads can be deleted because they have low quality, but I doubt that it could be the cases for so many reads... This variation is rather a LOH rather than a somatic mutation.
                And I've got this kind of results several times...

                I wonder how this problem affect the SNPs detection and how wrong is my analysis...

                Do you know if this problem will be taken into account?

                Comment

                • david.tamborero
                  Member
                  • Feb 2011
                  • 60

                  #9
                  I guess than before running varscan you have converted the bam file to pileup file, right?

                  If so, such 'lost' reads could be explained by the samtools mpileup command. Check its arguments: the mapping_quality filtering works (as far as i remember, and as opposite to the base_quality filtering), therefore some reads can be removed due to this parameter. The other source of reads removal during bam to pileup conversion can be the 'anomalous read pairs': check the '-A' argument of the mpileup command to see if reads counts are more consistent with IGV data (which opens the bam file).

                  Comment

                  • Jane M
                    Senior Member
                    • Aug 2011
                    • 239

                    #10
                    Originally posted by david.tamborero View Post
                    I guess than before running varscan you have converted the bam file to pileup file, right?

                    If so, such 'lost' reads could be explained by the samtools mpileup command. Check its arguments: the mapping_quality filtering works (as far as i remember, and as opposite to the base_quality filtering), therefore some reads can be removed due to this parameter. The other source of reads removal during bam to pileup conversion can be the 'anomalous read pairs': check the '-A' argument of the mpileup command to see if reads counts are more consistent with IGV data (which opens the bam file).
                    Exactly, I have converted my bam files into pileup files to use VarScan.

                    samtools mpileup -f ~/fasta/hg19.fasta /data/patient1/s_garma-296_converted_sorted.bam > /data/patient1/s_garma-296_converted_sorted.pileup
                    The default arguments are:
                    -q INT skip alignments with mapQ smaller than INT [0]
                    -Q INT skip bases with baseQ/BAQ smaller than INT [13]
                    I checked on IGV the PHRED of some of the reads at this position and they were good (>30). So the missing reads are probably not due to base quality. I don't know for the mapping quality... How can I check it?

                    As you suggested me, I re-run samtools mpileup with -A:
                    samtools mpileup -f ~/fasta/hg19.fasta /data/patient1/s_garma-296_converted_sorted.bam > /data/patient1/s_garma-296_converted_sorted.pileup
                    And I have still the same kind of problems: LOH considered as somatic mutations because of missing reads.
                    I am not sure what is doing the -A option... Does it add some anomalous reads?
                    Last edited by Jane M; 02-06-2012, 08:52 AM.

                    Comment

                    • dkoboldt
                      Member
                      • Mar 2009
                      • 62

                      #11
                      Jane,

                      It will also help if you provide the pileup output for 1-2 examples of this base dropout - we can quickly look at what's in it to see if VarScan is not counting anything.

                      Please note that the base quality parsing issue was resolved as of VarScan v2.2.8. If any of you encounter it, please let me know!

                      Yours,

                      Dan Koboldt

                      Comment

                      • Jane M
                        Senior Member
                        • Aug 2011
                        • 239

                        #12
                        Thank you for your answer Dan,

                        Here is the output of the pileup file at the position where I have the problem I mentioned:
                        With VarScan somatic:
                        - for the normal sample: 183 (reference G) and 4 (variant A)
                        - for the tumor sample: 8 (reference) and 14 (variant)

                        With IGV:
                        - for the normal sample: 185 (reference) and 165 (variant)
                        - for the tumor sample: 8 (reference) and 359 (variant)

                        [PCJane patient1]$ grep 186380515 s_garma-fibros_converted_sorted.pileup
                        chr4 186380515 G 350 .$,$a$,$,$a$a$.Aa,aA.aaa,..,a,.....aa,,,a,,AAaaaA.AA,,Aa,,aa...a,..a,,.AAaaa,a...,,a,,,..A,a,a,.,,.Aaaaa,,,.,Aa.,Aa,,Aa,AAa,aa..a,,a..a,aAa,aA.Aaa.aaAAAA.A,a,aA,A,,,..aa,aaa.A.aaa..aa...A,,,aA.AA..Aa,,aa...a,aaaAA.AaAa,a,AAA.,a,AAA....a..,aa,AA,A.a..aA,aaa.,,aA..AAa,aA.,aA,,AA,,,,,AaAAA,A,,,A.aaa..aa,AAa,,,..A,,,,A.a,a,aa,,....aaa,....,,,,..A..,aa......^~A^~, EC!@C!!B(+@+3J585C@IF7BJIGIJ97FDF9FD337774I3:FF75HH55JIJ5GJJ9HH733555H5GII:J3JIJBH4J3J3JIJJJ53333JIJEJ43IJ43JJ43J334F33EJ4JJ3JI3J333J;3J433J333363J4J3F43J6JGIJJ3:J333?3J333JJ33JGJ4JJJ84J33JJ33GJ33JJJ3J:3346J3343J3J>33JJ*J334IJGJ4JJJ34G33F3J4JJ46J433JDJ33IJ333I33JJ33JJ43JIJJJ43334G3FJJ5F333HF73J564IGJFH5JJGG5H3H4I53IIFFFF454JFFFFHHGHFF7CCF67C@CCCC%E
                        ^C
                        [PCJane patient1]$
                        I intend to test Varscan in the "simple mode" today to check if I have this issue too.

                        When using the CASAVA pipeline on my normal sample, I got:

                        186380515 chr4 (G)155 (A) 142
                        By comparison between, IGV, CASAVA and Varscan in the normal sample, I have:
                        -for the reference: 185, 155, 183
                        -for the variant: 165, 142, 4

                        Moreover, I tried JointSNVMix, which let the direct comparison between normal and tumoral sample. At this specific position, I have nothing. But at several sites, where I see this problem in VarScan ,my results (number of read counts) with JointSNV are closed to the ones in IGV...
                        Last edited by Jane M; 03-02-2012, 01:39 AM.

                        Comment

                        • Jane M
                          Senior Member
                          • Aug 2011
                          • 239

                          #13
                          I ran VarScan in the simple mode on the normal sample, using pileup2snp:

                          java -Xmx20g -jar VarScan.v2.2.8.jar pileup2snp /data/patient1/s_garma-fibros_converted_sorted.pileup --min-coverage 10 --min-avg-qual 25 > /data/patient1/output_varscan_snp_garma.snp
                          At the position where I have a problem in read counts, I got:
                          Code:
                          chr4  	186380515	(G)  184	   (A) 5
                          I have the same problem in both mode. It's very strange... It cannot be related to the version of VarScan since I'm using the latest one, v2.2.8...

                          I attach here the screenshots from IGV at this specific position.
                          Attached Files

                          Comment

                          • Jane M
                            Senior Member
                            • Aug 2011
                            • 239

                            #14
                            Hello,

                            I just wanted to let you know that there are buggs in VarScan, in both default and somatic modes. There are other people experiencing the same problem like me.
                            To conclude, I give up VarScan but I hope the problem will be taken into account, because appart this issue, this seems to be an interesting tool!

                            Comment

                            • Jane M
                              Senior Member
                              • Aug 2011
                              • 239

                              #15
                              Hello,

                              If some users are experiencing the same issue and read this subject, the problem comes from samtools. Add the -B or -E option to disable the BAQ computation.

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                Yesterday, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 12:03 PM
                              0 responses
                              19 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, Yesterday, 11:40 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-26-2026, 10:12 AM
                              0 responses
                              31 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...