Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • [Problem] The pileup inconsistence between 0.1.8 and 0.1.16 samtools

    Dear all,

    [I sent the following message to samtools-help mailling list, but no luck. Therefore, I post it here. I am sorry for sending the message twice.]

    I tried to use samtools version 0.1.8 and version 0.1.16 to generate .plp files on the same bam file:

    samtools.0.1.8 pileup -f genome_hg19.fa -c accepted_hits.sorted.bam > normal.raw.plp
    samtools.0.1.16 pileup -f genome_hg19.fa -c accepted_hits.sorted.bam > normal.raw.plp

    However, I found that the plp file generated by 0.1.16 version is several times bigger than the plp file compiled by 0.1.8 version of samtools.

    I tried -B option to disable BAQ, but the result is still the same. I checked the coverage on the genome of the two plp files, and I found they have huge difference.

    What happened? I googled the question and found someone has ever asked the same question recently:
    Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


    It seemed like not just me suffering this problem.

    Could someone please help me explain the problem? Thanks!

    Bests,

    Woody

  • #2
    Further Information

    I tried to use a relatively small bam file to see how big the difference is:

    ###Generate plp file with different versions of samtools###
    09:19:38 woody@normal:samtools pileup -f ~/Documents/cDNA/chromosome/hg19/tmp/genome_hg19.fa -c normal_chr6.sorted.bam > normal_chr6.raw.plp
    09:20:17 woody@normal:samtools.0.1.8 pileup -f ~/Documents/cDNA/chromosome/hg19/tmp/genome_hg19.fa -c normal_chr6.sorted.bam > normal_chr6.old.plp

    ###Check file size###
    -rw-r--r-- 1 woody staff 17M 5 31 09:20 normal_chr6.old.plp
    -rw-r--r-- 1 woody staff 781M 5 31 09:19 normal_chr6.raw.plp

    ###Count the lines in the files###
    09:22:15 woody@normal:wc -l normal_chr6.raw.plp
    88556 normal_chr6.raw.plp
    09:22:22 woody@normal:wc -l normal_chr6.old.plp
    8872 normal_chr6.old.plp

    As you can see, the file size is a lot bigger for the normal_chr6.raw.plp (compiled by 0.1.16 version of samtools). The number of lines is 10 times larger.

    I have also checked the plp files with my naked eyes, and I found that the alignment results are slightly different. The numbers of hits on the genomic position are different between the two files.

    Could anyone helps? Please.

    Bests,

    Woody

    Comment


    • #3
      mpileup generates different results too

      [Command]
      10:32:54 woody@normal:samtools.0.1.8 mpileup -f genome_hg19.fa normal_chr6.sorted.bam > normal_chr6.old.mplp
      [mpileup] 1 samples in 1 input files
      10:36:57 woody@normal:samtools mpileup -f genome_hg19.fa normal_chr6.sorted.bam > normal_chr6.raw.mplp
      [mpileup] 1 samples in 1 input files
      <mpileup> Set max per-file depth to 8000

      [File sizes]
      -rw-r--r-- 1 woody staff 5.4M 3 9 15:18 normal_chr6.bam
      -rw-r--r-- 1 woody staff 17M 6 1 10:32 normal_chr6.old.mplp
      -rw-r--r-- 1 woody staff 17M 5 31 09:20 normal_chr6.old.plp
      -rw-r--r-- 1 woody staff 781M 6 1 10:30 normal_chr6.raw.mplp
      -rw-r--r-- 1 woody staff 781M 5 31 09:19 normal_chr6.raw.plp

      Even I used mpileup, the difference is still very obvious.

      Please somebody help answering the question.

      Are you sure the latest version is the correct one? or the corrupted one?

      Woody

      Comment


      • #4
        Could post the input BAM for us to take a look at it? Also, you could try to narrow it down for us by:
        A: trying the releases between 0.1.8 and 0.1.16.
        B: after A, finding the commit that produces the increase, if it exists.

        Looking through the commit log, over 9 months of development exists between 0.1.8 and 0.1.16, so I am not shocked if there are differences (improvements), but the behavior you are seeing is rather odd.

        Comment


        • #5
          Thanks for answering

          Thanks, nilshomer.

          I have already tried the step A and posted results previously.

          I wish I could find the commit.
          Last edited by woodydon; 06-01-2011, 04:48 AM. Reason: removed hyperlinks

          Comment


          • #6
            Originally posted by woodydon View Post
            Thanks, nilshomer.

            I have already tried the step A and posted results previously.

            I wish I could find the commit.
            I only see 0.1.8 and 0.1.16, and not 0.1.9 through 0.1.15. What is your impediment to finding the commit? Go through each intermediate commit until you find the difference.

            Comment


            • #7
              Happy now?

              Originally posted by nilshomer View Post
              I only see 0.1.8 and 0.1.16, and not 0.1.9 through 0.1.15. What is your impediment to finding the commit? Go through each intermediate commit until you find the difference.
              I use another Ubuntu Linux server to generate these files since I originally use Mac OS.

              -rw-r--r-- 1 aiia aiia 18M 2011-06-02 17:06 chr6.1.4.plp
              -rw-r--r-- 1 aiia aiia 18M 2011-06-02 17:05 chr6.1.5.plp
              -rw-r--r-- 1 aiia aiia 18M 2011-06-02 16:57 chr6.1.6.plp
              -rw-r--r-- 1 aiia aiia 18M 2011-06-02 16:51 chr6.1.7a.plp
              -rw-r--r-- 1 aiia aiia 18M 2011-06-02 16:35 chr6.1.8.plp
              -rw-r--r-- 1 aiia aiia 797M 2011-06-02 16:53 chr6.1.9.plp
              -rw-r--r-- 1 aiia aiia 782M 2011-06-02 16:45 chr6.1.10.plp
              -rw-r--r-- 1 aiia aiia 782M 2011-06-02 17:14 chr6.1.11.plp
              -rw-r--r-- 1 aiia aiia 782M 2011-06-02 17:15 chr6.1.12a.plp
              -rw-r--r-- 1 aiia aiia 782M 2011-06-02 17:16 chr6.1.13.plp
              -rw-r--r-- 1 aiia aiia 782M 2011-06-02 17:20 chr6.1.14.plp
              -rw-r--r-- 1 aiia aiia 782M 2011-06-02 17:21 chr6.1.15.plp
              -rw-r--r-- 1 aiia aiia 782M 2011-06-02 16:41 chr6.1.16.plp

              I have tested all compilable versions of samtools. My results showed that after version 0.1.9 the file size became very big.

              The 0.1.9 version had the following changes:

              * A reference skip (the N CIGAR operator) is shown as '<' or '>' depending on the strand. Tview is also changed accordingly.
              * Accelerated pileup. The plain pileup is about 50% faster.

              Any ideas?

              I will write a script to generate my own plp-alike file and see what I can find.

              Comment


              • #8
                If you have time to run multiple versions, you should first have a look at the small and big files. It is very easy to figure out what make one file big. I guess you are using RNA-seq.

                Anyway, it does not matter. pileup has gone forever.

                Comment


                • #9
                  Ok, so now we know a change occurred between 0.1.8 and 0.1.9, can you identify the commit (use svn checkout to iterate through the commits)? If not, send us the input file(s) and someone here could try.

                  Like Heng (lh3) said, pileup will be removed in future versions but the size increase seems to be also affecting mpileup, which will not be removed.

                  Comment


                  • #10
                    Identified the source causing the problem

                    Dear all,

                    Sorry about the delay.

                    I have figured out why newer samtools generates larger plp files.

                    I am using RNA-Seq data. It is the main reason why samtools went wrong.

                    Apparently, after 0.1.9, pileup/mpileup will fulfill the gap between the read mates automatically while converting the sorted bam file to plp file.

                    In other words, when dealing with the alignment results of junction reads in RNA-Seq, pileup/mpileup will automatically fill the gap (intron) with a serials of Ns.

                    I guess that the gap fulfilling process is the reason of "50% faster" since the later pileup/mpileup doesn't have to take care about the gap anymore.

                    However, the change will totally mess up the alignment results; especially, for my RNA-Seq data.

                    Here's an illustration:

                    Comment


                    • #11
                      Originally posted by woodydon View Post
                      Dear all,

                      Sorry about the delay.

                      I have figured out why newer samtools generates larger plp files.

                      I am using RNA-Seq data. It is the main reason why samtools went wrong.

                      Apparently, after 0.1.9, pileup/mpileup will fulfill the gap between the read mates automatically while converting the sorted bam file to plp file.

                      In other words, when dealing with the alignment results of junction reads in RNA-Seq, pileup/mpileup will automatically fill the gap (intron) with a serials of Ns.
                      That's interesting. I'd actually wondered about doing coverage calculations with and without the span of Ns between properly mapped forward and reverse reads - in the context of paired end genomic reads, I think you can make a good case either way. Might be worth asking about this change on the 0.1.9 release notes thread:
                      Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

                      Comment


                      • #12
                        Thank you for replying

                        Originally posted by maubp View Post
                        That's interesting. I'd actually wondered about doing coverage calculations with and without the span of Ns between properly mapped forward and reverse reads - in the context of paired end genomic reads, I think you can make a good case either way. Might be worth asking about this change on the 0.1.9 release notes thread:
                        http://seqanswers.com/forums/showthread.php?t=7542

                        You are right about the coverage calculation. With and without the span of Ns will make quite different coverage calculations based on my own experiments.

                        By the way, from my own experiments, 0.1.9 pileup seemed only take properly mapped pairs.

                        Comment


                        • #13
                          I got the same problem when converting bam to pileup. Did you find a solution?

                          Comment


                          • #14
                            Originally posted by zengzheng123 View Post
                            I got the same problem when converting bam to pileup. Did you find a solution?
                            Hi Zeng Zheng,

                            You could just use the samtools 0.1.8. Sorry for late replying.

                            Woody

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Best Practices for Single-Cell Sequencing Analysis
                              by seqadmin



                              While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                              06-06-2024, 07:15 AM
                            • seqadmin
                              Latest Developments in Precision Medicine
                              by seqadmin



                              Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                              Somatic Genomics
                              “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                              05-24-2024, 01:16 PM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, Yesterday, 07:24 AM
                            0 responses
                            10 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 06-13-2024, 08:58 AM
                            0 responses
                            11 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 06-12-2024, 02:20 PM
                            0 responses
                            16 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 06-07-2024, 06:58 AM
                            0 responses
                            184 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X