Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • fc35802
    Member
    • Nov 2011
    • 27

    A doubt about mpileup

    Hi,

    I used the command samtools mpileup -f /GenoStorage/Genomas/hg19/hg19RefGenome.fa sample_align_sorted.bam.bam > sample_mpileup

    It was surprisingly quick, then I put

    ls -lah sample02187A_mpileup
    -rw-rw-r-- 1 meusebio meusebio 5.8G Feb 10 15:03 sample_mpileup

    just to see if it looked ok.
    Then:
    [meusebio@IMMGene1 meusebio]$ bcftools view -vcg sample_mpileup > sample.vcf
    [bcf_sync] incorrect number of fields (0 != 5) at 0:0
    [afs] 0:0.000


    What should I do?

    Thank you in advance
  • swbarnes2
    Senior Member
    • May 2008
    • 910

    #2
    The mpileup command you used generates a flat text pileup that is not suitable for import into bcftools, though it should not be quick. Try looking at that pileup file, see if it looks like it should.

    To make a vcf, try something like

    samtools mpileup -ugf ref.fa file.bam | bcftools view -bcvg - > file.bcf
    bcftools view file.bcf > file.vcf
    Personally, I always add a -B to the mpileup command, because I've occasionally found that SNPs got missed if I didn't, and I'd much rather deal with false positives than false negatives.

    Comment

    • fc35802
      Member
      • Nov 2011
      • 27

      #3
      To day I used:

      samtools mpileup -d8000 -B -ugf /GenoStorage/Genomas/hg19/hg19RefGenome.fa sample02187A_align_sorted.bam | bcftools view -bvcg - > varSample02187A.bcf
      [mpileup] 1 samples in 1 input files
      [bcfview] 100000 sites processed.
      [afs] 0:73270.205 1:12061.094 2:14668.700
      [bcfview] 200000 sites processed.
      [afs] 0:93673.767 1:2809.202 2:3517.031
      [bcfview] 300000 sites processed.
      [afs] 0:62908.369 1:16718.513 2:20373.117
      [bcfview] 400000 sites processed.
      [afs] 0:62494.168 1:16735.723 2:20770.109
      [bcfview] 500000 sites processed.
      [afs] 0:67447.351 1:14587.030 2:17965.618
      [bcfview] 600000 sites processed.
      [afs] 0:67444.849 1:14410.756 2:18144.395
      [bcfview] 700000 sites processed.
      [afs] 0:69741.610 1:13586.786 2:16671.604
      [bcfview] 800000 sites processed.
      [afs] 0:66746.668 1:14996.990 2:18256.342
      [afs] 0:7849.588 1:1177.825 2:1518.587


      Only took me 9 minutes. It's really strange

      Comment

      • fc35802
        Member
        • Nov 2011
        • 27

        #4
        I started almost from the beginning.
        I also marked the duplicates.

        My mpileup as 14GB when I use the command -f, and he is smaller when I use -ugf

        The odd thing is when I use -ugf I see some strange symbols.

        In both cases the mpileup goes on for your if I only use the mpileup command, but if I save the output in a file or do the bcftools part, it takes only a few minutes

        Comment

        • swbarnes2
          Senior Member
          • May 2008
          • 910

          #5
          I don't see how marking duplicates matters. I don't think mpileup pays attention to that.

          Using -ug makes a file that is suitable for import into bcftools for SNP calling. Without it, mpileup makes a human readable pileup. The pileup is going to have one line for every base in your reference, so if your reference is large, it'll be a huge file. But you can eyeball it to see how mpileup is interpreting your .bam

          Comment

          • fc35802
            Member
            • Nov 2011
            • 27

            #6
            This was my command:

            samtools mpileup -P ILLUMINA -d8000 -B -ugf /GenoStorage/Genomas/hg19/hg19RefGenome.fa sample02187A_align_sorted.bam | bcftools view -bcvg - > varSample02187A.bcf
            [mpileup] 1 samples in 1 input files
            [bcfview] 100000 sites processed.
            [afs] 0:73270.205 1:12061.094 2:14668.700
            [bcfview] 200000 sites processed.
            [afs] 0:93673.767 1:2809.202 2:3517.031
            [bcfview] 300000 sites processed.
            [afs] 0:62908.369 1:16718.513 2:20373.117
            [bcfview] 400000 sites processed.
            [afs] 0:62494.168 1:16735.723 2:20770.109
            [bcfview] 500000 sites processed.
            [afs] 0:67447.351 1:14587.030 2:17965.618
            [bcfview] 600000 sites processed.
            [afs] 0:67444.849 1:14410.756 2:18144.395
            [bcfview] 700000 sites processed.
            [afs] 0:69741.610 1:13586.786 2:16671.604
            [bcfview] 800000 sites processed.
            [afs] 0:66746.668 1:14996.990 2:18256.342
            [afs] 0:7849.588 1:1177.825 2:1518.587

            It always stops in here, it took 9 minutes to do this and generated a .bcf file empty.
            I really can't understand what is wrong with the command

            Comment

            • fc35802
              Member
              • Nov 2011
              • 27

              #7
              I really can't understand why mpileup is so quick just when I save the output for a file or piping with bcftools.
              While doing the following command I saw

              samtools mpileup -P ILLUMINA -d8000 -A -B -C50 -uf /GenoStorage/Genomas/hg19/hg19RefGenome.fa sample02187A_align_sorted.bam
              [mpileup] 1 samples in 1 input files
              \ufffdBC\ufffd\ufffd9\ufffdBCF\ufffdchr1chr2chr3chr4chr5chr6chr7chr8chr9chr10chr11chr12chr13chr14chr15chr16chr17chr18chr19chr20chr21chr22chrXchrYchrMsample%##samtoolsVersion=0.1.18 (r982:295)

              It's normal?
              Another thing that a thought was if I should need my memory swap to be the double of my ram memory. This is true?

              I know I'm being really annoying. I'm really sorry.

              Comment

              • swbarnes2
                Senior Member
                • May 2008
                • 910

                #8
                For the third time, why don't you look at the human readable pileup output of mpileup?

                Comment

                • fc35802
                  Member
                  • Nov 2011
                  • 27

                  #9
                  It looks like this

                  chr1 10018 c 1 ^!. D
                  chr1 10019 t 1 . F
                  chr1 10020 a 1 . F
                  chr1 10021 a 1 . F
                  chr1 10022 c 1 . H
                  chr1 10023 c 1 . H
                  chr1 10024 c 1 . H
                  chr1 10025 t 1 . H
                  chr1 10026 a 1 . H
                  chr1 10027 a 2 .^!. IF
                  chr1 10028 c 2 .. JD
                  chr1 10029 c 2 .. J+
                  chr1 10030 c 2 .. J2
                  chr1 10031 t 2 .. J=
                  chr1 10032 a 2 .. JC
                  chr1 10033 a 2 .. CD
                  chr1 10034 c 2 .. HC
                  chr1 10035 c 2 .. GC
                  chr1 10036 c 2 .. G@
                  chr1 10037 t 2 .. JF
                  chr1 10038 a 2 .. GH
                  chr1 10039 a 2 .. GG
                  chr1 10040 c 2 .. IH
                  chr1 10041 c 2 .. CI
                  chr1 10042 c 2 .. ED
                  chr1 10043 t 2 .. HG
                  chr1 10044 a 2 .. GC
                  chr1 10045 a 3 ..^2c I@#
                  chr1 10046 c 3 .., JB#
                  chr1 10047 c 3 .., IBC
                  chr1 10048 c 4 ..,^+. I98F

                  Comment

                  • swbarnes2
                    Senior Member
                    • May 2008
                    • 910

                    #10
                    The good news is at least mpileup has figure out that the reference fasta you gave it is the one used to make the .bam file. So it has the reference all squared away.

                    This part has a coverage of 2, and 1 in places, but in isn't this region telomeric junk in the human genome? Is the coverage higher in a more representaitve region of the genome?

                    What % of your reads aligned? How many reads aligned total? Maybe this run was junk, and that's why nothing is working.

                    Comment

                    Latest Articles

                    Collapse

                    • SEQadmin2
                      From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                      by SEQadmin2


                      Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                      The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                      ...
                      Yesterday, 10:05 AM
                    • SEQadmin2
                      Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                      by SEQadmin2


                      With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                      Introduction

                      Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                      05-22-2026, 06:42 AM
                    • SEQadmin2
                      Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                      by SEQadmin2

                      Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                      Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                      05-06-2026, 09:04 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by SEQadmin2, Yesterday, 12:03 PM
                    0 responses
                    19 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, Yesterday, 11:40 AM
                    0 responses
                    14 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 05-28-2026, 11:40 AM
                    0 responses
                    29 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 05-26-2026, 10:12 AM
                    0 responses
                    31 views
                    0 reactions
                    Last Post SEQadmin2  
                    Working...