Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extracting a subset of bam files

    Hi everyone,

    What I have:
    A bam file with 10kb MP reads mapped back to an in house reference genome assembled in Allpaths-LG. Going to be used for visual inspection in IGV and Geneious to evaluate a possible scaffold break up suggested by linkage map.

    What I want:
    To extract a new bam file that contains the mapped reads for only one of the scaffolds in my reference genome.

    samtools view -b mybamfile.bam 'scaffold000046' > scf000046.bam

    The resulting file lists all the original scaffolds in the header, like this:

    @SQ SN:scaffold_0 LN:21965366
    @SQ SN:scaffold_1 LN:18670197
    @SQ SN:scaffold_2 LN:16657575
    @SQ SN:scaffold_3 LN:15264954
    @SQ SN:scaffold_4 LN:14680565

    I may have misunderstood the output of the above command, but I expected to find only the target scaffold in the new bam-file header.

    Any help would be much appreciated. Feel free to direct me if the solution for my problem has already been posted elsewhere. However, have not succeeded in finding one.

    Last edited by toreoe; 04-14-2014, 10:17 AM.

  • #2
    Index scf000046.bam ('samtools index scf000046.bam') and then see what 'samtools idxstats scf000046.bam' says
    Are there reads reported for all the scaffolds or just scaffold000046?


    • #3
      I extracted another, scaffold0008, from a different bam file. I then indexed scaffold00008.bam as suggested. samtools idxstats reported reads only for the target scaffold, none for the others:

      scaffold00001 21966007 0 0
      scaffold00002 18670670 0 0
      scaffold00003 16657927 0 0
      scaffold00004 15265287 0 0
      scaffold00005 14680965 0 0
      scaffold00006 14166276 0 0
      scaffold00007 14112451 0 0
      scaffold00008 13353334 2289885 55567
      scaffold00009 13059713 0 0

      scaffold00008 loaded properly in IGV, something I couldn't do with my previous extracted file. So it might be that there was something wrong with my original bam file.

      Anyways, it seems like this is simply the way the samtools view extract command outputs the subset bam, i.e. contains all scaffolds from original bam in header. And it works fine.

      Thank you very much for your help, jeales.


      • #4
        Thats good, if there were reads from other scaffolds an awful lot of my work would be wrong as well!

        I don't have a good solution for trimming the header to only include scaffolds/chromosomes that have reads present in the bam, other than manual editing

        But it's not realyl doing any harm as long as the counts are zero


        • #5
          The header is left as-is for technical reasons. In BAM files (and internally by samtools and likely picard too) the chromosome/contig field of an alignment is just a numeric index into this list in the header. Trimming the header, then would result in needing to alter every read as it was output, which would be annoying (this is also why people need to be careful when they reheader BAM files, since changing the order of the contigs/chromosomes in the header will make all of the alignments wrong), though I guess not that hard to implement (it would affect performance, though).


          • #6
            Very informative thanks
            So if all the alignments in a bam are mapped to the 10th chromosome/contig in the header their chromosome/contig will be 10 (or probably 9, if you start counting at 0), irrespective of chromosome name?


            • #7
              Exactly. As you surmised, everything is 0-based internally, so the value is 9. For those curious, the value is unsigned, so a value of -1 used for unmapped reads.


              Latest Articles


              • seqadmin
                Exploring the Dynamics of the Tumor Microenvironment
                by seqadmin

                The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                07-08-2024, 03:19 PM
              • seqadmin
                Exploring Human Diversity Through Large-Scale Omics
                by seqadmin

                In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                06-25-2024, 06:43 AM





              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 05:49 AM
              0 responses
              Last Post seqadmin  
              Started by seqadmin, 07-15-2024, 06:53 AM
              0 responses
              Last Post seqadmin  
              Started by seqadmin, 07-10-2024, 07:30 AM
              0 responses
              Last Post seqadmin  
              Started by seqadmin, 07-03-2024, 09:45 AM
              0 responses
              Last Post seqadmin