Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • jazz
    Member
    • Nov 2008
    • 28

    Split SAM/BAM files into thousands of contigs

    Hi,

    I am new to bioinformatics and programming, so this might be too easy to tackle for some of you. I am working on a newly sequenced genome with thousands of contigs. I have Illumina sequence data that I have mapped using bowtie and BWA. When I try to import that file into my genome viewer, Geneious, it crashes. So, I want to split my BAM files into many, so the program can handle them easily. My question is if there is an easy way to split bam files by each of the contigs. So, in effect, it should split one BAM file into thousands of mini-files, each corresponding to a particular contig. I am familiar with samtools view (e.g. samtools view in.bam chr1 > chr1.bam) and have used it to make it for individual cases. Is there a better way to automate that?
    Any help much appreciated.
  • swbarnes2
    Senior Member
    • May 2008
    • 910

    #2
    No; I'd just write my own script that will read the line of the .bam, and then write it to the proper file for its contig.

    Comment

    • vivek_
      PhD Student
      • Jul 2012
      • 164

      #3
      A unix oneliner should work right?

      Code:
      for i in {1..22};do samtools view -bh input.bam chr$i > chr$i.bam;done
      Last edited by vivek_; 06-05-2013, 01:16 PM.

      Comment

      • dpryan
        Devon Ryan
        • Jul 2011
        • 3478

        #4
        Originally posted by vivek_ View Post
        A unix oneliner should work right?

        Code:
        for i in {1..22};do samtools view -bh input.bam chr$i > chr$i.bam;done
        Depends, I've seen examples where the contigs weren't sequentially numbered (presumably due to some contigs becoming merged as latter data came in)

        Also, for a file with a large number of contigs (have a look at some of the mouse lines from the Sanger Institute), looping over the whole file many many times will get super slow. You could probably write a script to process the whole thing in one go in a fraction of the time.

        Comment

        • vivek_
          PhD Student
          • Jul 2012
          • 164

          #5
          That's why you have the BAM index right, so you are not reading the entire file to export each coordinate?

          for the sequentiality issue, you can extract the contig names from the bam header into a file and loop over them:

          Code:
          samtools view -H input.bam | awk '{print $2}' | awk '{gsub(/SN\:/,""); print}'  > contigs.txt

          Comment

          • syfo
            Just a member
            • Nov 2012
            • 103

            #6
            Originally posted by vivek_ View Post
            That's why you have the BAM index right, so you are not reading the entire file to export each coordinate?

            for the sequentiality issue, you can extract the contig names from the bam header into a file and loop over them:

            Code:
            samtools view -H input.bam | awk '{print $2}' | awk '{gsub(/SN\:/,""); print}'  > contigs.txt
            Watch out, you don't want the first ("@HD ...") nor the last ("@PG ...") line of the header.

            Try this instead:
            Code:
            samtools view -H all.bam | sed '1d;s/.*SN:\(.*\)\t.*/\1/;$d' > contigs.list
            Or, if you prefer awk:
            Code:
            samtools view -H all.bam | awk '/^@SQ/{gsub(/SN\:/,"");print $2}' > contigs.list
            or even (just for fun):
            Code:
            samtools idxstats all.bam | cut -f1 > contigs.list
            All those should give you the same list of contigs.

            Then,
            Code:
            for c in `cat contigs.list` ; do
            echo processing $c
            samtools view -bh all.bam $c > $c.bam
            done
            But I agree it might take a while...

            Comment

            • jazz
              Member
              • Nov 2008
              • 28

              #7
              Thanks everyone. I will give these suggestions a try and let you know how it went.

              Comment

              • jjlaisnoopy
                Junior Member
                • Sep 2011
                • 6

                #8
                This is good method to split bam file. But I got a question.
                The following is idxstats of a bam file:
                chrM 16571 2073252 32042
                chr1 249250621 115733016 1937746
                chr2 243199373 104133908 2244387
                chr3 198022430 96577573 1501432
                chr4 191154276 89582368 1825761
                chr5 180915260 94818025 1486923
                chr6 171115067 84533173 1273600
                chr7 159138663 71186849 1531851
                chr8 146364022 65630236 1315785
                chr9 141213431 59368028 1184324
                chr10 135534747 63503839 2018103
                chr11 135006516 59963670 1373030
                chr12 133851895 63898721 1180836
                chr13 115169878 41939790 616704
                chr14 107349540 43647215 758336
                chr15 102531392 39227879 624791
                chr16 90354753 42298502 991456
                chr17 81195210 49043800 916042
                chr18 78077248 75701725 1614444
                chr19 59128983 26119207 755016
                chr20 63025520 32668117 644231
                chr21 48129895 19226969 547857
                chr22 51304566 15797809 277926
                chrX 155270560 74715396 1365662
                chrY 59373566 2021162 642042
                * 0 0 42640654

                How could I get the "*" 42640654 reads, those were not mapped to any contigs ?

                Comment

                • GenoMax
                  Senior Member
                  • Feb 2008
                  • 7142

                  #9
                  Originally posted by jjlaisnoopy View Post
                  This is good method to split bam file. But I got a question.

                  * 0 0 42640654

                  How could I get the "*" 42640654 reads, those were not mapped to any contigs ?

                  Comment

                  • jjlaisnoopy
                    Junior Member
                    • Sep 2011
                    • 6

                    #10
                    Originally posted by GenoMax View Post
                    I tried the parameter: -f 4
                    And then index the result bam file
                    Here is the idxstats from it:

                    chrM 16571 0 32042
                    chr1 249250621 0 1937746
                    chr2 243199373 0 2244387
                    chr3 198022430 0 1501432
                    chr4 191154276 0 1825761
                    chr5 180915260 0 1486923
                    chr6 171115067 0 1273600
                    chr7 159138663 0 1531851
                    chr8 146364022 0 1315785
                    chr9 141213431 0 1184324
                    chr10 135534747 0 2018103
                    chr11 135006516 0 1373030
                    chr12 133851895 0 1180836
                    chr13 115169878 0 616704
                    chr14 107349540 0 758336
                    chr15 102531392 0 624791
                    chr16 90354753 0 991456
                    chr17 81195210 0 916042
                    chr18 78077248 0 1614444
                    chr19 59128983 0 755016
                    chr20 63025520 0 644231
                    chr21 48129895 0 547857
                    chr22 51304566 0 277926
                    chrX 155270560 0 1365662
                    chrY 59373566 0 642042
                    * 0 0 42640654

                    Any suggestions ?

                    Comment

                    • GenoMax
                      Senior Member
                      • Feb 2008
                      • 7142

                      #11
                      Are you asking about why the number is not adding up to 42640654? See possible explanation here: https://www.biostars.org/p/18949/
                      Last edited by GenoMax; 02-10-2015, 07:05 PM.

                      Comment

                      • jjlaisnoopy
                        Junior Member
                        • Sep 2011
                        • 6

                        #12
                        All I want is the Unmapped reads with no mate or an unmapped mate are assigned to chrom "*" , not include unmapped mate reads which assigned to chr1, chr2, ...
                        just reads assigned to
                        "*" 0 0 42640654

                        Comment

                        • dpryan
                          Devon Ryan
                          • Jul 2011
                          • 3478

                          #13
                          Just write a quick little script in python with pysam to do this. There isn't always a premade program to do everything.

                          Comment

                          • GenoMax
                            Senior Member
                            • Feb 2008
                            • 7142

                            #14
                            A command line solution. See if this works:

                            Code:
                            $ samtools view -h file.bam | awk -F'\t' '{OFS = "\n"; ORS = "\n";}{ if ($3 == "*" ) print "@"$1,$10,"+",$11}' > outfile.fastq
                            Last edited by GenoMax; 02-11-2015, 06:38 AM. Reason: correction

                            Comment

                            • sarvidsson
                              Senior Member
                              • Jan 2015
                              • 137

                              #15
                              Originally posted by GenoMax View Post
                              A command line solution. See if this works:

                              Code:
                              $ samtools view -h file.bam | awk -F'\t' '{OFS = "\n"; ORS = "\n";}{ if ($3 == "*" ) print "@"$1,$10,"+",$11}' > outfile.bam
                              GenoMax, you probably mean to call the output file "outfile.fastq", right?
                              Code:
                              $ samtools view -h file.bam | awk -F'\t' '{OFS = "\n"; ORS = "\n";}{ if ($3 == "*" ) print "@"$1,$10,"+",$11}' > outfile.[B]fastq[/B]

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Pathogen Surveillance with Advanced Genomic Tools
                                by seqadmin




                                The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                                03-24-2025, 11:48 AM
                              • seqadmin
                                New Genomics Tools and Methods Shared at AGBT 2025
                                by seqadmin


                                This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                                The Headliner
                                The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                                03-03-2025, 01:39 PM
                              • seqadmin
                                Investigating the Gut Microbiome Through Diet and Spatial Biology
                                by seqadmin




                                The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                                02-24-2025, 06:31 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-20-2025, 05:03 AM
                              0 responses
                              41 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-19-2025, 07:27 AM
                              0 responses
                              44 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-18-2025, 12:50 PM
                              0 responses
                              35 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-03-2025, 01:15 PM
                              0 responses
                              191 views
                              0 reactions
                              Last Post seqadmin  
                              Working...