Seqanswers Leaderboard Ad

**swbarnes2** · 06-05-2013, 12:16 PM

No; I'd just write my own script that will read the line of the .bam, and then write it to the proper file for its contig.

**vivek_** · 06-05-2013, 01:13 PM

A unix oneliner should work right?

Code:

for i in {1..22};do samtools view -bh input.bam chr$i > chr$i.bam;done

**dpryan** · 06-05-2013, 01:29 PM

Originally posted by vivek_ View Post

A unix oneliner should work right?

Code:

for i in {1..22};do samtools view -bh input.bam chr$i > chr$i.bam;done

Depends, I've seen examples where the contigs weren't sequentially numbered (presumably due to some contigs becoming merged as latter data came in)

Also, for a file with a large number of contigs (have a look at some of the mouse lines from the Sanger Institute), looping over the whole file many many times will get super slow. You could probably write a script to process the whole thing in one go in a fraction of the time.

**vivek_** · 06-05-2013, 01:37 PM

That's why you have the BAM index right, so you are not reading the entire file to export each coordinate?

for the sequentiality issue, you can extract the contig names from the bam header into a file and loop over them:

Code:

samtools view -H input.bam | awk '{print $2}' | awk '{gsub(/SN\:/,""); print}'  > contigs.txt

**syfo** · 06-06-2013, 08:23 AM

Originally posted by vivek_ View Post

That's why you have the BAM index right, so you are not reading the entire file to export each coordinate?

for the sequentiality issue, you can extract the contig names from the bam header into a file and loop over them:

Code:

samtools view -H input.bam | awk '{print $2}' | awk '{gsub(/SN\:/,""); print}'  > contigs.txt

Watch out, you don't want the first ("@HD ...") nor the last ("@PG ...") line of the header.

Try this instead:

Code:

samtools view -H all.bam | sed '1d;s/.*SN:\(.*\)\t.*/\1/;$d' > contigs.list

Or, if you prefer awk:

Code:

samtools view -H all.bam | awk '/^@SQ/{gsub(/SN\:/,"");print $2}' > contigs.list

or even (just for fun):

Code:

samtools idxstats all.bam | cut -f1 > contigs.list

All those should give you the same list of contigs.

Then,

Code:

for c in `cat contigs.list` ; do
echo processing $c
samtools view -bh all.bam $c > $c.bam
done

But I agree it might take a while...

**jazz** · 06-06-2013, 10:59 AM

Thanks everyone. I will give these suggestions a try and let you know how it went.

**jjlaisnoopy** · 02-09-2015, 06:09 PM

This is good method to split bam file. But I got a question.
The following is idxstats of a bam file:
chrM 16571 2073252 32042
chr1 249250621 115733016 1937746
chr2 243199373 104133908 2244387
chr3 198022430 96577573 1501432
chr4 191154276 89582368 1825761
chr5 180915260 94818025 1486923
chr6 171115067 84533173 1273600
chr7 159138663 71186849 1531851
chr8 146364022 65630236 1315785
chr9 141213431 59368028 1184324
chr10 135534747 63503839 2018103
chr11 135006516 59963670 1373030
chr12 133851895 63898721 1180836
chr13 115169878 41939790 616704
chr14 107349540 43647215 758336
chr15 102531392 39227879 624791
chr16 90354753 42298502 991456
chr17 81195210 49043800 916042
chr18 78077248 75701725 1614444
chr19 59128983 26119207 755016
chr20 63025520 32668117 644231
chr21 48129895 19226969 547857
chr22 51304566 15797809 277926
chrX 155270560 74715396 1365662
chrY 59373566 2021162 642042
* 0 0 42640654

How could I get the "*" 42640654 reads, those were not mapped to any contigs ?

**GenoMax** · 02-09-2015, 06:35 PM

Originally posted by jjlaisnoopy View Post

This is good method to split bam file. But I got a question.

* 0 0 42640654

How could I get the "*" 42640654 reads, those were not mapped to any contigs ?

How To Filter Mapped Reads With Samtools

https://www.biostars.org/p/56246/

**jjlaisnoopy** · 02-10-2015, 05:50 PM

Originally posted by GenoMax View Post

https://www.biostars.org/p/56246/

I tried the parameter: -f 4
And then index the result bam file
Here is the idxstats from it:

chrM 16571 0 32042
chr1 249250621 0 1937746
chr2 243199373 0 2244387
chr3 198022430 0 1501432
chr4 191154276 0 1825761
chr5 180915260 0 1486923
chr6 171115067 0 1273600
chr7 159138663 0 1531851
chr8 146364022 0 1315785
chr9 141213431 0 1184324
chr10 135534747 0 2018103
chr11 135006516 0 1373030
chr12 133851895 0 1180836
chr13 115169878 0 616704
chr14 107349540 0 758336
chr15 102531392 0 624791
chr16 90354753 0 991456
chr17 81195210 0 916042
chr18 78077248 0 1614444
chr19 59128983 0 755016
chr20 63025520 0 644231
chr21 48129895 0 547857
chr22 51304566 0 277926
chrX 155270560 0 1365662
chrY 59373566 0 642042
* 0 0 42640654

Any suggestions ?

**GenoMax** · 02-10-2015, 06:55 PM

Are you asking about why the number is not adding up to 42640654? See possible explanation here: https://www.biostars.org/p/18949/

**jjlaisnoopy** · 02-10-2015, 07:54 PM

All I want is the Unmapped reads with no mate or an unmapped mate are assigned to chrom "*" , not include unmapped mate reads which assigned to chr1, chr2, ...
just reads assigned to
"*" 0 0 42640654

**dpryan** · 02-11-2015, 12:31 AM

Just write a quick little script in python with pysam to do this. There isn't always a premade program to do everything.

**GenoMax** · 02-11-2015, 06:34 AM

A command line solution. See if this works:

Code:

$ samtools view -h file.bam | awk -F'\t' '{OFS = "\n"; ORS = "\n";}{ if ($3 == "*" ) print "@"$1,$10,"+",$11}' > outfile.fastq

**sarvidsson** · 02-11-2015, 06:37 AM

Originally posted by GenoMax View Post

A command line solution. See if this works:

Code:

$ samtools view -h file.bam | awk -F'\t' '{OFS = "\n"; ORS = "\n";}{ if ($3 == "*" ) print "@"$1,$10,"+",$11}' > outfile.bam

GenoMax, you probably mean to call the output file "outfile.fastq", right?

Code:

$ samtools view -h file.bam | awk -F'\t' '{OFS = "\n"; ORS = "\n";}{ if ($3 == "*" ) print "@"$1,$10,"+",$11}' > outfile.[B]fastq[/B]

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, Today, 11:09 AM	0 responses 22 views 0 likes	Last Post by seqadmin Today, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, Today, 06:13 AM	0 responses 20 views 0 likes	Last Post by seqadmin Today, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 30 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

Split SAM/BAM files into thousands of contigs

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News