Unconfigured Ad

**MaximeOfOslo** · 06-18-2015, 06:34 AM

The sorted and not-sorted bam files are the same size

Code:

-bash-4.1$ pwd
/usit/abel/u1/maxib/1_data/1_project/1st_assembly_strategy
-bash-4.1$ du -sh *
7,0G	1_align.sam
84K	chrysanthemum_indicum_chloroplast.fasta
3,5K	chrysanthemum_indicum_chloroplast.fasta.fai
15G	contig.fa
1,8G	file.bam
1,8G	file_sorted.bam
6,5K	file_sorted.bam.bai
0	file.vcf.gz
0	out.fa
512	sam.sh
6,3G	scafseq.fa
1,5K	test.vcf.gz
0	vcffile

Running manually mpileup produces the same error

Code:

-bash-4.1$ /usit/abel/u1/maxib/8_samtools/bin/samtools mpileup  -v -f chrysanthemum_indicum_chloroplast.fasta file_sorted.bam -o file.vcf.gz 
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
Abandon

**GenoMax** · 06-18-2015, 06:57 AM

By chance do you have extremely deep coverage (> 8000)? That is a small genome and the result is a large bam file.

**MaximeOfOslo** · 06-18-2015, 07:09 AM

Well, I'm working with WGS data to extract, for the moment, the chloroplast genome.
So I have 307 210 727 reads of mean length 151 which equals 46 696 030 504 base pairs.
The chloroplast I've mapped them to is 86444 bp.
So the coverage is around 540 188...

Well, I think you found the problem ! Thanks for your help, I'll randomly subsample my fastq files before alignment by 1000 folds !
ps : to whom might be interested, here is a script to do it :

Code:

# Written by  Aaronquinlan
# https://www.biostars.org/p/6544/
# Starting FASTQ files
export FQ1=1.fq
export FQ2=2.fq

# The names of the random subsets you wish to create
export FQ1SUBSET=1.rand.fq
export FQ2SUBSET=2.rand.fq

# How many random pairs do we want?
export N=100

# paste the two FASTQ such that the 
# header, seqs, seps, and quals occur "next" to one another
  paste $FQ1 $FQ2 | \
# "linearize" the two mates into a single record.  Add a random number to the front of each line
  awk 'BEGIN{srand()}; {OFS="\t"; \
                        getline seqs; getline sep; getline quals; \
                        print rand(),$0,seqs,sep,quals}' | \
# sort by the random number
  sort -k1,1 | \
# grab the first N records
  head -n $N | \
# Convert the stream back to 2 separate FASTQ files.
  awk '{OFS="\n"; \
        print $2,$4,$6,$8 >> ENVIRON["FQ1SUBSET"]; \
        print $3,$5,$7,$9 >> ENVIRON["FQ2SUBSET"]}'

**GenoMax** · 06-18-2015, 07:29 AM

540,000x

Reformat.sh from BBMap can also subsample directly from sam/bam.

This can save you some alignment time.

Code:

$ reformat.sh in=in.sam out=out.sam sample=some_number_here

**swbarnes2** · 06-18-2015, 10:02 AM

Picard tools can also randomly downsample a .bam file.

And the cheesy way to do it yourself would be to use awk or grep to only grab reads from a particular tile, one not on the edge would be preferable.

Topics	Statistics	Last Post
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 9 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 18 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 52 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 110 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM

Unconfigured Ad

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News