Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • MaximeOfOslo
    Junior Member
    • Jun 2015
    • 6

    #16
    The sorted and not-sorted bam files are the same size

    Code:
    -bash-4.1$ pwd
    /usit/abel/u1/maxib/1_data/1_project/1st_assembly_strategy
    -bash-4.1$ du -sh *
    7,0G	1_align.sam
    84K	chrysanthemum_indicum_chloroplast.fasta
    3,5K	chrysanthemum_indicum_chloroplast.fasta.fai
    15G	contig.fa
    1,8G	file.bam
    1,8G	file_sorted.bam
    6,5K	file_sorted.bam.bai
    0	file.vcf.gz
    0	out.fa
    512	sam.sh
    6,3G	scafseq.fa
    1,5K	test.vcf.gz
    0	vcffile
    Running manually mpileup produces the same error

    Code:
    -bash-4.1$ /usit/abel/u1/maxib/8_samtools/bin/samtools mpileup  -v -f chrysanthemum_indicum_chloroplast.fasta file_sorted.bam -o file.vcf.gz 
    [mpileup] 1 samples in 1 input files
    <mpileup> Set max per-file depth to 8000
    Abandon

    Comment

    • GenoMax
      Senior Member
      • Feb 2008
      • 7142

      #17
      By chance do you have extremely deep coverage (> 8000)? That is a small genome and the result is a large bam file.

      Comment

      • MaximeOfOslo
        Junior Member
        • Jun 2015
        • 6

        #18
        Well, I'm working with WGS data to extract, for the moment, the chloroplast genome.
        So I have 307 210 727 reads of mean length 151 which equals 46 696 030 504 base pairs.
        The chloroplast I've mapped them to is 86444 bp.
        So the coverage is around 540 188...

        Well, I think you found the problem ! Thanks for your help, I'll randomly subsample my fastq files before alignment by 1000 folds !
        ps : to whom might be interested, here is a script to do it :

        Code:
        # Written by  Aaronquinlan
        # https://www.biostars.org/p/6544/
        # Starting FASTQ files
        export FQ1=1.fq
        export FQ2=2.fq
        
        # The names of the random subsets you wish to create
        export FQ1SUBSET=1.rand.fq
        export FQ2SUBSET=2.rand.fq
        
        # How many random pairs do we want?
        export N=100
        
        # paste the two FASTQ such that the 
        # header, seqs, seps, and quals occur "next" to one another
          paste $FQ1 $FQ2 | \
        # "linearize" the two mates into a single record.  Add a random number to the front of each line
          awk 'BEGIN{srand()}; {OFS="\t"; \
                                getline seqs; getline sep; getline quals; \
                                print rand(),$0,seqs,sep,quals}' | \
        # sort by the random number
          sort -k1,1 | \
        # grab the first N records
          head -n $N | \
        # Convert the stream back to 2 separate FASTQ files.
          awk '{OFS="\n"; \
                print $2,$4,$6,$8 >> ENVIRON["FQ1SUBSET"]; \
                print $3,$5,$7,$9 >> ENVIRON["FQ2SUBSET"]}'
        Last edited by MaximeOfOslo; 06-18-2015, 07:15 AM.

        Comment

        • GenoMax
          Senior Member
          • Feb 2008
          • 7142

          #19
          540,000x

          Reformat.sh from BBMap can also subsample directly from sam/bam.

          This can save you some alignment time.
          Code:
          $ reformat.sh in=in.sam out=out.sam sample=some_number_here

          Comment

          • swbarnes2
            Senior Member
            • May 2008
            • 910

            #20
            Picard tools can also randomly downsample a .bam file.

            And the cheesy way to do it yourself would be to use awk or grep to only grab reads from a particular tile, one not on the edge would be preferable.

            Comment

            Latest Articles

            Collapse

            • GATTACAT
              Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
              by GATTACAT
              Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
              Yesterday, 11:43 AM
            • SEQadmin2
              Nine Things a Sample Prep Scientist Thinks About Before Sequencing
              by SEQadmin2


              I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

              Here are nine questions we think about, in roughly the order they matter, before...
              06-18-2026, 07:11 AM
            • SEQadmin2
              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
              by SEQadmin2


              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
              ...
              06-02-2026, 10:05 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, 06-30-2026, 05:37 AM
            0 responses
            9 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-26-2026, 11:10 AM
            0 responses
            18 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-17-2026, 06:09 AM
            0 responses
            52 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-09-2026, 11:58 AM
            0 responses
            110 views
            0 reactions
            Last Post SEQadmin2  
            Working...