Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    The sorted and not-sorted bam files are the same size

    Code:
    -bash-4.1$ pwd
    /usit/abel/u1/maxib/1_data/1_project/1st_assembly_strategy
    -bash-4.1$ du -sh *
    7,0G	1_align.sam
    84K	chrysanthemum_indicum_chloroplast.fasta
    3,5K	chrysanthemum_indicum_chloroplast.fasta.fai
    15G	contig.fa
    1,8G	file.bam
    1,8G	file_sorted.bam
    6,5K	file_sorted.bam.bai
    0	file.vcf.gz
    0	out.fa
    512	sam.sh
    6,3G	scafseq.fa
    1,5K	test.vcf.gz
    0	vcffile
    Running manually mpileup produces the same error

    Code:
    -bash-4.1$ /usit/abel/u1/maxib/8_samtools/bin/samtools mpileup  -v -f chrysanthemum_indicum_chloroplast.fasta file_sorted.bam -o file.vcf.gz 
    [mpileup] 1 samples in 1 input files
    <mpileup> Set max per-file depth to 8000
    Abandon

    Comment


    • #17
      By chance do you have extremely deep coverage (> 8000)? That is a small genome and the result is a large bam file.

      Comment


      • #18
        Well, I'm working with WGS data to extract, for the moment, the chloroplast genome.
        So I have 307 210 727 reads of mean length 151 which equals 46 696 030 504 base pairs.
        The chloroplast I've mapped them to is 86444 bp.
        So the coverage is around 540 188...

        Well, I think you found the problem ! Thanks for your help, I'll randomly subsample my fastq files before alignment by 1000 folds !
        ps : to whom might be interested, here is a script to do it :

        Code:
        # Written by  Aaronquinlan
        # https://www.biostars.org/p/6544/
        # Starting FASTQ files
        export FQ1=1.fq
        export FQ2=2.fq
        
        # The names of the random subsets you wish to create
        export FQ1SUBSET=1.rand.fq
        export FQ2SUBSET=2.rand.fq
        
        # How many random pairs do we want?
        export N=100
        
        # paste the two FASTQ such that the 
        # header, seqs, seps, and quals occur "next" to one another
          paste $FQ1 $FQ2 | \
        # "linearize" the two mates into a single record.  Add a random number to the front of each line
          awk 'BEGIN{srand()}; {OFS="\t"; \
                                getline seqs; getline sep; getline quals; \
                                print rand(),$0,seqs,sep,quals}' | \
        # sort by the random number
          sort -k1,1 | \
        # grab the first N records
          head -n $N | \
        # Convert the stream back to 2 separate FASTQ files.
          awk '{OFS="\n"; \
                print $2,$4,$6,$8 >> ENVIRON["FQ1SUBSET"]; \
                print $3,$5,$7,$9 >> ENVIRON["FQ2SUBSET"]}'
        Last edited by MaximeOfOslo; 06-18-2015, 07:15 AM.

        Comment


        • #19
          540,000x

          Reformat.sh from BBMap can also subsample directly from sam/bam.

          This can save you some alignment time.
          Code:
          $ reformat.sh in=in.sam out=out.sam sample=some_number_here

          Comment


          • #20
            Picard tools can also randomly downsample a .bam file.

            And the cheesy way to do it yourself would be to use awk or grep to only grab reads from a particular tile, one not on the edge would be preferable.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Genetic Variation in Immunogenetics and Antibody Diversity
              by seqadmin



              The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
              11-06-2024, 07:24 PM
            • seqadmin
              Choosing Between NGS and qPCR
              by seqadmin



              Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
              10-18-2024, 07:11 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 11-08-2024, 11:09 AM
            0 responses
            43 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 11-08-2024, 06:13 AM
            0 responses
            32 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 11-01-2024, 06:09 AM
            0 responses
            34 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 10-30-2024, 05:31 AM
            0 responses
            23 views
            0 likes
            Last Post seqadmin  
            Working...
            X