Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • lh3
    replied
    The latest SVN parses the header @SQ lines. You need to use the "view" command to do this.

    Leave a comment:


  • jkbonfield
    replied
    Feature request

    I find it cumbersome that samtools import *requires* an external index file to tell it what reference sequences are present. This is even true if there is a header in the .sam file.

    For my own sanity I knocked up this simple, albeit slow, perl script:

    Code:
    #!/usr/bin/perl -w
    
    # Reads a sam file and generates an index suitable for "samtools import".
    
    use strict;
    
    #--- Build index
    my %ref; #key=name, value=length
    while (<>) {
        my @F=split(" ", $_);
        my $rname = $F[2];
    
        next if $rname eq "*";
        my $end = $F[3] + length($F[9]);
    
        $ref{$rname} = $end if (!exists($ref{$rname}) || $ref{$rname} < $end);
    }
    
    #--- Output input to stdout
    foreach (sort keys %ref) {
        print "$_\t$ref{$_}\n";
    }
    It generates a file suitable for passing as the 1st argument after "samtools import". Admittedly it doesn't know the true ending of the reference sequence, but for de-novo assemblies the reference is just the consensus anyway so it's faked.

    Note that the above is an estimate. It doesn't consider the fact that the sequences may have CIGAR string possibly altering the mapped coordinates of the reference, but it's a start at least.

    Leave a comment:


  • zee
    replied
    Originally posted by lh3 View Post
    Please check out the latest version (0.1.3-9 r256) from SVN. I fixed a kind of typo but I am not sure if this also fixes this assertion failure. It is not easy to see this. Please let me know if this works. Many thanks.
    I have checked out the 'latest' version from SVN:

    Program: samtools (Tools for alignments in the SAM format)
    Version: 0.1.3-16 (r267)


    This seems to be a later version, would it contain the fix?

    Leave a comment:


  • lh3
    replied
    Bwa and samtools are two separated packages. I will start a new thread about questions related to bwa.

    I answer your question here. I guess you are using reads shorter than 30bp. BWT based algorithms are generally slow for enumerating multiple hits and very short reads are more likely to have many hits. Therefore, BWA's 'sampe' and 'samse -n' are inefficient for these short reads. Probably non-BWT algorithms are better for such alignment. If you really want to align very short reads, you may consider use smaller -n with samse and smaller -o (say 200) with sampe. Using smaller -o will make pairing less effective, though.

    The memory of bwa is almost independent of the number of input reads and the speed is linear in that. You can separate input into smaller batches.
    Last edited by lh3; 05-04-2009, 11:03 AM.

    Leave a comment:


  • gcrdb
    replied
    Hi,
    The bwa aln is fast (few hours),
    but somehow bwa samse(or sampe) is slow.
    This program have been running for almost a week:
    bwa samse -n 255 mouse_genome.fa data.sai data.fq >data.sam

    My data.fq has about 80 million reads. Is that to much reads for samse(pe)? Should I split them and merge?

    Leave a comment:


  • lh3
    replied
    Please check out the latest version (0.1.3-9 r256) from SVN. I fixed a kind of typo but I am not sure if this also fixes this assertion failure. It is not easy to see this. Please let me know if this works. Many thanks.
    Last edited by lh3; 05-04-2009, 10:22 AM.

    Leave a comment:


  • nilshomer
    replied
    I have also had a lot of problems when viewing locations with many insertions and deletions. I have been trying to find the bug in the source code, but so far no luck.

    Nils

    Leave a comment:


  • zee
    replied
    I was wondering if I could get some help on the SAM specification.

    I get the error:

    samtools: bam_lpileup.c:115: tview_func: Assertion `l == tv->n_pre' failed
    When I page the viewer to my reads containing indels. A sample of my data is

    suis_indels_30749_30917_a0b7e 147 Streptococcus_suis 30782 150 60M7D15M * 0 0 CAGATTGTTGCGGTTTATAAGGATCAGGCGACGGCTTTTAATGGTGGTAAGAAAGAACAGGCAAGGGCCGGCTCA YX[QRQZETFQLZT[FQMX]_``\SZ```_`^``_[``^XZ```^``__``X]T[```_^`_````````````` NM:i:7 MD:Z:60M7D15M
    suis_indels_30862_31024_a3384 131 Streptococcus_suis 30801 150 41M7D34M * 0 0 AGGATCAGGCGACGGCTTTTAATGGTGGTAAGAAAGAACAGGCAAGGGCCGGCTCAATAATCTGATTTCATCTTT `````H^_^^_```Z]_``````]``^]``_]```]]X````]_]^````]Z_URYT]]\\]_X``_^][U^``\ NM:i:7 MD:Z:41M7D34M
    suis_indels_30874_31073_a388e 131 Streptococcus_suis 30813 150 29M7D46M * 0 0 CGGCTTTTAATGGTGGTAAGAAAGAACAGGCAAGGGCCGGCTCAATAATCTGATTTCATCTTTGATTTTTGAAAA ```^`\[`````````````_X][U]RIST\`XDDLIHGK]J\XMFVER^]`````[POHLGTU]^_Z``]]]KV NM:i:7 MD:Z:29M7D46M
    suis_indels_30890_31082_a9685 131 Streptococcus_suis 30829 150 13M7D62M * 0 0 TAAGAAAGAACAGGCAAGGGCCGGCTCAATAATCTGATTTCATCTTTGATTTTTGAAAAATTGAATGAAGCTGGT `]__`ZOY^``ZUV``]TTY`_``````^LU```_`_``XV[````^\`\\^UU[`XHOPKITNEUNIRRNDWW` NM:i:7 MD:Z:13M7D62M
    suis_indels_16281_16438_b9354 131 Streptococcus_suis 16224 150 6M1D69M * 0 0 AAATGAGATATTTACGAATTATCTCACGCAATACAATCACCGGGATTGATAAACGATGAATTATTAATGTTGAG`[``````_`_I_]O[__`__Y__VFMY\````^Z```^Z\U``_``````]S`\Y^[_]^```^]`_\Y^ZQXH NM:i:2 MD:Z:6M1D69M
    suis_indels_2007371_2007505_bf820 147 Streptococcus_suis 2007364 150 71M3D4M * 0 0 GTTCTGACTTTTATTCACATCTTGTGGAAAACTTTTCTTAACAGTGTGGATTTTAAAAATTATCTGTGGAATTTT MRT^\_]RZ__`]]YP[^`ZT`T]VH``^]^TS]]``]^^````````_W_`_\\```_````````````\^`^ NM:i:3 MD:Z:71M3D4M
    suis_indels_30783_30934_c09ed 147 Streptococcus_suis 30799 150 43M7D32M * 0 0 TAAGGATCAGGCGACGGCTTTTAATGGTGGTAAGAAAGAACAGGCAAGGGCCGGCTCAATAATCTGATTTCATCT U]PRKMJTRKIK___`Y`\]`\^```ZXVYMZX]`_```Z``_``WWQ]_````_]``]````````````Z_W` NM:i:7 MD:Z:43M7D32M
    suis_indels_16171_16358_c4122 147 Streptococcus_suis 16227 150 3M1D72M * 0 0 TGAGATATTTACGAATTATCTCACGCAATACAATCACCGGGATTGATAAACGATGAATTATTAATGTTGAGAGGNFMU``_`_`NTPV]^__^]YYQLX^ZK[^[[[^^``]RUX``Z]Y\Z````````^``_`]U\`^_```]```` NM:i:1 MD:Z:3M1D72M
    suis_indels_16164_16353_ce374 147 Streptococcus_suis 16222 150 8M1D67M * 0 0 CCAAATGAGATATTTACGAATTATCTCACGCAATACAATCACCGGGATTGATAAACGATGAATTATTAATGTTG[]]^[]\X[^`\ZZX`^_^\_``````\`_JL[\R]X```_`````_``_Y[]`````[\````Z```````^_H NM:i:2 MD:Z:8M1D67M
    suis_indels_30842_31012_d14fb 131 Streptococcus_suis 30781 150 61M7D14M * 0 0 CCAGATTGTTGCGGTTTATAAGGATCAGGCGACGGCTTTTAATGGTGGTAAGAAAGAACAGGCAAGGGCCGGCTC ```_`_`````]`````_```^_VUW^```_^_ZXZT^`````__]YNXTEKV`_YXNRW\\REHDDLSGMFURD NM:i:7 MD:Z:61M7D14M
    suis_indels_30721_30932_d9806 147 Streptococcus_suis 30797 150 45M7D30M * 0 0 TATAAGGATCAGGCGACGGCTTTTAATGGTGGTAAGAAAGAACAGGCAAGGGCCGGCTCAATAATCTGATTTCAT PPJ[TFRJ``````TVU`__`^^`_]_`````GFPZTWY\VX`_````__``````^`````````````````` NM:i:7 MD:Z:45M7D30M
    suis_indels_30785_30945_e17b9 147 Streptococcus_suis 30810 150 32M7D43M * 0 0 CGACGGCTTTTAATGGTGGTAAGAAAGAACAGGCAAGGGCCGGCTCAATAATCTGATTTCATCTTTGATTTTTGA TFWOOKDUNQZ`^RU^^`STTSVW^_``````^`_V``_``````````\[[`XR_```_W\]O_``__`_T^`` NM:i:7 MD:Z:32M7D43M
    suis_indels_30737_30928_e6336 147 Streptococcus_suis 30793 150 49M7D26M * 0 0 GGTTTATAAGGATCAGGCGACGGCTTTTAATGGTGGTAAGAAAGAACAGGCAAGGGCCGGCTCAATAATCTGATT LMRUPOTQPY[V]]^``]TS\^`\^`UXX^S^```_^Z``\\``_```^Y^Q]```````V\]^``^``K^```` NM:i:7 MD:Z:49M7D26M
    And I'm using novoalign format where I've modified the novo2sam.pl script to produce a 'correct' CIGAR format. Most samtools operation work except for the tview

    Leave a comment:


  • mhc
    replied
    Understanding samtools pileup output

    Hi,

    I'm having trouble trying to parse the samtools output. In the example below, at position 60, I have 108 reads. As I understand it, 8 reads terminate (since there are 8 '$'s), and there are 2 new reads (marked by the '^') on the next line.
    So the next line - line 61 - should have 108-8+2=102 reads.
    Instead, it has 99.
    What am I missing here?
    This is the 40th line of input with 40bp reads, and this is the first instance where '$' appears. Other lines seem to work out fine.


    seq1 60 a 108 .$,$,$,$,$g$....ggt*.G,g,,.,,+2tt,+3agcG.,+4atgc,+4ttgcg.c,c.$.,,,..$,+4aggat+6ccgttt,..,tt,.,,..,.
    +7CTGCCTG,.,.,,.,,..,.,..,.,.,,.,,..,,.,.,.,.,.,,.,.,.,.,,^].^],^], CBB=ABBA>BBCBB7BB<BBBBCBBBBBCBB@BB@BBCB:CBBAACC>ABCBBBBBBBBBCC9BBABB@B
    B<BBB7CBBBBABBBBBCBBBBB@BB;BBBC@CBCBCB
    seq1 61 g 99 A$.$.$.$t$c$t$,..,,a,A,,$..,.a,,.,+4gcag,,.,,$,A.,,+7ctgtttg,t$A,a..,.,.,.,,.,,..,.,..,.,.,,.,,..,,
    .,.,.,.,.,,.,.,.,.,,.,,^].^]t BCAABB@B1BB<BBBBBBBBBBBBBACBB@BBABBBBBBBBBCBBCCBBBBBBBBCCBBB6BBBBBBBBBBBABCBBBBBBBBBBBBCAC?BCCBCBB@
    Last edited by mhc; 05-01-2009, 10:30 AM.

    Leave a comment:


  • lh3
    replied
    The first thing you may want to try is:

    samtools view -h aln.bam | less -S

    Leave a comment:


  • emucaki
    replied
    Hi, I'm a novice geneticist who is interested in using the 1000 Genome project data available on NCBI and I can't quite figure how to obtain sequence information from the BAM file, SAMTools' website is little help. I am wondering if anyone knows a good place to get information for this kind of work.

    (Offtopic, anyone know why the 1000 Genome project has a log-in but no register option?)

    Leave a comment:


  • nilshomer
    replied
    I have a working patch to view ABI SOLiD color space using samtools (http://samtools.sourceforge.net/) text viewer. For example, some of the features using output from BFAST (in SAM format), which includes the "CS" and "CQ" tags, are:

    - option to display colors instead of nucleotides.
    - option to color bases/colors based on color. This is similar if you want to color bases based on the given base.
    - option to color bases/colors based on color quality.
    - the "." (dot) option when displaying color space will only show those colors that were corrected during alignment (i.e. the color errors).
    - option to remove all insertions in the current display (in some regions, spurious insertions can cause a headache when viewing that region).

    PM me and I can supply you with a source version.

    Leave a comment:


  • lh3
    replied
    You are invoking pileup with "-c" and you should also read this page:



    A read base "*" means a deletion. The second line at "chr1 1949878" shows indel call. In principle this is not part of pileup.

    Leave a comment:


  • gcrdb
    replied
    lh3,
    Thanks for quick response, pileup is working now!
    Here is some questions about pileup format when I look at them at first time:
    (1) what is a "*" in read bases , which is not documented in "http://samtools.sourceforge.net/pileup.shtml".
    (2) Is it okay for a base in the same position pile-up twice ? (chr1 1949878 occur twice in my first pileup output)
    thanks,
    Below is the piece pile-up output I found the problem:
    chr1 1949878 A A 142 0 60 55 C$....,,,...,,.C..,.,,..,...,,,.,.,.+1C,,,,..,.,........,^F
    ,^], &5,2IIII5I<II+%5=I+II(II8@CII3*I0I+IIII,I$@IIAAIIDI@I*5
    chr1 1949878 * */+C 38 38 * +C 13 4 30 8
    chr1 1949879 A A 150 0 60 54 ....,G,...,,.-1G..-1G.,.,,..,...,,,.,.,.,,,,..,.,........,,
    , II*IIII;I.II&(&1I3II%9III:II7&I&@&IIII5I"6IIIIG.I33I$6
    chr1 1949879 * -G/* 481 481 -G * 6 9 30 9
    chr1 1949880 G G 25 0 60 55 .$A$..,A,..A,,*.*A,A,,A.,A..,,,.,A,.,,,,.+1A.,.,........A,,
    ^], +(,.III)8-II8%D.I0II#,I@5III.$I+I,IIII$I2EIIIIIIIIIE&?/
    chr1 1949880 * */+A 350 350 * +A 24 3 19 9
    chr1 1949881 A A 162 0 60 53 ..,,,...,,....,.,,..,...,,,.,.,.,,,,..,.,........G,,, 6DI
    II3I$II81D)I%II+.III'III$I2I+IIII+II<II46IIAHI.2IB

    Leave a comment:


  • lh3
    replied
    should be: ../samtools pileup -t ex1.fa.fai ex1.sam or ../samtools pileup ex1.bam. I have added a Makefile.

    Note that there is a companion format called BAM which is the binary representation of SAM. Most of samtools commands work on BAM only. I know having two formats is a bit confusing, but this is necessary for faster parsing.

    Leave a comment:

Latest Articles

Collapse

  • SEQadmin2
    Nine Things a Sample Prep Scientist Thinks About Before Sequencing
    by SEQadmin2


    I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


    Here are nine questions we think about, in roughly the order they matter, before...
    06-18-2026, 07:11 AM
  • SEQadmin2
    From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
    by SEQadmin2


    Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


    The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
    ...
    06-02-2026, 10:05 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by SEQadmin2, 06-17-2026, 06:09 AM
0 responses
25 views
0 reactions
Last Post SEQadmin2  
Started by SEQadmin2, 06-09-2026, 11:58 AM
0 responses
42 views
0 reactions
Last Post SEQadmin2  
Started by SEQadmin2, 06-05-2026, 10:09 AM
0 responses
48 views
0 reactions
Last Post SEQadmin2  
Started by SEQadmin2, 06-04-2026, 08:59 AM
0 responses
49 views
0 reactions
Last Post SEQadmin2  
Working...