Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GenoMax
    replied
    @r.rosati: See if this helps. https://docs.oracle.com/cd/E23824_01...033/glmha.html

    Leave a comment:


  • Brian Bushnell
    replied
    Oh, that's really strange. It's due to your locale settings, I guess. In the US, decimals are printed like "##readLengthAvg=162.590" rather than "##readLengthAvg=162,590". For some reason Java is using commas as the delimiter when printing the floating point numbers, but not when trying to read floating point numbers, which is extremely irritating.

    This will take a little research to fix (other than you changing your computer's locale to US or something like that); I'll get back to you.

    Leave a comment:


  • r.rosati
    replied
    BBmap on Ion data

    Hello,
    I tried to run BBmap to call variants on four BAM files (2 parents, 2 children) generated from exome sequencing on an Ion Proton machine (TMAP software).

    The bbmap command was:
    Code:
    /opt/bbmap/callvariants.sh in=IonXpress_009_rawlib.bam,IonXpress_010_rawlib.bam,IonXpress_011_rawlib.bam,IonXpress_012_rawlib.bam ref=./hg19/hg19.fasta out=vars.vcf multisample ploidy=2 prefilter 1> err.txt 2> out.txt &
    I got an error as following (I'm including also the last lines of output):

    Code:
    Finished second pass.
    Writing output.
    Merging [(IonXpress_009_rawlib, individual_IonXpress_009_rawlib.vcf.gz), (IonXpress_010_rawlib, individual_IonXpress_010_rawlib.vcf.gz), (IonXpress_011_rawlib, individual_IonXpress_011_rawlib.vcf.gz), (IonXpress_012_rawlib, individual_IonXpress_012_rawlib.vcf.gz)]
    Exception in thread "main" java.lang.NumberFormatException: For input string: "162,590"
    	at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
    	at sun.misc.FloatingDecimal.parseFloat(FloatingDecimal.java:122)
    	at java.lang.Float.parseFloat(Float.java:451)
    	at var2.MergeSamples.processHeader(MergeSamples.java:233)
    	at var2.MergeSamples.processRow(MergeSamples.java:177)
    	at var2.MergeSamples.mergeFiles(MergeSamples.java:151)
    	at var2.MergeSamples.mergeSamples(MergeSamples.java:117)
    	at var2.CallVariants2.process(CallVariants2.java:380)
    	at var2.CallVariants2.main(CallVariants2.java:52)
    	at var2.CallVariants.main(CallVariants.java:48)
    I unzipped the vcf files and used grep to find "162,590". I got a match at line 9 of the vcf for one individual:
    Code:
    ##readLengthAvg=162,590
    The other individuals also had similar formatting:
    (IonXpress_010)
    Code:
    ##readLengthAvg=160,055
    (IonXpress_011)
    Code:
    ##readLengthAvg=167,107
    (IonXpress_012)
    Code:
    ##readLengthAvg=169,228

    This is the full header for barcode 009:
    Code:
    ##fileformat=VCFv4.2
    ##BBMapVersion=36.92
    ##ploidy=2
    ##rarity=1,00000
    ##minallelefraction=0,10000
    ##reads=43874924
    ##pairedReads=0
    ##properlyPairedReads=0
    ##readLengthAvg=162,590
    ##properPairRate=0,00000
    ##totalQualityAvg=23,525
    ##mapqAvg=77,583
    ##reference=./hg19/hg19.fasta
    ##contig=<ID=chr1,length=249250621>
    ##contig=<ID=chr2,length=243199373>
    ##contig=<ID=chr3,length=198022430>
    ##contig=<ID=chr4,length=191154276>
    ##contig=<ID=chr5,length=180915260>
    ##contig=<ID=chr6,length=171115067>
    ##contig=<ID=chr7,length=159138663>
    ##contig=<ID=chr8,length=146364022>
    ##contig=<ID=chr9,length=141213431>
    ##contig=<ID=chr10,length=135534747>
    ##contig=<ID=chr11,length=135006516>
    ##contig=<ID=chr12,length=133851895>
    ##contig=<ID=chr13,length=115169878>
    ##contig=<ID=chr14,length=107349540>
    ##contig=<ID=chr15,length=102531392>
    ##contig=<ID=chr16,length=90354753>
    ##contig=<ID=chr17,length=81195210>
    ##contig=<ID=chr18,length=78077248>
    ##contig=<ID=chr19,length=59128983>
    ##contig=<ID=chr20,length=63025520>
    ##contig=<ID=chr21,length=48129895>
    ##contig=<ID=chr22,length=51304566>
    ##contig=<ID=chrX,length=155270560>
    ##contig=<ID=chrY,length=59373566>
    ##contig=<ID=chrM,length=16569>
    ##FORMAT=<ID=FAIL,Description="Fail">
    ##FORMAT=<ID=PASS,Description="Pass">
    ##INFO=<ID=SN,Number=1,Type=Integer,Description="Scaffold Number">
    ##INFO=<ID=STA,Number=1,Type=Integer,Description="Start">
    ##INFO=<ID=STO,Number=1,Type=Integer,Description="Stop">
    ##INFO=<ID=TYP,Number=1,Type=String,Description="Type">
    ##INFO=<ID=R1P,Number=1,Type=Integer,Description="Read1 Plus Count">
    ##INFO=<ID=R1M,Number=1,Type=Integer,Description="Read1 Minus Count">
    ##INFO=<ID=R2P,Number=1,Type=Integer,Description="Read2 Plus Count">
    ##INFO=<ID=R2M,Number=1,Type=Integer,Description="Read2 Minus Count">
    ##INFO=<ID=PPC,Number=1,Type=Integer,Description="Paired Count">
    ##INFO=<ID=LS,Number=1,Type=Integer,Description="Length Sum">
    ##INFO=<ID=MQS,Number=1,Type=Integer,Description="MAPQ Sum">
    ##INFO=<ID=MQM,Number=1,Type=Integer,Description="MAPQ Max">
    ##INFO=<ID=BQS,Number=1,Type=Integer,Description="Base Quality Sum">
    ##INFO=<ID=BQM,Number=1,Type=Integer,Description="Base Quality Max">
    ##INFO=<ID=EDS,Number=1,Type=Integer,Description="End Distance Sum">
    ##INFO=<ID=EDM,Number=1,Type=Integer,Description="End Distance Max">
    ##INFO=<ID=IDS,Number=1,Type=Integer,Description="Identity Sum">
    ##INFO=<ID=IDM,Number=1,Type=Integer,Description="Identity Max">
    ##INFO=<ID=COV,Number=1,Type=Integer,Description="Coverage">
    ##INFO=<ID=MCOV,Number=1,Type=Integer,Description="Minus Coverage">
    ##INFO=<ID=CED,Number=1,Type=Integer,Description="Contig End Distance">
    ##INFO=<ID=HMP,Number=1,Type=Integer,Description="Homopolymer Count">
    ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
    ##INFO=<ID=AF,Number=1,Type=Float,Description="Allele Fraction">
    ##INFO=<ID=RAF,Number=1,Type=Float,Description="Revised Allele Fraction">
    ##INFO=<ID=DP4,Number=4,Type=Integer,Description="Ref+, Ref-, Alt+, Alt-">
    ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
    ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
    ##FORMAT=<ID=AD,Number=1,Type=Integer,Description="Allele Depth">
    ##FORMAT=<ID=AF,Number=1,Type=Float,Description="Allele Fraction">
    ##FORMAT=<ID=SC,Number=1,Type=Float,Description="Score">
    ##FORMAT=<ID=PF,Number=1,Type=String,Description="Pass Filter">
    #CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	IonXpress_009_rawlib
    Anything I should have done differently to not cause the error?
    Thank you!

    Leave a comment:


  • Brian Bushnell
    replied
    Actually, I think the problem in this case is the samtools version. Samtools changed the default command line flags going from 0.x to 1.x, which causes headaches, and I didn't realize this until recently. If you download the latest version of BBMap (36.92), it will look at the version of samtools installed and change the samtools command appropriately.

    Alternately, you can just run these commands manually. For samtools 0.x:
    Code:
    samtools view -bSh1 mapped.sam.gz | samtools sort -m 1G -@ 4 - mapped_sorted
    samtools index mapped_sorted.bam
    For samtools 1.x:
    Code:
    samtools view -bSh1 mapped.sam | samtools sort -m 1G -@ 4 - -o mapped_sorted.bam
    samtools index mapped_sorted.bam
    The exact amount of memory specified (e.g. 1G) depends on your computer but should be fine in this case.

    Leave a comment:


  • GenoMax
    replied
    Do you have samtools installed and available in your $PATH and which version?

    If yes then you can create bam files directly in bbmap command and then sort/index them. Following example is for samtools (v.1.3.1).

    Code:
    bbmap.sh in1=UPWRP0508161KA_TCCGGAGA-ATAGAGGC_L1L2_R1_001.fastq in2=UPWRP0508161KA_TCCGGAGA-ATAGAGGC_L1L2_R2_001.fastq out=mapped.bam -Xmx30g ref=reformatted-contigs.fasta
    
    samtools sort -o sorted.bam mapped.bam
    samtools index sorted.bam

    Leave a comment:


  • shimingt
    replied
    Difficulties using bbmap to generate sorted, indexed bam file

    Hello people,

    I am trying to generate sorted, indexed bam file with the bbmap tool, but I have an error message.

    tanshiming@S620100019205:~/Documents/ACDC-trial/shimingles$ bbmap.sh in1=UPWRP0508161KA_TCCGGAGA-ATAGAGGC_L1L2_R1_001.fastq in2=UPWRP0508161KA_TCCGGAGA-ATAGAGGC_L1L2_R2_001.fastq out=mapped.sam -Xmx30g ref=reformatted-contigs.fasta bamscript=bs.sh; sh bs.sh
    java -Djava.library.path=/home/tanshiming/tools/bbmap/jni/ -ea -Xmx30g -cp /home/tanshiming/tools/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 in1=UPWRP0508161KA_TCCGGAGA-ATAGAGGC_L1L2_R1_001.fastq in2=UPWRP0508161KA_TCCGGAGA-ATAGAGGC_L1L2_R2_001.fastq out=mapped.sam -Xmx30g ref=reformatted-contigs.fasta bamscript=bs.sh
    Executing align2.BBMap [build=1, overwrite=true, fastareadlen=500, in1=UPWRP0508161KA_TCCGGAGA-ATAGAGGC_L1L2_R1_001.fastq, in2=UPWRP0508161KA_TCCGGAGA-ATAGAGGC_L1L2_R2_001.fastq, out=mapped.sam, -Xmx30g, ref=reformatted-contigs.fasta, bamscript=bs.sh]

    BBMap version 36.02
    Retaining first best site only for ambiguous mappings.
    Writing reference.
    Executing dna.FastaToChromArrays2 [reformatted-contigs.fasta, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=true, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=false]

    Set genScaffoldInfo=true
    Writing chunk 1
    Set genome to 1

    Loaded Reference: 0.084 seconds.
    Loading index for chunk 1-1, build 1
    No index available; generating from reference genome: /home/tanshiming/Documents/ACDC-trial/shimingles/ref/index/1/chr1_index_k13_c7_b1.block
    Indexing threads started for block 0-1
    Indexing threads finished for block 0-1
    Generated Index: 4.146 seconds.
    Analyzed Index: 3.074 seconds.
    Started output stream: 0.015 seconds.
    Cleared Memory: 0.307 seconds.
    Processing reads in paired-ended mode.
    Started read stream.
    Started 32 mapping threads.
    Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31

    ------------------ Results ------------------

    Genome: 1
    Key Length: 13
    Max Indel: 16000
    Minimum Score Ratio: 0.56
    Mapping Mode: normal
    Reads Used: 13403154 (2943817149 bases)

    Mapping: 634.915 seconds.
    Reads/sec: 21110.16
    kBases/sec: 4636.56


    Pairing data: pct reads num reads pct bases num bases

    mated pairs: 82.0206% 5496676 85.8858% 2528321686
    bad pairs: 1.6892% 113202 1.6858% 49625732
    insert size avg: 339.43


    Read 1 data: pct reads num reads pct bases num bases

    mapped: 85.2928% 5715965 85.4106% 1312819237
    unambiguous: 85.0524% 5699850 85.1949% 1309504251
    ambiguous: 0.2405% 16115 0.2157% 3314986
    low-Q discards: 0.0000% 0 0.0000% 0

    perfect best site: 56.5028% 3786577 57.4090% 882415075
    semiperfect site: 57.7478% 3870016 58.6870% 902059233
    rescued: 0.7551% 50604

    Match Rate: NA NA 95.4737% 1291210481
    Error Rate: 32.2729% 1844876 3.9447% 53348516
    Sub Rate: 31.0425% 1774536 0.7285% 9852710
    Del Rate: 2.5010% 142970 2.9285% 39606112
    Ins Rate: 4.1940% 239750 0.2876% 3889694
    N Rate: 2.6675% 152488 0.5816% 7866352


    Read 2 data: pct reads num reads pct bases num bases

    mapped: 85.4634% 5727397 85.6329% 1204639477
    unambiguous: 85.1581% 5706936 85.3792% 1201071323
    ambiguous: 0.3053% 20461 0.2536% 3568154
    low-Q discards: 0.0000% 0 0.0000% 0

    perfect best site: 54.6069% 3659525 55.8135% 785155749
    semiperfect site: 55.7207% 3734167 56.9902% 801709710
    rescued: 1.0234% 68587

    Match Rate: NA NA 95.6569% 1184979976
    Error Rate: 34.8039% 1993488 3.7863% 46904205
    Sub Rate: 33.7245% 1931661 0.7674% 9506079
    Del Rate: 2.2309% 127782 2.7561% 34141479
    Ins Rate: 3.7482% 214688 0.2629% 3256647
    N Rate: 2.4683% 141381 0.5567% 6896775

    Total time: 643.704 seconds.
    bs.sh: 2: bs.sh: module: not found
    bs.sh: 3: bs.sh: module: not found
    Note: This script is designed to run with the amount of memory detected by BBMap.
    If Samtools crashes, please ensure you are running on the same platform as BBMap,
    or reduce Samtools' memory setting (the -m flag).
    Note: Please ignore any warnings about 'EOF marker is absent'; this is a bug in samtools that occurs when using piped input.
    [bam_sort] Use -T PREFIX / -o FILE to specify temporary and final output files
    Usage: samtools sort [options...] [in.bam]
    Options:
    -l INT Set compression level, from 0 (uncompressed) to 9 (best)
    -m INT Set maximum memory per thread; suffix K/M/G recognized [768M]
    -n Sort by read name
    -o FILE Write final output to FILE rather than standard output
    -T PREFIX Write temporary files to PREFIX.nnnn.bam
    -@, --threads INT
    Set number of sorting and compression threads [1]
    --input-fmt-option OPT[=VAL]
    Specify a single input file format option in the form
    of OPTION or OPTION=VALUE
    -O, --output-fmt FORMAT[,OPT[=VAL]]...
    Specify output format (SAM, BAM, CRAM)
    --output-fmt-option OPT[=VAL]
    Specify a single output file format option in the form
    of OPTION or OPTION=VALUE
    --reference FILE
    Reference sequence FASTA FILE [null]
    [E::hts_open_format] fail to open file 'mapped_sorted.bam'
    samtools index: failed to open "mapped_sorted.bam": No such file or directory

    Could someone please advice me on this?

    Thank You.

    Leave a comment:


  • GenoMax
    replied
    @Brian: What setting (if any) would prevent the paired reads from aligning to two references (maxindel=0 perhaps) in first place?

    Leave a comment:


  • YeastGuy
    replied
    Originally posted by Brian Bushnell View Post
    Oh, that file should be fine; BBMap sees the individual sequences as unconnected. It's still possible for read 1 to map to one csequence and read 2 to a different sequence, though. The coverage may be different for various reasons, typically due to ambiguity (two different reference sequences that are very similar). You can use "ambig=toss" or "ambig=all" to change the behavior of reads mapping to multiple possible locations. But bear in mind that BBMap is slightly nondeterministic when mapping paired reads.
    Okay then, thank you for your quick help! I will try your suggestions.
    Now that I know what the problem is, it should be possible to find a solution.

    Leave a comment:


  • Brian Bushnell
    replied
    Oh, that file should be fine; BBMap sees the individual sequences as unconnected. It's still possible for read 1 to map to one csequence and read 2 to a different sequence, though. The coverage may be different for various reasons, typically due to ambiguity (two different reference sequences that are very similar). You can use "ambig=toss" or "ambig=all" to change the behavior of reads mapping to multiple possible locations. But bear in mind that BBMap is slightly nondeterministic when mapping paired reads.

    Leave a comment:


  • YeastGuy
    replied
    Originally posted by Brian Bushnell View Post
    How are the ORFs indicated? Do you have them in a separate file (gff/gtf/bed)? BBMap doesn't use any of those, though you could probably use a different tool to mask the reference with them. Generally, I don't recommend constraining alignment to a specific subset of the genome, which incurs confirmation bias.
    I tried to simply use the multi-fasta format with the hope that the sequences would be seen as seperate and not connected:
    >lcl||1|YAL001C|TFC3|1|145
    TAGTTACTATGGTCGTTAACGAAATAATATTTCATCCAGGGA
    >lcl||2|YAL002W|VPS8|1|145
    CTGGTCTGGACCCATTACTTTTTCTAGCTTGGGAAAATGTACAG

    But after randomising the order withing the ref file the coverage changed.

    Leave a comment:


  • Brian Bushnell
    replied
    Originally posted by YeastGuy View Post
    Hi Brian,

    I have a discontinuous reference genome in a single fasta file (6000 coding sequences) and I would like to use bbmap to align my paired reads to the coding sequence only. I want to know how many reads align to each ORF.

    Now I realised that the order of the reference genome affects the alignment because bbmap seems to ignore the ORF borderes in the reference.
    Do you have a suggestion how to constrain the alignment within each ORF?

    Thank you!
    How are the ORFs indicated? Do you have them in a separate file (gff/gtf/bed)? BBMap doesn't use any of those, though you could probably use a different tool to mask the reference with them. Generally, I don't recommend constraining alignment to a specific subset of the genome, which incurs confirmation bias.

    Leave a comment:


  • Brian Bushnell
    replied
    Originally posted by Mat29 View Post
    Thank you for developing BBMap. Could you please advise how to preprocess the output VCF to make it compatible with VCFAnnotator, SNPEff. I recieve "No gene feature" and "Chromososme is missing" errors on annotation. Tried Pilon also VCFs and errors persist.
    Please ensure that your chromosome names do not have spaces in them. If they do, use the "trd" flag when mapping (trim read description), or for simplicity, just process the fasta initially like this:

    reformat.sh in=ref.fa out=fixed.fa trd


    ...then use fixed.fa for mapping and everything else.

    Leave a comment:


  • YeastGuy
    replied
    @GenoMax: Yes, they are.

    Leave a comment:


  • GenoMax
    replied
    @YeastGuy: Are those sequences in multi-fasta format?

    Leave a comment:


  • YeastGuy
    replied
    Hi Brian,

    I have a discontinuous reference genome in a single fasta file (6000 coding sequences) and I would like to use bbmap to align my paired reads to the coding sequence only. I want to know how many reads align to each ORF.

    Now I realised that the order of the reference genome affects the alignment because bbmap seems to ignore the ORF borderes in the reference.
    Do you have a suggestion how to constrain the alignment within each ORF?

    Thank you!

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Strategies for Sequencing Challenging Samples
    by seqadmin


    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
    03-22-2024, 06:39 AM
  • seqadmin
    Techniques and Challenges in Conservation Genomics
    by seqadmin



    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

    Avian Conservation
    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
    03-08-2024, 10:41 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Yesterday, 06:37 PM
0 responses
10 views
0 likes
Last Post seqadmin  
Started by seqadmin, Yesterday, 06:07 PM
0 responses
9 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-22-2024, 10:03 AM
0 responses
51 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-21-2024, 07:32 AM
0 responses
67 views
0 likes
Last Post seqadmin  
Working...
X