Seqanswers Leaderboard Ad

**rkanwar** · 12-18-2014, 01:34 PM

I am using version v34.08. Another pair of reads that failed are:

R0266248:292:C5C0TACXX:4:1101:3508:3710 97 3 193355806 42 19=1X77= = 96255219 -97100684 AATATTCCCAAGAGGATCTTGGGAGATGATGACACGTTCTCCAGTTAAGGTAAGAACATAGGCCGTCTCAGTGAGGTTCCTTAGGAGAGTAACTGCT CCCFFFFFHHHHFHIGHIJIIJIIJJJJGIIJJJJJJJJJJJIHHIIGGJFHIIHIJJJGGIJJJJJHHFHHHDF@?AEECCCEDDC??>CCDECDD NM:i:1 AM:i:42 RG:Z:AS_N#1#4#AS_N_1
R0266248:292:C5C0TACXX:4:1101:3508:3710 145 11 96255219 44 98= = 193355806 97100684 GTAGCCATGCAGAGTTCCCAGATTTCTTCCCCTTCAGCCCAGCTTCTGTGTCTTCCCTCTGTCCACTTGGCATCTTCCCTGTGAAGATCTGTTAAAAG DDCDDDDDDDDDDDDBDDEDDDDECDABFHHECAJJIJIJHHHDFIHFJJIIGHIJIIIIJJJIGHIJJHHE8JJIJIJIJJJJIHFGHHFFFFFCCC NM:i:0 AM:i:44 RG:Z:AS_N#1#4#AS_N_1

I used the following command to run bbmap:

bbmap_cmd="${bbmap} \
in1=${duk1} \
in2=${duk2} \
ambiguous=toss \
rgid=${ID} \
rgsm=${SM} \
rgpl=${PL} \
rglb=${LB} \
out=${bb_prefix}.sam \
pairlen=1000 \
threads=10 \
k=11 \
trimreaddescriptions=t \
reads=1000 \
slow=t"

Thanks!

Originally posted by Brian Bushnell View Post

Sounds like a bug, and I agree, it seems as though the mate is mapped to a different (longer) contig. What version of BBMap are you using? Also, can you give me the complete command line and, if possible, the text of the second read? Thanks!

**michaellim** · 12-22-2014, 05:05 PM

Hey Guys,

If I have Java 8, does that mean I can't use BBMap?

Thank you. Merry Christmas

**Brian Bushnell** · 12-22-2014, 05:07 PM

Originally posted by michaellim View Post

Hey Guys,

If I have Java 8, does that mean I can't use BBMap?

Thank you. Merry Christmas

It works fine with Java 7 or 8

Java is backwards-compatible.

**rkanwar** · 01-13-2015, 03:23 PM

Thanks for fixing the chimeric alignment thing [v34.30] ! I have another issue now while running GATK. It is now quitting using the following error:

##### ERROR MESSAGE: SAM/BAM file SAMFileReader{test.bam} is malformed: Read R0266248:292:C5C0TACXX:4:2101:19666:82211 is missing the read group (RG) tag, which is required by the GATK.

The SAM records for the corresponding read are:

R0266248:292:C5C0TACXX:4:2101:19666:82211 73 1 6500339 44 98= = 6500339 0 GGTGCCACGCCCCTGTACCTGGCGTGCCAGGAGGGCCACCTGGAGGTGACCCAGTACCTGGTGCAGGAATGCGGCGCAGACCCGCACGCGCGCGCCCA @@@FFFFFHGHHHIBFHIJIGIJJIIIJJJGIIIIIIJJJJJIGHI@GHHHHHFBDFFE>B.;>CDDDDDDDDDDBDDBDDDDDBDDDDDDDDBDDDD RG:Z:AS_N#1#4#AS_N_1 AM:i:0 NM:i:0
R0266248:292:C5C0TACXX:4:2101:19666:82211 133 1 6500339 0 * = 6500339 0 GAATTGAGGACATACAGCACAAAGAAGTTTCTGAGAATGCTTCTGTCTAGATTTAATATGAAGATAACCCGTTTCCAATGAAATCCTCAAAGCTATCC @CCFFFFFHHHHHIJJIJIJJIJJJJIHHIIJJIJHIGGJJIJJJIJJIJIJJJJIJJJGIIJJIJJJIJIHIJJJHHGEHHFFDEDDCE;ACEDCDD

I am using paired end reads and it looks like for this particular read pair the read group info is missing for one of the ends. Can you please fix this small issue too ? Thanks a lot for your help !

bye,
Rahul

**Brian Bushnell** · 01-13-2015, 05:21 PM

Rahul,

Thanks again! This is now fixed in 34.33. I was skipping the generation of optional tags for unmapped reads, which caused this problem.

-Brian

**lankage** · 01-23-2015, 10:28 AM

dedupe.sh question

I recently have been experimenting with read clustering tools using pacbio data. When running dedupe.sh without any of the clustering parameters, I noticed it outputs significantly fewer reads than are input despite saying it found 0 duplicates, 0 contained sequences and 0 invalid entries. I went from a fastq file containing ~3000 reads to a file containing 130. Im wondering, what filtering is occurring behind the scenes here?

dhowifi-19

acbio4 mgraham$ wc -l 0002_Forward--0002_Reverse.fastq
12460 0002_Forward--0002_Reverse.fastq

dhowifi-19:shortread mgraham$ bbmap34_26/dedupe.sh in=pacbio4/0002_Forward--0002_Reverse.fastq -Xmx10g
java -Djava.library.path=/Users/mgraham/Desktop/dev/shortread/bbmap34_26/jni/ -ea -Xmx10g -Xms10g -cp /Users/mgraham/Desktop/dev/shortread/bbmap34_26/current/ jgi.Dedupe in=pacbio4/0002_Forward--0002_Reverse.fastq -Xmx10g
Executing jgi.Dedupe [in=pacbio4/0002_Forward--0002_Reverse.fastq, -Xmx10g]

Initial:
Memory: free=10129m, used=161m

Found 0 duplicates.
Finished exact matches. Time: 0.070 seconds.
Memory: free=9807m, used=483m

Found 0 contained sequences.
Finished containment. Time: 0.036 seconds.
Memory: free=9538m, used=752m

Removed 0 invalid entries.
Finished invalid removal. Time: 0.017 seconds.
Memory: free=9538m, used=752m

Input: 130 reads 125193 bases.
Duplicates: 0 reads (0.00%) 0 bases (0.00%) 0 collisions.
Containments: 0 reads (0.00%) 0 bases (0.00%) 1539 collisions.
Result: 130 reads (100.00%) 125193 bases (100.00%)

Time: 0.131 seconds.
Reads Processed: 130 0.99k reads/sec
Bases Processed: 125k 0.96m bases/sec

**Brian Bushnell** · 01-23-2015, 10:41 AM

There should not be any filtering, so I don't know what the problem is. I imagine that all of my tools would perceive that file as containing 130 reads. Would it be possible for you to email me the file, or at least the first 2000 lines of it? It's possible that it is misformatted, or contains a control character in an unexpected place, so that the program quits reading it early.

**sdriscoll** · 01-27-2015, 11:46 AM

Question about bbmapskimmer.sh. so it says in the readme this version is designed to find all mapping locations with alignment score above a certain threshold. So does that mean when you run it and say set minid=0.9 then you get all mappings at that score and above or do you need to set something else? I'm asking because the options appear to be identical to standard bbmap (i.e. with the ambig, maxsites2, secondary, maxsites options).

With all of these options identical to bbmap I don't understand how this version can output all mappings above a threshold.

**Brian Bushnell** · 01-27-2015, 12:14 PM

The options are identical as much as possible; the code itself is different. bbmapskimmer.sh does not run BBMap, but a different program based on BBMap, but substantially different - BBMap gains a lot of speed by only looking for mapping sites similar to or better than the current best discovered site, and by limiting the minimum score in the dynamic programming alignment to a little below the best known score. These heuristics and early exits were removed from skimmer, so it is generally slower than BBMap. I tried to keep the interface the same for convenience.

Some else recently asked me roughly the same question via email, and I'll post my response here:

Maxsites controls the number of alignments printed for a read that maps to multiple locations, and has no function unless ambig=all. Maxsites2 controls the maximum number of potential mapping locations that are considered for alignment, which is computationally expensive; it's really a heuristic that affects speed versus accuracy. If there are more than "maxsites2" seed hits in the reference, the lowest-scoring ones will be discarded prior to alignment. The default is 800 - normally, there is no reason to change it unless you have a situation where you actually expect reads to have hundreds of correct mapping locations, and need all of them reported (like when aligning a primer sequence to a 16S database, for example). maxsites2 has an effect whether or not ambig=all.

With ambig=all, BBMap will only report alignments that are equal-scoring, or quite close, to the best alignment - so if the best alignment was perfect, and the second best had 4 substitutions, the second-best would NOT be reported. If you actually want every single alignment above some identity threshold, then you should use BBMapSkimmer (bbmapskimmer.sh) instead. The syntax is the same, but rather than looking for the best mapping location and then reporting it and anything else with a similar score, it is designed to report ALL alignments above a minimum quality.

For example:

bbmapskimmer.sh in=reads.fq out=mapped.sam ambig=all idtag minid=0.80 idfilter=0.9 maxsites=40

This will look for alignments with an approximate identity of at least 80%. Once it finds them, anything with an actual identity of under 90% will be removed (and the actual identity will be printed in the sam line). So the net result is that you will end up with everything that has at least 90% identity, up to 40 per read. The reason minid<idfilter is because minid is a heuristic based on approximate score, which acts before the true identity is known (which requires an expensive alignment), while idfilter is exact and acts AFTER alignment.

**sdriscoll** · 01-27-2015, 12:42 PM

Nice. See I use tools like RSEM and eXpress and both of those tools like to get all possible alignments up to some allowed alignment threshold. That's sort of the default output of something like bowtie or bowtie2 with the -a flag (all alignments). So it sounds like bbmapskimmer.sh outputs what I'm looking for when it comes to running those expression quantification tools. Thanks.

**lankage** · 01-28-2015, 01:42 PM

BBTools programs stop reading in fastq reads when encountering 0 length sequences

Brian, found the issue. Our PacBio data contained 0 length sequences that were causing dedupe/bbmap to quit reading in sequences. I removed these and everything runs fine.
Example:

@m141217_232404_42136_c100683242550000001823136602221530_s1_X0/45415/ccs 0.9 18

+

@m141217_232404_42136_c100683242550000001823136602221530_s1_X0/12786/ccs 0.99 27

**Brian Bushnell** · 01-28-2015, 01:47 PM

OK, thanks for the report. I will modify it to process those correctly.

**sdriscoll** · 01-30-2015, 11:24 AM

Hey Brian,

Considering the very nice multi-threaded set of tools you have thus far, would it be out of the question to add in a variant caller for SNP/INDEL calling? It seems that those things tend to NOT be multi-threaded.

**Brian Bushnell** · 01-30-2015, 11:32 AM

I wrote a variant caller, which is indeed multithreaded, but I don't think it works right now because of massive changes I've made to related code over the last couple years. But it would be useful in a lot of scenarios (including error-correcting assemblies) so I have been meaning to resurrect it.

I will do so if I have time; that's really the limiting reagent.

**sdriscoll** · 01-30-2015, 11:35 AM

Cool. That would be a useful tool. There is a multithreaded caller in the Subread package however it is a little too barebones.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Today, 11:49 AM	0 responses 11 views 0 likes	Last Post by seqadmin Today, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News