Seqanswers Leaderboard Ad

**Brian Bushnell** · 02-11-2016, 02:16 PM

If you want to deduplicate based on mapping, I recommend converting to bam, sorting, and using some other program (Picard, for example). Or you can use dedupebymapping.sh which DOES accept sam files. It's much faster than Picard but does not fully take advantage of sorting, or write temp files, so it uses a lot of memory (no more than Dedupe, though).

Dedupe is for deduplicating by sequence. So, whether you deduplicate the fastq then map, or map and deduplicate the sam, depends on your goal - the results will differ slightly. In practice, it won't usually matter much for paired reads, but single-ended reads are much more impacted by deduplication methodology.

What kind of data are you using, and what's your experiment?

**lac302** · 02-12-2016, 11:09 AM

Originally posted by GenoMax View Post

Dedupe can't use sam files.

Perhaps I'm using an older version of bbmap but usage shows:

Code:

Usage:	dedupe.sh in=<file or stdin> out=<file or stdout>

An example of running Dedupe for clustering short reads:
dedupe.sh in=x.fq am ac fo c rnc=f mcs=4 mo=100 s=1 pto cc qin=33 csf=stats.txt pattern=cluster_%.fq dot=graph.dot

Input may be stdin or a fasta, fastq, or sam file, compressed or uncompressed.
Output may be stdout or a file.  With no output parameter, data will be written to stdout.
If 'out=null', there will be no output, but statistics will still be printed.
You can also use 'dedupe <infile> <outfile>' without the 'in=' and 'out='.

Originally posted by Brian Bushnell View Post

If you want to deduplicate based on mapping, I recommend converting to bam, sorting, and using some other program (Picard, for example). Or you can use dedupebymapping.sh which DOES accept sam files. It's much faster than Picard but does not fully take advantage of sorting, or write temp files, so it uses a lot of memory (no more than Dedupe, though).

Dedupe is for deduplicating by sequence. So, whether you deduplicate the fastq then map, or map and deduplicate the sam, depends on your goal - the results will differ slightly. In practice, it won't usually matter much for paired reads, but single-ended reads are much more impacted by deduplication methodology.

What kind of data are you using, and what's your experiment?

WGS, 2x275 PE reads from a Nextera XT library. Looking at novel non-syn SNPS found in treatment group vs WT.

**Brian Bushnell** · 02-12-2016, 11:36 AM

Originally posted by lac302 View Post

Perhaps I'm using an older version of bbmap but usage shows:
[CODE]Usage: dedupe.sh in=<file or stdin> out=<file or stdout>

An example of running Dedupe for clustering short reads:
dedupe.sh in=x.fq am ac fo c rnc=f mcs=4 mo=100 s=1 pto cc qin=33 csf=stats.txt pattern=cluster_%.fq dot=graph.dot

Input may be stdin or a fasta, fastq, or sam file, compressed or uncompressed.

Yep, that would be an older version, before I fixed that comment. Dedupe has never been able to process sam files, but I didn't realize it when I first wrote it, since I had never tried it. The problem is that BBTools Read objects have a single special field that can hold attachments; processing sam input uses that special field to hold the original sam line, but Dedupe needs it for storing other information.

WGS, 2x275 PE reads from a Nextera XT library. Looking at novel non-syn SNPS found in treatment group vs WT.

You can dedupe the fastqs before mapping or the sam files (by mapping) after mapping, then. Probably best to deduplicate by mapping in this case because 2x275 bp reads will have a high error rate so deduplication by sequence will be less effective and concentrate reads with errors.

**lac302** · 02-12-2016, 11:39 AM

Thanks for the quick reply Brain. I'll download the updated version.

**horvathdp** · 02-24-2016, 08:41 AM

incomplete removal of singletons by BBNORM

Brian (or others),

I am still a bit disturbed by the inability to remove reads that contain singleton kmers. I understand that BBNORM generates an output file that has all reads that contain kmers between specific values, and that in doing so the output file still may contain reads with singleton kmers- thus the khist runs of output BBNORM files still may have a large number of singletons present. However, it seems that it should be possible to generate a list of all reads that contain singleton kmers and mask or delete them from the output - thus eliminating any fragment that contains any low-quality kmers. Note being a programer, is there a reason why this can not (or should not) be done? It seems to me that eliminating such reads might greatly simplify the graphs during any subsequent assembly.

**Brian Bushnell** · 02-24-2016, 10:36 PM

You can remove all the reads with any 1-count kmers using BBNorm if you really want to, by messing with some of the parameters, but that's not really what it's designed for. I should probably make that more straightforward. It's easier to do like this:

Code:

kcompress.sh in=file.fq out=kmers.fa minprob=0 mincount=1 maxcount=1
bbduk.sh in=file.fq out=filtered.fq ref=kmers.fa k=31 mm=f

However, there will still be singleton kmers. There will always be singleton kmers, because when you remove reads with singleton kmers, that creates new singleton kmers. For example, if a read has 50 1-count kmers and 50 2-count kmers, then after you remove it, you eliminate 50 1-count kmers but generate 50 new 1-count kmers. You can reduce them, but they never go away.

**horvathdp** · 02-25-2016, 06:18 AM

Ahh...OK, thanks for using the small words and simple pictures for my old brain :-). I think I understand the issue. I'll play with this a bit and see what I can teach myself.

**skbrimer** · 03-09-2016, 07:47 PM

Hi Brian,

I have been playing with BBmap to see if there is any difference between mapping virus cDNA (RNAseq) with a splice aware aligner vs the default TMAP (Ion Torrent) and I was hoping you can help me understand what I am seeing.

I can see that BBmap is saying I have reads that could be splices, how do I check this? When compared to the TMAP output the overall output looks the same. Same shape in the coverage histogram on IGV, mostly the same SNP calls.

When I evaluated them with Qualimap, that is where I can see a lot more differences. BBmap mapped less reads but had more coverage and more uniform coverage then TMAP. However TMAP had a higher mapping quality and lower overall error rate. I think these last two indexes inflated in the BBmap report because of the splicing but that is why I am asking. Are these two mappers equivalent in this case?

Also I wanted to include a SNP report for you to look as well but freebayes was unable to locate the index for the ref when trying to call SNPs with the BBMap output. Since it makes its on ref index I think I should use that but I'm not sure how to call it.

I would appreciate any insight from you when you get a chance.

Thank you,

Sean

Attached Files

**Brian Bushnell** · 03-10-2016, 08:42 PM

It's hard to directly compare a splice-aware versus non-splice-aware aligner directly on data that may contain splice-like features since you don't really know which is right. Also, viral data is very noisy due to the rapid mutation rate, or something; maybe the RNA forms catalytic structures and the reads edit each other in solution. And finally, the coverage is tricky to calculate using a output from a splice-aware aligner, because the splices may or may not be counted - if they are counted, the coverage is artificially inflated.

BBMap has a tool called Pileup for calculating coverage. You can use it like this:

pileup.sh in=mapped.bam delcov=f covstats=stats.txt basecov=perbasecov.txt bincov=binnedcov.txt

delcov=f will ignore deletions (such as splice sites) in coverage calculation; delcov=t will include them.

The main factor in the difference in mapping rates is probably clipping. Note that TMAP clipped 25% of the reads, which allows possibly chimeric reads to map; BBMap does not do clipping by default, but you can turn it on by adding the "local=t" flag.

The mapping quality is not important in this case. You can't compare it between aligners since the quality values are subjective and the implementations are different. In fact, TMAP's average mapq of over 70 is fairly absurd; normally they don't go above 50 (by definition, a mapq of 70 means a 1/10,000,000 chance of being wrong, which would be difficult to achieve in practice). But anyway, that's just the aligner's opinion of how likely it is that an alignment is correct. Mapq is sometimes useful for filtering reads, but I don't think average mapq is ever useful.

The error rate here is being miscalculated due to the "splices" - the long deletions are being calculated as errors. Also, the fact that BBMap is not clipping reads that have poorly-matching ends inflates the error rate.

Perhaps the best way to do an apples-to-apples comparison is to turn off BBMap's long-indel detection and turn on soft-clipping:

bbmap.sh local=t maxindel=20 [other arguments]

That will make their outputs more comparable. It won't tell you which is correct, though.

Anyway... it's possible that those things that look like splices, really are splices - not traditional ones defining introns like in a eukaryote, but the RNA cutting itself or other pieces and randomly recombining, or inserting itself into other nearby pieces of viral RNA. I'm not a biologist and I have not studied viral behavior, but I have seen a lot of results of viral mapping that looked like this (with what appear to be large splices all over), and assembled isolate viral genomes that were supposed to be uniform and exactly like the reference, but turned out to have tons of different versions with hypervariable regions, and so forth. They're often really ugly.

**skbrimer** · 03-11-2016, 07:31 AM

Hi Brian,

Thanks for the input, I'm glad to hear that I was interpreting the outputs correctly as a biologist and not a computer engineer. I agree with our interpretation of the results as well. I understand that the way I uses this comparison was not overall fair, but mostly I was looking to see if there was a large difference between the final alignment of the type types of aligners and I think in that manner this test worked well.

I was afraid that because most of the viral species we are studying make their different proteins by differential spice events or from subgenomic start sites that TMAP made have been making false positive calls for variation. From this test I feel that it was not but I also feel that using a splice-aware aligner would be more prudent for this work. It looks like BBMap was able to find splice sites (real or not because some of them are in areas that there is no information on them yet) and completed the alignment using the longer read segments, whereas TMAP overcame this issue using the smaller segments or clipping the longer ones until it matched the one area.

Thank you for getting back to me, this makes me feel more confident about my interpretation of the data.

**telia** · 03-25-2016, 04:50 AM

error in pileup.sh

Dear Brain,
thank you for your attention. I am new in Bioinformatic. I have some sam file which I made using BWA. I am going to count unique mapped reads for gene expression. I have tried several times to work with Pileup.sh, but I had an error like this: Could not find or load main class pileup.sh
after that I used this command :
java -ea -Xmx30G -cp /home/BWA_sam/countnew/Pileup.sh in=tube1_dedup.sam stats=stats.txt

but unfortunately I have this error....
Error: Could not find or load main class in=tube1_dedup.sam

any help is much appreciated
Best regards

**GenoMax** · 03-25-2016, 10:05 AM

Originally posted by telia View Post

Dear Brain,
thank you for your attention. I am new in Bioinformatic. I have some sam file which I made using BWA. I am going to count unique mapped reads for gene expression. I have tried several times to work with Pileup.sh, but I had an error like this: Could not find or load main class pileup.sh
after that I used this command :
java -ea -Xmx30G -cp /home/BWA_sam/countnew/Pileup.sh in=tube1_dedup.sam stats=stats.txt

but unfortunately I have this error....
Error: Could not find or load main class in=tube1_dedup.sam

any help is much appreciated
Best regards

Did you move pileup.sh script out of the BBMap folder? Try running it from its original location.

**Brian Bushnell** · 03-25-2016, 10:13 AM

Hi telia,

The command depends on the operating system you are using. Assuming you unzipped the package to /home/BWA_sam/countnew - meaning there is a file in that directory called "pileup.sh" (which is normally in a directory called "bbmap") - for linux, it would be:

/home/BWA_sam/countnew/pileup.sh in=tube1_dedup.sam stats=stats.txt -Xmx30g

If you unzip the package normally to "/home/BWA_sam/countnew/", pileup.sh should be at "/home/BWA_sam/countnew/bbmap/pileup.sh"

You can also skip the shellscript and use this command, which works in any operating system (as long as the -cp argument points to the location of the directory named "current"):

java -ea -Xmx30G -cp /home/BWA_sam/countnew/current jgi.CoveragePileup jgi.Pileupin=tube1_dedup.sam stats=stats.txt

These are equivalent, but the shellscript is shorter if you are using Linux/bash.

**sk8bro** · 04-15-2016, 11:08 PM

idfilter?

Hey Brian,

I have some unexpected BBsplit mapping results. Do you have any insights?

Here is my command:

bbsplit.sh k=8 in=1P.fastq\
in2=2P.fastq\
basename=blah\ outu=clean_1P.fastq\
outu2=clean_2P.fastq\
ref=path_to_fasta\
ambiguous2=split pairlen=200000 rescuedist=200\ refstats=stats.txt\
idfilter=0.95 maxindel=20 nzo=f vslow pairedonly 2>> BBsplit_stats.txt

I don't understand why, for example, this read pair would align to the reference that BBsplit reports, because neither read actually has 95% identity to the reference. See attached Geneious alignment of both Reads to Reference sequence. Note: these reads are a little weird in that there isn't an insert, but still this is unexpected behavior that I'd like to understand and remedy. Thanks!

------------------ Results ------------------

Genome: 1
Key Length: 8
Max Indel: 20
Minimum Score Ratio: 0.56
Mapping Mode: normal
Reads Used: 2215982 (265861975 bases)

Mapping: 38.699 seconds.
Reads/sec: 57261.41
kBases/sec: 6869.93

Pairing data: pct reads num reads pct bases num bases

mated pairs: 0.0001% 1 0.0001% 210
bad pairs: 0.0000% 0 0.0000% 0
insert size avg: 51.00

Read 1 data: pct reads num reads pct bases num bases

mapped: 0.0001% 1 0.0001% 105
unambiguous: 0.0001% 1 0.0001% 105
ambiguous: 0.0000% 0 0.0000% 0
low-Q discards: 0.0000% 0 0.0000% 0

perfect best site: 0.0000% 0 0.0000% 0
semiperfect site: 0.0000% 0 0.0000% 0
rescued: 0.0000% 0

Match Rate: NA NA 49.4444% 89
Error Rate: 100.0000% 1 50.5556% 91
Sub Rate: 100.0000% 1 8.8889% 16
Del Rate: 100.0000% 1 41.6667% 75
Ins Rate: 0.0000% 0 0.0000% 0
N Rate: 0.0000% 0 0.0000% 0

Read 2 data: pct reads num reads pct bases num bases

mapped: 0.0001% 1 0.0001% 106
unambiguous: 0.0001% 1 0.0001% 106
ambiguous: 0.0000% 0 0.0000% 0
low-Q discards: 0.0000% 0 0.0000% 0

perfect best site: 0.0000% 0 0.0000% 0
semiperfect site: 0.0000% 0 0.0000% 0
rescued: 0.0000% 0

Match Rate: NA NA 49.1713% 89
Error Rate: 0.0096% 1 50.8287% 92
Sub Rate: 0.0096% 1 9.3923% 17
Del Rate: 0.0096% 1 41.4365% 75
Ins Rate: 0.0000% 0 0.0000% 0
N Rate: 0.0000% 0 0.0000% 0

Attached Files

alignment.JPG (28.9 KB, 89 views)

**sk8bro** · 05-03-2016, 11:54 PM

hey Brian,

I am still trying to get rid of these alignments, and others that don't make sense coming through at 95% identity of both reads. Changing kmer size, pairlen, rescuedist, ambiguous2 and maxindel back to default as well as removing vslow had no effect. However, changing minid from unspecified/default of 0.76 to 0.95 did change the output such that these alignments are no longer outputted.

I didn't think minid should effect output with idfilter set, only speed. Or in other words, that idfilter trumps at the very end.

Do you think there could be a bug? Thanks! Kate

Topics	Statistics	Last Post
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, Yesterday, 05:31 AM	0 responses 10 views 0 likes	Last Post by seqadmin Yesterday, 05:31 AM
Small Blood Stem Cell Subset Linked to Immune System Aging by seqadmin Started by seqadmin, 10-24-2024, 06:58 AM	0 responses 20 views 0 likes	Last Post by seqadmin 10-24-2024, 06:58 AM
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types by seqadmin Started by seqadmin, 10-23-2024, 08:43 AM	0 responses 50 views 0 likes	Last Post by seqadmin 10-23-2024, 08:43 AM
Microbes in Urban Spaces Adapt to Disinfectants and Scarce Resources by seqadmin Started by seqadmin, 10-17-2024, 07:29 AM	0 responses 58 views 0 likes	Last Post by seqadmin 10-17-2024, 07:29 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News