Variable Kmer warning when using bbduk
Hello,
I appear to be getting different results when running the same parameters on the same datasets. I used the below commands:
./bbduk.sh in=R1_001.fastq in2=R2_001.fastq minlen=51 mink=11 ktrim=r tbo tpe K=31 hdist=1 hdist2=1 ref=adapters.fa stats=stats_adaptor.txt out=R1_clean.fq out2=R2_clean.fq outm=ER1_adaptormatch.fq outm2=R2_adaptormatch.fq
In the first instance I had a great improvement in the data quality as reported by fastQC with no problems with kmer content in either read. However, when I had to re-run the same data with the same commands, FastQC now flagged up a warning on my R1 reads on the Kmer content. When I ran it a third time it then flagged up a warning on my R2 reads on the Kmer content.
Each time I run the above commands on the dataset I either get:
-No warning on either set of reads for Kmer content
-A warning on R1 Kmer content but not R2
-A warning on R2 Kmer content but not R1
I see two different kmer warnings, either with one or two overrepresented kmers although what kmers and their position in the read differs each time.
Could anyone shed some light on why this is happening? I've attached screenshots of the two-types of Kmer warning that get flagged by fastqc.
Many thanks for your help as always.
Header Leaderboard Ad
Collapse
BBMap (aligner for DNA/RNAseq) is now open-source and available for download.
Collapse
Announcement
Collapse
No announcement yet.
X
-
Hi Brian,
I really like callvariants.sh (simulated di-allelic SNPs and at both low (4x) and moderate (60x) sequencing coverage, filtering to keep only variants with quality score of >=Q27, callvariants.sh is more accurate than using GATK 3.7 with their recommended variant filtering). One thing though that I was wondering about, was whether it would be possible to have an option that sets variants that fail for some individuals (in multisample mode) to a missing genotype call "./." rather than whatever genotype is called by callvariants.sh
Best,
Gopo
Leave a comment:
-
Originally posted by Brian Bushnell View PostI don't like to alter sequence headers by default because that could cause the introduction of non-unique headers
Since spaces in names are not allowed by the sam format, I would expect bbmap either to silently trim at the first space (which is not nice...!) or fail straightaway with noting that if you really want to proceed you have to enable the tdr=t option. Just a thought...
Leave a comment:
-
reformat.sh has an option "underscore" which will change whitespace in sequence headers into underscores, if the extra information is important. Alternatively, as Genomax says, you can use "trimrname". Generally I don't like to alter sequence headers by default because that could cause the introduction of non-unique headers, which would break all tools, rather than just some tools
Leave a comment:
-
You can choose to truncate the names at the first space during mapping using "trd=t".
Since you already have sam/bam files you can use reformat.sh to take care of the spaces using
Code:trimrname=f For sam/bam files, trim rname/rnext fields after the first space.
Leave a comment:
-
Sorry... another issue.
I have a reference genome where the header lines contain spaces, like this:
Code:>chr1 AC:CM000663.2 gi:568336023 LN:248956422 rl:Chromosome M5:6aef897c3d6ff0c78aff06ac189178dd AS:GRCh38 NNNNN...
Code:cat -vet test.sam | head @HD^IVN:1.4^ISO:unsorted$ @SQ^ISN:chr1 AC:CM000663.2 gi:568336023 LN:248956422 rl:Chromosome M5:6aef897c3d6ff0c78aff06ac189178dd AS:GRCh38^ILN:248956422$
I think the printing of the sequence name in the sam header should strip everything after the first blank space?
Leave a comment:
-
Originally posted by Brian Bushnell View PostHi Dario and Genomax,
I though I had gotten rid of that a while ago, but I guess not - I'll investigate and fix it. It's harmless and due to a race condition for a thread finishing after it was prematurely shut down because enough reads were generated, but it looks scary.
Leave a comment:
-
1) I have not looked at Nanopore reads in several years now so I don't have any further experience in mapping them. But from what I read and hear the quality is gradually creeping upward (and my initial exposure was to very low quality data) so some of the settings can probably be relaxed; e.g., increasing to k=10 and minratio=0.25.
2) "Read Depth" (DP field) is the depth of all reads covering (or spanning) the event, while "Allele Depth" (AD field) is the same but only for those reads exhibiting the variation.
Coverage is added to an array tracking per-reference-base coverage at each position when each read is processed (and deleted bases are considered covered by a read containing that deletion). Similarly, coverage is added to each variation when a read containing that variation is processed.
The reference coverage for a variant is calculated by summing the coverage per-base in the array over the length of the event. For substitutions this is straightforward. For deletions, this is the deleted reference bases. For insertions, this is the average coverage of the bases on either side of the insertion.
It is possible for the allele coverage to be slightly higher than the total coverage using this calculation method (for example, as a result of a sudden drop in coverage on one side of an insertion event) so DP reports max(ref coverage, allele coverage) while COV reports the raw calculated coverage (they will virtually always be the same, though). AF is calculated as DP/AD to prevent AF over 1.0.
3) "Allele Fraction" is "Allele Depth"/"Read Depth". As you say, this is straightforward for SNPs but less so for indels, particularly with short reads and long deletions. It's not entirely clear what AF should really mean for a deletion, because it describes a ratio of two classes of reads - those that cover a variant location and do not indicate the variant, and those that cover a variant location and do indicate the variant. But when deletions are longer than reads there is a third case; those that land in the middle of the deletion but do not touch either end. You could, for example, have 100x coverage on either end of a deletion event, with 50x indicating the deletion and 50x indicating no deletion, and calculate an AF of 0.5. But if you had 1000x coverage somewhere in the middle of the deletion event from reads not spanning the junction on either end, that's a 3rd category and really they should not contribute to the AF (since the deleted allele would never be observed in that case), but currently they do. I'll look into changing the way that long deletion AF and DP are calculated; thanks for making me think about it.
For insertions, there is "Alternate Allele Frequency" which is allele frequency adjusted to compensate for the fact that insertions will not be called on reads that do not span the full event. So if you have a 50bp insertion and 100bp reads, only 1/3rd of the reads with bases in the inserted region would actually span the full insertion and thus possibly yield an insertion call, so a 100% true allele frequency might be reported as 33%. "Alternate Allele Frequency" would adjust the 33% observed back up to 100% based on the relative average read length and insertion length.Last edited by Brian Bushnell; 10-04-2017, 01:23 PM.
Leave a comment:
-
Hi Brian,
I have a couple of questions about using callvariants.sh and mapPacbio.sh. I am using mapPacbio.sh to align reads from a Nanopore minion to a viral sequence and looking for deletions with callvariants.sh. My questions are
1. Do you have a suggested method for mapping minion reads. I used a suggestion you provided in 2014 but I don't know if you have since updated this. Here is what I'm running.
mapPacBio.sh k=8 in=reads.fq ref=ref.fa maxreadlen=1000 minlen=200 sam=1.4 idtag ow int=f qin=33 indelhist=indelhist1.txt out=mapped.sam minratio=0.15 ignorequality slow ordered maxindel=12000 outputunmapped=f bs=bs1.sh
2. I am then calling variants with the desire to look for deletions with callvariants.sh.
callvariants.sh in=mapped.sam rarity=0.0001 out=deletions.tsv vcf=deletions.vcf ref=ref.fa extended=t border=20 minscore=10
In the output can you tell me what the meaning of "coverage" or "read Depth" or "allele depth" are? What is the difference between these? Also, what does the allele fraction mean in this case? For a SNP it's self-explanatory but for a indel it seems like it's not as straight forward.
Any insight you can provide would be extremely useful. Thanks again for all you do.
Leave a comment:
-
Hi Dario and Genomax,
I though I had gotten rid of that a while ago, but I guess not - I'll investigate and fix it. It's harmless and due to a race condition for a thread finishing after it was prematurely shut down because enough reads were generated, but it looks scary.
Leave a comment:
-
I would like to add that reformat.sh also exhibits this "issue".
Code:$ reformat.sh in=P1.fastq.gz out=stdout.fa reads=5 Input: 5 reads 1094 bases Output: 5 reads (100.00%) 1094 bases (100.00%) Time: 0.178 seconds. Reads Processed: 5 0.03k reads/sec Bases Processed: 1094 0.01m bases/sec Exception in thread "main" java.lang.RuntimeException: ReformatReads terminated in an error state; the output may be corrupt. at jgi.ReformatReads.process(ReformatReads.java:1148) at jgi.ReformatReads.main(ReformatReads.java:45)
Leave a comment:
-
bbduk.sh: "reads" parameter does not exit clean
Hi- Not sure this thread is the best place to post this...
It seems to me that the `reads` parameter makes bbduk exits with an error when the input is not fully processed.
These runs are both ok (input file has less then 1M reads):
Code:bbduk.sh in=test.fq.gz out=test.out.fq reads=-1 overwrite=true bbduk.sh in=test.fq.gz out=test.out.fq reads=1000000 overwrite=true
Code:bbduk.sh in=test.fq.gz out=test.out.fq reads=10 overwrite=true java -Djava.library.path=/home/db291g/applications/BBmap/bbmap/jni/ -ea -Xmx15023m -Xms15023m -cp /home/db291g/applications/BBmap/bbmap/current/ jgi.BBDukF in=test.fq.gz out=test.out.fq reads=10 overwrite=true Executing jgi.BBDukF [in=test.fq.gz, out=test.out.fq, reads=10, overwrite=true] BBDuk version 37.54 NOTE: No reference files specified, no trimming mode, no min avg quality, no histograms - read sequences will not be changed. Initial: Memory: max=15097m, free=14703m, used=394m Input is being processed as paired Started output streams: 0.082 seconds. Processing time: 0.017 seconds. Input: 20 reads 3011 bases. Total Removed: 0 reads (0.00%) 0 bases (0.00%) Result: 20 reads (100.00%) 3011 bases (100.00%) Time: 0.114 seconds. Reads Processed: 20 0.18k reads/sec Bases Processed: 3011 0.03m bases/sec Exception in thread "main" java.lang.RuntimeException: jgi.BBDukF terminated in an error state; the output may be corrupt. at jgi.BBDukF.process(BBDukF.java:913) at jgi.BBDukF.main(BBDukF.java:72) echo $? 1
Am I missing something here or should this be fixed?
Thanks
Dario
Leave a comment:
-
Originally posted by Brian Bushnell View PostYes, that's correct. This is especially important for BBMerge, where insert size can only be computed from reads that overlap; so if the average calculated insert size is say 270bp for 2x150bp reads, but "PercentOfPairs" is only 20%, then presumably the actual insert size is substantially higher but those pairs did not overlap so it could not be computed. It's less important for BBMap but still useful in some cases - if PercentOfPairs is very low, something probably went wrong and the insert size is not trustworthy. For example, if you have an extremely fragmented reference with contig length shorter than insert size, such that most paired reads map to different contigs, then the pairing rate will be low and the calculated insert size will be untrustworthy (shorter than the actual insert size).
I'm one of the authors of a BBMap module for the MultiQC bioinformatics log aggregation/summarization tool, and we'd love your input/help with small snippets of explanatory text for some of the outputs produced by the different BBMap tools. Would you be willing to help out? The first reasonably finished version of the BBMap module for MultiQC was finished and merged yesterday, but it unfortunately still lacks explanatory help text for most plot types. Information at the level of your reply regarding PercentOfPairs is perfect for the help text in the MultiQC reports, to aid people unfamiliar with BBMap output to interpret the plotted results included in a MultiQC report. Perhaps you even have suggestions on how to better visualize aggregations of some of the different outputs. Let me know what you think, and we'll set up an Issue in the MultiQC Github issue tracker to organize such work if you're interested.
Leave a comment:
-
Originally posted by boulund View PostHi,
I have a question about the PercentOfPairs output found in the insert size histogram output (ihist). What does it mean, percent of pairs? Is it the proportion of read pairs where an insert size could be computed?
Leave a comment:
-
Hi,
I have a question about the PercentOfPairs output found in the insert size histogram output (ihist). What does it mean, percent of pairs? Is it the proportion of read pairs where an insert size could be computed?
Leave a comment:
Latest Articles
Collapse
-
by seqadmin
Amplicon sequencing is a targeted approach that allows researchers to investigate specific regions of the genome. This technique is routinely used in applications such as variant identification, clinical research, and infectious disease surveillance. The amplicon sequencing process begins by designing primers that flank the regions of interest. The DNA sequences are then amplified through PCR (typically multiplex PCR) to produce amplicons complementary to the targets. RNA targets...-
Channel: Articles
03-21-2023, 01:49 PM -
-
by seqadmin
Targeted sequencing is an effective way to sequence and analyze specific genomic regions of interest. This method enables researchers to focus their efforts on their desired targets, as opposed to other methods like whole genome sequencing that involve the sequencing of total DNA. Utilizing targeted sequencing is an attractive option for many researchers because it is often faster, more cost-effective, and only generates applicable data. While there are many approaches...-
Channel: Articles
03-10-2023, 05:31 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Yesterday, 12:26 PM
|
0 responses
7 views
0 likes
|
Last Post
by seqadmin
Yesterday, 12:26 PM
|
||
Started by seqadmin, 03-17-2023, 12:32 PM
|
0 responses
14 views
0 likes
|
Last Post
by seqadmin
03-17-2023, 12:32 PM
|
||
Started by seqadmin, 03-15-2023, 12:42 PM
|
0 responses
21 views
0 likes
|
Last Post
by seqadmin
03-15-2023, 12:42 PM
|
||
Started by seqadmin, 03-09-2023, 10:17 AM
|
0 responses
68 views
1 like
|
Last Post
by seqadmin
03-09-2023, 10:17 AM
|
Leave a comment: