BBMap (aligner for DNA/RNAseq) is now open-source and available for download.

Brian Bushnell replied

08-25-2017, 10:13 AM
Originally posted by jweger1988 View Post

I'm just curious why Brian said the data was really low quality. We are using a Nextseq 500 and V2 chemistry. The Q30>30 was 80% and it was actually underclustered.

I would not use the "Q30" number for anything; I consider it marketing fluff. The only way to determine data quality is to actually map it and count the number of sequencing errors. You can see from the graph that when you do that, the average starts at Q11 (which is incredibly low and very unusual), goes up to a peak of just under 30, and drops to Q12 at the end, which is also very low. At no point does the average quality even reach Q30.

You can see this visually in IGV - the screen is always littered with random blips indicating indels or mismatches to the reference, which is very unusual for Illumina data (at least, HiSeq 2500 Illumina data). The screen should look very clean except where mismatches to the reference form an obvious column, indicating a real mutation.

We only have access to Nextseq at the moment, is there anything we can do to improve this or is it just how it is with Nextseq?

You might want to make sure you are using the latest chemistry and software, and talk to your Illumina rep about your concerns with the quality. We are still using NextSeq routinely, and the quality is usually much better than that, but I don't know the reason.

I don't think the platform is appropriate for low-frequency variant-calling anyway, but on any platform, I'd recommend paired-end reads for that (preferably overlapping so they can be merged), and definitely not strand-specific protocols, which incur bias (particularly, certain genomic features, read in a certain direction, are prone to miscalls). How did you end up with only minus-strand reads, anyway?
Leave a comment:
jweger1988 replied

08-25-2017, 03:47 AM
Originally posted by GenoMax View Post

A MiSeq/HiSeq generally generates better data so if you do have the option of using one of those then do switch. Especially for lowfreq SNP calls.

Thanks Genomax.

I'm just curious why Brian said the data was really low quality. We are using a Nextseq 500 and V2 chemistry. The Q30>30 was 80% and it was actually underclustered.

We only have access to Nextseq at the moment, is there anything we can do to improve this or is it just how it is with Nextseq?
Leave a comment:
GenoMax replied

08-25-2017, 03:07 AM
A MiSeq/HiSeq generally generates better data so if you do have the option of using one of those then do switch. Especially for lowfreq SNP calls.
Leave a comment:
jweger1988 replied

08-25-2017, 12:28 AM
Brian,

Wow, that's very surprising to me.

things we used to see on our NextSeqs until it eventually got better after various chemistry and software revisions

Are you still currently using a nextseq? Do you recommend a MiSeq or a Hiseq instead? I assume by chemistry and software revisions you mean those put out by Illumina? I wonder if we don't have the latest software?

Any suggestion to improve this in the future?
Leave a comment:
Brian Bushnell replied

08-24-2017, 04:11 PM
Unfortunately, it looks like your data is very low quality and has inaccurate quality scores, things we used to see on our NextSeqs until it eventually got better after various chemistry and software revisions. I recommend:

1) Adapter-trimming (you have >5% of reads with adapters)
2) Mapping
3) Recalibrating the raw reads
4) Adapter-trimming and quality-trimming the recalibrated reads, with "ftm=5"
5) Mapping again
6) Calling variants

E.g.

Code:

bbduk.sh in=reads.fq.gz out=trimmed0.fq.gz ktrim=r k=23 mink=11 hdist=1 minlen=100 ref=adapters bbmap.sh in=trimmed0.fq.gz ref=ref.fa out=mapped0.sam nodisk qahist=qahist.txt qhist=qhist.txt mhist=mhist.txt rm trimmed0.fq.gz calctruequality.sh in=mapped0.sam rm mapped0.sam bbduk.sh in=reads.fq.gz out=recal.fq.gz bbduk.sh in=recal.fq.gz out=trimmed.fq.gz qtrim=rl trimq=14 ktrim=r k=23 mink=11 hdist=1 minlen=100 ref=adapters bbmap.sh in=trimmed.fq ref=ref.fa out=mapped.sam nodisk bs=bs.sh qahist=qahist_rc.txt qhist=qhist_rc.txt mhist=mhist_rc.txt; sh bs.sh

Result of recalibration:
Before:

After:

However, I am not sure you will get anything useful out of this data. We have found that NextSeq runs are just not trustworthy for calling variants at a frequency of below 5%. I looked at lofreq's highest-quality variant, with a Q-score of 1392, and it was seen in 2 reads, with a Q-score of 14 in each case, once within 3 bp of the read end (probably in adapter sequence because there are 2 mismatches and an insertion next to it there) and both reads had a lot of other mismatches. That is the the highest quality variation, and it has zero chance of being correct.

The reason you were not getting any variant calls with CallVariants except at the ends is because for some reason you have stranded data; all the reads only map to the minus strand. In that situation you need to set "usebias=f" and "minstrandratio=0" to turn off strand-bias filters. It still won't call much, though. 95 variants are called using this command (on the trimmed recalibrated reads):

Code:

callvariants.sh in=recal.sam.gz ref=ref.fa rarity=0.01 out=foo3.txt vcf=foo3.vcf minreads=2 minscore=10 usebias=f minstrandratio=0

But only 12 are above Q20 which is my normal cutoff (default is minscore=20, which I recommend), so I'd suggest perhaps looking at those. Maybe they're real. The insertion at 9225 looks plausible.
Attached Files

before.png (21.5 KB, 319 views)

after.png (24.8 KB, 334 views)
Leave a comment:
Brian Bushnell replied

08-24-2017, 10:42 AM
Originally posted by jweger1988 View Post

Brian,

Thanks for the splitsam, that worked great!

As for the other question by the Ns I meant read name I think? Sorry my grasp of nomenclature with these things is not great.

BBmap
NS500697:37:HH5V7BGXY:2:23303:13708:4313 1:N:0:TAAGGCGA+ATATAGAG

Bowtie
NS500697:37:HH5V7BGXY:4:13502:9896:14279

Yep, BBMap uses the full read name whereas Bowtie truncates the read name at the first whitespace. As Genomax noted, you can add the flag "trd" (trimreaddescriptions) to BBMap to produce truncated names, you you can reprocess the already-mapped sam file with Reformat using the "trd" flag to do that also.

Thanks for the feedback on CallVariants! It's certainly odd that there seems to be nothing in common between the two results. Would it be possible for you to email me the bam and reference so I can examine it in more detail?

Last edited by Brian Bushnell; 08-24-2017, 10:46 AM.
Leave a comment:
jweger1988 replied

08-24-2017, 09:31 AM
As for the comparison for lofreq and callvariants, I can never seem to get callvariants to call what look like good snps for my samples. Maybe I am using it wrong in some way.

The sensitivity parameters are a little trick to play with in lofreq. But with callvariants and lofreq with standard settings at rarity=0.01, callvariants calls completely different snps as compared to lofreq. After manual inspection of the bam file it looks like lofreq is more accurate. Could definitely be some paremeter issues though. I'm open to any suggestions at changing these, as I'm a big fan of the bbmap suite.

Here is a link to the vcf files and the script I'm running if you are interested.

Google Drive: Sign-in

https://drive.google.com/drive/folders/0B3EYKskcePi6NWJxUUVSYTk1V1k?usp=sharing

Access Google Drive with a Google account (for personal use) or Google Workspace account (for business use).

Thanks as always.
Leave a comment:
jweger1988 replied

08-24-2017, 09:23 AM
Brian,

Thanks for the splitsam, that worked great!

As for the other question by the Ns I meant read name I think? Sorry my grasp of nomenclature with these things is not great.

BBmap
NS500697:37:HH5V7BGXY:2:23303:13708:4313 1:N:0:TAAGGCGA+ATATAGAG

Bowtie
NS500697:37:HH5V7BGXY:4:13502:9896:14279
Leave a comment:
Brian Bushnell replied

08-23-2017, 02:53 PM
After mapping, you can split the sam file like this:

Code:

splitsam.sh mapped.sam plus.sam minus.sam unmapped.sam reformat.sh in=plus.sam out=plus.fq reformat.sh in=minus.sam out=minus.fq rcomp cat plus.fq minus.fq > all.fq

As for your previous question (sorry, somehow I did not notice it) - I think Genomax may be right that the issue is the read names, but it's hard to say. Also, it's hard to see what's going on here since none of the reads are mapped...

Incidentally, what do you mean by this?

I did notice that bbmap bams have the N and bowtie does not.

Do you mean N in the read name, or in the cigar string, or in the read sequence?

Also, I'd be interested to see how the output of callvariants.sh (with appropriate sensitivity parameters) compares to lofreq, if you want to give it a try.
Last edited by Brian Bushnell; 08-23-2017, 03:06 PM.
Leave a comment:
jweger1988 replied

08-23-2017, 01:05 PM
Hi Brian,

I'm trying to call some kmers with kmercountexact.sh and am running into some troubles.

I am merging paired end reads with bbmerge then mapping to a small viral reference file with bbmap. I am then using reformat.sh to trim to a smaller read length and then calling/counting kmers with kmercountexact.

The problem I am having is that after mapping I'm getting reads that are both in the forward and reverse sense. So when calling kmers I get them from both ends of the sequence I'm interested in. Is there any way to reverse complement just the reads that are in the reverse complement in relation to the reference sequence?

Thanks in advance.
Leave a comment:
jweger1988 replied

08-18-2017, 05:01 AM
Originally posted by GenoMax View Post

I can't think of anything now. Let us see if Brian has any other suggestions.

lofreq viterbi seems to be for re-alignment. I wonder if you can do without that with BBMap generated files.

It seems to end up calling weird indels when viterbi is not run that are clearly not real. Thanks for the help. I'll wait for Brian and see what he thinks.
Leave a comment:
GenoMax replied

08-18-2017, 04:54 AM
I can't think of anything now. Let us see if Brian has any other suggestions.

lofreq viterbi seems to be for re-alignment. I wonder if you can do without that with BBMap generated files.
Leave a comment:
jweger1988 replied

08-18-2017, 04:12 AM
So I tried

reformat.sh in=1_bbmap.sam out=1_bbmap_reform1.bam trd=t sam=1.3

and just

reformat.sh in=1_bbmap.sam out=1_bbmap_reform1.bam trd=t

I then tried running lofreq viterbi either with sorting or without and same error.

Segmentation fault (core dumped)

I did notice that bbmap bams have the N and bowtie does not.

Any more thoughts?

Thanks for the quick reply.
Leave a comment:
GenoMax replied

08-18-2017, 03:30 AM
I wonder if viterbi program is having trouble with (NS500697:37:HH5V7BGXY:2:23303:13708:4313 1:N:0:TAAGGCGA+ATATAGAG ) the tag sequence which BBMap preserves but Bowtie does not.

Can you "reformat" the bbmap alignment using "trd=t" flag and see if that helps?
Leave a comment:

jweger1988 replied

08-17-2017, 10:55 PM

Hi Brian,

I'm having an issue I'm wondering you can help with.

I am using bbmap to align to a small viral genome. The resulting bam or sam file give me an error something like : Segmentation fault (core dumped) - every time I run the program lofreq viterbi, which is used before variant calling with lofreq. When I generate a bam with the same reference file and fastq files with bowtie2 I do not get the same error.

I have tried using reformat.sh to convert to cigar 1.3 but this doesn't seem to work. The fasta file has only one sequence and there are no spaces in the name.

Here is an example of the error with the command -

Code:

lofreq viterbi -f ZIKV_PRVABC59_CDS.fa -o bbmapv.bam bbmap.bam
Segmentation fault (core dumped)

It runs OK with bowtie produced bam.

Here is the output from the command head with either bowtie or bbmap produced sam file.

bbmap

Code:

@HD     VN:1.4  SO:unsorted
@SQ     SN:ZIKV_PRVABC59_CDS    LN:10272
@PG     ID:BBMap        PN:BBMap        VN:37.47        CL:java -Djava.library.path=/home/weger/bb
p/jni/ -ea -Xmx216859m align2.BBMap build=1 overwrite=true fastareadlen=500 in1=1_ctff_R1.fq.gz in
1_ctff_R2.fq.gz out=1_bbmap.sam ref=ZIKV_PRVABC59_CDS.fa maq=25
NS500697:37:HH5V7BGXY:2:23303:13708:4313 1:N:0:TAAGGCGA+ATATAGAG        77      *       0       0
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT   AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEAE/EEEA/AEEEE/EA
NS500697:37:HH5V7BGXY:2:23303:13708:4313 1:N:0:TAAGGCGA+ATATAGAG        141     *       0       0
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA      AAAE<6<<<66<<<<<6<<6666666666666E66666666E66E666A6666A66AA
E6EAA6EAAAEAEAEAEEAEEAEEEEEAE6EAEAAEAAAEEEAEAEAAEEEEAEAEEAEEEEEEEEAEEEEEEEE
NS500697:37:HH5V7BGXY:2:21207:1798:4356 1:N:0:TAAGGCGA+ATAGAGAC 77      *       0       0       *
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT  AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEAEEE//EEEEE//E
NS500697:37:HH5V7BGXY:2:21207:1798:4356 1:N:0:TAAGGCGA+ATAGAGAC 141     *       0       0       *
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 6666666666<<<<<<666666666666666666666666
NS500697:45:HJGYTAFXX:2:11101:23195:4485 1:N:0:TAAGGCGA+ATAGAGAG        77      *       0       0
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
TTTTTTTTTTTTTTTTTTTTTT  AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEAE<EE<A/EA//E
NS500697:45:HJGYTAFXX:2:11101:23195:4485 1:N:0:TAAGGCGA+ATAGAGAG        141     *       0       0
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA A6A6<<<6<<E<<<<<6<<6<6<<<<<6666<<<<6666E66E6666666EE6666E66EEEE666
6AAAAAAAEA6EAAEAAE6E6EAE6EAEEAEAAEAEAEEEAEEEAAEEAEEEEEEAEEEEAEAE
NS500697:37:HH5V7BGXY:2:22101:16007:4646 1:N:0:TAAGGCGA+ATAGAGAG        77      *       0       0
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT      AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEE/AE/EEA//A

bowtie2

Code:

@HD     VN:1.0  SO:unsorted
@SQ     SN:ZIKV_PRVABC59_CDS    LN:10272
@PG     ID:bowtie2      PN:bowtie2      VN:2.2.9        CL:"/apps/bowtie2-2.2.9/bowtie2-align-s --
apper basic-0 -x /data/ebel_lab/Weger/indexed_fasta_bowtie/ZIKV__PRVABC59_CDS -S 1_bowtie.sam -1 /
p/31806.inpipe1 -2 /tmp/31806.inpipe2"
NS500697:37:HH5V7BGXY:4:13502:9896:14279        77      *       0       0       *       *       0
AGCCTAAGTATGTCAATACAACAAATACTTACTGTTTCATTTCTAGTAATGAAAAAAAAAAAAAAAA     AAAAAEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EE        YT:Z:UP
NS500697:37:HH5V7BGXY:4:13502:9896:14279        141     *       0       0       *       *       0
CAGAGACATACCCCTCCTTACAAATAAAAATGTTCTTTATAGATGTAAATTTATTTTACAAAAATGTTTCAAAATGACCAGATGAAAATCATCCTTAT
CAGAAAGACTTGTTTTTTTTTTTTTAAT    AAAAAEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEAEEE/AEEEEEEEEEEEEEEEE/<E   YT:Z:UP
NS500697:37:HH5V7BGXY:4:12605:4827:14803        77      *       0       0       *       *       0
AGCCTAAGTATGTCAATACAACAAATACTTACTGTTTCATTTCTAGTAATGAAAAAAAAAAAAACAA     AAAAAEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEAEEAEEEEEEEEEEEEEEEEAEE        YT:Z:UP
NS500697:37:HH5V7BGXY:4:12605:4827:14803        141     *       0       0       *       *       0
CAGAGACATACCCCTCCTTACAAATAAAAATGTTCTTTATAGATGTAAATTTATTTTACAAAAATGTTTCAAAATGACCAGATGAAAATCATCCTTAT
CAGAAAGACTTGTTTTTTTTTTTTTAATT   AAAAAEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEA<AEEEEEEEEEEEEEEEEEEEEEEE<<<E  YT:Z:UP
NS500697:37:HH5V7BGXY:4:22607:4886:15556        77      *       0       0       *       *       0
AGCCTAAGTATGTCAATACAACAAATACTTACTGTTTCATTTCTAGTAATGAAAAAAAAAAAAAAAA     AAAAAEEEEEEEEEAEEAEEEEEEEE
E/EEEAE/AAEEEEEEAAEEEEEAA6EEE6EEEEEEEE/E        YT:Z:UP
NS500697:37:HH5V7BGXY:4:22607:4886:15556        141     *       0       0       *       *       0
AGAGACATACCCCTCCTTACAAATAAAAATGTTCTTTATAGATGTAAATTTATTTTACAAAAATGTTTCAAAATGACCAGATGAAAATCATCCTTATG
AGAAAGACTTGTTTTTTTTTTTTTTAATTA  AAAAEEEEEEA/EAEEEEEEAEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEAEEAEEEEEEEAEEEEEEAEAAE/AEAAEEEEE6E/EEEAEEAEEEEA//<EA YT:Z:UP
NS500697:37:HH5V7BGXY:4:12507:9984:15827        77      *       0       0       *       *       0
AGCCTAAGTATGTCAATACAACAAATACTTACTGTTTCATTTCTAGTAATGAAAAAAAAAAAAAAAAGTCTTTAT     AAAAAEAEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE6E6EEE/EEEEE/EE/E6EAE/E        YT:Z:UP

Any ideas you might have would be greatly appreciated.

Thanks in advance.

Last edited by GenoMax; 08-18-2017, 03:23 AM. Reason: Added [CODE] tags

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News