Seqanswers Leaderboard Ad

**Brian Bushnell** · 08-09-2017, 02:42 PM

Originally posted by darthsequencer View Post

Thanks that's helpful. On the note of loading references - is there a way to use wildcards with the input and output of bbwrap?

Ah, sorry, but nope. You can, however, do something like...

Code:

ls *.fastq.gz > ls.txt

...then replace '\n' with ',' using a text editor (like Notepad++). I'm sure there's a simpler sed/awk solution, too.

**lucila** · 08-15-2017, 07:18 AM

Hi Brian,
thank you so much for this useful tool. I would like to ask you a question. Does BBMap generate a list of transcripts as a result of the mapping of the reads? What I mean is if this tool generates a fasta file of reconstructed transcripts based on the reference genome and the reads used for mapping.
I need this because the genome that I am using is not annotated (I mean, I do not have a gff) and I want to generate a reference transcriptome using my reads and the genome.

Thank you again!
Best,
Lucila.

**Brian Bushnell** · 08-15-2017, 12:20 PM

Originally posted by lucila View Post

Hi Brian,
thank you so much for this useful tool. I would like to ask you a question. Does BBMap generate a list of transcripts as a result of the mapping of the reads? What I mean is if this tool generates a fasta file of reconstructed transcripts based on the reference genome and the reads used for mapping.
I need this because the genome that I am using is not annotated (I mean, I do not have a gff) and I want to generate a reference transcriptome using my reads and the genome.

Hi Lucila,

No, it does not, but I have been considering adding gff output, at least for exon boundaries, if not full transcripts. For generating a reference transcriptome, you can do either mapping or assembly; so, assemblers like Trinity are also an option.

I have not tried it, but you may also want to examine this:

GitHub - enormandeau/gawn: Genome Annotation Without Nightmares

https://github.com/enormandeau/gawn

Genome Annotation Without Nightmares. Contribute to enormandeau/gawn development by creating an account on GitHub.

**lucila** · 08-16-2017, 10:35 AM

Originally posted by Brian Bushnell View Post

Hi Lucila,

No, it does not, but I have been considering adding gff output, at least for exon boundaries, if not full transcripts. For generating a reference transcriptome, you can do either mapping or assembly; so, assemblers like Trinity are also an option.

I have not tried it, but you may also want to examine this:
https://github.com/enormandeau/gawn

Thank you Brian for the information!!
Cheers,
Lucila.

**jweger1988** · 08-17-2017, 10:55 PM

Hi Brian,

I'm having an issue I'm wondering you can help with.

I am using bbmap to align to a small viral genome. The resulting bam or sam file give me an error something like : Segmentation fault (core dumped) - every time I run the program lofreq viterbi, which is used before variant calling with lofreq. When I generate a bam with the same reference file and fastq files with bowtie2 I do not get the same error.

I have tried using reformat.sh to convert to cigar 1.3 but this doesn't seem to work. The fasta file has only one sequence and there are no spaces in the name.

Here is an example of the error with the command -

Code:

lofreq viterbi -f ZIKV_PRVABC59_CDS.fa -o bbmapv.bam bbmap.bam
Segmentation fault (core dumped)

It runs OK with bowtie produced bam.

Here is the output from the command head with either bowtie or bbmap produced sam file.

bbmap

Code:

@HD     VN:1.4  SO:unsorted
@SQ     SN:ZIKV_PRVABC59_CDS    LN:10272
@PG     ID:BBMap        PN:BBMap        VN:37.47        CL:java -Djava.library.path=/home/weger/bb
p/jni/ -ea -Xmx216859m align2.BBMap build=1 overwrite=true fastareadlen=500 in1=1_ctff_R1.fq.gz in
1_ctff_R2.fq.gz out=1_bbmap.sam ref=ZIKV_PRVABC59_CDS.fa maq=25
NS500697:37:HH5V7BGXY:2:23303:13708:4313 1:N:0:TAAGGCGA+ATATAGAG        77      *       0       0
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT   AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEAE/EEEA/AEEEE/EA
NS500697:37:HH5V7BGXY:2:23303:13708:4313 1:N:0:TAAGGCGA+ATATAGAG        141     *       0       0
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA      AAAE<6<<<66<<<<<6<<6666666666666E66666666E66E666A6666A66AA
E6EAA6EAAAEAEAEAEEAEEAEEEEEAE6EAEAAEAAAEEEAEAEAAEEEEAEAEEAEEEEEEEEAEEEEEEEE
NS500697:37:HH5V7BGXY:2:21207:1798:4356 1:N:0:TAAGGCGA+ATAGAGAC 77      *       0       0       *
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT  AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEAEEE//EEEEE//E
NS500697:37:HH5V7BGXY:2:21207:1798:4356 1:N:0:TAAGGCGA+ATAGAGAC 141     *       0       0       *
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 6666666666<<<<<<666666666666666666666666
NS500697:45:HJGYTAFXX:2:11101:23195:4485 1:N:0:TAAGGCGA+ATAGAGAG        77      *       0       0
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
TTTTTTTTTTTTTTTTTTTTTT  AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEAE<EE<A/EA//E
NS500697:45:HJGYTAFXX:2:11101:23195:4485 1:N:0:TAAGGCGA+ATAGAGAG        141     *       0       0
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA A6A6<<<6<<E<<<<<6<<6<6<<<<<6666<<<<6666E66E6666666EE6666E66EEEE666
6AAAAAAAEA6EAAEAAE6E6EAE6EAEEAEAAEAEAEEEAEEEAAEEAEEEEEEAEEEEAEAE
NS500697:37:HH5V7BGXY:2:22101:16007:4646 1:N:0:TAAGGCGA+ATAGAGAG        77      *       0       0
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT      AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEE/AE/EEA//A

bowtie2

Code:

@HD     VN:1.0  SO:unsorted
@SQ     SN:ZIKV_PRVABC59_CDS    LN:10272
@PG     ID:bowtie2      PN:bowtie2      VN:2.2.9        CL:"/apps/bowtie2-2.2.9/bowtie2-align-s --
apper basic-0 -x /data/ebel_lab/Weger/indexed_fasta_bowtie/ZIKV__PRVABC59_CDS -S 1_bowtie.sam -1 /
p/31806.inpipe1 -2 /tmp/31806.inpipe2"
NS500697:37:HH5V7BGXY:4:13502:9896:14279        77      *       0       0       *       *       0
AGCCTAAGTATGTCAATACAACAAATACTTACTGTTTCATTTCTAGTAATGAAAAAAAAAAAAAAAA     AAAAAEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EE        YT:Z:UP
NS500697:37:HH5V7BGXY:4:13502:9896:14279        141     *       0       0       *       *       0
CAGAGACATACCCCTCCTTACAAATAAAAATGTTCTTTATAGATGTAAATTTATTTTACAAAAATGTTTCAAAATGACCAGATGAAAATCATCCTTAT
CAGAAAGACTTGTTTTTTTTTTTTTAAT    AAAAAEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEAEEE/AEEEEEEEEEEEEEEEE/<E   YT:Z:UP
NS500697:37:HH5V7BGXY:4:12605:4827:14803        77      *       0       0       *       *       0
AGCCTAAGTATGTCAATACAACAAATACTTACTGTTTCATTTCTAGTAATGAAAAAAAAAAAAACAA     AAAAAEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEAEEAEEEEEEEEEEEEEEEEAEE        YT:Z:UP
NS500697:37:HH5V7BGXY:4:12605:4827:14803        141     *       0       0       *       *       0
CAGAGACATACCCCTCCTTACAAATAAAAATGTTCTTTATAGATGTAAATTTATTTTACAAAAATGTTTCAAAATGACCAGATGAAAATCATCCTTAT
CAGAAAGACTTGTTTTTTTTTTTTTAATT   AAAAAEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEA<AEEEEEEEEEEEEEEEEEEEEEEE<<<E  YT:Z:UP
NS500697:37:HH5V7BGXY:4:22607:4886:15556        77      *       0       0       *       *       0
AGCCTAAGTATGTCAATACAACAAATACTTACTGTTTCATTTCTAGTAATGAAAAAAAAAAAAAAAA     AAAAAEEEEEEEEEAEEAEEEEEEEE
E/EEEAE/AAEEEEEEAAEEEEEAA6EEE6EEEEEEEE/E        YT:Z:UP
NS500697:37:HH5V7BGXY:4:22607:4886:15556        141     *       0       0       *       *       0
AGAGACATACCCCTCCTTACAAATAAAAATGTTCTTTATAGATGTAAATTTATTTTACAAAAATGTTTCAAAATGACCAGATGAAAATCATCCTTATG
AGAAAGACTTGTTTTTTTTTTTTTTAATTA  AAAAEEEEEEA/EAEEEEEEAEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEAEEAEEEEEEEAEEEEEEAEAAE/AEAAEEEEE6E/EEEAEEAEEEEA//<EA YT:Z:UP
NS500697:37:HH5V7BGXY:4:12507:9984:15827        77      *       0       0       *       *       0
AGCCTAAGTATGTCAATACAACAAATACTTACTGTTTCATTTCTAGTAATGAAAAAAAAAAAAAAAAGTCTTTAT     AAAAAEAEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE6E6EEE/EEEEE/EE/E6EAE/E        YT:Z:UP

Any ideas you might have would be greatly appreciated.

Thanks in advance.

**GenoMax** · 08-18-2017, 03:30 AM

I wonder if viterbi program is having trouble with (NS500697:37:HH5V7BGXY:2:23303:13708:4313 1:N:0:TAAGGCGA+ATATAGAG ) the tag sequence which BBMap preserves but Bowtie does not.

Can you "reformat" the bbmap alignment using "trd=t" flag and see if that helps?

**jweger1988** · 08-18-2017, 04:12 AM

So I tried

reformat.sh in=1_bbmap.sam out=1_bbmap_reform1.bam trd=t sam=1.3

and just

reformat.sh in=1_bbmap.sam out=1_bbmap_reform1.bam trd=t

I then tried running lofreq viterbi either with sorting or without and same error.

Segmentation fault (core dumped)

I did notice that bbmap bams have the N and bowtie does not.

Any more thoughts?

Thanks for the quick reply.

**GenoMax** · 08-18-2017, 04:54 AM

I can't think of anything now. Let us see if Brian has any other suggestions.

lofreq viterbi seems to be for re-alignment. I wonder if you can do without that with BBMap generated files.

**jweger1988** · 08-18-2017, 05:01 AM

Originally posted by GenoMax View Post

I can't think of anything now. Let us see if Brian has any other suggestions.

lofreq viterbi seems to be for re-alignment. I wonder if you can do without that with BBMap generated files.

It seems to end up calling weird indels when viterbi is not run that are clearly not real. Thanks for the help. I'll wait for Brian and see what he thinks.

**jweger1988** · 08-23-2017, 01:05 PM

Hi Brian,

I'm trying to call some kmers with kmercountexact.sh and am running into some troubles.

I am merging paired end reads with bbmerge then mapping to a small viral reference file with bbmap. I am then using reformat.sh to trim to a smaller read length and then calling/counting kmers with kmercountexact.

The problem I am having is that after mapping I'm getting reads that are both in the forward and reverse sense. So when calling kmers I get them from both ends of the sequence I'm interested in. Is there any way to reverse complement just the reads that are in the reverse complement in relation to the reference sequence?

Thanks in advance.

**Brian Bushnell** · 08-23-2017, 02:53 PM

After mapping, you can split the sam file like this:

Code:

splitsam.sh mapped.sam plus.sam minus.sam unmapped.sam
reformat.sh in=plus.sam out=plus.fq
reformat.sh in=minus.sam out=minus.fq rcomp
cat plus.fq minus.fq > all.fq

As for your previous question (sorry, somehow I did not notice it) - I think Genomax may be right that the issue is the read names, but it's hard to say. Also, it's hard to see what's going on here since none of the reads are mapped...

Incidentally, what do you mean by this?

I did notice that bbmap bams have the N and bowtie does not.

Do you mean N in the read name, or in the cigar string, or in the read sequence?

Also, I'd be interested to see how the output of callvariants.sh (with appropriate sensitivity parameters) compares to lofreq, if you want to give it a try.

**jweger1988** · 08-24-2017, 09:23 AM

Brian,

Thanks for the splitsam, that worked great!

As for the other question by the Ns I meant read name I think? Sorry my grasp of nomenclature with these things is not great.

BBmap
NS500697:37:HH5V7BGXY:2:23303:13708:4313 1:N:0:TAAGGCGA+ATATAGAG

Bowtie
NS500697:37:HH5V7BGXY:4:13502:9896:14279

**jweger1988** · 08-24-2017, 09:31 AM

As for the comparison for lofreq and callvariants, I can never seem to get callvariants to call what look like good snps for my samples. Maybe I am using it wrong in some way.

The sensitivity parameters are a little trick to play with in lofreq. But with callvariants and lofreq with standard settings at rarity=0.01, callvariants calls completely different snps as compared to lofreq. After manual inspection of the bam file it looks like lofreq is more accurate. Could definitely be some paremeter issues though. I'm open to any suggestions at changing these, as I'm a big fan of the bbmap suite.

Here is a link to the vcf files and the script I'm running if you are interested.

Google Drive: Sign-in

https://drive.google.com/drive/folders/0B3EYKskcePi6NWJxUUVSYTk1V1k?usp=sharing

Access Google Drive with a Google account (for personal use) or Google Workspace account (for business use).

Thanks as always.

**Brian Bushnell** · 08-24-2017, 10:42 AM

Originally posted by jweger1988 View Post

Brian,

Thanks for the splitsam, that worked great!

As for the other question by the Ns I meant read name I think? Sorry my grasp of nomenclature with these things is not great.

BBmap
NS500697:37:HH5V7BGXY:2:23303:13708:4313 1:N:0:TAAGGCGA+ATATAGAG

Bowtie
NS500697:37:HH5V7BGXY:4:13502:9896:14279

Yep, BBMap uses the full read name whereas Bowtie truncates the read name at the first whitespace. As Genomax noted, you can add the flag "trd" (trimreaddescriptions) to BBMap to produce truncated names, you you can reprocess the already-mapped sam file with Reformat using the "trd" flag to do that also.

Thanks for the feedback on CallVariants! It's certainly odd that there seems to be nothing in common between the two results. Would it be possible for you to email me the bam and reference so I can examine it in more detail?

**Brian Bushnell** · 08-24-2017, 04:11 PM

Unfortunately, it looks like your data is very low quality and has inaccurate quality scores, things we used to see on our NextSeqs until it eventually got better after various chemistry and software revisions. I recommend:

1) Adapter-trimming (you have >5% of reads with adapters)
2) Mapping
3) Recalibrating the raw reads
4) Adapter-trimming and quality-trimming the recalibrated reads, with "ftm=5"
5) Mapping again
6) Calling variants

E.g.

Code:

bbduk.sh in=reads.fq.gz out=trimmed0.fq.gz ktrim=r k=23 mink=11 hdist=1 minlen=100 ref=adapters
bbmap.sh in=trimmed0.fq.gz ref=ref.fa out=mapped0.sam nodisk qahist=qahist.txt qhist=qhist.txt mhist=mhist.txt
rm trimmed0.fq.gz
calctruequality.sh in=mapped0.sam
rm mapped0.sam
bbduk.sh in=reads.fq.gz out=recal.fq.gz
bbduk.sh in=recal.fq.gz out=trimmed.fq.gz qtrim=rl trimq=14 ktrim=r k=23 mink=11 hdist=1 minlen=100 ref=adapters
bbmap.sh in=trimmed.fq ref=ref.fa out=mapped.sam nodisk bs=bs.sh  qahist=qahist_rc.txt qhist=qhist_rc.txt mhist=mhist_rc.txt; sh bs.sh

Result of recalibration:
Before:

After:

However, I am not sure you will get anything useful out of this data. We have found that NextSeq runs are just not trustworthy for calling variants at a frequency of below 5%. I looked at lofreq's highest-quality variant, with a Q-score of 1392, and it was seen in 2 reads, with a Q-score of 14 in each case, once within 3 bp of the read end (probably in adapter sequence because there are 2 mismatches and an insertion next to it there) and both reads had a lot of other mismatches. That is the the highest quality variation, and it has zero chance of being correct.

The reason you were not getting any variant calls with CallVariants except at the ends is because for some reason you have stranded data; all the reads only map to the minus strand. In that situation you need to set "usebias=f" and "minstrandratio=0" to turn off strand-bias filters. It still won't call much, though. 95 variants are called using this command (on the trimmed recalibrated reads):

Code:

callvariants.sh in=recal.sam.gz ref=ref.fa rarity=0.01 out=foo3.txt vcf=foo3.vcf minreads=2 minscore=10 usebias=f minstrandratio=0

But only 12 are above Q20 which is my normal cutoff (default is minscore=20, which I recommend), so I'd suggest perhaps looking at those. Maybe they're real. The insertion at 9225 looks plausible.

Attached Files

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News