Seqanswers Leaderboard Ad

**Brian Bushnell** · 09-07-2016, 01:03 PM

Originally posted by JVGen View Post

Thanks Brian,

I took the de novo approach to try to eliminate contaminating sequences. We're all working on HIV, and it's highly mutated: it's hard to remove contaminants by DNA sequence alone. However, since I know the approximate size of my PCR amplicons (it's different in each case because of DNA deletions), I know what size my contig should be after de novo assembly. By using stringent assembly parameters, that require ~30 bp overlap with maybe 1 mismatch, I can force contaminants to be assembled into a separate contig. I can then extract consensus sequences from each of these contigs, and filter them so that only contigs greater than ~150 bp are used in a subsequent map to reference, which should remove any minor contaminants that were present. I can then extract the consensus from this alignment, which should represent the original PCR amplicon - any contaminants that might have made it into my contig should be lost as they are outnumber by the 100s.

Does this workflow make sense for my application? I'm working on a desktop Mac, so I have limited options. I've been told that Spades might be a better assembler for me, but I think I'd need to purchase a server. With the little coding experience I have, I'm a bit nervous to invest the money, lest we never get the thing to work.

Are you available for hire as a consultant?

Thanks,
Jake

Hi Jake,

It might be helpful in this case if you could generate a kmer frequency histogram to see whether the contaminant and non-contaminant sequence is easily separable by depth alone. If so, there are a couple of easy ways to remove it. You can generate a kmer-frequency histogram with kmercountexact or BBNorm; just attach the text file to this thread. Normally I look at it in a log-log plot.

What assembler are you currently using, by the way? I've had poor results with Spades on viruses, and better results with Tadpole. But this was raw viral sequence and amplicon sequence may give different results.

As for consultancy, I've sent you a pm.

-Brian

**Brian Bushnell** · 11-07-2016, 09:35 AM

Originally posted by moistplus

I've ran pileup for contig coverage estimation after assembly.

The output is :

ID Avg_fold Length Ref_GC Base_Coverage Read_GC

1) The coverage is the Avg_fold right ?

Yes, that's correct.

2) If yes, I have some negative or 0 coverage ... How can it be possible ?

Zero coverage is certainly possible, since mapping and assembly results can differ. Negative coverage should be impossible. Can you send me the output file, and post your exact command line?

Thanks,
Brian

**Brian Bushnell** · 11-08-2016, 09:58 AM

Indeed, I see 3 entries with negative coverage, which should not happen. However, v32.15 was released about 2.5 years ago and there have been thousands of changes since then, including hundreds of bug fixes. Could you try the latest version and see if that fixes the problem?

**GenoMax** · 11-09-2016, 02:56 PM

Originally posted by moistplus

With the new version, I don't have anymore negative coverage!

Another question:
Using this command :

in1 and in2 are Forward reads and Reverse reads right ? So it keeps the mated pair at the end ?

It should do that.

**Brian Bushnell** · 11-16-2016, 10:01 AM

Yes, it's random.

**Dario1984** · 11-29-2016, 10:00 PM

But can it output chimerically mapped reads to a separate output file, like STAR can? There's no mention of chimeric reads in the user guide, so I'm not sure if it's even suitable for that case.

**Brian Bushnell** · 11-30-2016, 09:14 AM

No, BBMap does not output chimerically-mapped reads to a separate file, though it can be used to separate properly-paired reads from improper (possibly chimeric) pairs.

**GenoMax** · 11-30-2016, 09:25 AM

Originally posted by Brian Bushnell View Post

No, BBMap does not output chimerically-mapped reads to a separate file, though it can be used to separate properly-paired reads from improper (possibly chimeric) pairs.

Is that functionality on "feature request" list?

**Brian Bushnell** · 11-30-2016, 10:14 AM

Originally posted by GenoMax View Post

Is that functionality on "feature request" list?

Oh, very well, I'll add it to the list

**JVGen** · 12-05-2016, 03:59 PM

Originally posted by Brian Bushnell View Post

Oh, very well, I'll add it to the list

Hi Brian,

I am trying to use BBMap to align 150 bp paired end reads to a 10 kb reference. The reference is an HIV genome, and my sequencing input is PCR amplified HIV proviruses (means I get lots of coverage).

I use BBDuk to adapter and quality trim my reads. I then used BBNorm to normalize coverage to ~150. Then I used BBMap to map the reads to the reference.

Deletions in HIV proviruses are common, and I noticed that BBMap seems to get hung up near the deletions. For instance, if a read spans the deletion, BBMap doesn't seem to insert a gap into the read so that it aligns on the other size of the deletion. I've attached some pictures. First pic is the full alignment, second 5' of the deletion, third 3' of the deletion. Note that the "CTGAGGGGACAGAT" sequence is present on the reads on the 5' side of the deletion, and should extend to the 3' side. In this case, most of these reads were trimmed so that the consensus would reflect the correct deletion, but I do worry that this might not always be the case.

Are there any settings to adjust this to allow the reads to span the deletion?

Thanks,
Jake

**GenoMax** · 12-05-2016, 04:57 PM

@JVGen: Are you using default alignment settings for bbmap? You may need to adjust maxindel (which defaults to 16000) and intronlen settings.

**JVGen** · 12-05-2016, 06:14 PM

Originally posted by GenoMax View Post

@JVGen: Are you using default alignment settings for bbmap? You may need to adjust maxindel (which defaults to 16000) and intronlen settings.

Hi GenoMax,

The entirety of my reference is only 9000 bp, so I think the default maxindel size is appropriate. What does intronlen do?

Thanks!
JV

**Brian Bushnell** · 12-06-2016, 04:06 AM

The default maxindel should be fine in this case. If there are reads spanning the deletion, they will be mapped spanning the deletion. I'm not familiar with the viewer you are using... perhaps you could try IGV?

"intronlen=10" will, for example, replace "D" (deletion) symbols in cigar strings with "N" (skipped) symbols, for deletions of at least 10bp in length. Some programs and viewers prefer N over D for whatever reason. I consider them equivalent. But, it's possible the viewer you are using does not properly display reads with "D" symbols in the cigar string, so using IGV or remapping with the "intronlen=10" flag might be helpful. Or, if you send me the sam file and reference I can look at it.

most of these reads were trimmed so that the consensus would reflect the correct deletion

I'm not sure what you mean by that - can you clarify? Also, can you give the exact command you used for BBMap? If you use the "local" flag, long deletions might get erased.

I've honestly never heard a concern before that BBMap was unwilling to map reads spanning long deletions - only the opposite. In your last picture, it looks to me like all of the reads are mapped with a long deletion extending off the screen to the left; or am I misinterpreting it?

**JVGen** · 12-06-2016, 05:13 AM

Originally posted by Brian Bushnell View Post

The default maxindel should be fine in this case. If there are reads spanning the deletion, they will be mapped spanning the deletion. I'm not familiar with the viewer you are using... perhaps you could try IGV?

Hi Brian, that's Geneious I'm viewing in. I download and tried using IGV, but I get an error when trying to load up the SAM file. I shared the SAM file with you on google drive. I included the trimmed, normalized, unassembled reads and the reference as a separate FASTA as well, just in case.

Originally posted by Brian Bushnell View Post

"intronlen=10" will, for example, replace "D" (deletion) symbols in cigar strings with "N" (skipped) symbols, for deletions of at least 10bp in length. Some programs and viewers prefer N over D for whatever reason. I consider them equivalent. But, it's possible the viewer you are using does not properly display reads with "D" symbols in the cigar string, so using IGV or remapping with the "intronlen=10" flag might be helpful. Or, if you send me the sam file and reference I can look at it.

This could be, but they replace no coverage with gaps, so I don't know what the program is doing behind the scenes (if it's logging D's or N's). I will try repeating with this flag and see if it changes the outcome.

Originally posted by Brian Bushnell View Post

I'm not sure what you mean by that - can you clarify? Also, can you give the exact command you used for BBMap? If you use the "local" flag, long deletions might get erased.

I'm running it in Geneious with the following parameters (in picture). I'm going to try ticking the "discard trimmed regions", though I think it is unnecessary, because I don't think BBDuk keeps trimmed information (which is how I trimmed the reads). Dissolve contigs is redundant - no contigs have yet been assembled. Quirk of the program.

Originally posted by Brian Bushnell View Post

I've honestly never heard a concern before that BBMap was unwilling to map reads spanning long deletions - only the opposite. In your last picture, it looks to me like all of the reads are mapped with a long deletion extending off the screen to the left; or am I misinterpreting it?

There is a ~3.5kb deletion on the 3' end of the HIV genome. The HIV reference sequence is depicted in the faded yellow box. Reads assembled to the reference are depicted below as black rectangles (which, when I zoom in, show their sequence). A coverage map is shown above the reference in red. A consensus sequence for the assembled reads is provided above the coverage map. Within the consensus, black represents a mismatch to the reference (this many mismatches is not uncommon for HIV, as it's a retrovirus and reverse transcription introduces many mutations).

Thanks for any help!

JV

**Brian Bushnell** · 12-06-2016, 01:21 PM

Originally posted by JVGen View Post

Hi Brian, that's Geneious I'm viewing in. I download and tried using IGV, but I get an error when trying to load up the SAM file.

IGV needs a sorted, indexed bam file. It won't accept sam.

I shared the SAM file with you on google drive. I included the trimmed, normalized, unassembled reads and the reference as a separate FASTA as well, just in case.

Please send me the links, and I'll look at them.

I'm running it in Geneious with the following parameters (in picture). I'm going to try ticking the "discard trimmed regions", though I think it is unnecessary, because I don't think BBDuk keeps trimmed information (which is how I trimmed the reads). Dissolve contigs is redundant - no contigs have yet been assembled. Quirk of the program.

Hmmm, I'm not really sure what Geneous is doing behind the scenes here with regards to trimming, but it doesn't look like it would have any kind of effect that would suppress long deletions.

Topics	Statistics	Last Post
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, Yesterday, 05:31 AM	0 responses 10 views 0 likes	Last Post by seqadmin Yesterday, 05:31 AM
Small Blood Stem Cell Subset Linked to Immune System Aging by seqadmin Started by seqadmin, 10-24-2024, 06:58 AM	0 responses 20 views 0 likes	Last Post by seqadmin 10-24-2024, 06:58 AM
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types by seqadmin Started by seqadmin, 10-23-2024, 08:43 AM	0 responses 48 views 0 likes	Last Post by seqadmin 10-23-2024, 08:43 AM
Microbes in Urban Spaces Adapt to Disinfectants and Scarce Resources by seqadmin Started by seqadmin, 10-17-2024, 07:29 AM	0 responses 58 views 0 likes	Last Post by seqadmin 10-17-2024, 07:29 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News