Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
While Brian will have a more detailed insight it should be ok to use the unmasked genome with BBSplit. Any mapping issues you may have with short reads should be more or less same with genome or just CDS sequences.
-
Biological question about contaminants DB
Dear Brian,
I have created a de novo transcriptome of a non-model organism. After BLASTing the contigs I found some homology with possible contaminants. My question is if I should use DNA genomes (un/masked) or the CDS gene predictions from ENSMBL database.
In the first case, do you recommend the unmasked, the masked version of genomes (ensmbl) or should I use BBMask?
And in the second case, could BBsplit handle with CDS mapping?
I am a bit lost, I think that contigs from de novo transcriptome have to be mapped against CDS sequences.
With thanks,
Xavier
Leave a comment:
-
Thanks Brian, I will look into the options you suggested to handle this use case!
Kate
Leave a comment:
-
Originally posted by GenoMax View PostYou can set nodisk=t to prevent the index from being written to disk (memory will always be used).
Hi Brian,
I am wondering for BBsplit (with ambiguous2=split) if it is expected behavior that if I provide 2 references:
Ref 1 has 1 sequence that is a perfect match to PE sequencing reads
Ref 2 has many unique sequences that are at least 1 MM different from sequencing reads
For BBsplit to think the reads are ambiguously mapped? And for this to be the case for >90% of read pairs but not for every read pair?
Is there a way to force BBsplit to select Ref1 (perfect) over the options in Ref2?
I am trying to remove a nearly-identical putative contaminant sequence from my data...
(FYI I have ambiguous is set to best, and I've also tried setting both ambiguous and ambiguous2 to best, but the reads are almost all considered to be ambiguous in that scenario too)
Thx! Kate
You can certainly set "perfectmode" to only allow mappings that are 100% identity to the reference. For example, you could run once in perfectmode, and then run the remaining unmapped reads normally. But generally, yes, this is intended behavior. BBSplit is intended to do things like separate mouse reads from human reads when a mouse is used as the vector for some human DNA study. Also, it is designed to separate reads belonging to various bacteria in a metagenome, and chloroplast/plant reads in a plant. In any of these cases, if there is a single bp mismatch, you can't unambiguously assign a read to one reference or the other, since it could be a sequencing error or (more importantly) an actual variation.
You might find Seal (seal.sh) useful. It has functionality similar to BBSplit, but is alignment free (meaning it is way faster, but uses more memory). It allows you to specify cutoffs, so you can for example send all of the perfectly-matching reads to one file, regardless of whether they almost match another genome. It decides which file to assign a read to based on the number of kmers (by default, 31-mers) matching that reference. When dealing with bacteria, I always prefer Seal over BBSplit because it is so much faster and easier to use, and bacteria have tiny genomes that are under high evolutionary pressure to avoid low-complexity sequence. Euks have huge genomes with a lot of low-complexity regions so BBSplit is better in that case, since it is more precise.
If you want to remove a particular contaminant from your data, I suggest trying BBDuk. BBDuk does not allow kmers longer than 31. However, it emulates longer kmers. For example, if you set "k=90", then it will consider reads as matching (by default, they get discarded) if there is a 90bp stretch in which all 31-mers match 31-mers in the reference. In practice this is very similar to matching 90-mers.
So, for example:
bbduk.sh in=reads.fq out=filtered.fq ref=contaminant.fa k=90 mm=f
That will remove all reads containing at least 90bp that shares all kmers with your reference (which should be the unwanted sequence). The exact values depend on the length of the sequence in question. Note that read length is very important here; if a read only overlaps 80bp of the 90bp sequence in question, it would not be removed.
Leave a comment:
-
Hi Brian,
I am wondering for BBsplit (with ambiguous2=split) if it is expected behavior that if I provide 2 references:
Ref 1 has 1 sequence that is a perfect match to PE sequencing reads
Ref 2 has many unique sequences that are at least 1 MM different from sequencing reads
For BBsplit to think the reads are ambiguously mapped? And for this to be the case for >90% of read pairs but not for every read pair?
Is there a way to force BBsplit to select Ref1 (perfect) over the options in Ref2?
I am trying to remove a nearly-identical putative contaminant sequence from my data...
(FYI I have ambiguous is set to best, and I've also tried setting both ambiguous and ambiguous2 to best, but the reads are almost all considered to be ambiguous in that scenario too)
Thx! KateLast edited by sk8bro; 10-11-2017, 01:08 PM.
Leave a comment:
-
For some other programs in BBMap suite there has been a "race condition" which produces a nasty (though according to @Brian a harmless) error related to threads.
At least in this case the result appears to be identical. In the last table you had given us the results all looked different.
Not sure why you still have this in the output, if you had deleted the ref folder:
NOTE: Ignoring reference file because it already appears to have been processed.
NOTE: If you wish to regenerate the index, please manually delete ref/genome/1/summary.txtLast edited by GenoMax; 10-05-2017, 04:43 AM.
Leave a comment:
-
40428 reads indeed. The number of threads shouldn't matter for the succes or failure of the process. However, I did run it on 4 threads for you, so here are the results:
Code:BBMap version 37.53 Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.560 Set threads to 4 Retaining first best site only for ambiguous mappings. NOTE: Ignoring reference file because it already appears to have been processed. NOTE: If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt Set genome to 1 Loaded Reference: 1.647 seconds. Loading index for chunk 1-1, build 1 Generated Index: 1.542 seconds. Analyzed Index: 3.989 seconds. Cleared Memory: 6.978 seconds. Processing reads in single-ended mode. Started read stream. Started 4 mapping threads. Detecting finished threads: 0, 1, 2, 3 ------------------ Results ------------------ Genome: 1 Key Length: 13 Max Indel: 20 Minimum Score Ratio: 0.56 Mapping Mode: normal Reads Used: 40428 (3919820 bases) Mapping: 10.603 seconds. Reads/sec: 3812.77 kBases/sec: 369.68 Read 1 data: pct reads num reads pct bases num bases mapped: 4.5464% 1838 4.3524% 170607 unambiguous: 1.2120% 490 1.2217% 47890 ambiguous: 3.3343% 1348 3.1307% 122717 low-Q discards: 0.0000% 0 0.0000% 0 perfect best site: 0.7173% 290 0.6776% 26561 semiperfect site: 0.7297% 295 0.6905% 27066 Match Rate: NA NA 91.4514% 156445 Error Rate: 42.4134% 1543 8.4773% 14502 Sub Rate: 42.1935% 1535 7.7799% 13309 Del Rate: 2.2265% 81 0.2701% 462 Ins Rate: 4.1506% 151 0.4273% 731 N Rate: 0.2474% 9 0.0713% 122 Total time: 25.038 seconds.
Leave a comment:
-
Do you only have 40428 reads in the file? 48 threads seems like an overkill for this amount of reads. Can you use 4 and see if that helps? You can also delete the previous bbmap index files and force the program to create new ones using ref=.
Adding relative paths before the file names should not cause the program to crash. It never has for me.
Leave a comment:
-
No error messages are produced, but you might be right about it crashing. (It also crashes when I put ./ in front of the input file.) Maybe you can see what goes wrong from the output text, so here it is:
Code:BBMap version 37.53 Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.560 Retaining first best site only for ambiguous mappings. NOTE: Ignoring reference file because it already appears to have been processed. NOTE: If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt Set genome to 1 Loaded Reference: 1.506 seconds. Loading index for chunk 1-1, build 1 Generated Index: 1.482 seconds. Analyzed Index: 3.715 seconds. Cleared Memory: 6.933 seconds. Processing reads in single-ended mode. Started read stream. Started 48 mapping threads. Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47 ------------------ Results ------------------ Genome: 1 Key Length: 13 Max Indel: 20 Minimum Score Ratio: 0.56 Mapping Mode: normal Reads Used: 40428 (3919820 bases) Mapping: 13.209 seconds. Reads/sec: 3060.74 kBases/sec: 296.76 Read 1 data: pct reads num reads pct bases num bases mapped: 4.5464% 1838 4.3524% 170607 unambiguous: 1.2120% 490 1.2217% 47890 ambiguous: 3.3343% 1348 3.1307% 122717 low-Q discards: 0.0000% 0 0.0000% 0 perfect best site: 0.7173% 290 0.6776% 26561 semiperfect site: 0.7297% 295 0.6905% 27066 Match Rate: NA NA 91.4514% 156445 Error Rate: 42.4134% 1543 8.4773% 14502 Sub Rate: 42.1935% 1535 7.7799% 13309 Del Rate: 2.2265% 81 0.2701% 462 Ins Rate: 4.1506% 151 0.4273% 731 N Rate: 0.2474% 9 0.0713% 122 Total time: 27.078 seconds.
Leave a comment:
-
The numbers of mapped reads do not add up to the same record in each case... it seems like the program is crashing and not processing all the reads, or else the input is different. The output is deterministic for single-ended reads so I can't see any other explanation. Does the program print an error message?
Leave a comment:
-
That's odd, if you are not intentionally producing random assignments for sequences. Can you please post your full command line? Also, are the reads paired or unpaired?
Leave a comment:
-
Neither deterministic=t nor ambiguous2=all helped change the results. Unless you or Brian know a way of solving this, I'm sorry to say that I'll try using another tool (or attempt writing something of my own).
Reads are 100bp.
Leave a comment:
-
You could try changing the ambiguous2= parameter and see how that changes the result. "toss/all" may generate results that look more similar. With very similar genomes you are going to have trouble assigning reads reproducibly. How long are these reads?
Leave a comment:
-
@GenoMax: I did not consider the non-deterministic behaviour of mappers, however BBMap should still be deterministic in my case, since I have single-end reads. See copied text from the BBMap help section below. I could still try to explicitly call this though.
Run in deterministic mode. In this case it is good to set averagepairdist. BBMap is deterministic without this flag if using single-ended reads, or run singlethreaded.
Commands:
bash ~/bbmap/bbsplit.sh ref={}
# ref= a folder containing my reference sequences
bash ~/bbmap/bbsplit.sh in={} basename={}/out_%.fq
# in= sample, basename= output folder and output filename
Leave a comment:
Latest Articles
Collapse
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...-
Channel: Articles
04-22-2024, 07:01 AM -
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Yesterday, 11:49 AM
|
0 responses
15 views
0 likes
|
Last Post
by seqadmin
Yesterday, 11:49 AM
|
||
Started by seqadmin, 04-24-2024, 08:47 AM
|
0 responses
16 views
0 likes
|
Last Post
by seqadmin
04-24-2024, 08:47 AM
|
||
Started by seqadmin, 04-11-2024, 12:08 PM
|
0 responses
61 views
0 likes
|
Last Post
by seqadmin
04-11-2024, 12:08 PM
|
||
Started by seqadmin, 04-10-2024, 10:19 PM
|
0 responses
60 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 10:19 PM
|
Leave a comment: