Announcement
Collapse
No announcement yet.
X
-
Any chance someone generated a bacterial genome masked human genome database using the latest human genome?
-
Hello everyone,
I have a small problem in understanding the working of 'removehuman.sh' script from the BBTools (Version 39.01). Maybe I did something wrong.
I was trying to remove human reads from paired-end fastq.gz files with a cmd similar to this:
bash $BBMAP_dir/removehuman.sh in1=${R1_File} in2=${R2_File} out1=${sample_name}_cleaned_1.fastq.gz out2=${sample_name}_cleaned_2.fastq.gz t=12
It invoked various commands in the backend and I got 2 output files as expected. The output files are lesser in size than the original files, so I assumed the difference is due to human read removal.
For eg. one of the file sizes were as was such:
266K SRR1X_1.fastq.gz 252K SRR1X_cleaned_1.fastq.gz
249K SRR1X_2.fastq.gz 235K SRR1X_cleaned_2.fastq.gz
I confirmed the read count with a simple cmd to count the number of lines and divide it by 4. It was the same read count even then.
Cmd to count reads: echo $(zcat ${FILENAME}.fastq.gz|wc -l)/4|bc
a. Is this normal behavior? (Are the reads not removed but retained & just masked in the original file somehow?) OR
b. Did the code not work the way I was expecting it to work? (i.e, to remove human reads and give me all other reads only)
Clearly, I am overthinking this thing and ran the code in a wrong way. But still, can you help me understand what went wrong?
The only difference that I saw that
1. The reads in the final files were in a different order than before
2. The id variable and other things (from the 3rd line of each read) were removed from the output file.
For eg. The green highlighted part was missing form the output file.
One read from the original file:
@SRR13867754.1 SN640:172:HT7GYBCX2:1:1108:7955:3638 length=34
GAACGCTGGGCAGGGGGGGCGTCATGAGGCTACC
+SRR13867754.1 SN640:172:HT7GYBCX2:1:1108:7955:3638 length=34
DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIII
The same read from the output file:
@SRR13867754.1 SN640:172:HT7GYBCX2:1:1108:7955:3638 length=34
GAACGCTGGGCAGGGGGGGCGTCATGAGGCTACC
+
DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIII
java -ea -Xmx15000m -Xms15000m -cp /home/gpfs/o_shinde/Softwares_Tejus/bbmap/current/ align2.BBMap minratio=0.9 maxindel=3 bwr=0.16 bw=12 quickmatch fast minhits=2 path=$path_human_genome/GRCh38_human_genome pigz unpigz zl=6 qtrim=r trimq=10 untrim idtag usemodulo printunmappedcount ztd=2 kfilter=25 maxsites=1 k=14 bloomfilter in1=$path/SRR1X_1.fastq.gz
in2=$path/SRR1X_2.fastq.gz
out1=SRR1X_cleaned_1.fastq.gz out2=SRR1X_cleaned_2.fastq.gz t=12
Executing align2.BBMap [tipsearch=20, maxindel=80, minhits=2, bwr=0.18, bw=40, minratio=0.65, midpad=150, minscaf=50, quickmatch=t, rescuemismatches=15, rescuedist=800, maxsites=3, maxsites2=100, minratio=0.9, maxindel=3, bwr=0.16, bw=12, quickmatch, minhits=2, path=/home/gpfs/o_shinde/References_local/GRCh38_human_genome, pigz, unpigz, zl=6, qtrim=r, trimq=10, untrim, idtag, usemodulo, printunmappedcount, ztd=2, kfilter=25, maxsites=1, k=14, bloomfilter,
in1=$path/SRR1X_1.fastq.gz,
in2=$path/SRR1X_2.fastq.gz,
out1=SRR1X_cleaned_1.fastq.gz, out2=SRR1X_cleaned_2.fastq.gz, t=12]
Version 39.00
Leave a comment:
-
Hi Everyone!
Thank you for existing masked databases and the guidance on how to create your own database. I think it would be good if there would be a more formalised guide that includes the actual commands to execute so that there are no misunderstandings. I will try to write it up and would be happy if you would correct me if something is missing or wrong.
The goal is to mask all bases from a host that could also come from the microbiome. I will focus only on Archaea and Prokaryotes.
1. Input:
- A genome/gene database with a as complete as possible representation of bacterial/archaeal species. E.g. 100k species from Progenomes2 (now referred to as progrenomes.fasta)
- The host genome (now referred to as host.fasta)
- The silva 138 database (now referred to as silva.fasta)
- bbmap.sh/bbmask.sh
2. Output:
- A masked host genome (now referred to as host.masked.fasta)
How to get there:
- Take progenomes.fasta + silva.fasta and split all sequences into subsequences of 80bp. Should they be overlapping sequences extracted like in a sliding window or do you just chop each sequence into subsequences of equal length? (Output now referred to as progrenomes.silva.chopped.fasta). Does bbmap have a tool for this shredding?
- Align the progenomes.silva.chopped.fasta against the host genome and store as sam file. command: bbmap.sh ref=host.fasta in=progrenomes.silva.chopped.fasta minid=0.85 out=progenomes.silva.chopped.sam mappedonly=t secondary=t idfilter=0.85 cigar=t
- Use bbmask.sh with standard parameters and the progenomes.silva.chopped.sam as input. command: bbmask.sh in=host.fasta sam=progenomes.silva.chopped.sam out=host.masked.fasta
Does this sound about right?
Thanks a lot,
Hans
Leave a comment:
-
@uloeber: @Brian described the method of masking in first post of this thread. It uses "bbmask.sh" which is a component of BBMap suite. Take a look at in-line help for bbmask.sh to get an idea of what parameters you can use.
1) Areas containing multiple repeat short kmers.
2) Windows of low entropy, calculated using pentamer frequencies.
3) Via mapping from sam files:
The mapping is the more interesting approach. I used every fungal assembly on JGI's Mycocosm, every plant genome on JGI's phytozome, and a handful of others, including zebra danio (the closest relative to human I included), and the entire silva ribosomal database. First, I completely removed the assemblies that contained human contamination (a handful). Then, I shredded the remaining data into 70-80bp pieces, mapped it to human requiring a minimum of ~85% identity, and used BBMask to mask everything the shreds hit.
Leave a comment:
-
Originally posted by Brian Bushnell View Post... and the entire silva ribosomal database. .
Thank you so much!
Leave a comment:
-
how may I create a masked genome reference for mice?
Hi Brian,
I would love to have a masked version of Mus musculus genome. Could you probably share the details how you created the human masked genome, such that I can replicate this for mice?
Bests,
Ulrike
Leave a comment:
-
/usr/local/bin/bbmap.sh: line 349: 9039 Killed java ...
Hi Brain,
I am trying to run bbmap.sh with your data like this on my laptop (Linux Fedora 30, memory 8GB). Then I got following error "/usr/local/bin/bbmap.sh: line 349: 9039 Killed java ...". Do you have any idea to fix it? Thank you.
Code:$ bbmap.sh --version java -Djava.library.path=/home/root/local/bbmap/jni/ -ea -Xmx1124m -cp /home/root/local/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 --version Executing align2.BBMap [build=1, overwrite=true, fastareadlen=500, --version] BBMap version 37.10 $ java -version openjdk version "1.8.0_232" OpenJDK Runtime Environment (build 1.8.0_232-b09) OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode) $ bbmap.sh -Xmx24G ref=hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz ... Loaded Reference: 0.020 seconds. Loading index for chunk 1-7, build 1 No index available; generating from reference genome: /home/jaruga/data/yamp/ref/index/1/chr1-3_index_k13_c2_b1.block Indexing threads started for block 0-3 Indexing threads finished for block 0-3 No index available; generating from reference genome: /home/jaruga/data/yamp/ref/index/1/chr4-7_index_k13_c2_b1.block Indexing threads started for block 4-7 /usr/local/bin/bbmap.sh: line 349: 9039 Killed java -Djava.library.path=/home/root/local/bbmap/jni/ -ea -Xmx24G -cp /home/root/local/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 -Xmx24G ref=hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz $ echo $? 137
Leave a comment:
-
/usr/local/bin/bbmap.sh: line 349: 9039 Killed
Hi Brain,
I am trying to run bbmap.sh with your data like this on my laptop (Linux Fedora 30, memory 8GB). Then I got following error "/usr/local/bin/bbmap.sh: line 349: 9039 Killed java ...". Do you have any idea to fix it? Thank you.
Code:$ bbmap.sh --version java -Djava.library.path=/home/root/local/bbmap/jni/ -ea -Xmx1124m -cp /home/root/local/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 --version Executing align2.BBMap [build=1, overwrite=true, fastareadlen=500, --version] BBMap version 37.10 $ java -version openjdk version "1.8.0_232" OpenJDK Runtime Environment (build 1.8.0_232-b09) OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode) $ bbmap.sh -Xmx24G ref=hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz ... Loaded Reference: 0.020 seconds. Loading index for chunk 1-7, build 1 No index available; generating from reference genome: /home/jaruga/data/yamp/ref/index/1/chr1-3_index_k13_c2_b1.block Indexing threads started for block 0-3 Indexing threads finished for block 0-3 No index available; generating from reference genome: /home/jaruga/data/yamp/ref/index/1/chr4-7_index_k13_c2_b1.block Indexing threads started for block 4-7 /usr/local/bin/bbmap.sh: line 349: 9039 Killed java -Djava.library.path=/home/root/local/bbmap/jni/ -ea -Xmx24G -cp /home/root/local/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 -Xmx24G ref=hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz $ echo $? 137
Leave a comment:
-
Hi Brain,
I am trying to run bbmap.sh with your data like this on my laptop (Linux Fedora 30, memory 8GB). Then I got following error "/usr/local/bin/bbmap.sh: line 349: 9039 Killed java ...". Do you have any idea to fix it? Thank you.
Code:$ bbmap.sh --version java -Djava.library.path=/home/root/local/bbmap/jni/ -ea -Xmx1124m -cp /home/root/local/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 --version Executing align2.BBMap [build=1, overwrite=true, fastareadlen=500, --version] BBMap version 37.10 $ java -version openjdk version "1.8.0_232" OpenJDK Runtime Environment (build 1.8.0_232-b09) OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode) $ bbmap.sh -Xmx24G ref=hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz ... Loaded Reference: 0.020 seconds. Loading index for chunk 1-7, build 1 No index available; generating from reference genome: /home/jaruga/data/yamp/ref/index/1/chr1-3_index_k13_c2_b1.block Indexing threads started for block 0-3 Indexing threads finished for block 0-3 No index available; generating from reference genome: /home/jaruga/data/yamp/ref/index/1/chr4-7_index_k13_c2_b1.block Indexing threads started for block 4-7 /usr/local/bin/bbmap.sh: line 349: 9039 Killed java -Djava.library.path=/home/root/local/bbmap/jni/ -ea -Xmx24G -cp /home/root/local/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 -Xmx24G ref=hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz $ echo $? 137
Leave a comment:
-
@jylee: "/global/projectb/sandbox/gaag/bbtools/hg19/ref/genome/1/summary.txt" appears to refer to a location on JGI servers (if that is not your own). You will need to download and provide hg19 reference sequence. You can pre-index the genome with BBMap to use with path= or use ref= option to point to the genome sequence multi-fasta file location.
Leave a comment:
-
Hello Brian,
I am trying to use removehuman.sh on MSU HPCC.
Inputs (*_filtered.fastq.gz files) are phix filtered R1 and R2 files using BBduk as following: [leejooy5@dev-intel18 filtered_reads]$ bbduk.sh -Xmx10g in1=NFW_R1_trimmed.fastq.gz in2=NFW_R2_trimmed.fastq.gz out1=NFW_R1_filtered.fastq.gz out2=NFW_R2_filtered.fastq.gz ref=/opt/software/BBMap/37.93-foss-2018a/resources/phix174_ill.ref.fa.gz k=31 hdist=1 stats=GR25_stats.txt threadS=8)
As shown in below, error message "Exception in thread "main" java.lang.RuntimeException: Can't find file /global/projectb/sandbox/gaag/bbtools/hg19/ref/genome/1/summary.txt
" popped up when I tried to run "removehuman.sh". I tried without additional parameters such as -Xmx and threads, but same error happened. Also, I tried to find the find the
file "/global/projectb/sandbox/gaag/bbtools/hg19/ref/genome/1/summary.txt", but I couldn't. Could you tell me what mistake I did or let me know where I can find a solution? Thank you for your time and consideration.
Cheers,
Joo-Young
=======================
[leejooy5@dev-intel18 filtered_reads]$ removehuman.sh -Xmx10g in1=NFW_R1_filtered.fastq.gz in2=NFW_R2_filtered.fastq.gz out1=NFW_R1_clean.fastq.gz out2=NFW_R2_clean.fastq.gz threads=8
removehuman.sh -Xmx10g in1=NFW_R1_filtered.fastq.gz in2=NFW_R2_filtered.fastq.gz out1=NFW_R1_clean.fastq.gz out2=NFW_R2_clean.fastq.gz threads=8
java -Djava.library.path=/opt/software/BBMap/37.93-foss-2018a/jni/ -ea -Xmx10g -cp /opt/software/BBMap/37.93-foss-2018a/current/ align2.BBMap minratio=0.9 maxindel=3 bwr=0.16 bw=12 quickmatch fast minhits=2 path=/global/projectb/sandbox/gaag/bbtools/hg19 pigz unpigz zl=6 qtrim=r trimq=10 untrim idtag usemodulo printunmappedcount usejni ztd=2 kfilter=25 maxsites=1 k=14 -Xmx10g in1=NFW_R1_filtered.fastq.gz in2=NFW_R2_filtered.fastq.gz out1=NFW_R1_clean.fastq.gz out2=NFW_R2_clean.fastq.gz threads=8
Executing align2.BBMap [tipsearch=20, maxindel=80, minhits=2, bwr=0.18, bw=40, minratio=0.65, midpad=150, minscaf=50, quickmatch=t, rescuemismatches=15, rescuedist=800, maxsites=3, maxsites2=100, minratio=0.9, maxindel=3, bwr=0.16, bw=12, quickmatch, minhits=2, path=/global/projectb/sandbox/gaag/bbtools/hg19, pigz, unpigz, zl=6, qtrim=r, trimq=10, untrim, idtag, usemodulo, printunmappedcount, usejni, ztd=2, kfilter=25, maxsites=1, k=14, -Xmx10g, in1=NFW_R1_filtered.fastq.gz, in2=NFW_R2_filtered.fastq.gz, out1=NFW_R1_clean.fastq.gz, out2=NFW_R2_clean.fastq.gz, threads=8]
Version 37.93 [tipsearch=20, maxindel=80, minhits=2, bwr=0.18, bw=40, minratio=0.65, midpad=150, minscaf=50, quickmatch=t, rescuemismatches=15, rescuedist=800, maxsites=3, maxsites2=100, minratio=0.9, maxindel=3, bwr=0.16, bw=12, quickmatch, minhits=2, path=/global/projectb/sandbox/gaag/bbtools/hg19, pigz, unpigz, zl=6, qtrim=r, trimq=10, untrim, idtag, usemodulo, printunmappedcount, usejni, ztd=2, kfilter=25, maxsites=1, k=14, -Xmx10g, in1=NFW_R1_filtered.fastq.gz, in2=NFW_R2_filtered.fastq.gz, out1=NFW_R1_clean.fastq.gz, out2=NFW_R2_clean.fastq.gz, threads=8]
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.650
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.900
Set threads to 8
Retaining first best site only for ambiguous mappings.
Exception in thread "main" java.lang.RuntimeException: Can't find file /global/projectb/sandbox/gaag/bbtools/hg19/ref/genome/1/summary.txt
at fileIO.ReadWrite.getRawInputStream(ReadWrite.java:906)
at fileIO.ReadWrite.getInputStream(ReadWrite.java:871)
at fileIO.TextFile.open(TextFile.java:227)
at fileIO.TextFile.<init>(TextFile.java:71)
at dna.Data.setGenome2(Data.java:822)
at dna.Data.setGenome(Data.java:768)
at align2.BBMap.loadIndex(BBMap.java:313)
at align2.BBMap.main(BBMap.java:32)
Leave a comment:
-
"bbsplit.sh" is a general purpose tool that will bin reads into any number of bins (depending on the reference sequences provided, you can provide as many as you want). In this case you would provide human_genome.fa (and any other reference you want to use). If you only use human then reads not mapping to human genome will be collected in other bin.
Leave a comment:
-
I have not - however, bbsplit doesn't really seem to be the right tool for removing human reads?
Leave a comment:
-
Have you tried using "bbsplit.sh" with human genome to see if that works better. If you are interested in non-human data then I would use the non-masked genome and risk losing a few additional reads.
Leave a comment:
-
Hi Brian,
Just reviving an old thread here. I have been testing out a lot of different methods to clean human reads and I really love BBMap because it's such a well thought-out program. However, when I try to clean human reads with the settings you have specified, I routinely get a ton of reads remaining - upwards of 70% (so only 30% are cleaned). I have tried to adjust the various parameters, but the only thing that seems to make a difference for depletion is the 'minid' setting. Setting that at 0.50 (which is *very* low) depletes around 95% of reads. As a comparison, a default run with bwa mem depletes 100%.
Any idea how I might get BBMap to more accurately deplete human reads?
Leave a comment:
Latest Articles
Collapse
-
by seqadmin
The recent pandemic caused worldwide health, economic, and social disruptions with its reverberations still felt today. A key takeaway from this event is the need for accurate and accessible tools for detecting and tracking infectious diseases. Timely identification is essential for early intervention, managing outbreaks, and preventing their spread. This article reviews several valuable tools employed in the detection and surveillance of infectious diseases.
...-
Channel: Articles
11-27-2023, 01:15 PM -
-
by seqadmin
Microbiome research has led to the discovery of important connections to human and environmental health. Sequencing has become a core investigational tool in microbiome research, a subject that we covered during a recent webinar. Our expert speakers shared a number of advancements including improved experimental workflows, research involving transmission dynamics, and invaluable analysis resources. This article recaps their informative presentations, offering insights...-
Channel: Articles
11-09-2023, 07:02 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Today, 08:23 AM
|
0 responses
1 view
0 likes
|
Last Post
by seqadmin
Today, 08:23 AM
|
||
Started by seqadmin, 12-01-2023, 09:55 AM
|
0 responses
21 views
0 likes
|
Last Post
by seqadmin
12-01-2023, 09:55 AM
|
||
Started by seqadmin, 11-30-2023, 10:48 AM
|
0 responses
20 views
0 likes
|
Last Post
by seqadmin
11-30-2023, 10:48 AM
|
||
Started by seqadmin, 11-29-2023, 08:26 AM
|
0 responses
15 views
0 likes
|
Last Post
by seqadmin
11-29-2023, 08:26 AM
|
Leave a comment: