Good day,
I am trying to use BBDuk to remove adapter and to perform quality trimming. The sequencing facility told me that they used the following:
Adapter (Index)
UDI0003
Read 1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
Read 2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Can you help me what command should I use to detect these adapters and to trim the reads @Q10?
Thank you!
Header Leaderboard Ad
Collapse
BBMap (aligner for DNA/RNAseq) is now open-source and available for download.
Collapse
Announcement
Collapse
SEQanswers June Challenge Has Begun!
The competition has begun! We're giving away a $50 Amazon gift card to the member who answers the most questions on our site during the month. We want to encourage our community members to share their knowledge and help each other out by answering questions related to sequencing technologies, genomics, and bioinformatics. The competition is open to all members of the site, and the winner will be announced at the beginning of July. Best of luck!
For a list of the official rules, visit (https://www.seqanswers.com/forum/sit...wledge-and-win)
For a list of the official rules, visit (https://www.seqanswers.com/forum/sit...wledge-and-win)
See more
See less
X
-
Hello, I am trying to use bbnorm to normalize a very large dataset - 657G per paired-end file. I've been given access to a node with 208 cores and 5888 GB RAM and I assumed it would be able to process the files, but I keep getting the following error:
Exception in thread "Thread-1407" java.lang.RuntimeException: java.io.IOException: No space left on device at stream.ReadStreamByteWriter.run(ReadStreamByteWriter.java:32)
Caused by: java.io.IOException: No space left on device (more lines follow - I can share them if they'd be helpful.)
It continues to run and later on prints again
Exception in thread "Thread-1408" java.lang.RuntimeException: java.io.IOException: No space left on device at stream.ReadStreamByteWriter.run(ReadStreamByteWriter.java:32)
Caused by: java.io.IOException: No space left on device
Then later on I get the following:
Exception in thread "Thread-1497" java.lang.RuntimeException: Writing to a terminated thread.
I ended the run at this point because I wasn't sure if the output would be viable or what exactly was happening. I was wondering a few things:
1) Should I let it keep running? Will the output be usable?
2) If the node I'm using isn't sufficient, is there a way to figure out what would be? When I ran loglog.sh, I got the following: Cardinality: 45491636111 which I calculate to mean that I'd need roughly 500 GB of RAM with the default parameters. (Assuming I made the correct calculation.)
3) Are there changes I can make to the command to lessen the load? I tried increasing the min from 5 to 10 with no success, and then I tried decreasing the hash number from 3 to 2 and then to 1 - again with no change in result.
Here is the command I used:
HTML Code:bbnorm.sh prefilter=TRUE in1=${WDIR}/${JB}_R1_001.clean.fastq in2=${WDIR}/${JB}_R2_001.clean.fastq out1=${WDIR}/${JB}_R1_001.norm.fastq out2=${WDIR}/${JB}_R2_001.norm.fastq target=100 min=5
Thank you so much!
Leave a comment:
-
Hi Brian,
I ran filterbyname.sh to extract some FastQ reads from a bigger FastQ file and I noticed that on the quality line of the output, the # (hashtag symbol) gets replaced by an ! (exclamation mark symbol). I checked the documentation and the forums about this behaviour but couldn't find anything.
Is this expected? If so, why?
Mine were latest FastQ files produced by BaseSpace from an Illumina MiSeq instrument.
Thanks in advance.
Best,
Santiago
Leave a comment:
-
Hi Brian Bushnell, thank you very much for developing BBMap. It is such a useful tool!
I am trying to map sequencing reads to a metagenome I had previously assembled and I am getting a very strange output. After running the alignment, I also ran pileup.sh to obtain coverage stats and I got the following output:
Code:pileup.sh in=mapped.sam out=stats.txt Reads: 18093134 Mapped reads: 3652730 Mapped bases: 331276393 Ref scaffolds: 134515 Ref bases: 43282856 Percent mapped: 20.188 Percent proper pairs: 12.341 Average coverage: 7.654 Average coverage with deletions: 7.689 Standard deviation: 9.676 Percent scaffolds with any coverage: 99.51 Percent of reference bases covered: 99.34
Thanks in advance with any help you might be able to provide,
Esteban
- Likes 2
Leave a comment:
-
Brian Bushnell Thanks for providing this very useful toolset.
I have a few questions wrt the use of BBSplit, and I guess it's OK to put them in a single post...
[Edited: well, there has been no reply for a while and I figured out some of the answers myself, and I add them below just in case someone else run into the same problems...]
1. Is there any recommendation in terms of computation resource allocation and speeding up a run? For human+mouse genome (a total of 5.6GB .fasta files) and decompressed fastq files at about 50GB each (paired-end), I found that nearly 200GB memory was needed when using 1 thread, and it can take several days or even up to a week to run. Is this normal/expected?
[Edited: in one of my later applications where I only needed to map one of the read-pairs (decompressed at around 50G), I managed to increase the number of threads to 16, with 10G memory per thread, setting java heap size around 50G, and BBSplit finished running overnight.]
2. From the user guide I read that the `maxindel` parameter should be set to a large number for RNA-seq (e.g. 200k for human according to the guide). Another example in the guide for BBMap says:
Code:To map vertebrate RNA-seq reads to a genome: bbmap.sh in=reads.fq out=mapped.sam maxindel=200k ambig=random intronlen=20 xstag=us
3. I have this problem on re-using genome index as follows:
In this post, someone asked:
After running bbsplit once using the syntax: ref=ref1.fa,ref2.fa, is it possible to re-use that index on subsequent runs using the path= parameter?
Yes, it is. Also, with BBSplit, I think it will try to regenerate the index every time as long as "ref=" is specified, even if it already exists, so only do that once.
Let's say I first build an index for two reference genomes, x.fa and y.fa as follows:
Code:bbsplit.sh build=1 ref=x.fa,y.fa path=./test
Code:bbsplit.sh in1=r1.fq in2=r2.fq basename=o%_#.fq build=1 path=./test
Code:NOTE: Deleting contents of ./test/ref/genome/1 because reference is specified and overwrite=true NOTE: Deleting contents of ./test/ref/index/1 because reference is specified and overwrite=true Writing reference. Executing dna.FastaToChromArrays2 [ref/genome/1/merged_ref_1939486580590361.fa.gz, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=true, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=false] Set genScaffoldInfo=true Writing chunk 1 Writing chunk 2 ......
Besides, I also have a related question, which is whether I can pre-build indices separately for two genomes x.fa and y.fa, saved in different paths or with different build numbers, and then later reuse them for aligning with BBSplit (potentially with something like `build=1,2` or `path=<first_path>,<second_path>`).
[Edited: I figured this out. I first build the index with `bbsplit.sh ref=x.fa,y.fa`, which generated the index in a folder named `ref`, then in the subsequent BBSplit run for each sample organized in its own sample-specific dir, I first link the `ref` folder to the sample dir, then from the sample dir run with something like `bbsplit.sh ref=x.fa,y.fa in1=r1.fq in2=r2.fq basename=o%_#.fq`, now BBSplit can find the index without any issue.]
I am using BBMap version 39.01, the latest version to date.
I appreciate your kind help in advance.Last edited by KenC; 11-14-2022, 12:43 AM.
Leave a comment:
-
Is there anything else I need to do to test some Java changes to debug where the issue is happening?
For example, I would like to turn on verboseS=t and have it spit out debug info. I think the pairlen (MAX_PAIR_DIST) is just not being applied from what I can tell from verbose=t output.
Leave a comment:
-
Thank you for the response Brian Bushnell !
We are using the "pairedonly=t" flag, however the reads are still returned with PROPER_PAIR in the SAM file (sam flags 83 and 163).
It appears the MAPQ is penalized, probably due to the repetitive sequence in read2, however the reads are still shown as properly paired.
Maybe you'll spot something in the arguments below?
I've tried all combinations and all SAM alignments have the tag PROPERLY_PAIRED, and are included in the mapped output.
You can see I also added killbadpairs=t and pairedonly=t, but maybe something else has gone wrong.
Code:bbmap.sh \ in=testA_1.subset.fq.gz in2=testA_2.subset.fq.gz \ ref=/reference_genomes/mm39/mm39canonical.fa \ path=/reference_genomes/mm39/bbmap_mm39canonical/ \ out=testA.subset.mm39.maxindel20.pairlen35000.matelen33000.mapped.bam \ outu=testA.subset.mm39.maxindel20.pairlen35000.matelen33000.unmapped.bam \ mappedonly=t \ trimreaddescriptions=t \ t=3 \ maxindel=20 \ strictmaxindel=t \ inserttag=t \ matelen=33000 \ pairlen=35000 \ intronlen=999999999 \ ambig=random \ minid=0.76 \ killbadpairs=t \ pairedonly=t \ covstats=testA.subset.mm39.maxindel20.pairlen35000.matelen33000.mapped.bbmap_covstats.txt \ covhist=testA.subset.mm39.maxindel20.pairlen35000.matelen33000.mapped.bbmap_covhist.txt \ ihist=testA.subset.mm39.maxindel20.pairlen35000.matelen33000.mapped.bbmap_inserthist.txt \ statsfile=testA.subset.mm39.maxindel20.pairlen35000.matelen33000.mapped.bbmap_alignmentReport.txt \ 2>&1 | tee -a output_testA.subset.mm39.maxindel20.pairlen35000.matelen33000.mapped_bbmap.txt
Code:testA.subset.mm39.maxindel1400.pairlen1200.matelen1000.mapped.bam QNAME FLAG RNAME POS MAPQ CIGAR RNEXT PNEXT TLEN SEQ QUAL FLAGS testA.subset.mm39.maxindel20.pairlen35000.matelen33000.mapped.bam NS500489:195:HLKT3BGXX:1:12109:18917:8143 83 chr5 11805066 2 3X4=1X2=1X4=2X2=3X29= = 11375902 -429215 ACATGCCAAAATAACAACCCACTTTTAAGTGTATAATTCATTACACAATGC //AE//E//E/E/E///EE///EEEEEEEEEEEEEEEEEEEEEEEEAAAAA XT:A:R NM:i:10 AM:i:2 X8:Z:429215 NS500489:195:HLKT3BGXX:1:12109:18917:8143 163 chr5 11375902 3 51= = 11805066 429215 GACAGAGACAGAGAGAGGAACATAGACAGAGACAGAGAGAGGAACATAGAC AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEE/E/E/E/EE/E XT:A:R NM:i:0 AM:i:3 X8:Z:429215 NS500489:195:HLKT3BGXX:1:11102:2182:13318 99 chr17 20345318 42 51= = 20345469 202 ACTGGTGGCCTTATAACTTGTGTAATAAGAAATGTTTGAGTTTAACTATAA AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE NM:i:0 AM:i:42 X8:Z:202 NS500489:195:HLKT3BGXX:1:11102:2182:13318 147 chr17 20345469 42 51= = 20345318 -202 AGCTAAGAAATAGTGAGAGCTGTGAAAATTATTCTTTCCCAGGAAGGGTAG AEEEEAEEEEEEEEEEEEEEEEAEEEEEEEEEEE6EEEEEEE/EEEAAAAA NM:i:0 AM:i:42 X8:Z:202
Leave a comment:
-
Oh, that flag might be a little misleading... it's the maximum distance for reads to be considered as 'proper pairs'. Beyond that the mapq is penalized, alternative sites are prioritized, and the 'proper pair' flag bit is not set. If you want to ban pairs from mapping at all beyond that distance you need to use either the 'killbadpairs' (which will mark one of the two as unmapped) or 'pairedonly' (which will mark them both as unmapped) flag.
- Likes 1
Leave a comment:
-
Hi Brian Bushnell thanks again for the BBTools suite, it's really amazing! Lots of configurable settings, fantastic speed, logic, use of kmers, it's got it all. I'm replacing various bits of our pipelines with BBTools equivalents, and so far really loving the improvements. Also enjoyed reading through the various forum posts and responses, impressive support for the tool over the years. I digress.
I cannot figure out how to get bbmap.sh to filter paired read alignments greater than 32000 distance, which is what I understood the pairlen=32000 argument is supposed to do. It just doesn't filter out alignments. For example, one alignment shows BAM TLEN=429215 for read1 (and -429215 for read2), and there is even the optional tag X8:Z:429215. All consistent. And even with or without read lengths in the distance calculation it should be well over the cutoff (I understood pairlen was distance between read ends pointing toward each other).
Am I using the wrong argument? pairlen=32000 is default, even setting it to pairlen=40000 or pairlen=20000 it doesn't appear to do anything, or not what I expected anyway.
Thanks in advance for the help!
-James
Leave a comment:
-
Originally posted by Thias View PostI think you found a bug ...
When you look at the given line 625 in the file ./current/shared/TrimRead.java, it evaluates the CIGAR string and asserts that it ends with a match (M or =). This should be the case because the trimReadWithMatchFast() function is only invoked for 100% match or unmapped reads:
Code:char c=sl.cigar.charAt(sl.cigar.length()-1); assert(c=='M' || c=='=') : c+"; "+sl.cigar+"\n"+sl;
- Likes 2
Leave a comment:
-
I think you found a bug ...
When you look at the given line 625 in the file ./current/shared/TrimRead.java, it evaluates the CIGAR string and asserts that it ends with a match (M or =). This should be the case because the trimReadWithMatchFast() function is only invoked for 100% match or unmapped reads:
Code:char c=sl.cigar.charAt(sl.cigar.length()-1); assert(c=='M' || c=='=') : c+"; "+sl.cigar+"\n"+sl;
- Likes 1
Leave a comment:
-
Hi Brian Bushnell
I recently installed BBMap_38.98.tar.gz from SourceForge on a machine running Ubuntu 20.04.4 LTS.
Per the usage guide, I checked my current version of Java with "java -Xmx90m -version" which returned the following:openjdk version "11.0.15" 2022-04-19
OpenJDK Runtime Environment (build 11.0.15+10-Ubuntu-0ubuntu0.20.04.1)
OpenJDK 64-Bit Server VM (build 11.0.15+10-Ubuntu-0ubuntu0.20.04.1, mixed mode, sharing)
I tried running reformat on a sam file with a command like:reformat.sh in=in.sam out=out.sam forcetrimleft=10
And got the following error:java -ea -Xms300m -cp /data/usr/croy/tools/bbmap/current/ jgi.ReformatReads in=in.sam out=out.sam forcetrimleft=10
Executing jgi.ReformatReads [in=in.sam, out=out.sam, forcetrimleft=10]
Input is being processed as unpaired
Waiting on header to be read from a sam file.
Exception in thread "main" java.lang.AssertionError: H; 22M123H
FS10001085:123:BRL95608-3321:1:1101:7680:1000 2195 chr3 50608025 18 22M123H = 50607939 -98 AGTGGTTTTCACTGACAGCGTG F,:FFFFFFFF::FFF:F,F,F NM:i:0 MD:Z:22 MC:Z:93M21D52M AS:i:22 XS:i:0 SA:Z:chr3,50608053,-,17S128M,60,1;
at shared.TrimRead.trimReadWithMatchFast(TrimRead.java:625)
at shared.TrimRead.trimReadWithMatch(TrimRead.java:684)
at shared.TrimRead.trimByAmount(TrimRead.java:279)
at shared.TrimRead.trimToPosition(TrimRead.java:240)
at jgi.ReformatReads.process(ReformatReads.java:919)
at jgi.ReformatReads.main(ReformatReads.java:53)
The sam file was aligned to hg38 using a recent version of BWA MEM (v0.7.17-r1188). The CIGAR string seems to follow the SAM specs as far as I can tell, so I'm not sure where things are going wrong.
Please let me know if you need any more information.
Best,
Charles
Leave a comment:
-
Is bbmap missing mappings of bbmapskimmer ?
Hello Brian,
I will be glad to get your help with the next issue.
I mapped a large DB of reads(over 100K) against the human genome, and found different results between bbmap's and bbmapskimmer's mappings.
I used bbmap to map reads against hg19, then I was interested finding all the mappings over the genome- so I used bbmapskimmer.
Except this concept I do not know any main differences between bbmap and bbmapskimmer codes (am I right?).
And still, bbmap mapped 98,425 reads and bbmapskimmer mapped 107,708. While 96,987 reads are shared.
Means bbmap mapped 1,438 reads that wasn't mapped by bbmapskimmer, which did mapped another 10,721 that wasn't mapped by bbmap- some of them with perfect minid score of 1.0.
Is there any possible reason for that?
I attached SAM files, with the relevant mappings for reads with maximum minid score in each algorithem (best minid for 1,438 reads of bbmap is 0.96, and for bbmapskimmer is 1.0).
- I used the next reference genome: hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz
- All mappings were done with next parameters: minid=0.9 ambig=all
Leave a comment:
-
filter reads based on GC% content
Hello Brian and everyone,
I'm wondering whether it could be possible (and with which tool) to perform a read filter based on its GC% content.
For example, working with a metagenome with fungal and bacterial DNA fragments, how can I retain only paired-end reads with a GC > 50% and isolate bacterial reads?
Thanks for the attention
Stefano
Leave a comment:
-
Error running BBMap
Hi!
I already used BBMap successfully in the past to map RNA-seq data to a multi-FASTA transcriptome. Now I was trying to map to the human genome (downloaded from human genome browser of NCBI). I used the following command:
bbmap.sh ref=$mapping_ref build="$BBmap_ref_ID" in=$BBduked_reads1 in2=$BBduked_reads2 \
outu=unmapped_"$name1"_"$date".sam outm=mapped_"$name1"_"$date".sam \
maxindel=200k mdtag=true sam=1.4 subfilter=3 pairedonly=true
Unfortunately I´m running into an error:
[...]
Retaining first best site only for ambiguous mappings.
Writing reference.
Executing dna.FastaToChromArrays2 [/scratch/hpc-prf-ptma2/ptma2001/bbmap/GRCh38_genome_16032022.fna, 6, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=true, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=false]
Set genScaffoldInfo=true
Writing chunk 1
Writing chunk 2
Exception in thread "main" java.lang.NumberFormatException: For input string: "ӌ�jHN�H��¬���l��,ΚK��/S��ͩm���T
�!�l��X��o�ã’*s$�s^��m�h
J@�
�m��$֧����@�H,���xI%_ Uz�?� S�B�2[*i�_����)i�NEր�ٜ�?Z�i£�� "
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at dna.Data.setGenome2(Data.java:1016)
at dna.Data.setGenome(Data.java:769)
at align2.BBMap.loadIndex(BBMap.java:316)
at align2.BBMap.main(BBMap.java:32)
Any idea what causes the problem and how I could fix it?
Thanks a lot in advance and have a great day
Ella
Leave a comment:
Latest Articles
Collapse
-
by seqadmin
Developments in sequencing technologies and methodologies have transformed the field of epigenetics, giving researchers a better way to understand the complex world of gene regulation and heritable modifications. This article explores some of the diverse sequencing methods employed in the study of epigenetics, ranging from classic techniques to cutting-edge innovations while providing a brief overview of their processes, applications, and advances.
Methylation Detect...-
Channel: Articles
05-31-2023, 10:46 AM -
-
Differential Expression and Data Visualization: Recommended Tools for Next-Level Sequencing Analysisby seqadmin
After covering QC and alignment tools in the first segment and variant analysis and genome assembly in the second segment, we’re wrapping up with a discussion about tools for differential gene expression analysis and data visualization. In this article, we include recommendations from the following experts: Dr. Mark Ziemann, Senior Lecturer in Biotechnology and Bioinformatics, Deakin University; Dr. Medhat Mahmoud Postdoctoral Research Fellow at Baylor College of Medicine;...-
Channel: Articles
05-23-2023, 12:26 PM -
-
by seqadmin
Continuing from our previous article, we share variant analysis and genome assembly tools recommended by our experts Dr. Medhat Mahmoud, Postdoctoral Research Fellow at Baylor College of Medicine, and Dr. Ming "Tommy" Tang, Director of Computational Biology at Immunitas and author of From Cell Line to Command Line.
Variant detection and analysis tools
Mahmoud classifies variant detection work into two main groups: short variants (<50...-
Channel: Articles
05-19-2023, 10:03 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 06-01-2023, 08:56 PM
|
0 responses
11 views
0 likes
|
Last Post
by seqadmin
06-01-2023, 08:56 PM
|
||
Deep Sequencing Unearths Novel Genetic Variants: Enhancing Precision Medicine for Vascular Anomalies
by seqadmin
Started by seqadmin, 06-01-2023, 07:33 AM
|
0 responses
13 views
0 likes
|
Last Post
by seqadmin
06-01-2023, 07:33 AM
|
||
Unveiling Genetic Associations Through Transcription Factor Binding Quantitative Trait Loci
by seqadmin
Started by seqadmin, 05-31-2023, 07:50 AM
|
0 responses
51 views
0 likes
|
Last Post
by seqadmin
05-31-2023, 07:50 AM
|
||
Exploring French-Canadian Ancestry: Insights into Migration, Settlement Patterns, and Genetic Structure
by seqadmin
Started by seqadmin, 05-26-2023, 09:22 AM
|
0 responses
55 views
0 likes
|
Last Post
by seqadmin
05-26-2023, 09:22 AM
|
Leave a comment: