Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Thanks for the fast response. I will try reformat.

    Comment


    • Hi,

      I got the idea to use BBDuk as a tool to filter out kmers that are shared between two samples. Is this a good idea? It's metagenomics, and the concept is that one sample is representing the background composition of bacteria, and I want to remove that background composition from another sample. One challenge to the problem is that the samples are rather big, about 2x35 GB compressed fastq (paired end), each containing about 1 billion reads total (both read pairs combined).

      Comment


      • Originally posted by boulund View Post
        Hi,

        I got the idea to use BBDuk as a tool to filter out kmers that are shared between two samples. Is this a good idea? It's metagenomics, and the concept is that one sample is representing the background composition of bacteria, and I want to remove that background composition from another sample. One challenge to the problem is that the samples are rather big, about 2x35 GB compressed fastq (paired end), each containing about 1 billion reads total (both read pairs combined).
        You can find kmers that are shared between two samples like this (bearing in mind that it may take a lot of memory):

        kcompress.sh in=a.fq.gz out=a_kmers.fa.gz
        kcompress.sh in=b.fq.gz out=b_kmers.fa.gz
        kcompress.sh in=a_kmers.fa.gz,b_kmers.fa.gz out=shared_kmers.fa.gz mincount=2

        However, I think I'd probably tend to just assemble what you consider to be the background, and then map reads to the assembly requiring fairly high identity, keeping the reads that don't map. Either approach works (and also has disadvantages) but whole reads tend to be more specific than kmers.

        Originally posted by EssigSchurke View Post
        Thanks for the fast response. I will try reformat.
        BBDuk is fixed now, by the way

        Comment


        • "just assemble what you consider to be the background"

          Assemble as?

          Comment


          • To clarify, I meant running Megahit (for example) on the reads of the sample considered to be the background, then mapping the reads of the other sample to that assembly.

            Comment


            • Ok, that's interesting. Our current approach was just that; assembly of the background sample with Megahit, and then mapping the sample to be filtered against the background assembly to remove anything that matches. I was hoping it'd be possible to do it without the massive overhead of assembling the background sample, as that's fairly time consuming and memory hungry for these large samples. I will have to evaluate the different approaches against each other to see which one fits our setup the best. Thanks for your input, and thanks for pointing me to kcompress.sh!

              Comment


              • Different output with interleaved input

                Hello again-

                I think there is an inconsistent behavior in how bbduk handles interleaved input depending on whether the interleaved option is set to "true" or "auto".

                Consider the case where only one read in a pair is discarded because too short.
                With "interleaved=auto" we get in (interleaved) output only the read passing the filter, thus appearing as a single-end read. With "interleaved=true" both reads are discarded.

                Is this difference intentional? In my opinion, interleaved=auto does the right thing in discarding only the bad read and keeping the other. However, this creates an interleaved output with single-end reads (which are actually paired-end but with the mate gone) intercalated in pair-end ones. I'm not sure if the interleaved format has ever been defined to allow such a case (for the record, bwa seems to handle it correctly).

                I just thought useful to point this out...

                This should reproduce the issue wih BBDuk version 37.54:

                Code:
                bbduk.sh in=int2.fq out=stdout.fq qtrim=rl minlength=35 trimq=15 interleaved=auto
                bbduk.sh in=int2.fq out=stdout.fq qtrim=rl minlength=35 trimq=15 interleaved=true
                With int2.fq being:

                Code:
                @r1
                ATGGCATGCACCTGTAATCCCGCTACTTGTGAGGCTGAAGCAGGAGAAT
                +
                JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
                @r1
                TAATTATATGTTTAAGTAAATGAGTAAAATTCAAGATTGCTATCGGATT
                +
                JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
                @s1
                ATGGCATGCACCTGTAATCCCGCTACTTGTGAGGCTGAAGCAGGAGAAT
                +
                AAFFFJJJJJFJFJJJJJFAFJJJFJ7<FAJJJJJJJJAAFJJJJJJJJ
                @s1
                TAATTATATGTTTAAGTAAATGAGTAAAATTCAAGATTGCTATCGGATT
                +
                #################################################

                Comment


                • @Dario: Have you tried this with a file that has the (/1 of old Illumina style of 1:N:0 of new read headers)? Interleaved files may need those identifiers to be there. They can be added by reformat.sh.

                  Comment


                  • Originally posted by GenoMax View Post
                    @Dario: Have you tried this with a file that has the (/1 of old Illumina style of 1:N:0 of new read headers)? Interleaved files may need those identifiers to be there. They can be added by reformat.sh.
                    Hi Genomax- These are R1 and R2 real identifiers:

                    Code:
                    @E00295:75:H7LLTALXX:8:1101:4553:1643 1:N:0:8
                    @E00295:75:H7LLTALXX:8:1101:4553:1643 2:N:0:8
                    I think one issue is that the interleaved format has never been formally defined so it's not clear which of option of interleaved, auto or true, is doing the right thing.

                    By the way, with interleaved=auto bbduk gives to stderr the message:

                    Code:
                    Input is being processed as unpaired
                    If I append /1 and /2 to the read names then I get the same output as with interleaved=true (both reads discarded) and the message becomes:

                    Code:
                    Input is being processed as paired
                    so you are right /1 and /2 must be there for interleaved=auto to work as expected. Maybe it would be useful to make this behavior explicit in the documentation (unless I missed it)?

                    Having said that, I think discarding both reads in a pair when only one read fails is unnecessary.

                    Comment


                    • Having said that, I think discarding both reads in a pair when only one read fails is unnecessary.
                      Not having reads in sync causes issues with mapping with most aligners (you note that bwa is an exception). I assume only case where R2 would be eliminated (while R1 remains) would be if you were trimming based on quality. If R2 contains adapter contamination then likely the entire insert is small and may result in elimination of R1 as well. I generally never have to trim data based on quality so have not run into this issue so far.

                      Comment


                      • Hi!

                        Is it possible to get the various quality histograms for both before and after e.g. trimming with BBDuk in a single run, or do I need to run BBDuk it twice to produce metrics for before and after trimming?
                        That is; run once without any trimming, just outputting histograms, and then again to trim and output histograms? Or am I missing something? The histograms output by BBDuk normally show metrics after trimming/contaminant removal, right?

                        By the way, I might mention that I finally tried to assemble my very large background sample using Tadpole, and then align my primary sample to that to remove 'background/contamination' reads. It produced a fairly poor assembly overall, but at least it ran to completion on the 500GB background sample on my memory constrained machine (64GB). The kmer-based approach was just too memory consuming.
                        Last edited by boulund; 10-25-2017, 10:54 PM.

                        Comment


                        • Seal not printing outu file

                          Hi - I'm running seal to map reads to ref genomes. this is the command I ran:

                          Code:
                          seal in="${samplename}_nonribo.fq.gz" ref="${all4genomes}" pattern="${samplename}_out_%.fq.gz" outu="${samplename}_unmapped.fq.gz" ambig=all stats="${samplename}_mapstats.txt"
                          It's outputting the matched reads per reference sequence using the 'pattern', but it's not making the 'outu' file of unmapped reads (and there should be some unmapped reads, according to my stderr output).

                          Is there another trick to making this file?

                          Thanks,
                          MC

                          Comment


                          • Hi I would like to use bbduk to filter the reads that map to a genome. I have 48 pair samples and want to make sure I understand how to input the file.
                            My sample are are Pair end and are labeled as F and R (plus _1 and _2).
                            Should I put them all after in= ? or use in2= for the reverse? Does interleave means that I put each pair together?
                            eg. in= S1_F_paired_1.fq,S1_R_paired_2.fq,S10_F_paired_1.fq,S10_R_paired_2.fq...

                            I have them as two separate lines at the moment:
                            S1_F_paired_1.fq,S10_F_paired_1.fq,S11_F_paired_1.fq,S12_F_paired_1.fq,S13_F_paired_1.fq..
                            S1_R_paired_2.fq,S10_R_paired_2.fq,S11_R_paired_2.fq,S12_R_paired_2.fq,S13_R_paired_2.fq..

                            Thanks,

                            Catalina

                            Comment


                            • java error when trying to filter on entropy

                              Hi,

                              I am getting an error when trying to use BBDuk to filter based on entropy. I have previously filtered the dataset for phix and for adapters with no issues before so I'm a bit confused as to why it won't work now.

                              I'm working on a node with 12 Gb RAM so the 8 Gb called shouldn't be an issue. I got a similar error without the -Xmx flag.

                              Running CentOS Linux release 7.3.1611
                              java version "1.7.0_131"

                              My fq file contains ~135 million 100 bp unpaired sequences.

                              command and error logs as follows:

                              Code:
                              $ bbduk.sh -Xmx8g in=seq.fq out=seq_0-1-entrop-filtered.fq outm=low_complexity-0-1.fq entropy=0.1
                              java -Djava.library.path=/apps/chpc/bio/bbmap/jni/ -ea -Xmx8g -Xms8g -cp /apps/chpc/bio/bbmap/current/ jgi.BBDukF -Xmx8g in=fish-coral_1_filtered_clean.fq out=fish-coral_1_filtered_clean_0-1-entrop-filtered.fq outm=low_complexity-0-1.fq entropy=0.1
                              Executing jgi.BBDukF [-Xmx8g, in=seq.fq, out=seq_0-1-entrop-filtered.fq, outm=low_complexity-0-1.fq, entropy=0.1]
                              Version 37.90 [-Xmx8g, in=seq.fq, out=seq_0-1-entrop-filtered.fq, outm=low_complexity-0-1.fq, entropy=0.1]
                              
                              Initial:
                              Memory: max=8232m, free=8061m, used=171m
                              
                              Input is being processed as unpaired
                              Started output streams: 0.038 seconds.
                              Exception in thread "Thread-6" java.lang.ArrayIndexOutOfBoundsException: 39
                                      at structures.EntropyTracker.averageEntropy(EntropyTracker.java:302)
                                      at structures.EntropyTracker.passes(EntropyTracker.java:348)
                                      at jgi.BBDukF$ProcessThread.run(BBDukF.java:2583)
                              Exception in thread "Thread-28" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-25" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-8" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-17" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-9" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-24" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-13" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-15" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-23" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-11" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-14" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-7" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-10" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-29" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-20" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-22" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-16" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-26" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-19" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-12" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-18" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-27" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-21" java.lang.ArrayIndexOutOfBoundsException
                              Processing time:                0.192 seconds.
                              
                              Input:                          34841 reads             3436691 bases.
                              Low entropy discards:           2157 reads (6.19%)      215168 bases (6.26%)
                              Total Removed:                  2181 reads (6.26%)      216121 bases (6.29%)
                              Result:                         32660 reads (93.74%)    3220570 bases (93.71%)
                              
                              Time:                           0.255 seconds.
                              Reads Processed:       34841    136.55k reads/sec
                              Bases Processed:       3436k    13.47m bases/sec
                              Any suggestions?

                              Thank you.

                              Comment


                              • Entropy filtering: Java ArrayIndexOutOfBoundsException

                                Hi,

                                I'm having an issue while trying to filter a fastq file using an entropy filter. The library protocol used ribozero so there are a lot of poly T sequences that I would like to remove.

                                I have successfully removed adapter and phiX contamination from the file but when I try the entropy filter (with various -Xmx settings or none) I get a java array error.

                                There are ~137 million 100 bp unpaired reads in the fastq file and they have been filtered for adapters, low quality and phiX (using BBDuk).

                                I'm working on a node with 24 cores and 128 GiB of RAM running CentOS Linux release 7.3.1611 and java version "1.7.0_131".

                                Command and error messages follow:

                                $ bbduk.sh -Xmx8g in=seq.fq out=seq_0-1-entrop-filtered.fq outm=low_complexity-0-1.fq entropy=0.1
                                java -Djava.library.path=/apps/chpc/bio/bbmap/jni/ -ea -Xmx8g -Xms8g -cp /apps/chpc/bio/bbmap/current/ jgi.BBDukF -Xmx8g in=fish-coral_1_filtered_clean.fq out=fish-coral_1_filtered_clean_0-1-entrop-filtered.fq outm=low_complexity-0-1.fq entropy=0.1
                                Executing jgi.BBDukF [-Xmx8g, in=seq.fq, out=seq_0-1-entrop-filtered.fq, outm=low_complexity-0-1.fq, entropy=0.1]
                                Version 37.90 [-Xmx8g, in=seq.fq, out=seq_0-1-entrop-filtered.fq, outm=low_complexity-0-1.fq, entropy=0.1]

                                Initial:
                                Memory: max=8232m, free=8061m, used=171m

                                Input is being processed as unpaired
                                Started output streams: 0.038 seconds.
                                Exception in thread "Thread-6" java.lang.ArrayIndexOutOfBoundsException: 39
                                at structures.EntropyTracker.averageEntropy(EntropyTracker.java:302)
                                at structures.EntropyTracker.passes(EntropyTracker.java:348)
                                at jgi.BBDukF$ProcessThread.run(BBDukF.java:2583)
                                Exception in thread "Thread-28" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-25" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-8" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-17" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-9" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-24" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-13" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-15" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-23" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-11" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-14" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-7" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-10" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-29" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-20" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-22" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-16" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-26" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-19" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-12" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-18" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-27" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-21" java.lang.ArrayIndexOutOfBoundsException
                                Processing time: 0.192 seconds.

                                Input: 34841 reads 3436691 bases.
                                Low entropy discards: 2157 reads (6.19%) 215168 bases (6.26%)
                                Total Removed: 2181 reads (6.26%) 216121 bases (6.29%)
                                Result: 32660 reads (93.74%) 3220570 bases (93.71%)

                                Time: 0.255 seconds.
                                Reads Processed: 34841 136.55k reads/sec
                                Bases Processed: 3436k 13.47m bases/sec


                                Any suggestions?

                                Thanks.
                                Dave

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                15 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                21 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                16 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                45 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X