Header Leaderboard Ad
Collapse
Introducing FilterByTile: Remove Low-Quality Reads Without Adding Bias
Collapse
Announcement
Collapse
No announcement yet.
X
-
Sorry if the same post appears many times. I'm having issues with the browser and am not getting feedback after posting. If all the postings appears, just keep the last one. Thanks!
-
Hi Brian,
I have a couple of questions about this script:
1) does it work by default for patterned flowcells, like the ones for a HiSeq 4000? Or do I need to run it with some specific options, like "xsize" or "ysize"?
2) if I only have access to one lane instead of the whole flowcell, would it also be the way to go to create the "dump" file with the samples in it, and then use this profile to process sample by sample?
I have a special case where Read 1 and Read 2 have different behaviours as well as length patterns (R1=26bp; R2=75bp). In one lane, I already know that the whole TOP surface for R2 (only) failed, having those reads looking like this:
Code:@K00150:243:HLG7MBBXX:6:1101:26129:1297 2:N:0:NCGCAGAA NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN + ###########################################################################
a. Without "dump", using only Read 2, with aggresive filtering
Code:filterbytile.sh in=Sample1.R2.fastq.gz out=filtered.Sample1.R2.fq.gz ud=0.75 qd=1 ed=1 ua=.5 qa=.5 ea=.5
Code:Flagged 3358 of 6903 micro-tiles, containing 4747957 reads: 3346 exceeded uniqueness thresholds. 0 exceeded quality thresholds. 0 exceeded error-free probability thresholds. 81 had too few reads to calculate statistics. Reads Discarded: 0 0.000% Bases Discarded: 0 0.000%
Code:filterbytile.sh in=Sample1.R2.fastq.gz out=filtered.Sample1.R2.fq.gz
Code:Flagged 345 of 6903 micro-tiles, containing 144925 reads: 326 exceeded uniqueness thresholds. 0 exceeded quality thresholds. 0 exceeded error-free probability thresholds. 81 had too few reads to calculate statistics. Reads Discarded: 0 0.000% Bases Discarded: 0 0.000%
Code:filterbytile.sh in1=Sample1.R1.fastq.gz in2=Sample1.R2.fastq.gz out1=filtered.Sample1.R1.fq.gz out2=filtered.Sample1.R2.fq.gz
Code:Flagged 878 of 12803 micro-tiles, containing 119766 reads: 0 exceeded uniqueness thresholds. 547 exceeded quality thresholds. 856 exceeded error-free probability thresholds. 95 had too few reads to calculate statistics. Reads Discarded: 119k 0.612% Bases Discarded: 6043k 0.612%
Code:filterbytile.sh in=all.R2.fq.gz dump=dump.lane.R2 filterbytile.sh in=Sample1.R2.fastq.gz out=filtered.Sample1.R2.fq.gz indump=dump.lane.R2
Code:Flagged 11949 of 350360 micro-tiles, containing 7514071 reads: 11612 exceeded uniqueness thresholds. 0 exceeded quality thresholds. 0 exceeded error-free probability thresholds. 882 had too few reads to calculate statistics. Reads Discarded: 0 0.000% Bases Discarded: 0 0.000%
Code:filterbytile.sh in1=all.R1.fq.gz in2=all.R2.fq.gz dump=dump.lane filterbytile.sh in1=Sample1.R1.fastq.gz in2=Sample1.R2.fastq.gz out1=filtered.Sample1.R1.fq.gz out2=filtered.Sample1.R2.fq.gz indump=dump.lane
Code:Flagged 16907 of 691597 micro-tiles, containing 5698570 reads: 0 exceeded uniqueness thresholds. 10436 exceeded quality thresholds. 16794 exceeded error-free probability thresholds. 1192 had too few reads to calculate statistics. Reads Discarded: 173k 0.886% Bases Discarded: 8742k 0.886%
Maybe this is not an issue addressed by the script? Are you considering both surfaces as part of the same tile? Or are you treating them differently?
Alternatively, I've tried filtering the data by quality using "bbduk", but from the two set ups, only one of them work and I'm wondering if that might be a bug?
i. Filtering using "maq" only on Read 2 [didn't work, weird behavior]
Code:bbduk.sh qtrim=f maq=10 in=Sample1.R2.fq out=filtered.Sample1.R2.fq
Code:Input: 9773111 reads 732983325 bases. Low quality discards: 0 reads (0.00%) 0 bases (0.00%) Total Removed: 0 reads (0.00%) 0 bases (0.00%) Result: 9773111 reads (100.00%) 732983325 bases (100.00%)
Code:Input: 9773111 reads 732983325 bases. Low quality discards: 2428888 reads (24.85%) 182166600 bases (24.85%) Total Removed: 2428888 reads (24.85%) 182166600 bases (24.85%) Result: 7344223 reads (75.15%) 550816725 bases (75.15%)
ii. Filtering using "maxns" only on Read 2 [worked as a charm]
Code:bbduk.sh qtrim=f maxns=10 in=Sample1.R2.fq out=filtered.Sample1.R2.fq
Code:Input: 9773111 reads 732983325 bases. Low quality discards: 5013913 reads (51.30%) 376043475 bases (51.30%) Total Removed: 5013913 reads (51.30%) 376043475 bases (51.30%) Result: 4759198 reads (48.70%) 356939850 bases (48.70%)
This is the behaviour I am expecting for both "filterbytile" and "bbduk maq=10".
Please, let me know if I'm doing something wrong or if it's not how you intended it to be.
Thank you very much in advance, and very sorry for the lengthy post.
Cheers,
Santiago
Leave a comment:
-
Thanks! Yes, FilterByTile processes lanes independently, so you can use them all at once.
Leave a comment:
-
Thanks. This was HiSeq 2500 2x250 RapidRun (with both lanes concatenated, which I assume is ok, since you refer to "flowcell" and not "lane"). I used the entire dataset (15 metagenomes) to calc the stats, then used dump.flowcell to filter each sample. Here is one example:
Code:Flagged 36407 of 519552 micro-tiles, containing 50578608 reads: 0 exceeded uniqueness thresholds. 30332 exceeded quality thresholds. 34084 exceeded error-free probability thresholds. 0 had too few reads to calculate statistics. Filtering reads: 988.159 seconds. Time: 988.671 seconds. Reads Processed: 56900k 57.55k reads/sec Bases Processed: 14225m 14.39m bases/sec Reads Discarded: 3094k 5.438% Bases Discarded: 773m 5.438%
Last edited by mcmc; 02-11-2017, 05:01 PM.
Leave a comment:
-
Yeah, sorry about that, it's a known bug when you are using already-created flowcell files. If you do everything in a single pass it won't print that. Do you mind posting the filtering statistics, platform, and read length, just out of curiosity? The defaults generally remove around 2-5% of the reads in my testing, but I've only tested it on HiSeq 2500 and NextSeq data.
Leave a comment:
-
Hello Brian,
I get the message "Warning: Zero reads processed." using indump=dump.flowcell. But it looks like making the dump file worked ok (it says it processed 780m reads). Is it safe to ignore this warning?
Thanks,
mcmc
Leave a comment:
-
Hi MC,
FilterByTile should be run on raw data, before anything else that changes or removes any reads. If you do any quality-related filtering or trimming steps before FilterByTile, you will remove some of the lowest-quality reads, which will disrupt the statistics.
Leave a comment:
-
Brian et al., would you recommend filterbytile.sh be done first, before adapter trimming & quality filtering?
Thanks,
MC
Leave a comment:
-
OK, the new version of Clumpify is out, adding the "dedupe" and "optical" flags (as well as a few other related flags), so you can do optical or full deduplication. Also related to FilterByTile, BBDuk now has xmin, ymin, xmax, and ymax flags for large-scale location-based read filtering; essentially, you can eliminate tile-edge effects using a bounding box. For our NextSeq I was able to eliminate tile-edge duplicates with "xmin=1600 xmax=26300". There did not seem to be any on the Y edges. But, the exact values may vary by machine or run.Last edited by Brian Bushnell; 01-04-2017, 03:11 PM.
Leave a comment:
-
Hi Genomax,
There were plenty of identical pairs. Everything is scaled to 100% in that graph, but here is the raw data:
Code:A B C D DBM 0 3868 3868 3868 3868 3870 1 6260 6316 6402 6454 6470 2 6534 6562 6720 6728 6746 3 6590 6628 6794 6806 6826 4 6622 6662 6832 6840 6858 5 6652 6690 6864 6874 6888
Code:Reads originating from the same fragment, called multiple times despite originating from nearly the same physical flowcell location.
Last edited by Brian Bushnell; 12-29-2016, 03:48 PM.
Leave a comment:
-
Am I reading the graph above correctly in that you were not able to find true optical dups (perfect matches on the read) in data you tested? These should be present in problematic HiSeq 3K, 4K data.
You may also want to grab some HiSeq 4000 (or HiSeq X) data from SRA to test since we expect this to be a problem there.
Leave a comment:
-
I decided it would work best to add deduplication features to Clumpify. All approaches work perfectly for error-free reads, but Clumpify is affected slightly more by errors than mapping-based approaches, for situations when it is desirable to remove "duplicates" with mismatches. Here's a comparison showing Clumpify's paired read deduplication compared to DedupeByMapping on some real HiSeq data, allow reads with various number of mismatches to the reference (DedupeByMapping is considered the gold standard in each case, though that does not necessarily mean it is more correct). Clumpify is run with different settings (C and D have higher removal rates because they use 3 passes; A is at default settings).
I also added in the ability to restrict duplicate removal to only clusters within a specific number pixels of each other on the flowcell, to avoid removal of PCR duplicates or coincidental duplicates due to high coverage. In so doing I noticed some interesting things... firstly, that most optical duplicates are on different tiles (inter-tile duplicates) rather than the same tile, and secondly, that in the data I tested, NextSeq has a WAY higher optical duplicate rate (~1%) than HiSeq 2500/1T (0.05%). The way you can distinguish between an inter-tile optical duplicate and a PCR duplicate is that inter-tile optical duplicates will share an X or Y coordinate (within some number of pixels, typically under 40). Intra-tile optical duplicates will of course share both coordinates as well as the tile number.
I'll release this once I'm done testing.Attached Files
Leave a comment:
-
GenoMax, my first thought was also that an optical dup filter was needed! I would love to remove optical duplicates as a first step in a pipeline (before mapping). Seems like between clumpify and this filterbytile he almost has it written already. (Sorry Brian, I'm sure it is annoying for us to besiege you with requests and then add how it should be easy to do.)
Leave a comment:
-
If you can look in the positional neighborhood for clusters having identical sequence then those would be it.
Leave a comment:
-
Originally posted by GenoMax View Post@Brian: Can this be easily extended to provide "MarkDuplicates" functionality? Currently only Picard tools does this.
I am referring to optical duplicates that form due to "pad hopping" in case of patterned HiSeq 4000 flowcells. Since you are using smaller rectangular tiles to look at reads in the neighborhood would it be possible to identify clusters that may be optical dups and mark them?
Leave a comment:
Latest Articles
Collapse
-
by seqadmin
Cancer research has been transformed through numerous molecular techniques, with RNA sequencing (RNA-seq) playing a crucial role in understanding the complexity of the disease. Maša Ivin, Ph.D., Scientific Writer at Lexogen, and Yvonne Goepel Ph.D., Product Manager at Lexogen, remarked that “The high-throughput nature of RNA-seq allows for rapid profiling and deep exploration of the transcriptome.” They emphasized its indispensable role in cancer research, aiding in biomarker...-
Channel: Articles
09-07-2023, 11:15 PM -
-
by seqadmin
Ribonucleic acid (RNA) represents a range of diverse molecules that play a crucial role in many cellular processes. From serving as a protein template to regulating genes, the complex processes involving RNA make it a focal point of study for many scientists. This article will spotlight various methods scientists have developed to investigate different RNA subtypes and the broader transcriptome.
Whole Transcriptome RNA-seq
Whole transcriptome sequencing...-
Channel: Articles
08-31-2023, 11:07 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Yesterday, 09:05 AM
|
0 responses
13 views
0 likes
|
Last Post
by seqadmin
Yesterday, 09:05 AM
|
||
Started by seqadmin, 09-21-2023, 06:18 AM
|
0 responses
10 views
0 likes
|
Last Post
by seqadmin
09-21-2023, 06:18 AM
|
||
Started by seqadmin, 09-20-2023, 09:17 AM
|
0 responses
12 views
0 likes
|
Last Post
by seqadmin
09-20-2023, 09:17 AM
|
||
Started by seqadmin, 09-19-2023, 09:23 AM
|
0 responses
28 views
0 likes
|
Last Post
by seqadmin
09-19-2023, 09:23 AM
|
Leave a comment: