Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Hi Brian,

    I have a couple of questions about this script:

    1) does it work by default for patterned flowcells, like the ones for a HiSeq 4000? Or do I need to run it with some specific options, like "xsize" or "ysize"?

    2) if I only have access to one lane instead of the whole flowcell, would it also be the way to go to create the "dump" file with the samples in it, and then use this profile to process sample by sample?

    I have a special case where Read 1 and Read 2 have different behaviours as well as length patterns (R1=26bp; R2=75bp). In one lane, I already know that the whole TOP surface for R2 (only) failed, having those reads looking like this:

    Code:
    @K00150:243:HLG7MBBXX:6:1101:26129:1297 2:N:0:NCGCAGAA
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
    +
    ###########################################################################
    So, out of the 10 Mi reads, half of them looks like this, hence should be discarded. However, I tried running "filterbytile" in five different ways but I'm not getting it to filter out this data. Here is what I've done:

    a. Without "dump", using only Read 2, with aggresive filtering

    Code:
    filterbytile.sh in=Sample1.R2.fastq.gz out=filtered.Sample1.R2.fq.gz ud=0.75 qd=1 ed=1 ua=.5 qa=.5 ea=.5
    Code:
    Flagged 3358 of 6903 micro-tiles, containing 4747957 reads:
    3346 exceeded uniqueness thresholds.
    0 exceeded quality thresholds.
    0 exceeded error-free probability thresholds.
    81 had too few reads to calculate statistics.
    
    Reads Discarded:           0 	0.000%
    Bases Discarded:           0 	0.000%
    b. Without "dump", using only Read 2

    Code:
    filterbytile.sh in=Sample1.R2.fastq.gz out=filtered.Sample1.R2.fq.gz
    Code:
    Flagged 345 of 6903 micro-tiles, containing 144925 reads:
    326 exceeded uniqueness thresholds.
    0 exceeded quality thresholds.
    0 exceeded error-free probability thresholds.
    81 had too few reads to calculate statistics.
    
    Reads Discarded:           0 	0.000%
    Bases Discarded:           0 	0.000%
    c. Without "dump", using both pairs

    Code:
    filterbytile.sh in1=Sample1.R1.fastq.gz in2=Sample1.R2.fastq.gz out1=filtered.Sample1.R1.fq.gz out2=filtered.Sample1.R2.fq.gz
    Code:
    Flagged 878 of 12803 micro-tiles, containing 119766 reads:
    0 exceeded uniqueness thresholds.
    547 exceeded quality thresholds.
    856 exceeded error-free probability thresholds.
    95 had too few reads to calculate statistics.
    
    Reads Discarded:        119k 	0.612%
    Bases Discarded:       6043k 	0.612%
    d. With "dump" for the whole lane, using only Read 2

    Code:
    filterbytile.sh in=all.R2.fq.gz dump=dump.lane.R2
    filterbytile.sh in=Sample1.R2.fastq.gz out=filtered.Sample1.R2.fq.gz indump=dump.lane.R2
    Code:
    Flagged 11949 of 350360 micro-tiles, containing 7514071 reads:
    11612 exceeded uniqueness thresholds.
    0 exceeded quality thresholds.
    0 exceeded error-free probability thresholds.
    882 had too few reads to calculate statistics.
    
    Reads Discarded:           0 	0.000%
    Bases Discarded:           0 	0.000%
    e. With "dump" for the whole lane, using both pairs

    Code:
    filterbytile.sh in1=all.R1.fq.gz in2=all.R2.fq.gz dump=dump.lane
    filterbytile.sh in1=Sample1.R1.fastq.gz in2=Sample1.R2.fastq.gz out1=filtered.Sample1.R1.fq.gz out2=filtered.Sample1.R2.fq.gz indump=dump.lane
    Code:
    Flagged 16907 of 691597 micro-tiles, containing 5698570 reads:
    0 exceeded uniqueness thresholds.
    10436 exceeded quality thresholds.
    16794 exceeded error-free probability thresholds.
    1192 had too few reads to calculate statistics.
    
    Reads Discarded:        173k 	0.886%
    Bases Discarded:       8742k 	0.886%

    Maybe this is not an issue addressed by the script? Are you considering both surfaces as part of the same tile? Or are you treating them differently?

    Alternatively, I've tried filtering the data by quality using "bbduk", but from the two set ups, only one of them work and I'm wondering if that might be a bug?

    i. Filtering using "maq" only on Read 2 [didn't work, weird behavior]

    Code:
    bbduk.sh qtrim=f maq=10 in=Sample1.R2.fq out=filtered.Sample1.R2.fq
    Code:
    Input:                  	9773111 reads 		732983325 bases.
    Low quality discards:   	0 reads (0.00%) 	0 bases (0.00%)
    Total Removed:          	0 reads (0.00%) 	0 bases (0.00%)
    Result:                 	9773111 reads (100.00%) 	732983325 bases (100.00%)
    First I thought the filtering was not active because in the documentation says it's applied "after" trimming. However, if I specified a big "maq" number, like 30, it is doing something:

    Code:
    Input:                  	9773111 reads 		732983325 bases.
    Low quality discards:   	2428888 reads (24.85%) 	182166600 bases (24.85%)
    Total Removed:          	2428888 reads (24.85%) 	182166600 bases (24.85%)
    Result:                 	7344223 reads (75.15%) 	550816725 bases (75.15%)
    Is this a bug, then? Could this option be applied when no trimming is defined? That is sometimes a way to go, specially when using alignment software capable of soft-masking ends.

    ii. Filtering using "maxns" only on Read 2 [worked as a charm]

    Code:
    bbduk.sh qtrim=f maxns=10 in=Sample1.R2.fq out=filtered.Sample1.R2.fq
    Code:
    Input:                  	9773111 reads 		732983325 bases.
    Low quality discards:   	5013913 reads (51.30%) 	376043475 bases (51.30%)
    Total Removed:          	5013913 reads (51.30%) 	376043475 bases (51.30%)
    Result:                 	4759198 reads (48.70%) 	356939850 bases (48.70%)

    This is the behaviour I am expecting for both "filterbytile" and "bbduk maq=10".

    Please, let me know if I'm doing something wrong or if it's not how you intended it to be.

    Thank you very much in advance, and very sorry for the lengthy post.

    Cheers,
    Santiago

    Comment


    • #17
      Sorry if the same post appears many times. I'm having issues with the browser and am not getting feedback after posting. If all the postings appears, just keep the last one. Thanks!

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Recent Developments in Metagenomics
        by seqadmin





        Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
        09-23-2024, 06:35 AM
      • seqadmin
        Understanding Genetic Influence on Infectious Disease
        by seqadmin




        During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

        Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
        09-09-2024, 10:59 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 10-02-2024, 04:51 AM
      0 responses
      13 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 10-01-2024, 07:10 AM
      0 responses
      21 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 09-30-2024, 08:33 AM
      0 responses
      25 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 09-26-2024, 12:57 PM
      0 responses
      18 views
      0 likes
      Last Post seqadmin  
      Working...
      X