Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GenoMax
    replied
    While Brian will have a more detailed insight it should be ok to use the unmasked genome with BBSplit. Any mapping issues you may have with short reads should be more or less same with genome or just CDS sequences.

    Leave a comment:


  • Microalgues
    replied
    Biological question about contaminants DB

    Dear Brian,

    I have created a de novo transcriptome of a non-model organism. After BLASTing the contigs I found some homology with possible contaminants. My question is if I should use DNA genomes (un/masked) or the CDS gene predictions from ENSMBL database.

    In the first case, do you recommend the unmasked, the masked version of genomes (ensmbl) or should I use BBMask?
    And in the second case, could BBsplit handle with CDS mapping?

    I am a bit lost, I think that contigs from de novo transcriptome have to be mapped against CDS sequences.

    With thanks,
    Xavier

    Leave a comment:


  • sk8bro
    replied
    Thanks Brian, I will look into the options you suggested to handle this use case!

    Kate

    Leave a comment:


  • Brian Bushnell
    replied
    Originally posted by GenoMax View Post
    You can set nodisk=t to prevent the index from being written to disk (memory will always be used).
    Actually, "nodisk" does not work with BBSplit... sorry! I'll clarify that in the documentation. It's not like it's impossible to make it work, but it would be pretty complicated; one of those things that I would do if I could clone myself.

    Hi Brian,

    I am wondering for BBsplit (with ambiguous2=split) if it is expected behavior that if I provide 2 references:

    Ref 1 has 1 sequence that is a perfect match to PE sequencing reads
    Ref 2 has many unique sequences that are at least 1 MM different from sequencing reads

    For BBsplit to think the reads are ambiguously mapped? And for this to be the case for >90% of read pairs but not for every read pair?

    Is there a way to force BBsplit to select Ref1 (perfect) over the options in Ref2?

    I am trying to remove a nearly-identical putative contaminant sequence from my data...

    (FYI I have ambiguous is set to best, and I've also tried setting both ambiguous and ambiguous2 to best, but the reads are almost all considered to be ambiguous in that scenario too)

    Thx! Kate
    Hi Kate,

    You can certainly set "perfectmode" to only allow mappings that are 100% identity to the reference. For example, you could run once in perfectmode, and then run the remaining unmapped reads normally. But generally, yes, this is intended behavior. BBSplit is intended to do things like separate mouse reads from human reads when a mouse is used as the vector for some human DNA study. Also, it is designed to separate reads belonging to various bacteria in a metagenome, and chloroplast/plant reads in a plant. In any of these cases, if there is a single bp mismatch, you can't unambiguously assign a read to one reference or the other, since it could be a sequencing error or (more importantly) an actual variation.

    You might find Seal (seal.sh) useful. It has functionality similar to BBSplit, but is alignment free (meaning it is way faster, but uses more memory). It allows you to specify cutoffs, so you can for example send all of the perfectly-matching reads to one file, regardless of whether they almost match another genome. It decides which file to assign a read to based on the number of kmers (by default, 31-mers) matching that reference. When dealing with bacteria, I always prefer Seal over BBSplit because it is so much faster and easier to use, and bacteria have tiny genomes that are under high evolutionary pressure to avoid low-complexity sequence. Euks have huge genomes with a lot of low-complexity regions so BBSplit is better in that case, since it is more precise.

    If you want to remove a particular contaminant from your data, I suggest trying BBDuk. BBDuk does not allow kmers longer than 31. However, it emulates longer kmers. For example, if you set "k=90", then it will consider reads as matching (by default, they get discarded) if there is a 90bp stretch in which all 31-mers match 31-mers in the reference. In practice this is very similar to matching 90-mers.

    So, for example:

    bbduk.sh in=reads.fq out=filtered.fq ref=contaminant.fa k=90 mm=f

    That will remove all reads containing at least 90bp that shares all kmers with your reference (which should be the unwanted sequence). The exact values depend on the length of the sequence in question. Note that read length is very important here; if a read only overlaps 80bp of the 90bp sequence in question, it would not be removed.

    Leave a comment:


  • sk8bro
    replied
    Hi Brian,

    I am wondering for BBsplit (with ambiguous2=split) if it is expected behavior that if I provide 2 references:

    Ref 1 has 1 sequence that is a perfect match to PE sequencing reads
    Ref 2 has many unique sequences that are at least 1 MM different from sequencing reads

    For BBsplit to think the reads are ambiguously mapped? And for this to be the case for >90% of read pairs but not for every read pair?

    Is there a way to force BBsplit to select Ref1 (perfect) over the options in Ref2?

    I am trying to remove a nearly-identical putative contaminant sequence from my data...

    (FYI I have ambiguous is set to best, and I've also tried setting both ambiguous and ambiguous2 to best, but the reads are almost all considered to be ambiguous in that scenario too)

    Thx! Kate
    Last edited by sk8bro; 10-11-2017, 01:08 PM.

    Leave a comment:


  • GenoMax
    replied
    For some other programs in BBMap suite there has been a "race condition" which produces a nasty (though according to @Brian a harmless) error related to threads.

    At least in this case the result appears to be identical. In the last table you had given us the results all looked different.

    Not sure why you still have this in the output, if you had deleted the ref folder:
    NOTE: Ignoring reference file because it already appears to have been processed.
    NOTE: If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt
    You can set nodisk=t to prevent the index from being written to disk (memory will always be used).
    Last edited by GenoMax; 10-05-2017, 04:43 AM.

    Leave a comment:


  • wlokhorst
    replied
    40428 reads indeed. The number of threads shouldn't matter for the succes or failure of the process. However, I did run it on 4 threads for you, so here are the results:
    Code:
    BBMap version 37.53
    Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.560
    Set threads to 4
    Retaining first best site only for ambiguous mappings.
    NOTE:   Ignoring reference file because it already appears to have been processed.
    NOTE:   If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt
    Set genome to 1
    
    Loaded Reference:       1.647 seconds.
    Loading index for chunk 1-1, build 1
    Generated Index:        1.542 seconds.
    Analyzed Index:         3.989 seconds.
    Cleared Memory:         6.978 seconds.
    Processing reads in single-ended mode.
    Started read stream.
    Started 4 mapping threads.
    Detecting finished threads: 0, 1, 2, 3
    
       ------------------   Results   ------------------
    
    Genome:                 1
    Key Length:             13
    Max Indel:              20
    Minimum Score Ratio:    0.56
    Mapping Mode:           normal
    Reads Used:             40428   (3919820 bases)
    
    Mapping:                10.603 seconds.
    Reads/sec:              3812.77
    kBases/sec:             369.68
    
    
    Read 1 data:            pct reads       num reads       pct bases          num bases
    
    mapped:                   4.5464%            1838         4.3524%             170607
    unambiguous:              1.2120%             490         1.2217%              47890
    ambiguous:                3.3343%            1348         3.1307%             122717
    low-Q discards:           0.0000%               0         0.0000%                  0
    
    perfect best site:        0.7173%             290         0.6776%              26561
    semiperfect site:         0.7297%             295         0.6905%              27066
    
    Match Rate:                   NA               NA        91.4514%             156445
    Error Rate:              42.4134%            1543         8.4773%              14502
    Sub Rate:                42.1935%            1535         7.7799%              13309
    Del Rate:                 2.2265%              81         0.2701%                462
    Ins Rate:                 4.1506%             151         0.4273%                731
    N Rate:                   0.2474%               9         0.0713%                122
    
    Total time:             25.038 seconds.
    This also includes removal of the ref folder (which I do every time before testing the next run) and re-creation with the first command I've given you before.

    Leave a comment:


  • GenoMax
    replied
    Do you only have 40428 reads in the file? 48 threads seems like an overkill for this amount of reads. Can you use 4 and see if that helps? You can also delete the previous bbmap index files and force the program to create new ones using ref=.

    Adding relative paths before the file names should not cause the program to crash. It never has for me.

    Leave a comment:


  • wlokhorst
    replied
    No error messages are produced, but you might be right about it crashing. (It also crashes when I put ./ in front of the input file.) Maybe you can see what goes wrong from the output text, so here it is:
    Code:
    BBMap version 37.53
    Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.560
    Retaining first best site only for ambiguous mappings.
    NOTE:   Ignoring reference file because it already appears to have been processed.
    NOTE:   If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt
    Set genome to 1
    
    Loaded Reference:       1.506 seconds.
    Loading index for chunk 1-1, build 1
    Generated Index:        1.482 seconds.
    Analyzed Index:         3.715 seconds.
    Cleared Memory:         6.933 seconds.
    Processing reads in single-ended mode.
    Started read stream.
    Started 48 mapping threads.
    Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47
    
       ------------------   Results   ------------------
    
    Genome:                 1
    Key Length:             13
    Max Indel:              20
    Minimum Score Ratio:    0.56
    Mapping Mode:           normal
    Reads Used:             40428   (3919820 bases)
    
    Mapping:                13.209 seconds.
    Reads/sec:              3060.74
    kBases/sec:             296.76
    
    
    Read 1 data:            pct reads       num reads       pct bases          num bases
    
    mapped:                   4.5464%            1838         4.3524%             170607
    unambiguous:              1.2120%             490         1.2217%              47890
    ambiguous:                3.3343%            1348         3.1307%             122717
    low-Q discards:           0.0000%               0         0.0000%                  0
    
    perfect best site:        0.7173%             290         0.6776%              26561
    semiperfect site:         0.7297%             295         0.6905%              27066
    
    Match Rate:                   NA               NA        91.4514%             156445
    Error Rate:              42.4134%            1543         8.4773%              14502
    Sub Rate:                42.1935%            1535         7.7799%              13309
    Del Rate:                 2.2265%              81         0.2701%                462
    Ins Rate:                 4.1506%             151         0.4273%                731
    N Rate:                   0.2474%               9         0.0713%                122
    
    Total time:             27.078 seconds.

    Leave a comment:


  • Brian Bushnell
    replied
    The numbers of mapped reads do not add up to the same record in each case... it seems like the program is crashing and not processing all the reads, or else the input is different. The output is deterministic for single-ended reads so I can't see any other explanation. Does the program print an error message?

    Leave a comment:


  • GenoMax
    replied
    @Brian: Reads are single end 100 bp. Commands used are in post #37

    Leave a comment:


  • Brian Bushnell
    replied
    That's odd, if you are not intentionally producing random assignments for sequences. Can you please post your full command line? Also, are the reads paired or unpaired?

    Leave a comment:


  • wlokhorst
    replied
    Neither deterministic=t nor ambiguous2=all helped change the results. Unless you or Brian know a way of solving this, I'm sorry to say that I'll try using another tool (or attempt writing something of my own).

    Reads are 100bp.

    Leave a comment:


  • GenoMax
    replied
    You could try changing the ambiguous2= parameter and see how that changes the result. "toss/all" may generate results that look more similar. With very similar genomes you are going to have trouble assigning reads reproducibly. How long are these reads?

    Leave a comment:


  • wlokhorst
    replied
    @GenoMax: I did not consider the non-deterministic behaviour of mappers, however BBMap should still be deterministic in my case, since I have single-end reads. See copied text from the BBMap help section below. I could still try to explicitly call this though.

    Run in deterministic mode. In this case it is good to set averagepairdist. BBMap is deterministic without this flag if using single-ended reads, or run singlethreaded.
    My references are very similar in some cases (e.g. different strains of the same species) and I'm leaving the ambiguous parameters untouched.

    Commands:
    bash ~/bbmap/bbsplit.sh ref={}
    # ref= a folder containing my reference sequences
    bash ~/bbmap/bbsplit.sh in={} basename={}/out_%.fq
    # in= sample, basename= output folder and output filename

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    04-22-2024, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Yesterday, 11:49 AM
0 responses
15 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-24-2024, 08:47 AM
0 responses
16 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
61 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
60 views
0 likes
Last Post seqadmin  
Working...
X