Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • baseq
    Junior Member
    • Oct 2015
    • 3

    Contamination resources

    Hi All,

    I'm working with a library with significant bacterial contamination, and I've spent a lot of time trying to remove it without much success. The organism of interest is an obligate root pathogen (not bacterial), and I'm afraid I've sequenced many non-target, associated bacterial species. Eventually, I hope to do some de novo assembly of the cleaned reads. Well, I've already done some de novo assembly, but find many bacterial sequences in my blast results. I still have some things I want to try, but wanted to ask for suggestions so that I might optimize my strategy and time. Anyway, so far I have tried:

    - Mapping raw reads DeconSeq with the included bacterial databases. This hasn't worked particularly well, and the program frequently crashes on our system anyway (even after recompiling as suggested).
    - Mapping raw reads with bwa mem to NCBI's all bacterial genome database, that is I downloaded the all_fna.tar.gz file for bacterial genomes, concatenated them, split this file into reasonably sized files, and indexed them as references for bwa mem. I then wrote a script to pull out any unmapped sequences from the resulting sam files. I realize this a nearly identical approach to DeconSeq, but it seems to work a little better (and is much more stable!)

    Via blast I'm still finding bacterial contamination in my resulting contigs, so whatever I'm doing isn't working well enough. I've checked the forums and it seems like BBMap/split is a logical next step, so I'll be trying that soon. I've got some questions for you:

    - With BBsplit can I use my concatenated NCBI bacterial genome fasta as my reference?
    - I've been using the default algorithm parameters for bwa mem. Is there something that I might change to make that pipeline more effective?
    - Any other suggestions?

    Obviously I've learned my lesson, and I'm trying to acquire some much cleaner template right now. However, I'd like to not waste all the data I've already received.

    Thanks for your help; this website is such a great resource!
  • GenoMax
    Senior Member
    • Feb 2008
    • 7142

    #2
    Originally posted by baseq View Post

    Via blast I'm still finding bacterial contamination in my resulting contigs, so whatever I'm doing isn't working well enough. I've checked the forums and it seems like BBMap/split is a logical next step, so I'll be trying that soon. I've got some questions for you:

    - With BBsplit can I use my concatenated NCBI bacterial genome fasta as my reference?
    You should be able to use the NCBI fasta as your reference but see caveats below.
    Obviously I've learned my lesson, and I'm trying to acquire some much cleaner template right now. However, I'd like to not waste all the data I've already received.
    A few questions/observations:
    1. Is there any other genome available of a close(ly) linked species that could be used as a bait with BBSplit to pull out reads of interest.
    2. Is your organism of interest living on surface or inside the roots? I would have imagined that you must have gone through some sterilization/clean-up step to minimize/remove bacterial contaminants on/near surface but are still seeing bacterial contamination?
    3. What fraction of reads appear to be of bacterial origin?

    That said you will need to be careful about labeling "contaminants". Short reads will align to references just by chance or in other legitimate cases to (parts of) genes that may be conserved across genera. The only way you are going to be absolutely sure that you have data from your own organism is to somehow separate/purify it from other living things before making a library from it.

    You are obviously on the right track. If the present data bothers you too much then you could set it aside till such a time when you have a new/more defined dataset that can give you a better reference to use to pull out reads from this first set

    Comment

    • Brian Bushnell
      Super Moderator
      • Jan 2014
      • 2709

      #3
      Originally posted by baseq View Post
      - With BBsplit can I use my concatenated NCBI bacterial genome fasta as my reference?
      BBSplit needs multiple reference files as input; one per organism, or one for target and another for everything else. It only outputs one file per reference file.

      Seal, on the other hand, which is similar, can use a single concatenated file, as it (by default) will output one file per reference sequence within a concatenated set of references.

      Comment

      • baseq
        Junior Member
        • Oct 2015
        • 3

        #4
        Hi Genomax,

        Thanks for the reply. Regarding your questions:

        1. Yes, there is a somewhat closely related organism with some scaffolds available. I tried to assemble the raw reads to it without much success. I'll use the sequences as bait as you've suggested.

        2. It lives inside the roots. I'm working with very robust spores. My method of purification was a DNAse treatment after lysing the root tissue using beads. This seems to have worked very well for removing the plant sequences, but not so well for the bacteria. I'm now testing a much more thorough protocol to lyse the bacteria before going through with the DNase treatment.

        3. I lose about 20-25% of the reads in the filtering that I'm currently doing. For the time being my goal is not to get a nice, clean genome assembly, rather I'm just trying to get some decent contigs with which we can do some other work. Therefore I'm filtering pretty stringently, even if I lose some of my good sequences.

        Hi Brian,

        Thanks for your reply too. I'm planning to run bbsplit with ref_hostplant = hostplantref.fa, ref_bact=ncbi_concatenated_bacteria.fa, ref_relative = ref_close_relative.fa and will ask it to capture unmapped reads as well. I don't really care how the bacteria reads map (at this point), I just want them gone! Hopefully this will help.

        Comment

        • cliffbeall
          Senior Member
          • Jan 2010
          • 144

          #5
          You might want to try comparing at the protein level to the nr protein database and/or kmer frequency analysis, since I'm not sure how much bacteria from soil will match at the nucleotide level to known reference genomes.

          Comment

          • SylvainL
            Senior Member
            • Feb 2012
            • 180

            #6
            Since you have already contigs, why you don't blast them and eliminate the bacterial contigs.

            You can even use your contigs to blast vs the closely related organism.

            This should allow you to get your contigs of interest. They may not be enough to get the full assembly but at least you won't feel you wasted your time with this first experiment...

            And with your new dataset, you will be able to confirm or not the first contigs.

            s.

            Comment

            Latest Articles

            Collapse

            • SEQadmin2
              Nine Things a Sample Prep Scientist Thinks About Before Sequencing
              by SEQadmin2


              I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


              Here are nine questions we think about, in roughly the order they matter, before...
              06-18-2026, 07:11 AM
            • SEQadmin2
              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
              by SEQadmin2


              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
              ...
              06-02-2026, 10:05 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, 06-17-2026, 06:09 AM
            0 responses
            31 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-09-2026, 11:58 AM
            0 responses
            96 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-05-2026, 10:09 AM
            0 responses
            117 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-04-2026, 08:59 AM
            0 responses
            109 views
            0 reactions
            Last Post SEQadmin2  
            Working...