Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • danova
    Member
    • Sep 2010
    • 27

    bbduk with a large reference database

    Hi,
    I would like to check for contaminants using both phiX and the human genome. My data is metagenomics data and i want to remove any read mapping to both phiX and the Human genome.

    So far bbduk can handle this by using the ref=phiX.fa
    However for checking contaminations from human samples i would like to ust the non redundant nucleotide database. It is split into small pieces and usually i access them through blast using the reference nt.nal file.

    Is that is also feasible with bbduk ??
  • GenoMax
    Senior Member
    • Feb 2008
    • 7142

    #2
    I don't completely understand what you mean by "i would like to use the non redundant nucleotide database" to remove contamination from human samples. It may still be easier to do what you have been doing (separate human reads from other stuff).

    You should be able to use BBSplit or seal, which can accept a folder of references. Whether BBSplit can accept a "nr" size folder may need to be experimented with.

    Comment

    • danova
      Member
      • Sep 2010
      • 27

      #3
      Sorry for the confusion. I was confused with large blast databases (.nal file). bbduk does its own indexing....so no way to use blast index databases.

      Which Human database does people mots frequently use to discard human contamintation reads from metagenomes ? I tough to use the nt database (nucleotide sequence database, with entries from all traditional divisions of GenBank, EMBL, and DDBJ; excluding bulk divisions (gss, sts, pat, est, htg) ) ???

      Comment

      • GenoMax
        Senior Member
        • Feb 2008
        • 7142

        #4
        Originally posted by danova View Post
        Sorry for the confusion. I was confused with large blast databases (.nal file). bbduk does its own indexing....so no way to use blast index databases.

        Which Human database does people mots frequently use to discard human contamintation reads from metagenomes ? I tough to use the nt database (nucleotide sequence database, with entries from all traditional divisions of GenBank, EMBL, and DDBJ; excluding bulk divisions (gss, sts, pat, est, htg) ) ???
        Correct - for first question/comment.

        You can just use the human genome sequence (multi-fasta concatenated chromosomes in single file, from UCSC/Ensembl/NCBI/iGenomes) with bbduk (or bbsplit). BBSplit may be better since you can bin all sequences that align to human in one file and capture the rest of the data in second output file.

        Comment

        • danova
          Member
          • Sep 2010
          • 27

          #5
          great i´ll work on that.... combining with bbsplit
          thanks

          Comment

          • Brian Bushnell
            Super Moderator
            • Jan 2014
            • 2709

            #6
            After using BBDuk for PhiX removal, the protocol JGI uses for human removal is this, with BBMap and a masked human reference. Using BBSplit is strictly better, if you know your intended organism's genome. But, JGI rarely knows that, which is why we are sequencing it

            You can download the masked human reference from the link provided. It constitutes around 98% of the human genome. That means some reads will intentionally slip through, in regions that are highly conserved down to early eukaryotes, or those with very low complexity. But, the point is to remove virtually all human contamination with no risk of false positives. If you absolutely need to remove ALL human contamination and don't know the organism's genome, you should use the unmasked reference, and you probably will get some false positive removals.

            For assembly of a new organism, I think it is best to remove human contaminants using the above very safe procedure, then assemble, then BLAST the assembly and remove anything long (say, >400bp) that hits human with >98% identity, and hits nothing else other than other primates (typically chimp, gorilla, and orangutan).

            Also, note that I do not recommend using nt/nr in any primary decontamination procedure for which you know the possible contaminants (like determining which reads are, specifically, human) - they are incomplete, poorly-curated, and the process becomes extremely slow because they are huge. Rather, using the references (or masked versions of the references) will give you a better signal-to-noise ratio. nt/nr are much better for diagnosing which things may be present than actually removing them.

            Since you're doing metagenomics, using an unmasked human genome is probably fine since humans and bacteria are very dissimilar. But, unless you are doing a human-related microbiome, you might consider removing common human-associated microbes such as E.coli and Salmonella. They seem to be anywhere humans are. Masking things like ribosomes is probably prudent if you do this. There are also some others like Delftia and Pseudomonas that seem to be common sequencing contaminants and cause problems with metagenome analysis, as they seem to show up everywhere, even if human-related DNA is not present, and even in single-cell experiments of other species. Anyway, something to consider.

            Comment

            • danova
              Member
              • Sep 2010
              • 27

              #7
              Thanks Brian,

              Thanks for the masked version on Hg19. Do you hava also masked version hg38 ?

              Just another quick question. Have you published BBmap or how to cite your software ?

              Comment

              • GenoMax
                Senior Member
                • Feb 2008
                • 7142

                #8
                You can use bbmask.sh from BBMap to create masked version of hg38.

                BBMap has not been published yet. In the past @Brian has asked people to cite the project's SourceForge (http://sourceforge.net/projects/bbmap/) website in publications.

                Comment

                • Brian Bushnell
                  Super Moderator
                  • Jan 2014
                  • 2709

                  #9
                  I would not worry about HG19 versus HG38 for the purposes of contaminant removal. They mainly differ in their coordinates, not contents.

                  Comment

                  Latest Articles

                  Collapse

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by SEQadmin2, 06-05-2026, 10:09 AM
                  0 responses
                  11 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-04-2026, 08:59 AM
                  0 responses
                  23 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-02-2026, 12:03 PM
                  0 responses
                  28 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-02-2026, 11:40 AM
                  0 responses
                  22 views
                  0 reactions
                  Last Post SEQadmin2  
                  Working...