Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • felvis56
    Member
    • Dec 2012
    • 11

    Remove identical mapping reads between two indexed samples

    I have 24 indexed samples from a single Miseq run and I want to remove identical reads between different samples i.e. cross-talk or potential lab contamination/carry-over.

    I was trying bedtools intersect -f 1 -r to try and report only unique sequences from bed file but I'm not sure this is the right tool for the job.

    Thanks in advance!
  • GenoMax
    Senior Member
    • Feb 2008
    • 7142

    #2
    See this: http://seqanswers.com/forums/showthread.php?t=39270

    Post #3.

    I think you should be able to provide multiple files as input (separate them by a comma). Brian can confirm.
    Last edited by GenoMax; 02-18-2015, 06:56 AM.

    Comment

    • sarvidsson
      Senior Member
      • Jan 2015
      • 137

      #3
      bedtools intersect -f 1 -r would give you an idea but won't handle paired reads, so you will have high rates of falsely accused carry-over fragments. You would need to filter the input and run it several times, each time exporting the individual IDs of your sequences, and compare those lists in the end. Possible, but tedious.

      Another way to get a rough idea but avoid alignment would be to merge the read pairs (if the fragments you've sequenced are shorter than <580 bp this works well), awk for the sequence lines, sort and get the count of unique sequences per sample. Then write a small script to compare the unique sequences between samples.

      In any case, you will falsely accuse some reads at high coverages. If you are really interested in getting quantitative values for carry-over contamination, you should spike in different synthetic DNA in each sample...

      Comment

      • Brian Bushnell
        Super Moderator
        • Jan 2014
        • 2709

        #4
        Originally posted by GenoMax View Post
        See this: http://seqanswers.com/forums/showthread.php?t=39270

        Post #3.

        I think you should be able to provide multiple files as input (separate them by a comma). Brian can confirm.
        Dedupe can remove duplicate reads from multiple files simultaneously, if they are comma-delimited (e.g. in=file1.fastq,file2.fastq,file3.fastq). And if you set the flag "uniqueonly=t" then ALL copies of duplicate reads will be removed, as opposed to the default behavior of leaving one copy of duplicate reads.

        However, it does not care which file a read came from; in other words, it can't remove only reads that are duplicates across multiple files but leave the ones that are duplicates within a file. That can still be accomplished, though, like this:

        1) Run dedupe on each sample individually, so now there are at most 1 copy of a read per sample.
        2) Run dedupe again on all of the samples together, with "uniqueonly=t". The only remaining duplicate reads will be the ones duplicated between samples, so that's all that will be removed.

        Comment

        • Elsie
          Member
          • Mar 2011
          • 85

          #5
          Thanks Brian, Genomax and sarvidsson for the helpful comments. Do you feel that this methodology would remove all possible cross-talk?
          thanks.

          Comment

          • Brian Bushnell
            Super Moderator
            • Jan 2014
            • 2709

            #6
            Hi Elsie,

            This will only remove PCR duplicates between samples, not all crosstalk. Can you describe your experimental setup? Are you multiplexing different organisms, or multiple individuals of the same organism, or...?

            Comment

            • Elsie
              Member
              • Mar 2011
              • 85

              #7
              Multiplexing ~20 samples of different, complex organisms, on the MiSeq. No red flags but it would be nice to know if there were any good methods for eliminating/investigating possible cross-talk.
              thanks.

              Comment

              • GenoMax
                Senior Member
                • Feb 2008
                • 7142

                #8
                If you have reference genomes available you could split the reads into organism specific files using BBsplit.

                Comment

                • Elsie
                  Member
                  • Mar 2011
                  • 85

                  #9
                  Ah no reference genomes unfortunately, part of the problem.

                  Comment

                  • Brian Bushnell
                    Super Moderator
                    • Jan 2014
                    • 2709

                    #10
                    So, I have a different program for this, called CrossBlock; it's specifically for cross-contamination between multiple organisms (and I specifically designed it for multiplexed single cell bacterial libraries). The usage is like this:

                    First, preprocess (adapter trim, etc) all libraries, and if they are paired in two files, interleave them so you have exactly one file of reads per library (we'll call this "preprocessed"). Then assemble the preprocessed reads into one assembly per library. For single cells, I recommend Spades, though it if fails in some cases you can use Tadpole for those - you need an assembly for every library, but it does not have to be optimal.

                    Now you have one assembly fasta file and one reads fastq file (preprocessed) per library.

                    Prepare two text files, one with the paths to the read files (one per library), and one with the paths to the assemblies, in the same order, one line per file. For example:

                    readlist.txt:
                    Code:
                    lib1.fq.gz
                    lib2.fq.gz
                    lib3.fq.gz
                    reflist.txt:
                    Code:
                    lib1.fasta
                    lib2.fasta
                    lib3.fasta
                    Now run CrossBlock:
                    crossblock.sh readnamefile=readlist.txt refnamefile=reflist.txt out=decontaminated

                    The end result will be clean contigs files for each library, without cross-contaminant contigs. If you want clean reads, you can then map the reads to the clean contigs and only keep the ones that map, then optionally reassemble to get your final assemblies.

                    Comment

                    • Elsie
                      Member
                      • Mar 2011
                      • 85

                      #11
                      Brian you are a genius.
                      I'll give this a go and let you know what the results look like, thank you.

                      Comment

                      • Elsie
                        Member
                        • Mar 2011
                        • 85

                        #12
                        Works a treat.
                        All my transcripts are still remaining, which validates my hypothesis that there is no (or no substantial) cross talk.
                        Got 1 java error running cross block:

                        Rename/Merge Phase Start
                        Input is being processed as paired
                        Writing interleaved.
                        Exception in thread "main" java.lang.NoClassDefFoundError: java/lang/ProcessBuilder$Redirect
                        at fileIO.ReadWrite.getOutputStreamFromProcess(ReadWrite.java:619)
                        at fileIO.ReadWrite.getPigzStream(ReadWrite.java:550)
                        at fileIO.ReadWrite.getGZipOutputStream(ReadWrite.java:521)
                        at fileIO.ReadWrite.getOutputStream(ReadWrite.java:382)
                        at fileIO.ReadWrite.getOutputStream(ReadWrite.java:346)
                        at stream.ReadStreamWriter.<init>(ReadStreamWriter.java:69)
                        at stream.ReadStreamByteWriter.<init>(ReadStreamByteWriter.java:17)
                        at stream.ConcurrentGenericReadOutputStream.<init>(ConcurrentGenericReadOutputStream.java:39)
                        at stream.ConcurrentReadOutputStream.getStream(ConcurrentReadOutputStream.java:53)
                        at stream.ConcurrentReadOutputStream.getStream(ConcurrentReadOutputStream.java:27)
                        at jgi.DecontaminateByNormalization.renameAndMerge(DecontaminateByNormalization.java:298)
                        at jgi.DecontaminateByNormalization.process(DecontaminateByNormalization.java:234)
                        at jgi.DecontaminateByNormalization.main(DecontaminateByNormalization.java:49)
                        Caused by: java.lang.ClassNotFoundException: java.lang.ProcessBuilder$Redirect
                        at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
                        at java.security.AccessController.doPrivileged(Native Method)
                        at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
                        at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
                        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
                        at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
                        ... 13 more


                        But results were still written to the decontaminated folder.
                        Thanks Brian, this is excellent.

                        Comment

                        • Brian Bushnell
                          Super Moderator
                          • Jan 2014
                          • 2709

                          #13
                          Hi Elsie,

                          Thanks for the report! Would you mind running this command:

                          java -Xmx20m -version

                          ...and posting the result?

                          Thanks!

                          P.S. I am not currently confident that your results are correct. You may need to rerun with a different version of Java; that error really should not have happened and I'm not sure what the effects would be.
                          Last edited by Brian Bushnell; 11-12-2015, 04:00 PM.

                          Comment

                          • Elsie
                            Member
                            • Mar 2011
                            • 85

                            #14
                            java -Xmx20m -version

                            java version "1.6.0_34"
                            OpenJDK Runtime Environment (IcedTea6 1.13.6) (rhel-1.13.6.1.el6_6-x86_64)
                            OpenJDK 64-Bit Server VM (build 23.25-b01, mixed mode)

                            We do have other versions of Java - different programs are pedantic about the version of java they run. I'll try a different version and get back to you.
                            thanks.

                            Comment

                            • Elsie
                              Member
                              • Mar 2011
                              • 85

                              #15
                              java -Xmx20m -version
                              java version "1.7.0_67"
                              Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
                              Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)


                              No java error and a complete list of files, thanks Brian.

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by SEQadmin2


                                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                                Here are nine questions we think about, in roughly the order they matter, before...
                                06-18-2026, 07:11 AM
                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-26-2026, 11:10 AM
                              0 responses
                              11 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-17-2026, 06:09 AM
                              0 responses
                              45 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-09-2026, 11:58 AM
                              0 responses
                              105 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-05-2026, 10:09 AM
                              0 responses
                              125 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...