Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Remove identical mapping reads between two indexed samples

    I have 24 indexed samples from a single Miseq run and I want to remove identical reads between different samples i.e. cross-talk or potential lab contamination/carry-over.

    I was trying bedtools intersect -f 1 -r to try and report only unique sequences from bed file but I'm not sure this is the right tool for the job.

    Thanks in advance!

  • #2
    See this: http://seqanswers.com/forums/showthread.php?t=39270

    Post #3.

    I think you should be able to provide multiple files as input (separate them by a comma). Brian can confirm.
    Last edited by GenoMax; 02-18-2015, 06:56 AM.

    Comment


    • #3
      bedtools intersect -f 1 -r would give you an idea but won't handle paired reads, so you will have high rates of falsely accused carry-over fragments. You would need to filter the input and run it several times, each time exporting the individual IDs of your sequences, and compare those lists in the end. Possible, but tedious.

      Another way to get a rough idea but avoid alignment would be to merge the read pairs (if the fragments you've sequenced are shorter than <580 bp this works well), awk for the sequence lines, sort and get the count of unique sequences per sample. Then write a small script to compare the unique sequences between samples.

      In any case, you will falsely accuse some reads at high coverages. If you are really interested in getting quantitative values for carry-over contamination, you should spike in different synthetic DNA in each sample...

      Comment


      • #4
        Originally posted by GenoMax View Post
        See this: http://seqanswers.com/forums/showthread.php?t=39270

        Post #3.

        I think you should be able to provide multiple files as input (separate them by a comma). Brian can confirm.
        Dedupe can remove duplicate reads from multiple files simultaneously, if they are comma-delimited (e.g. in=file1.fastq,file2.fastq,file3.fastq). And if you set the flag "uniqueonly=t" then ALL copies of duplicate reads will be removed, as opposed to the default behavior of leaving one copy of duplicate reads.

        However, it does not care which file a read came from; in other words, it can't remove only reads that are duplicates across multiple files but leave the ones that are duplicates within a file. That can still be accomplished, though, like this:

        1) Run dedupe on each sample individually, so now there are at most 1 copy of a read per sample.
        2) Run dedupe again on all of the samples together, with "uniqueonly=t". The only remaining duplicate reads will be the ones duplicated between samples, so that's all that will be removed.

        Comment


        • #5
          Thanks Brian, Genomax and sarvidsson for the helpful comments. Do you feel that this methodology would remove all possible cross-talk?
          thanks.

          Comment


          • #6
            Hi Elsie,

            This will only remove PCR duplicates between samples, not all crosstalk. Can you describe your experimental setup? Are you multiplexing different organisms, or multiple individuals of the same organism, or...?

            Comment


            • #7
              Multiplexing ~20 samples of different, complex organisms, on the MiSeq. No red flags but it would be nice to know if there were any good methods for eliminating/investigating possible cross-talk.
              thanks.

              Comment


              • #8
                If you have reference genomes available you could split the reads into organism specific files using BBsplit.

                Comment


                • #9
                  Ah no reference genomes unfortunately, part of the problem.

                  Comment


                  • #10
                    So, I have a different program for this, called CrossBlock; it's specifically for cross-contamination between multiple organisms (and I specifically designed it for multiplexed single cell bacterial libraries). The usage is like this:

                    First, preprocess (adapter trim, etc) all libraries, and if they are paired in two files, interleave them so you have exactly one file of reads per library (we'll call this "preprocessed"). Then assemble the preprocessed reads into one assembly per library. For single cells, I recommend Spades, though it if fails in some cases you can use Tadpole for those - you need an assembly for every library, but it does not have to be optimal.

                    Now you have one assembly fasta file and one reads fastq file (preprocessed) per library.

                    Prepare two text files, one with the paths to the read files (one per library), and one with the paths to the assemblies, in the same order, one line per file. For example:

                    readlist.txt:
                    Code:
                    lib1.fq.gz
                    lib2.fq.gz
                    lib3.fq.gz
                    reflist.txt:
                    Code:
                    lib1.fasta
                    lib2.fasta
                    lib3.fasta
                    Now run CrossBlock:
                    crossblock.sh readnamefile=readlist.txt refnamefile=reflist.txt out=decontaminated

                    The end result will be clean contigs files for each library, without cross-contaminant contigs. If you want clean reads, you can then map the reads to the clean contigs and only keep the ones that map, then optionally reassemble to get your final assemblies.

                    Comment


                    • #11
                      Brian you are a genius.
                      I'll give this a go and let you know what the results look like, thank you.

                      Comment


                      • #12
                        Works a treat.
                        All my transcripts are still remaining, which validates my hypothesis that there is no (or no substantial) cross talk.
                        Got 1 java error running cross block:

                        Rename/Merge Phase Start
                        Input is being processed as paired
                        Writing interleaved.
                        Exception in thread "main" java.lang.NoClassDefFoundError: java/lang/ProcessBuilder$Redirect
                        at fileIO.ReadWrite.getOutputStreamFromProcess(ReadWrite.java:619)
                        at fileIO.ReadWrite.getPigzStream(ReadWrite.java:550)
                        at fileIO.ReadWrite.getGZipOutputStream(ReadWrite.java:521)
                        at fileIO.ReadWrite.getOutputStream(ReadWrite.java:382)
                        at fileIO.ReadWrite.getOutputStream(ReadWrite.java:346)
                        at stream.ReadStreamWriter.<init>(ReadStreamWriter.java:69)
                        at stream.ReadStreamByteWriter.<init>(ReadStreamByteWriter.java:17)
                        at stream.ConcurrentGenericReadOutputStream.<init>(ConcurrentGenericReadOutputStream.java:39)
                        at stream.ConcurrentReadOutputStream.getStream(ConcurrentReadOutputStream.java:53)
                        at stream.ConcurrentReadOutputStream.getStream(ConcurrentReadOutputStream.java:27)
                        at jgi.DecontaminateByNormalization.renameAndMerge(DecontaminateByNormalization.java:298)
                        at jgi.DecontaminateByNormalization.process(DecontaminateByNormalization.java:234)
                        at jgi.DecontaminateByNormalization.main(DecontaminateByNormalization.java:49)
                        Caused by: java.lang.ClassNotFoundException: java.lang.ProcessBuilder$Redirect
                        at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
                        at java.security.AccessController.doPrivileged(Native Method)
                        at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
                        at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
                        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
                        at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
                        ... 13 more


                        But results were still written to the decontaminated folder.
                        Thanks Brian, this is excellent.

                        Comment


                        • #13
                          Hi Elsie,

                          Thanks for the report! Would you mind running this command:

                          java -Xmx20m -version

                          ...and posting the result?

                          Thanks!

                          P.S. I am not currently confident that your results are correct. You may need to rerun with a different version of Java; that error really should not have happened and I'm not sure what the effects would be.
                          Last edited by Brian Bushnell; 11-12-2015, 04:00 PM.

                          Comment


                          • #14
                            java -Xmx20m -version

                            java version "1.6.0_34"
                            OpenJDK Runtime Environment (IcedTea6 1.13.6) (rhel-1.13.6.1.el6_6-x86_64)
                            OpenJDK 64-Bit Server VM (build 23.25-b01, mixed mode)

                            We do have other versions of Java - different programs are pedantic about the version of java they run. I'll try a different version and get back to you.
                            thanks.

                            Comment


                            • #15
                              java -Xmx20m -version
                              java version "1.7.0_67"
                              Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
                              Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)


                              No java error and a complete list of files, thanks Brian.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Exploring the Dynamics of the Tumor Microenvironment
                                by seqadmin




                                The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                                07-08-2024, 03:19 PM
                              • seqadmin
                                Exploring Human Diversity Through Large-Scale Omics
                                by seqadmin


                                In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                                06-25-2024, 06:43 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 07-16-2024, 05:49 AM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-15-2024, 06:53 AM
                              0 responses
                              31 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-10-2024, 07:30 AM
                              0 responses
                              40 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-03-2024, 09:45 AM
                              0 responses
                              205 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X