Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Avoiding Genome Assembly: Illumina DNA-Seq in-silico Library Subtraction?

    Hey all! I've got (what I consider) an interesting data processing question that I thought someone here may be able to help out with.

    Study System: Single species, non-model (ie. no reference genome of anything close) vertebrate with genome ~2.5Gb that shows large within-species karyotypic differences between geographic isolates.

    Data: Low-coverage approach; 100bp PE DNA-Seq datasets with ~170M PF clusters per isolate genome.

    Study Question: Can we use low-coverage, isolate-specific DNA-seq libraries to identify genomic differences between individuals/populations in a non-model organism?

    SeqAnswers Question: Because our cost-induced low coverage limits our ability to make a decent assembly (Ray worked okay, but inconsistently between samples), we were thinking that maybe there would be a way to work from the reads of each isolate directly to perform a sort of in silico subtraction? Then, a small scale assembly could be done on the isolate-specific reads.

    To rephrase for clarity:

    1) Sequence genome of isolate 1 and isolate 2 at low coverage
    2) Perform normal read QC
    3) Instead of assembling and reciprocal BLASTing (precluded from coverage), how could we compare the reads from isolate 1 to the reads of isolate 2 and subset the unique reads in each isolate (which since they're the same species should be highly enriched for the chromosomal anomolies).

    Any thoughts here? There may be a tool for this and I just don't know the right search terms. It also may be bioinformatics sacrilege, and if so, accept my thousand pardons.

    EDIT: I just wanted to add that we have tried to assemble these genomes and do our analyses in a more traditional way, but assembly quality wasn't consistent between samples. I add this as a gesture that I'm not milking SeqAnswers to come up with my whole project pipeline, but rather, to assist in looking for non-traditional methods when the others fall away.
    Last edited by jazz710; 01-28-2015, 02:47 PM. Reason: Addition

  • #2
    If you had a lot of memory, and error-free or low-error data, you could subtract one library from another using BBDuk, based on the presence of shared kmers:

    bbduk.sh in=isolate1.fastq ref=isolate2.fastq out=unique1.fastq k=25 mm=f speed=8 prealloc

    That would certainly work for bacteria, but vertebrate data will have 1000x more kmers. Reducing k would reduce the kmer space, though. The "speed" flag also reduces memory consumption by ignoring part of the kmer space; max is 15, default is 0.

    BBDuk uses around 15 bytes per kmer, so around 37.5 gigs if the data was error-free and haploid, but of course higher for a diploid with sequencing errors. The "speed=8" flag will cut memory usage in half (speed=12 would cut it by a factor of 4, and 14 by a factor of 8). Also, quality-trimming the reads first would reduce the number of errors and thus the number of kmers. So - I think it's probably doable in this case.

    Edit:

    Unless the assembly does not account for a lot of the reads, though, it's probably best to do this by reads mapping to the other sample's assembly, and keeping what doesn't map. This will work even if the assembly is highly fragmented, as long as it accounts for most of the genome.

    It's also possible to convert the reads to fasta and map to them directly rather than assembling, and again keep what doesn't map.
    Last edited by Brian Bushnell; 01-28-2015, 03:40 PM.

    Comment


    • #3
      Thanks for the tremendously fast response. I actually used BBduk to clean my data with the following command:

      bbduk.sh in1=ISOLATE1_1.fastq in2=ISOLATE1_2.fastq out1=ISOLATE1_1.clean.fastq out2=ISOLATE1_2.clean.fastq minlen=50 qtrim=rl trimq=10

      So now, I will merge my PE data into ISOLATE1.fastq and ISOLATE2.fastq and I'll try to give them a run on our university server.

      Thanks again! I'll check in with results later.

      Comment


      • #4
        From the karyotype differences you see, do you expect there are large deletions in one sample that completely remove genes (both copies)? If the karyotype differences are inversions or balanced translocations then the in silico subtraction wouldn't give the results you want, since one or both copies of the genes would still be present.

        I would hesitate to conclude much from assemblies that lacked some genomic region in one sample, and it sounds like you have the same doubts. Making genetic maps in the different populations would be helpful, but I'm sure would be difficult in a natural population. When we first developed RAD-Seq and tried it out in a stickleback mapping population it was pretty clear there was a large inversion on one of the linkage groups we were interested in for one of the populations.

        Let's say you have some enriched set of reads. You'll still be trying to do an assembly from low coverage data, so it may be hard to move from reads to a more satisfying contig/contigs of genes.

        Sorry to ramble. Definitely a hard problem to demonstrate rigorously!
        Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

        Comment


        • #5
          Originally posted by jazz710 View Post
          Thanks again! I'll check in with results later.
          OK - just bear in mind, that running successfully in this case means collecting the set of reads from one sample that share zero kmers with the reads from the other sample. So, this will not catch long deletions (since kmers would still be shared), just sequence unique to one sample or the other for at least a read length. You can adjust the settings, though, like running in "ktrim=N" mode which will N-mask reads wherever kmers are shared (rather than filtering), thus retaining the unique sequence even in reads that do share a kmer with the other dataset. In that case, you would end up with the deletion junctions including bases up to 1 kmer length out in each direction, for example. You could then run a second pass using the initial unmasked data as "in" and the masked data as "ref" to filter and retain all reads sharing kmers (using the "outm" stream) with the unique sequence. That way, you would end up with all read pairs that contain a single kmer not present in the other dataset. And, I guess, now that I think about it, it's probably a better approach to the one I initially suggested, although it's more convoluted.

          I'm not sure what the best strategy is since it's a novel problem; it may take a little experimentation.

          Comment


          • #6
            That seems like a slice of fried gold right there! I'll play around with it for sure.

            Comment


            • #7
              My impression is that many large rearrangement breakpoints are in repetitive sequences that would make identifying an exact breakpoint difficult since the entire read would be a triplet repeat (for example) in both the normal and rearranged chromosome. If the goal is to identify rearrangements then moderate coverage PacBio would be perhaps a way to see something, but I understand you've got what you've got and want to extract something useful from your Illumina data.
              Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

              Comment


              • #8
                Originally posted by SNPsaurus View Post
                My impression is that many large rearrangement breakpoints are in repetitive sequences that would make identifying an exact breakpoint difficult since the entire read would be a triplet repeat (for example) in both the normal and rearranged chromosome. If the goal is to identify rearrangements then moderate coverage PacBio would be perhaps a way to see something, but I understand you've got what you've got and want to extract something useful from your Illumina data.
                Slightly different system, this isn't chromosomal rearrangements. This is hybridization with whole new chromosome arms. The sequence differences should be large and in charge.

                Comment


                • #9
                  I see. That explains why the assembly has been difficult as well, since the diploid sequences are so different! You will still have some issues with the assembly, since your coverage depth for each haploid chromosome is less than 7X... but you could at least get some short contigs out. Have you tried any haplotype-aware assemblers? At the PAG conference there was talk of a few, like nrgene's http://www.denovomagic.com/
                  Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

                  Comment


                  • #10
                    Yes, and I'm seeing now as well that one of the two isolates has a very high amount of duplication as well :-/

                    I'll take a look at the haploid assemblers!

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Genetic Variation in Immunogenetics and Antibody Diversity
                      by seqadmin



                      The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
                      11-06-2024, 07:24 PM
                    • seqadmin
                      Choosing Between NGS and qPCR
                      by seqadmin



                      Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                      10-18-2024, 07:11 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 11-08-2024, 11:09 AM
                    0 responses
                    211 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 11-08-2024, 06:13 AM
                    0 responses
                    154 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 11-01-2024, 06:09 AM
                    0 responses
                    80 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 10-30-2024, 05:31 AM
                    0 responses
                    27 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X