Yes, and I'm seeing now as well that one of the two isolates has a very high amount of duplication as well :-/
I'll take a look at the haploid assemblers!
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
I see. That explains why the assembly has been difficult as well, since the diploid sequences are so different! You will still have some issues with the assembly, since your coverage depth for each haploid chromosome is less than 7X... but you could at least get some short contigs out. Have you tried any haplotype-aware assemblers? At the PAG conference there was talk of a few, like nrgene's http://www.denovomagic.com/
Leave a comment:
-
Originally posted by SNPsaurus View PostMy impression is that many large rearrangement breakpoints are in repetitive sequences that would make identifying an exact breakpoint difficult since the entire read would be a triplet repeat (for example) in both the normal and rearranged chromosome. If the goal is to identify rearrangements then moderate coverage PacBio would be perhaps a way to see something, but I understand you've got what you've got and want to extract something useful from your Illumina data.
Leave a comment:
-
My impression is that many large rearrangement breakpoints are in repetitive sequences that would make identifying an exact breakpoint difficult since the entire read would be a triplet repeat (for example) in both the normal and rearranged chromosome. If the goal is to identify rearrangements then moderate coverage PacBio would be perhaps a way to see something, but I understand you've got what you've got and want to extract something useful from your Illumina data.
Leave a comment:
-
That seems like a slice of fried gold right there! I'll play around with it for sure.
Leave a comment:
-
Originally posted by jazz710 View PostThanks again! I'll check in with results later.
I'm not sure what the best strategy is since it's a novel problem; it may take a little experimentation.
Leave a comment:
-
From the karyotype differences you see, do you expect there are large deletions in one sample that completely remove genes (both copies)? If the karyotype differences are inversions or balanced translocations then the in silico subtraction wouldn't give the results you want, since one or both copies of the genes would still be present.
I would hesitate to conclude much from assemblies that lacked some genomic region in one sample, and it sounds like you have the same doubts. Making genetic maps in the different populations would be helpful, but I'm sure would be difficult in a natural population. When we first developed RAD-Seq and tried it out in a stickleback mapping population it was pretty clear there was a large inversion on one of the linkage groups we were interested in for one of the populations.
Let's say you have some enriched set of reads. You'll still be trying to do an assembly from low coverage data, so it may be hard to move from reads to a more satisfying contig/contigs of genes.
Sorry to ramble. Definitely a hard problem to demonstrate rigorously!
Leave a comment:
-
Thanks for the tremendously fast response. I actually used BBduk to clean my data with the following command:
bbduk.sh in1=ISOLATE1_1.fastq in2=ISOLATE1_2.fastq out1=ISOLATE1_1.clean.fastq out2=ISOLATE1_2.clean.fastq minlen=50 qtrim=rl trimq=10
So now, I will merge my PE data into ISOLATE1.fastq and ISOLATE2.fastq and I'll try to give them a run on our university server.
Thanks again! I'll check in with results later.
Leave a comment:
-
If you had a lot of memory, and error-free or low-error data, you could subtract one library from another using BBDuk, based on the presence of shared kmers:
bbduk.sh in=isolate1.fastq ref=isolate2.fastq out=unique1.fastq k=25 mm=f speed=8 prealloc
That would certainly work for bacteria, but vertebrate data will have 1000x more kmers. Reducing k would reduce the kmer space, though. The "speed" flag also reduces memory consumption by ignoring part of the kmer space; max is 15, default is 0.
BBDuk uses around 15 bytes per kmer, so around 37.5 gigs if the data was error-free and haploid, but of course higher for a diploid with sequencing errors. The "speed=8" flag will cut memory usage in half (speed=12 would cut it by a factor of 4, and 14 by a factor of 8). Also, quality-trimming the reads first would reduce the number of errors and thus the number of kmers. So - I think it's probably doable in this case.
Edit:
Unless the assembly does not account for a lot of the reads, though, it's probably best to do this by reads mapping to the other sample's assembly, and keeping what doesn't map. This will work even if the assembly is highly fragmented, as long as it accounts for most of the genome.
It's also possible to convert the reads to fasta and map to them directly rather than assembling, and again keep what doesn't map.Last edited by Brian Bushnell; 01-28-2015, 03:40 PM.
Leave a comment:
-
Avoiding Genome Assembly: Illumina DNA-Seq in-silico Library Subtraction?
Hey all! I've got (what I consider) an interesting data processing question that I thought someone here may be able to help out with.
Study System: Single species, non-model (ie. no reference genome of anything close) vertebrate with genome ~2.5Gb that shows large within-species karyotypic differences between geographic isolates.
Data: Low-coverage approach; 100bp PE DNA-Seq datasets with ~170M PF clusters per isolate genome.
Study Question: Can we use low-coverage, isolate-specific DNA-seq libraries to identify genomic differences between individuals/populations in a non-model organism?
SeqAnswers Question: Because our cost-induced low coverage limits our ability to make a decent assembly (Ray worked okay, but inconsistently between samples), we were thinking that maybe there would be a way to work from the reads of each isolate directly to perform a sort of in silico subtraction? Then, a small scale assembly could be done on the isolate-specific reads.
To rephrase for clarity:
1) Sequence genome of isolate 1 and isolate 2 at low coverage
2) Perform normal read QC
3) Instead of assembling and reciprocal BLASTing (precluded from coverage), how could we compare the reads from isolate 1 to the reads of isolate 2 and subset the unique reads in each isolate (which since they're the same species should be highly enriched for the chromosomal anomolies).
Any thoughts here? There may be a tool for this and I just don't know the right search terms. It also may be bioinformatics sacrilege, and if so, accept my thousand pardons.
EDIT: I just wanted to add that we have tried to assemble these genomes and do our analyses in a more traditional way, but assembly quality wasn't consistent between samples. I add this as a gesture that I'm not milking SeqAnswers to come up with my whole project pipeline, but rather, to assist in looking for non-traditional methods when the others fall away.
Latest Articles
Collapse
-
by seqadmin
Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.
Long-Read Sequencing
Long-read sequencing has...-
Channel: Articles
12-02-2024, 01:49 PM -
-
by seqadmin
The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben MartÃnez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...-
Channel: Articles
11-06-2024, 07:24 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 12-02-2024, 09:29 AM
|
0 responses
124 views
0 likes
|
Last Post
by seqadmin
12-02-2024, 09:29 AM
|
||
Started by seqadmin, 12-02-2024, 09:06 AM
|
0 responses
47 views
0 likes
|
Last Post
by seqadmin
12-02-2024, 09:06 AM
|
||
Started by seqadmin, 12-02-2024, 08:03 AM
|
0 responses
38 views
0 likes
|
Last Post
by seqadmin
12-02-2024, 08:03 AM
|
||
Started by seqadmin, 11-22-2024, 07:36 AM
|
0 responses
67 views
0 likes
|
Last Post
by seqadmin
11-22-2024, 07:36 AM
|
Leave a comment: