I work on an auto-allo hexaploid plant species with a genome size estimated to be about 2.1 Gb. We have an excellent transcriptome assembly from four different experiments that produced about 600,000 contigs with an N50 of over 1600 bases and a CEGMA run showed 98% representation at full length and 100% representation at partial. We also have PE100 genomic sequence data from 4 different libraries from 300-400 bases in size that gives us about 15X coverage (based on kmer representation analysis). For kicks, I used BLAT to see how many genomic fragments would map to my assembled transcriptome, and got (what to me seems) a surprising number of hits. Well over 70% of the genomic fragments had hits to the transcriptome. Is this likely real, or is there something about the BLAT program that would produce a high number of spurious hits? I counted only the unique hits, so this isn't due to genomic fragments having multiple hits to the transcriptome. Thoughts? Is there a better way to determine the percentage of the genome represented in my transcriptome?
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
What is the N50 of your 15X assembly?
I would suggest using BWA or Bowtie2 to map your reads against your transcriptome. See what percent map.
Of course, depending on how "allo" your three sub-genomes are, there could be complexities there.
--
PhillipLast edited by pmiguel; 01-07-2015, 12:20 PM.
Comment
-
responses
There seems to be a miscommunication I think. The genomic sequences are not assembled, only the transcriptome. Thus I am looking to see how many of the genomic fragments contain transcribed sequences.
Incidentally, I had planned to pull out those genomic sequences that were had homology to the transcriptome (along with those fragments that contained conserved sequences from related organisms) and do an assembly of the transcribed space from the genome. However, I wasn't expecting that to be but a few percentages of the genomic sequence.
I could see possibly running bowtie2 since that might allow me to map both ends of my reads simultaneously, but I want to collect as many of the intron and promoter sequences from the genome as possible- since those might be useful for other studies.
So, back to my question: Why did I get so many hits?
Comment
-
BLAT is really the wrong tool for short reads (at least without careful tuning of parameters and post filtering), especially if the goal is to decide where these reads are actually coming from. By default BLAT uses a tile size of 11, needing two tiles to match and an identity of 90%, plus large gaps are allowed and the default score is 30. Something like bowtie would be a lot more stringent. But if you must use blat, I’d increase the tile size, require a much higher identity and score. At minimum, you could use the UCSC tool pslCDnaFilter to impose some stricter cutoffs and you wouldn’t have to rerun the mapping.
Comment
-
So you think the reason I am getting so many hits is because the homology is too lax? That seems possible given the high number of hits. Bowtie doesn't handle gaps but bowtie2 does. However, since I am mapping a genome fragment against an assembled transcriptome (rather than the more common transcriptome fragment to an assembled genome) would bowtie2 still give me hits in situations where the genomic fragment contained intron sequences? Would it still give me a hit if only one of the PEs had a hit to the transcript? As I ask these questions, I am guessing I should probably read up on the bowtie program .
Comment
-
Do the more stringent alignment first, but I would also think that your transcriptome assembly is contaminated with genomic sequences, which happens quite easily (especially if you include a lot of sequence in your assembly).
How many of the 600 000 "transcripts" contain ORFs covering most of their length? How many of the 600 000 "transcripts" are suspiciously large (> 10 kb)?
Comment
-
What was you method you used to ensure that you:
counted only the unique hits, so this isn't due to genomic fragments having multiple hits to the transcriptome.
--
Phillip
Comment
-
Thanks for all the replies! I'm working in the iPlant discovery environment so it is difficult to change parameters of some of the programs. However, I ran bowtie2 using the default parameters (which I am checking on currently to see what was programmed in for these) on just one of my genomic libraries and got the following:
101531102 reads; of these:
101531102 (100.00%) were paired; of these:
63600152 (62.64%) aligned concordantly 0 times
7094920 (6.99%) aligned concordantly exactly 1 time
30836030 (30.37%) aligned concordantly >1 times
----
63600152 pairs aligned concordantly 0 times; of these:
164821 (0.26%) aligned discordantly 1 time
----
63435331 pairs aligned 0 times concordantly or discordantly; of these:
126870662 mates make up the pairs; of these:
115162777 (90.77%) aligned 0 times
4017518 (3.17%) aligned exactly 1 time
7690367 (6.06%) aligned >1 times
43.29% overall alignment rate
So I still had 43% with hits! I am guessing we definitely have some contaminating genomic DNA, but I wouldn't have expected that much. However, my large contigs look real - most encode either proteins or bits of chloroplast or mitochondria sequences based on Blast hits. In answer to sarvidsson's question, about half (300K or so) of my contigs contain open reading frames greater than 300 bases in length. However, quite a number of the non-ORF contigs have reasonable expression values in some experiments. In answer to pmiguel, I just ran a sort on column 10 of the psl file that only returned the first hit for any given fragment, and then did a count using GREP on the @HWI at the start of each fragment name.
For kicks, I'll re-run the Bowtie using just my contigs with long ORFs.
Comment
-
I don't think there is much of a basis to presume DNA contamination of your transcriptome data. Plant genomes, especially ones with 1C genome sizes around that of sorghum or larger, tend to comprise retrotransposon clusters over a sizable percentage of their length. If one of the cDNA libraries that was used to generate your transcriptome data happened to include tissue that expressed retrotransposons, then that could give you the source of a large percentage of your hits.
Alternatively, it may be that the genome you are sequencing is not as large as you think.
--
Phillip
Comment
-
Good point pmiguel! I hadn't considered the possibility that a large number of the fragments might be transposons. I have a nice fasta file that combined the sequences of known transposons (and other repetitive sequences) from several plant species. I'll run a BLAT of my hits.fasta against it.
Comment
-
Originally posted by horvathdp View Post... about half (300K or so) of my contigs contain open reading frames greater than 300 bases in length. However, quite a number of the non-ORF contigs have reasonable expression values in some experiments.
For kicks, I'll re-run the Bowtie using just my contigs with long ORFs.
Comment
-
So I ran the Blat to see how many of my genomic fragments which map to my transcriptome, also map to the fasta file I built from the plant repetitive sequence database. Surprisingly, only 0.8% hit the repE-database. I also ran a BlastN to identify contigs that had similarity to my repE file, and only came up with a bit more than 200 (out of ~560,000) had matches greater than E-5. I really thought I would get more. So, a million or so of my genomic frags (that map to my transcriptome) that are from repetitive elements, but not nearly enough to explain the large percentage with hits to my transcriptome. My next step will be to address sarvidsson's thought and look at my individual transcripts to see if any have inordinately high representation among the genomic fragments. Anyone here have a nice script for counting the number of times a ref seq is hit in a psl file? Incidentally, my tanscriptome assembly only has 560 M bases (about a quarter of the estimated genome size) and a fair number are related contigs (as the assembly was done using trinity).
Comment
-
Originally posted by horvathdp View PostSo I ran the Blat to see how many of my genomic fragments which map to my transcriptome, also map to the fasta file I built from the plant repetitive sequence database. Surprisingly, only 0.8% hit the repE-database. I also ran a BlastN to identify contigs that had similarity to my repE file, and only came up with a bit more than 200 (out of ~560,000) had matches greater than E-5. I really thought I would get more. So, a million or so of my genomic frags (that map to my transcriptome) that are from repetitive elements, but not nearly enough to explain the large percentage with hits to my transcriptome.
You could gain some more sensitivity by using tblastx (or its equivalent) instead. But large segments of many LTR retrotransposons are not coding sequence, so protein level conservation may not be detectable.
--
Phillip
Comment
Latest Articles
Collapse
-
by seqadmin
The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...-
Channel: Articles
11-06-2024, 07:24 PM -
-
by seqadmin
Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...-
Channel: Articles
10-18-2024, 07:11 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 11-08-2024, 11:09 AM
|
0 responses
211 views
0 likes
|
Last Post
by seqadmin
11-08-2024, 11:09 AM
|
||
Started by seqadmin, 11-08-2024, 06:13 AM
|
0 responses
157 views
0 likes
|
Last Post
by seqadmin
11-08-2024, 06:13 AM
|
||
Started by seqadmin, 11-01-2024, 06:09 AM
|
0 responses
80 views
0 likes
|
Last Post
by seqadmin
11-01-2024, 06:09 AM
|
||
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks
by seqadmin
Started by seqadmin, 10-30-2024, 05:31 AM
|
0 responses
27 views
0 likes
|
Last Post
by seqadmin
10-30-2024, 05:31 AM
|
Comment