Seqanswers Leaderboard Ad

**Brian Bushnell** · 01-07-2015, 10:44 AM

I would shred the transcriptome into pieces (~300bp or so) and map them to the genome, then calculate coverage.

**pmiguel** · 01-07-2015, 12:15 PM

What is the N50 of your 15X assembly?
I would suggest using BWA or Bowtie2 to map your reads against your transcriptome. See what percent map.
Of course, depending on how "allo" your three sub-genomes are, there could be complexities there.
--
Phillip

**horvathdp** · 01-07-2015, 02:28 PM

responses

There seems to be a miscommunication I think. The genomic sequences are not assembled, only the transcriptome. Thus I am looking to see how many of the genomic fragments contain transcribed sequences.

Incidentally, I had planned to pull out those genomic sequences that were had homology to the transcriptome (along with those fragments that contained conserved sequences from related organisms) and do an assembly of the transcribed space from the genome. However, I wasn't expecting that to be but a few percentages of the genomic sequence.

I could see possibly running bowtie2 since that might allow me to map both ends of my reads simultaneously, but I want to collect as many of the intron and promoter sequences from the genome as possible- since those might be useful for other studies.

So, back to my question: Why did I get so many hits?

**Wallysb01** · 01-07-2015, 02:49 PM

BLAT is really the wrong tool for short reads (at least without careful tuning of parameters and post filtering), especially if the goal is to decide where these reads are actually coming from. By default BLAT uses a tile size of 11, needing two tiles to match and an identity of 90%, plus large gaps are allowed and the default score is 30. Something like bowtie would be a lot more stringent. But if you must use blat, I’d increase the tile size, require a much higher identity and score. At minimum, you could use the UCSC tool pslCDnaFilter to impose some stricter cutoffs and you wouldn’t have to rerun the mapping.

**horvathdp** · 01-07-2015, 03:25 PM

So you think the reason I am getting so many hits is because the homology is too lax? That seems possible given the high number of hits. Bowtie doesn't handle gaps but bowtie2 does. However, since I am mapping a genome fragment against an assembled transcriptome (rather than the more common transcriptome fragment to an assembled genome) would bowtie2 still give me hits in situations where the genomic fragment contained intron sequences? Would it still give me a hit if only one of the PEs had a hit to the transcript? As I ask these questions, I am guessing I should probably read up on the bowtie program

.

**sarvidsson** · 01-08-2015, 02:05 AM

Do the more stringent alignment first, but I would also think that your transcriptome assembly is contaminated with genomic sequences, which happens quite easily (especially if you include a lot of sequence in your assembly).

How many of the 600 000 "transcripts" contain ORFs covering most of their length? How many of the 600 000 "transcripts" are suspiciously large (> 10 kb)?

**pmiguel** · 01-08-2015, 04:35 AM

What was you method you used to ensure that you:

counted only the unique hits, so this isn't due to genomic fragments having multiple hits to the transcriptome.

Did you discard any reads that mapped to multiple contigs in the transcriptome? If so, did they get counted as "non-mapping", or were they just not counted at all?
--
Phillip

**horvathdp** · 01-08-2015, 08:40 AM

Thanks for all the replies! I'm working in the iPlant discovery environment so it is difficult to change parameters of some of the programs. However, I ran bowtie2 using the default parameters (which I am checking on currently to see what was programmed in for these) on just one of my genomic libraries and got the following:

101531102 reads; of these:
101531102 (100.00%) were paired; of these:
63600152 (62.64%) aligned concordantly 0 times
7094920 (6.99%) aligned concordantly exactly 1 time
30836030 (30.37%) aligned concordantly >1 times
----
63600152 pairs aligned concordantly 0 times; of these:
164821 (0.26%) aligned discordantly 1 time
----
63435331 pairs aligned 0 times concordantly or discordantly; of these:
126870662 mates make up the pairs; of these:
115162777 (90.77%) aligned 0 times
4017518 (3.17%) aligned exactly 1 time
7690367 (6.06%) aligned >1 times
43.29% overall alignment rate

So I still had 43% with hits! I am guessing we definitely have some contaminating genomic DNA, but I wouldn't have expected that much. However, my large contigs look real - most encode either proteins or bits of chloroplast or mitochondria sequences based on Blast hits. In answer to sarvidsson's question, about half (300K or so) of my contigs contain open reading frames greater than 300 bases in length. However, quite a number of the non-ORF contigs have reasonable expression values in some experiments. In answer to pmiguel, I just ran a sort on column 10 of the psl file that only returned the first hit for any given fragment, and then did a count using GREP on the @HWI at the start of each fragment name.

For kicks, I'll re-run the Bowtie using just my contigs with long ORFs.

**pmiguel** · 01-08-2015, 09:12 AM

I don't think there is much of a basis to presume DNA contamination of your transcriptome data. Plant genomes, especially ones with 1C genome sizes around that of sorghum or larger, tend to comprise retrotransposon clusters over a sizable percentage of their length. If one of the cDNA libraries that was used to generate your transcriptome data happened to include tissue that expressed retrotransposons, then that could give you the source of a large percentage of your hits.
Alternatively, it may be that the genome you are sequencing is not as large as you think.

--
Phillip

**horvathdp** · 01-08-2015, 09:30 AM

Good point pmiguel! I hadn't considered the possibility that a large number of the fragments might be transposons. I have a nice fasta file that combined the sequences of known transposons (and other repetitive sequences) from several plant species. I'll run a BLAT of my hits.fasta against it.

**sarvidsson** · 01-09-2015, 12:30 AM

Originally posted by horvathdp View Post

... about half (300K or so) of my contigs contain open reading frames greater than 300 bases in length. However, quite a number of the non-ORF contigs have reasonable expression values in some experiments.

For kicks, I'll re-run the Bowtie using just my contigs with long ORFs.

While pmiguel is right concerning the retrotransposons, I was referring to long transcontigs with either a single ORF covering a minor part (single percentage range) of the contig, or several ORFs at several positions. I've seen such transcontigs in extreme-coverage de novo transcriptome assemblies from plants (with mid- to large-sized genomes), and I don't trust them to be single transcripts, or at least not fully processed functional ones.

**horvathdp** · 01-09-2015, 08:01 AM

So I ran the Blat to see how many of my genomic fragments which map to my transcriptome, also map to the fasta file I built from the plant repetitive sequence database. Surprisingly, only 0.8% hit the repE-database. I also ran a BlastN to identify contigs that had similarity to my repE file, and only came up with a bit more than 200 (out of ~560,000) had matches greater than E-5. I really thought I would get more. So, a million or so of my genomic frags (that map to my transcriptome) that are from repetitive elements, but not nearly enough to explain the large percentage with hits to my transcriptome. My next step will be to address sarvidsson's thought and look at my individual transcripts to see if any have inordinately high representation among the genomic fragments. Anyone here have a nice script for counting the number of times a ref seq is hit in a psl file? Incidentally, my tanscriptome assembly only has 560 M bases (about a quarter of the estimated genome size) and a fair number are related contigs (as the assembly was done using trinity).

**pmiguel** · 01-09-2015, 08:14 AM

Originally posted by horvathdp View Post

So I ran the Blat to see how many of my genomic fragments which map to my transcriptome, also map to the fasta file I built from the plant repetitive sequence database. Surprisingly, only 0.8% hit the repE-database. I also ran a BlastN to identify contigs that had similarity to my repE file, and only came up with a bit more than 200 (out of ~560,000) had matches greater than E-5. I really thought I would get more. So, a million or so of my genomic frags (that map to my transcriptome) that are from repetitive elements, but not nearly enough to explain the large percentage with hits to my transcriptome.

There is generally little inter-genera nucleotide similarity among most transposable element sequences. At least in plants. Interesting exceptions to this, of course. But they are just that, exceptions.

You could gain some more sensitivity by using tblastx (or its equivalent) instead. But large segments of many LTR retrotransposons are not coding sequence, so protein level conservation may not be detectable.

--
Phillip

**horvathdp** · 01-09-2015, 08:34 AM

So actually my count should also turn up sequences that are highly represented in the genome which thus might include new repEs or at least leafy spurge specific repEs. Cool!

Topics	Statistics	Last Post
Study Reveals How Bacteria Defend Against Viral Attacks by seqadmin Started by seqadmin, 08-27-2024, 04:40 AM	0 responses 16 views 0 likes	Last Post by seqadmin 08-27-2024, 04:40 AM
New Single-Molecule Sequencing Platform Introduces Advanced Features for High-Throughput Genomics by seqadmin Started by seqadmin, 08-22-2024, 05:00 AM	0 responses 293 views 0 likes	Last Post by seqadmin 08-22-2024, 05:00 AM
New DNA Code Discovered Revealing Complex Gene Regulation Mechanisms by seqadmin Started by seqadmin, 08-21-2024, 10:49 AM	0 responses 135 views 0 likes	Last Post by seqadmin 08-21-2024, 10:49 AM
Epigenetic Clocks Derived from Retroelements Offer New Insights into Aging by seqadmin Started by seqadmin, 08-19-2024, 05:12 AM	0 responses 124 views 0 likes	Last Post by seqadmin 08-19-2024, 05:12 AM

Seqanswers Leaderboard Ad

Announcement

Blat of transcriptome to genome gave 70% hits! reality?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News