Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • pmiguel
    replied
    Originally posted by horvathdp View Post
    If pmiguel has any good programs for identifying such (scripts for identifying direct or indirect repeats, or other conserved transposases, etc) and would like to collaborate, I am certainly open to that possibility.
    Actually, I haven't done much more than dabble in this field for many years now. Probably easier for you to contact someone else from Bennetzen lab who would has done this sort of thing more recently. Feel free to email me at [email protected] if you need some contacts.

    --
    Phillip

    Leave a comment:


  • horvathdp
    replied
    So I ran the count of the contigs and identified quite a few that had high numbers of genomic fragments mapping to them. There was no obvious pattern or sequences among the ones with the most hits (although several had hits to more than million fragments each). Also, in consideration of sarvidsson's comments, the median size of these highly over-represented contigs is about 400 bases. I still do however expect that at least a fair subset of my contigs may well represent sequences that were assembled from hnRNA and span introns or have significant amounts of extended 3'UTRs.

    So, my next step will be to see if I can identify the nature of these over-represented contigs. Towards that end, I generated a list (exactly 6,666 of them frighteningly enough )of those that had 100X greater coverage than expected (based on my earlier Kmer analysis of the genomic libraries). I am in the process of using BLAT again to generate a set of fragments that map to these transcribed sequences that are highly over-represented in the genome. Once I have these, I will do an assembly and take a closer look to see if I can recognize any obvious transposons or retro elements among them. If pmiguel has any good programs for identifying such (scripts for identifying direct or indirect repeats, or other conserved transposases, etc) and would like to collaborate, I am certainly open to that possibility.

    Leave a comment:


  • pmiguel
    replied
    Originally posted by sarvidsson View Post
    While pmiguel is right concerning the retrotransposons, I was referring to long transcontigs with either a single ORF covering a minor part (single percentage range) of the contig, or several ORFs at several positions. I've seen such transcontigs in extreme-coverage de novo transcriptome assemblies from plants (with mid- to large-sized genomes), and I don't trust them to be single transcripts, or at least not fully processed functional ones.
    This could also be a characteristic of LTR-retrotransposon-derived cDNA. It is probably the case that few of the LTR-retros that compose the majority of mid- large genome-size plants are functionally autonomous. But this would not necessarily prevent them from being transcribed.

    LTR-retros that have recently transposed will tend to have very long intron-less ORFs encoding GAG-POL. These retros may actually be functionally autonomous -- able to catalyze their own transposition. But over evolutionary time these ORFs sustain mutations that break them up. Some mutations are no doubt due to the heavy cytosine methylation at CG and CNG sites of repetitive sequence in plants hindering repair of 5-methyl-cytosine deamination events. (Ie, C->U: easy to fix; 5MeC->T: hard to fix. I tend to think of this as plant's "slow motion" form of fungal "RIPing"). There also seem to be lots of deletions that gradually "erode" away the elements as the megayears pass. And, of course, as the LTR retros come to compose a larger portion of the genome, the chances of a new transposition occurring into a previously inserted LTR retrotransposon becomes greater and greater. Hence lots of insertion of elements into other elements creating nested clusters.

    Anyway, in certain tissues I think LTR-retrotransposons, even ones that are damaged, probably are expressed. Pollen is probably one such tissue, but there may be others.

    --
    Phillip

    Leave a comment:


  • horvathdp
    replied
    So actually my count should also turn up sequences that are highly represented in the genome which thus might include new repEs or at least leafy spurge specific repEs. Cool!

    Leave a comment:


  • pmiguel
    replied
    Originally posted by horvathdp View Post
    So I ran the Blat to see how many of my genomic fragments which map to my transcriptome, also map to the fasta file I built from the plant repetitive sequence database. Surprisingly, only 0.8% hit the repE-database. I also ran a BlastN to identify contigs that had similarity to my repE file, and only came up with a bit more than 200 (out of ~560,000) had matches greater than E-5. I really thought I would get more. So, a million or so of my genomic frags (that map to my transcriptome) that are from repetitive elements, but not nearly enough to explain the large percentage with hits to my transcriptome.
    There is generally little inter-genera nucleotide similarity among most transposable element sequences. At least in plants. Interesting exceptions to this, of course. But they are just that, exceptions.

    You could gain some more sensitivity by using tblastx (or its equivalent) instead. But large segments of many LTR retrotransposons are not coding sequence, so protein level conservation may not be detectable.

    --
    Phillip

    Leave a comment:


  • horvathdp
    replied
    So I ran the Blat to see how many of my genomic fragments which map to my transcriptome, also map to the fasta file I built from the plant repetitive sequence database. Surprisingly, only 0.8% hit the repE-database. I also ran a BlastN to identify contigs that had similarity to my repE file, and only came up with a bit more than 200 (out of ~560,000) had matches greater than E-5. I really thought I would get more. So, a million or so of my genomic frags (that map to my transcriptome) that are from repetitive elements, but not nearly enough to explain the large percentage with hits to my transcriptome. My next step will be to address sarvidsson's thought and look at my individual transcripts to see if any have inordinately high representation among the genomic fragments. Anyone here have a nice script for counting the number of times a ref seq is hit in a psl file? Incidentally, my tanscriptome assembly only has 560 M bases (about a quarter of the estimated genome size) and a fair number are related contigs (as the assembly was done using trinity).

    Leave a comment:


  • sarvidsson
    replied
    Originally posted by horvathdp View Post
    ... about half (300K or so) of my contigs contain open reading frames greater than 300 bases in length. However, quite a number of the non-ORF contigs have reasonable expression values in some experiments.

    For kicks, I'll re-run the Bowtie using just my contigs with long ORFs.
    While pmiguel is right concerning the retrotransposons, I was referring to long transcontigs with either a single ORF covering a minor part (single percentage range) of the contig, or several ORFs at several positions. I've seen such transcontigs in extreme-coverage de novo transcriptome assemblies from plants (with mid- to large-sized genomes), and I don't trust them to be single transcripts, or at least not fully processed functional ones.

    Leave a comment:


  • horvathdp
    replied
    Good point pmiguel! I hadn't considered the possibility that a large number of the fragments might be transposons. I have a nice fasta file that combined the sequences of known transposons (and other repetitive sequences) from several plant species. I'll run a BLAT of my hits.fasta against it.

    Leave a comment:


  • pmiguel
    replied
    I don't think there is much of a basis to presume DNA contamination of your transcriptome data. Plant genomes, especially ones with 1C genome sizes around that of sorghum or larger, tend to comprise retrotransposon clusters over a sizable percentage of their length. If one of the cDNA libraries that was used to generate your transcriptome data happened to include tissue that expressed retrotransposons, then that could give you the source of a large percentage of your hits.
    Alternatively, it may be that the genome you are sequencing is not as large as you think.

    --
    Phillip

    Leave a comment:


  • horvathdp
    replied
    Thanks for all the replies! I'm working in the iPlant discovery environment so it is difficult to change parameters of some of the programs. However, I ran bowtie2 using the default parameters (which I am checking on currently to see what was programmed in for these) on just one of my genomic libraries and got the following:

    101531102 reads; of these:
    101531102 (100.00%) were paired; of these:
    63600152 (62.64%) aligned concordantly 0 times
    7094920 (6.99%) aligned concordantly exactly 1 time
    30836030 (30.37%) aligned concordantly >1 times
    ----
    63600152 pairs aligned concordantly 0 times; of these:
    164821 (0.26%) aligned discordantly 1 time
    ----
    63435331 pairs aligned 0 times concordantly or discordantly; of these:
    126870662 mates make up the pairs; of these:
    115162777 (90.77%) aligned 0 times
    4017518 (3.17%) aligned exactly 1 time
    7690367 (6.06%) aligned >1 times
    43.29% overall alignment rate


    So I still had 43% with hits! I am guessing we definitely have some contaminating genomic DNA, but I wouldn't have expected that much. However, my large contigs look real - most encode either proteins or bits of chloroplast or mitochondria sequences based on Blast hits. In answer to sarvidsson's question, about half (300K or so) of my contigs contain open reading frames greater than 300 bases in length. However, quite a number of the non-ORF contigs have reasonable expression values in some experiments. In answer to pmiguel, I just ran a sort on column 10 of the psl file that only returned the first hit for any given fragment, and then did a count using GREP on the @HWI at the start of each fragment name.

    For kicks, I'll re-run the Bowtie using just my contigs with long ORFs.

    Leave a comment:


  • pmiguel
    replied
    What was you method you used to ensure that you:
    counted only the unique hits, so this isn't due to genomic fragments having multiple hits to the transcriptome.
    Did you discard any reads that mapped to multiple contigs in the transcriptome? If so, did they get counted as "non-mapping", or were they just not counted at all?
    --
    Phillip

    Leave a comment:


  • sarvidsson
    replied
    Do the more stringent alignment first, but I would also think that your transcriptome assembly is contaminated with genomic sequences, which happens quite easily (especially if you include a lot of sequence in your assembly).

    How many of the 600 000 "transcripts" contain ORFs covering most of their length? How many of the 600 000 "transcripts" are suspiciously large (> 10 kb)?

    Leave a comment:


  • horvathdp
    replied
    So you think the reason I am getting so many hits is because the homology is too lax? That seems possible given the high number of hits. Bowtie doesn't handle gaps but bowtie2 does. However, since I am mapping a genome fragment against an assembled transcriptome (rather than the more common transcriptome fragment to an assembled genome) would bowtie2 still give me hits in situations where the genomic fragment contained intron sequences? Would it still give me a hit if only one of the PEs had a hit to the transcript? As I ask these questions, I am guessing I should probably read up on the bowtie program .

    Leave a comment:


  • Wallysb01
    replied
    BLAT is really the wrong tool for short reads (at least without careful tuning of parameters and post filtering), especially if the goal is to decide where these reads are actually coming from. By default BLAT uses a tile size of 11, needing two tiles to match and an identity of 90%, plus large gaps are allowed and the default score is 30. Something like bowtie would be a lot more stringent. But if you must use blat, I’d increase the tile size, require a much higher identity and score. At minimum, you could use the UCSC tool pslCDnaFilter to impose some stricter cutoffs and you wouldn’t have to rerun the mapping.

    Leave a comment:


  • horvathdp
    replied
    responses

    There seems to be a miscommunication I think. The genomic sequences are not assembled, only the transcriptome. Thus I am looking to see how many of the genomic fragments contain transcribed sequences.

    Incidentally, I had planned to pull out those genomic sequences that were had homology to the transcriptome (along with those fragments that contained conserved sequences from related organisms) and do an assembly of the transcribed space from the genome. However, I wasn't expecting that to be but a few percentages of the genomic sequence.

    I could see possibly running bowtie2 since that might allow me to map both ends of my reads simultaneously, but I want to collect as many of the intron and promoter sequences from the genome as possible- since those might be useful for other studies.

    So, back to my question: Why did I get so many hits?

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin


    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
    Yesterday, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
39 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
41 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 09:21 AM
0 responses
35 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-04-2024, 09:00 AM
0 responses
55 views
0 likes
Last Post seqadmin  
Working...
X