Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Blat of transcriptome to genome gave 70% hits! reality?

    I work on an auto-allo hexaploid plant species with a genome size estimated to be about 2.1 Gb. We have an excellent transcriptome assembly from four different experiments that produced about 600,000 contigs with an N50 of over 1600 bases and a CEGMA run showed 98% representation at full length and 100% representation at partial. We also have PE100 genomic sequence data from 4 different libraries from 300-400 bases in size that gives us about 15X coverage (based on kmer representation analysis). For kicks, I used BLAT to see how many genomic fragments would map to my assembled transcriptome, and got (what to me seems) a surprising number of hits. Well over 70% of the genomic fragments had hits to the transcriptome. Is this likely real, or is there something about the BLAT program that would produce a high number of spurious hits? I counted only the unique hits, so this isn't due to genomic fragments having multiple hits to the transcriptome. Thoughts? Is there a better way to determine the percentage of the genome represented in my transcriptome?

  • #2
    I would shred the transcriptome into pieces (~300bp or so) and map them to the genome, then calculate coverage.

    Comment


    • #3
      What is the N50 of your 15X assembly?
      I would suggest using BWA or Bowtie2 to map your reads against your transcriptome. See what percent map.
      Of course, depending on how "allo" your three sub-genomes are, there could be complexities there.
      --
      Phillip
      Last edited by pmiguel; 01-07-2015, 12:20 PM.

      Comment


      • #4
        responses

        There seems to be a miscommunication I think. The genomic sequences are not assembled, only the transcriptome. Thus I am looking to see how many of the genomic fragments contain transcribed sequences.

        Incidentally, I had planned to pull out those genomic sequences that were had homology to the transcriptome (along with those fragments that contained conserved sequences from related organisms) and do an assembly of the transcribed space from the genome. However, I wasn't expecting that to be but a few percentages of the genomic sequence.

        I could see possibly running bowtie2 since that might allow me to map both ends of my reads simultaneously, but I want to collect as many of the intron and promoter sequences from the genome as possible- since those might be useful for other studies.

        So, back to my question: Why did I get so many hits?

        Comment


        • #5
          BLAT is really the wrong tool for short reads (at least without careful tuning of parameters and post filtering), especially if the goal is to decide where these reads are actually coming from. By default BLAT uses a tile size of 11, needing two tiles to match and an identity of 90%, plus large gaps are allowed and the default score is 30. Something like bowtie would be a lot more stringent. But if you must use blat, I’d increase the tile size, require a much higher identity and score. At minimum, you could use the UCSC tool pslCDnaFilter to impose some stricter cutoffs and you wouldn’t have to rerun the mapping.

          Comment


          • #6
            So you think the reason I am getting so many hits is because the homology is too lax? That seems possible given the high number of hits. Bowtie doesn't handle gaps but bowtie2 does. However, since I am mapping a genome fragment against an assembled transcriptome (rather than the more common transcriptome fragment to an assembled genome) would bowtie2 still give me hits in situations where the genomic fragment contained intron sequences? Would it still give me a hit if only one of the PEs had a hit to the transcript? As I ask these questions, I am guessing I should probably read up on the bowtie program .

            Comment


            • #7
              Do the more stringent alignment first, but I would also think that your transcriptome assembly is contaminated with genomic sequences, which happens quite easily (especially if you include a lot of sequence in your assembly).

              How many of the 600 000 "transcripts" contain ORFs covering most of their length? How many of the 600 000 "transcripts" are suspiciously large (> 10 kb)?

              Comment


              • #8
                What was you method you used to ensure that you:
                counted only the unique hits, so this isn't due to genomic fragments having multiple hits to the transcriptome.
                Did you discard any reads that mapped to multiple contigs in the transcriptome? If so, did they get counted as "non-mapping", or were they just not counted at all?
                --
                Phillip

                Comment


                • #9
                  Thanks for all the replies! I'm working in the iPlant discovery environment so it is difficult to change parameters of some of the programs. However, I ran bowtie2 using the default parameters (which I am checking on currently to see what was programmed in for these) on just one of my genomic libraries and got the following:

                  101531102 reads; of these:
                  101531102 (100.00%) were paired; of these:
                  63600152 (62.64%) aligned concordantly 0 times
                  7094920 (6.99%) aligned concordantly exactly 1 time
                  30836030 (30.37%) aligned concordantly >1 times
                  ----
                  63600152 pairs aligned concordantly 0 times; of these:
                  164821 (0.26%) aligned discordantly 1 time
                  ----
                  63435331 pairs aligned 0 times concordantly or discordantly; of these:
                  126870662 mates make up the pairs; of these:
                  115162777 (90.77%) aligned 0 times
                  4017518 (3.17%) aligned exactly 1 time
                  7690367 (6.06%) aligned >1 times
                  43.29% overall alignment rate


                  So I still had 43% with hits! I am guessing we definitely have some contaminating genomic DNA, but I wouldn't have expected that much. However, my large contigs look real - most encode either proteins or bits of chloroplast or mitochondria sequences based on Blast hits. In answer to sarvidsson's question, about half (300K or so) of my contigs contain open reading frames greater than 300 bases in length. However, quite a number of the non-ORF contigs have reasonable expression values in some experiments. In answer to pmiguel, I just ran a sort on column 10 of the psl file that only returned the first hit for any given fragment, and then did a count using GREP on the @HWI at the start of each fragment name.

                  For kicks, I'll re-run the Bowtie using just my contigs with long ORFs.

                  Comment


                  • #10
                    I don't think there is much of a basis to presume DNA contamination of your transcriptome data. Plant genomes, especially ones with 1C genome sizes around that of sorghum or larger, tend to comprise retrotransposon clusters over a sizable percentage of their length. If one of the cDNA libraries that was used to generate your transcriptome data happened to include tissue that expressed retrotransposons, then that could give you the source of a large percentage of your hits.
                    Alternatively, it may be that the genome you are sequencing is not as large as you think.

                    --
                    Phillip

                    Comment


                    • #11
                      Good point pmiguel! I hadn't considered the possibility that a large number of the fragments might be transposons. I have a nice fasta file that combined the sequences of known transposons (and other repetitive sequences) from several plant species. I'll run a BLAT of my hits.fasta against it.

                      Comment


                      • #12
                        Originally posted by horvathdp View Post
                        ... about half (300K or so) of my contigs contain open reading frames greater than 300 bases in length. However, quite a number of the non-ORF contigs have reasonable expression values in some experiments.

                        For kicks, I'll re-run the Bowtie using just my contigs with long ORFs.
                        While pmiguel is right concerning the retrotransposons, I was referring to long transcontigs with either a single ORF covering a minor part (single percentage range) of the contig, or several ORFs at several positions. I've seen such transcontigs in extreme-coverage de novo transcriptome assemblies from plants (with mid- to large-sized genomes), and I don't trust them to be single transcripts, or at least not fully processed functional ones.

                        Comment


                        • #13
                          So I ran the Blat to see how many of my genomic fragments which map to my transcriptome, also map to the fasta file I built from the plant repetitive sequence database. Surprisingly, only 0.8% hit the repE-database. I also ran a BlastN to identify contigs that had similarity to my repE file, and only came up with a bit more than 200 (out of ~560,000) had matches greater than E-5. I really thought I would get more. So, a million or so of my genomic frags (that map to my transcriptome) that are from repetitive elements, but not nearly enough to explain the large percentage with hits to my transcriptome. My next step will be to address sarvidsson's thought and look at my individual transcripts to see if any have inordinately high representation among the genomic fragments. Anyone here have a nice script for counting the number of times a ref seq is hit in a psl file? Incidentally, my tanscriptome assembly only has 560 M bases (about a quarter of the estimated genome size) and a fair number are related contigs (as the assembly was done using trinity).

                          Comment


                          • #14
                            Originally posted by horvathdp View Post
                            So I ran the Blat to see how many of my genomic fragments which map to my transcriptome, also map to the fasta file I built from the plant repetitive sequence database. Surprisingly, only 0.8% hit the repE-database. I also ran a BlastN to identify contigs that had similarity to my repE file, and only came up with a bit more than 200 (out of ~560,000) had matches greater than E-5. I really thought I would get more. So, a million or so of my genomic frags (that map to my transcriptome) that are from repetitive elements, but not nearly enough to explain the large percentage with hits to my transcriptome.
                            There is generally little inter-genera nucleotide similarity among most transposable element sequences. At least in plants. Interesting exceptions to this, of course. But they are just that, exceptions.

                            You could gain some more sensitivity by using tblastx (or its equivalent) instead. But large segments of many LTR retrotransposons are not coding sequence, so protein level conservation may not be detectable.

                            --
                            Phillip

                            Comment


                            • #15
                              So actually my count should also turn up sequences that are highly represented in the genome which thus might include new repEs or at least leafy spurge specific repEs. Cool!

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin


                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                                Yesterday, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              39 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              41 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              35 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              55 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X