Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trinity paralog filtering

    (cross-posted on BioStars: http://www.biostars.org/p/76694/)

    Greetings Everyone,

    I am working with a group that does population genetics on non-model species (the closest reference genome is usually at least 5%-10% divergent). We are just starting to move into NGS with the following general approach:

    1. Gather transcriptomic data through RNA-seq or EST databases
    2. Using transcriptome data, design hybrid capture bait sets (e.g. MYcroarray MYbaits) for several thousand transcripts
    3. Enrich exons and flanking intronic regions using the above bait set for hundreds of individuals and sequence (HiSeq and/or MiSeq)
    4. SNP calling
    5. Pop gen analyses

    For a particular experiment, I have RNA-seq data from three individuals from which I want to design hybrid capture baits. I've de novo assembled the transcriptomes of these individuals, and I'm now picking transcripts to use for the baits.

    My question for everyone: what can I do at this stage to reduce/eliminate enriching paralogous genes? Does anyone have a strategy for filtering at this stage that he or she could share (perhaps based off of blast E-values or sequence similarity)?

    One thing I have considered is taking the final Trinity.fasta files and simply removing all components that have more than one contig/sequence. So for the hypothetical dataset below I would keep component 5 and throw out components 2 and 6.

    Code:
    >comp2_c0_seq1 len=3 path=[354:0-2]
    CAT
    >comp2_c1_seq1 len=6 path=[972:0-5]
    ATTCAC
    >comp5_c0_seq1 len=8 path=[629:0-7]
    GGGCTTGA
    >comp6_c0_seq1 len=5 path=[449:0-4]
    CCAAC
    >comp6_c0_seq2 len=8 path=[225:0-7]
    GATACGGG
    Is this a potentially-valid approach? One concern I have with this is that, unless I'm mistaken, multiple sequences for a single component may represent allelic variation as well as potential paralogs and isoforms, so this approach might reduce the number of resulting SNPs after the pulldown experiment.

    If the goal is simply to find a bunch of markers to sequence for a bunch of individuals for population genetic analyses (while reducing/eliminating paralogs), what would you do?

    Thanks so much for your help in advance!

  • #2
    There are a few things you could do.

    1. As described here: http://trinityrnaseq.sourceforge.net...stimation.html, you could filter out lowly expressed components, but there will be caveats with that as described.

    2. You could reduce the redundancy of your data set with software like cd-hit-est, or TGICL clustering, fastanrdb from the Exonerate package, or Minimus2, which will collapse/cluster similar sequences into a longest representative sequence. I typically do cdhitest with 100% identity first, and then use TGICL. This will generate a 'unigene' set based on the threshold identity, which should substantially reduce the number of contigs in your assembly. You can set the threshold lower for cdhit if you kind of know what the level of similarity is between paralogous sequences.

    You can then map the original reads you had back to this 'consensus' assembly, to identify and visualize the polymorphic sites and/or call SNPs (reads which have a couple of mismatches to the consensus assembly will show up as SNPs).

    Note though by doing method 2, you might potentially by clustering wrong/chimeric transcripts and in essence enriching for these wrong sequences, though in my experience they are not a high percentage and you can validate these anyway with blastx for your genes of interest. (won't apply to ncRNA though)

    You should also filter out transcripts less than a certain size (at least 200 nt), but I think Trinity by default already does this.

    Comment


    • #3
      Thanks very much--I really appreciate your help. I think I almost understand.

      If I collapse similar contigs into representatives using something like cd-hit-est, do I lose information about whether paralogs exist for a particular gene/transcript? Or is this this something I could determine during the read-mapping step? My end goal is to select a few thousand single-copy genes/exons to sequence in the future for a few hundred individuals.

      Comment


      • #4
        Originally posted by atcghelix View Post
        Thanks very much--I really appreciate your help. I think I almost understand.

        If I collapse similar contigs into representatives using something like cd-hit-est, do I lose information about whether paralogs exist for a particular gene/transcript? Or is this this something I could determine during the read-mapping step? My end goal is to select a few thousand single-copy genes/exons to sequence in the future for a few hundred individuals.
        No you won't lose this information. You won't be able to tell how many sequences were clustered into the representative sequence directly from the fasta output, but in cd-hit-est, there is another file where you can find all the transcripts (component headers in your case) that were put in a particular cluster. So from this file, you can identify transcripts that did not have any other similar sequences (i.e. 'single' copy within the definition of your clustering). You need to be aware though that a 'single copy' transcript from your RNAseq assembly doesn't mean that it is truly single copy - just maybe other paralogs didn't express enough or couldn't be assembled properly.

        You won't (easily) be able to tell how many paralogs exists from read mapping, just the possibility that they exist or not based on the number of polymorphisms you see. However you probably can determine if a sequence is 'single' copy by the absence of any polymorphisms, but again you need to be aware of expression levels of paralogs in your RNAseq data.
        hth

        Comment


        • #5
          Yes, very helpful--thanks again!

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin




            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
            04-22-2024, 07:01 AM
          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 11:49 AM
          0 responses
          15 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-24-2024, 08:47 AM
          0 responses
          16 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          61 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          60 views
          0 likes
          Last Post seqadmin  
          Working...
          X