Header Leaderboard Ad

Collapse

Filtering out transcripts from non target organism

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Filtering out transcripts from non target organism

    Hi,

    I am assembling a transcriptome for a Drosophila species without a reference genome (my species diverged from the most closely related with genome about 15 mya). I used Trinity for the assembly, which constructed over 65K components (which I assume is sort of like a gene). I'm guessing that a lot of the sequences are from non target species (e.g. bacteria, yeasts, cactus) as larvae were taken directly from their food source. Is there an easy way to identify and get rid of the bulk of the transcripts that come from non target species (e.g. using BLAST or something else)? All trinity transcripts are currently in FASTA format. I'm not particularly savvy with bioinformatics, so I'm sure if there is an easy pipeline I could use? Thanks!

  • #2
    you might have assembled chimeric transcripts by using all the reads from different sources. Kind of like a metagenomic assembly, so you might want to read some papers that contain information on handling this sort of data.

    I would set up a 'contaminant' database containing all your non-target species, use a short read mapper (bowtie, bwa) to filter out reads that align to this database (i.e. take only the reads that didn't align to the contaminant database), and rerun trinity with only reads that didn't align.

    Otherwise as you mention you could make a blast database of your non-target sequences, and align your current assembly to it and take only those that did not align. I'm not aware of any pipeline that would automate this. You'll need to take all the component IDs that did align to this non-target database, and then subtract this from your original fasta file. This would best be done in command line, using bash or perl.
    The problem with this method is that there is a chance that your target sequences might also align to this non-target database, so you'll need to decide on some thresholds.

    Comment


    • #3
      Thanks for the quick response!
      I think the difficulty is that I have no idea what the non target organisms are so I don't think I could easily set up a database (it could be anything in rotting cactus). I assume this would be a typical issue with de novo assemblies but I haven't been able to find much information on how people are dealing with it, though I am continuing to look. My main goal is to look at differential expression, but I was hoping to create a transcriptome that is mostly free of contaminants before mapping reads back to it.

      Comment


      • #4
        How close is your target species to D.mel ? Could you alternatively align your reads to this with relaxed parameters, and use those that aligned to do a de novo assembly?

        It also could be the contamination is at a minimal level. You could pick a few possible non-target organisms, and see what % of reads mapped to each, and decide if this is an acceptable level. Plus if your contaminant sequences are quite different to your Drosophila (bacteria vs plant vs fly), the assembler can still do a good job distinguishing and assembling the sequences. I.e. you might not even need to worry about it too much.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Improved Targeted Sequencing: A Comprehensive Guide to Amplicon Sequencing
          by seqadmin



          Amplicon sequencing is a targeted approach that allows researchers to investigate specific regions of the genome. This technique is routinely used in applications such as variant identification, clinical research, and infectious disease surveillance. The amplicon sequencing process begins by designing primers that flank the regions of interest. The DNA sequences are then amplified through PCR (typically multiplex PCR) to produce amplicons complementary to the targets. RNA targets...
          Today, 01:49 PM
        • seqadmin
          Targeted Sequencing: Choosing Between Hybridization Capture and Amplicon Sequencing
          by seqadmin




          Targeted sequencing is an effective way to sequence and analyze specific genomic regions of interest. This method enables researchers to focus their efforts on their desired targets, as opposed to other methods like whole genome sequencing that involve the sequencing of total DNA. Utilizing targeted sequencing is an attractive option for many researchers because it is often faster, more cost-effective, and only generates applicable data. While there are many approaches...
          03-10-2023, 05:31 AM
        • seqadmin
          Expert Advice on Automating Your Library Preparations
          by seqadmin



          Using automation to prepare sequencing libraries isn’t a new concept, and most researchers are aware that there are numerous benefits to automating this process. However, many labs are still hesitant to switch to automation and often believe that it’s not suitable for their lab. To combat these concerns, we’ll cover some of the key advantages, review the most important considerations, and get real-world advice from automation experts to remove any lingering anxieties....
          02-21-2023, 02:14 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 03-17-2023, 12:32 PM
        0 responses
        12 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-15-2023, 12:42 PM
        0 responses
        18 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-09-2023, 10:17 AM
        0 responses
        67 views
        1 like
        Last Post seqadmin  
        Started by seqadmin, 03-03-2023, 12:03 PM
        0 responses
        64 views
        0 likes
        Last Post seqadmin  
        Working...
        X