Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Subsetting a fasta file based on a set of BLAST results

    Hi,

    I have assembled a de novo transcriptome from my species of interest using trinity. The species has an internal symbiont (algae) and those sequences were obviously sequenced along with it. I have done a blast on my assembly which has ID'd a number of contigs corresponding to the symbiont. I would like to use the results of the blast to remove those contigs from the assembly into a new file so I can deal with the two organisms separately. It there an easy way to do this?

    I have cleaned up the trinity fasta output so it has a single unique ID for each contig like so:
    >TR1-c0_g1_i1
    AGCTGTTTGGCCAAGGCTGCGGCCTGGTGGCAGCCTTGCGAGAGCAAGGGCAGCAAGGGC (etc...)

    I have extracted the sequence IDs from the symbiont BLAST flatfile output using cut (id-symb) and then made another file with all the IDs of the symb blast hits removed from the full id file, thus representing the "host" sequences (id-host), using sort and uniq.

    I therefore have the following.
    combined.fasta (trinity output of all the contigs)
    id-symb (single column text file of the IDs extracted from a blast search against the symbiont transcriptome)
    id-host (single column text file of the all the IDs from the combined.fasta file minus the id-symb IDs)

    I would like to generate the following:
    symb.fasta = all those sequences in the combined.fasta from the id-symb list
    and
    host.fasta = the rest (i.e. combined.fasta - symb.fasta) aka all those sequences in the combined.fasta from the id-host list

    I've been trying to use a looped fasgrep (from the FAST perl module) but that is far too slow (has taken more than a day to get through less than 10% of the file) so I'm sure there must be a better way.

    The assembly contains ~250,000 contigs.


    Thanks.

  • #2
    Using R, quite easy and fast...

    library(Biostrings)

    all_fasta <- read.DNAStringSet("combined.fasta") ## You have to give the path to your file combined.fasta

    id_symb <- scan("id_symb", what="character", sep="\n")

    symbFasta <- all_fasta[names(all_fasta) %in% id_symb]
    hostFasta <- all_fasta[! names(all_fasta) %in% id_symb]

    Comment


    • #3
      Another option is Jim Kent's faSomeRecords utility. I am linking the linux version but he has source/OS X executables available as well.



      faSomeRecords - Extract multiple fa records
      usage:
      faSomeRecords in.fa listFile out.fa
      options:
      -exclude - output sequences not in the list file.

      Comment


      • #4
        faSomeRecords = lifesaver

        Thank you so much genomax - that took less than 2 seconds for each file. The other way was still chugging away after two days...

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Recent Advances in Sequencing Analysis Tools
          by seqadmin


          The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
          Yesterday, 07:48 AM
        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          04-22-2024, 07:01 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Today, 06:57 AM
        0 responses
        5 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Yesterday, 07:17 AM
        0 responses
        13 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 05-02-2024, 08:06 AM
        0 responses
        19 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-30-2024, 12:17 PM
        0 responses
        21 views
        0 likes
        Last Post seqadmin  
        Working...
        X