Header Leaderboard Ad

Collapse

Compare fasta files

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Compare fasta files

    Hi there,

    I want to compare several fasta files containing sequences. These sequences are transcripts obtained from RNA-Seq. I want to find out the shared transcripts between samples.
    I cannot use CuffCompare or similar because I have no reference genome. I only have transcripts.

    Thanks in advance,

  • #2
    You could try CD-HIT to cluster the reads.

    Comment


    • #3
      Try using command line utilities
      cat
      sort
      uniq


      example :
      #get unique reads for 1, filter out read names (lines with >)
      cat 1.fa | grep -v ">" | sort | uniq > 1.tmp
      #get unique reads for 2
      cat 2.fa | grep -v ">" | sort | uniq > 2.tmp
      #get reads common to 1 and 2
      cat 1.tmp 2.tmp | sort | uniq -d


      sort takes a "more RAM memory" parameter if it's a large data files.
      check out the manual using "man sort" for details

      Comment


      • #4
        BLAT appears the easiest and straightforward way right?

        Comment


        • #5
          Check out bl2seq ...
          http://blast.ncbi.nlm.nih.gov/Blast...._LOC=align2seq

          There's a command line version if your into that kind of stuff.

          Comment


          • #6
            Thanks to all, I am very grateful for your help,

            This is my opinion:

            (i) CD-HIT seems interesting, but I have not test it yet.

            (ii) Using 'cat', 'sort' and 'uniq' commands in FASTA files performs an enormous mistake, because it compares each line instead of each sequence (composed of multiple lines).

            (iii) 'BLAT' needs a reference genome, and I do not have such.

            (iv) 'bl2seq' does not support large size files, so that they suggested to use Blast+. So it is the same as running a local blast with Blast+.

            Is this correct?

            Comment


            • #7
              Originally posted by Hel View Post
              (iii) 'BLAT' needs a reference genome, and I do not have such.
              Blat does not need a reference genome. In fact you use blat with just two files (which can be single sequences or multi-fasta files). First file on the command line serves as the "database" and the second "query". So in your case you will be blatting a sequence (actually many of them sequentially) against one "database" file (or the whole lot of files concatenated together). Ideally the sequence itself will be the top hit. You may want to use tabular format to be able to parse the results easily.

              Comment


              • #8
                Originally posted by Hel View Post

                (ii) Using 'cat', 'sort' and 'uniq' commands in FASTA files performs an enormous mistake, because it compares each line instead of each sequence (composed of multiple lines).
                You could remove the linebreaks in seqs and then continue as Richard advised..

                Code:
                awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' file.fa > out.fa
                Last edited by rhinoceros; 05-18-2015, 04:04 AM.
                savetherhino.org

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Improved Targeted Sequencing: A Comprehensive Guide to Amplicon Sequencing
                  by seqadmin



                  Amplicon sequencing is a targeted approach that allows researchers to investigate specific regions of the genome. This technique is routinely used in applications such as variant identification, clinical research, and infectious disease surveillance. The amplicon sequencing process begins by designing primers that flank the regions of interest. The DNA sequences are then amplified through PCR (typically multiplex PCR) to produce amplicons complementary to the targets. RNA targets...
                  03-21-2023, 01:49 PM
                • seqadmin
                  Targeted Sequencing: Choosing Between Hybridization Capture and Amplicon Sequencing
                  by seqadmin




                  Targeted sequencing is an effective way to sequence and analyze specific genomic regions of interest. This method enables researchers to focus their efforts on their desired targets, as opposed to other methods like whole genome sequencing that involve the sequencing of total DNA. Utilizing targeted sequencing is an attractive option for many researchers because it is often faster, more cost-effective, and only generates applicable data. While there are many approaches...
                  03-10-2023, 05:31 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 01:40 PM
                0 responses
                7 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-29-2023, 11:44 AM
                0 responses
                12 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-24-2023, 02:45 PM
                0 responses
                20 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2023, 12:26 PM
                0 responses
                28 views
                0 likes
                Last Post seqadmin  
                Working...
                X