Seqanswers Leaderboard Ad

**dschika** · 03-13-2015, 09:11 AM

You could try CD-HIT to cluster the reads.

**Richard Finney** · 03-13-2015, 10:24 AM

Try using command line utilities
cat
sort
uniq

example :
#get unique reads for 1, filter out read names (lines with >)
cat 1.fa | grep -v ">" | sort | uniq > 1.tmp
#get unique reads for 2
cat 2.fa | grep -v ">" | sort | uniq > 2.tmp
#get reads common to 1 and 2
cat 1.tmp 2.tmp | sort | uniq -d

sort takes a "more RAM memory" parameter if it's a large data files.
check out the manual using "man sort" for details

**vivek_** · 03-13-2015, 12:35 PM

BLAT appears the easiest and straightforward way right?

**Richard Finney** · 03-13-2015, 12:44 PM

Check out bl2seq ...

Nucleotide BLAST: Align two or more sequences using BLAST

http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch&BLAST_SPEC=blast2seq&LINK_LOC=align2seq

blast2seq,Align two sequences using BLAST (bl2seq)

There's a command line version if your into that kind of stuff.

**Hel** · 05-18-2015, 01:25 AM

Thanks to all, I am very grateful for your help,

This is my opinion:

(i) CD-HIT seems interesting, but I have not test it yet.

(ii) Using 'cat', 'sort' and 'uniq' commands in FASTA files performs an enormous mistake, because it compares each line instead of each sequence (composed of multiple lines).

(iii) 'BLAT' needs a reference genome, and I do not have such.

(iv) 'bl2seq' does not support large size files, so that they suggested to use Blast+. So it is the same as running a local blast with Blast+.

Is this correct?

**GenoMax** · 05-18-2015, 03:11 AM

Originally posted by Hel View Post

(iii) 'BLAT' needs a reference genome, and I do not have such.

Blat does not need a reference genome. In fact you use blat with just two files (which can be single sequences or multi-fasta files). First file on the command line serves as the "database" and the second "query". So in your case you will be blatting a sequence (actually many of them sequentially) against one "database" file (or the whole lot of files concatenated together). Ideally the sequence itself will be the top hit. You may want to use tabular format to be able to parse the results easily.

**rhinoceros** · 05-18-2015, 04:01 AM

Originally posted by Hel View Post

(ii) Using 'cat', 'sort' and 'uniq' commands in FASTA files performs an enormous mistake, because it compares each line instead of each sequence (composed of multiple lines).

You could remove the linebreaks in seqs and then continue as Richard advised..

Code:

awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' file.fa > out.fa

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Compare fasta files

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News