Seqanswers Leaderboard Ad

**gringer** · 10-31-2013, 01:39 AM

Now the problem is the PC I'm currently testing this process on isn't powerful, and as a results I want to filter the data so that >seq_example which occurs the most times is at the time of a list, with an additional identify in the time telling me how many times its occur, but then to delete the duplicate.

Your sentences get steadily more difficult for me to understand, such that I can't really work out what you're wanting to do here. So I'll have a go with a few example scripts.

My guess is that you're looking for something like the following, which will produce counts for the unique lines in a file:

Code:

cat file.txt | sort | uniq -c

That should be fine (time-wise) for up to about 20M lines sorted. The big choke point is the sort, which will generate loads of temporary files if the files are too large to be sorted entirely in memory.

If it's a fasta file with one sequence per line and you just want the sequences, then you can exclude the header lines:

Code:

cat file.fasta | grep -v '^>' | sort | uniq -c

If you just want the sequence names, then it changes to this:

Code:

cat file.fasta | grep '^>' | sort | uniq -c

[i.e. no '-v']

**BioEd1701** · 10-31-2013, 02:01 AM

Thanks for your response, and I sorry for the way I worded making it maybe more confusing than it should be. But I think you got what I was going for.

So it is currently in a fasta format:

>Seq1
ATTGGAAAAATACATAAAAAATAATACATACAAAAAGTCTCTATAGAGTA
>Seq2
GAAAGCCAAAGGTATAATTAGACAAGCAGACGGTTACTCTCACATCCTGA

in this file is a complete list of all the sequences including their repeats.

So if i use the command:

Code:

cat file.fasta | grep '^>' | sort | uniq -c

This will tell me which sequences are repeated? Will this all tell me how many times they're repeated?

Because then I want to sort them so that the sequence that occurs the most is at the top of the fasta file and then the 2nd one down is the sequence which occurs the second most amount of times and so on.

That way when I put the fasta file through meme I can cut off a sequence that occurs less then 40 times (for example)

**yueluo** · 10-31-2013, 02:24 AM

This will tell me which sequences are repeated? Will this all tell me how many times they're repeated?

Because then I want to sort them so that the sequence that occurs the most is at the top of the fasta file and then the 2nd one down is the sequence which occurs the second most amount of times and so on.

That way when I put the fasta file through meme I can cut off a sequence that occurs less then 40 times (for example)

**BioEd1701** · 10-31-2013, 02:44 AM

Thank you for your help guys, but it doesn't seem to work correctly the output file just comes out like this:

[code]
1 >Seq1
1 >Seq10
1 >Seq100
etc....
[code]

Maybe it's sometime to do with the fasta file going in is in this format:

[code]
>Seq1
ATTGGAAAAATACATAAAAAATAATACATACAAAAAGTCTCTATAGAGTA
>Seq2
GAAAGCCAAAGGTATAATTAGACAAGCAGACGGTTACTCTCACATCCTGA
>Seq3
ATTGGAAAAATACATAAAAAATAATAGATACAAAAAGTCTCTATAGAGTA
>Seq4
GAAAGCCAAAGGTATAATTAGACAAGCAGACGGTTACTCTCACATCCTGA
[code]

So for instance Seq4 is the same as Seq2 so what I would like the output to look like if possible would be:

[code]
>Seq2 2Occur
GAAAGCCAAAGGTATAATTAGACAAGCAGACGGTTACTCTCACATCCTGA
>Seq1 1Occur
ATTGGAAAAATACATAAAAAATAATACATACAAAAAGTCTCTATAGAGTA
>Seq3 1Occur
ATTGGAAAAATACATAAAAAATAATAGATACAAAAAGTCTCTATAGAGTA
[code]

for example? (the addition of the Occurrence thing isn't a necessity, but it would be nice to have it) as well as deleted all the other duplicates (keeping 1) and put it at the top of the sequence.

I hope I've explained it better this time.

Sorry I've found away to get what I wanted I needed to add the -v thank you guys you've all been very helpful!

**BioEd1701** · 11-01-2013, 02:30 AM

Sorry for the double post, but I didn't want to make a new thread.

So I'm happy to report that thanks to yueluo and gringer's help it all works nice and happily, so again I can't thank you guys enough.

I'm currently in a process of writing instructions for a friend who has even less understand of linux than I do and I'm trying to reduce the number of commands down to as few as possible so he can just copy and paste the commands into the terminal.

The data is from an ion torrent I believe, and what I do is open it up in ugene and search for (primerSeq1)(random region)(primerSeq2) and then pull out the annotations as a csv file.

Then in excel I delete all the primers 1/2 and then delete any sequences that have more than the random region size which is 50 in this case.

I then use a ruby command in windows, (as i can't get it to work in linux):

Code:

ruby –ne ‘puts “>” + $_.split<”,”>.first<2>.join<”\n”>’ FILE.CSV >FILE.fasta

This then generates a file in fasta format

Code:

>Seq1
ATTGGAAAAATACATAAAAAATAATACATACAAAAAGTCTCTATAGAGTA 
>Seq2
GAAAGCCAAAGGTATAATTAGACAAGCAGACGGTTACTCTCACATCCTGA 
>Seq3
ATTGGAAAAATACATAAAAAATAATAGATACAAAAAGTCTCTATAGAGTA 
>Seq4
GAAAGCCAAAGGTATAATTAGACAAGCAGACGGTTACTCTCACATCCTGA

I then run your command in linux on the file:

Code:

cat file.fasta | grep –v ‘^>’ | sort | uniq –c | sort –k1nr > file.csv

I then have to reopen the file in excel and then I run a macro I've recorded, now because the occurrences and the sequence are now in the same cell.

The macro copies the list onto two separate sheets, in the first sheet it deletes all of the numbers, in the second it deletes all the letters. This then combines the two so you get
Column 1 Column2
450 GAAAGCCAAAGGTATAATTAGACAAGCAGACGGTTACTCTCACATCCTGA
210 GAAAGCCAAAGGTATAATTAGACAAGCAGACGGTTACTCTCACATCCTGA

etc, the macro then gives them all sequences numbers again Seq1 in column 3
and in column 4 in just count =
Then in column 5 it places

Code:

=(C1&" "&D1&" "&A1)

So in E1 you have
"Seq1 Count = 450"

So then I copy this into Column A as text and delete Column C and D

I then run it back through ruby and finally I'm ready to use it in meme

I'm looking at python as a way to do this but it's not each to work out...Any suggestion to make this a bit simple would be so helpful

**gringer** · 11-01-2013, 03:33 AM

Originally posted by BioEd1701 View Post

open it up in ugene and search for (primerSeq1)(random region)(primerSeq2) and then pull out the annotations as a csv file.... I then run your command in linux on the file

I'm going to (for now) step slowly backwards and away from the pre-processing, and just concentrate on the next steps:

Code:

cat file.fasta | grep –v ‘^>’ | sort | uniq –c | sort –k1nr > file.csv

I then have to reopen the file in excel and then I run a macro I've recorded, now because the occurrences and the sequence are now in the same cell.

Er, that's weird. Excel should be able to handle this fine if you tell it the file is 'space' delimited, rather than 'comma' delimited. Of course, if you really want a comma-delimited file, then replace spaces with commas:

Code:

cat file.fasta | grep –v ‘^>’ | sort | uniq –c | sort –k1nr | \
  perl -pe 's/ /,/g' > file.csv

or tabs:

Code:

cat file.fasta | grep –v ‘^>’ | sort | uniq –c | sort –k1nr | \
  perl -pe 's/ /\t/g' > file.csv

But you're using a macro to do odd stuff (apologies if I don't quite get it right) that is probably quicker and easier to do in awk, something like this:

Code:

cat file.fasta | grep –v ‘^>’ | sort | uniq –c | sort –k1nr | \
  awk '{print "Seq" FNR " Count = " $1, $2}'> ruby.input

That should print "Seq" + <currentLineNumber> + " Count = " + <count>, then <sequence>, with the sequence separated by awk's output field separator (space by default).

**gringer** · 11-01-2013, 03:45 AM

Originally posted by BioEd1701 View Post

The data is from an ion torrent I believe, and what I do is open it up in ugene and search for (primerSeq1)(random region)(primerSeq2) and then pull out the annotations as a csv file.

Then in excel I delete all the primers 1/2 and then delete any sequences that have more than the random region size which is 50 in this case.

I then use a ruby command in windows, (as i can't get it to work in linux):

Code:

ruby –ne ‘puts “>” + $_.split<”,”>.first<2>.join<”\n”>’ FILE.CSV >FILE.fasta

Okay, so here's my attempt at that:

Code:

cat input.file | perl -ne 'if(/(<primerSeq1>)(.{1,50})(<primerSeq2>)/){
    print ">Seq"(++$count)."\n$2\n";
  }'

[ruby can probably do something similar to that in a similar amount of code]

In summary, look for lines with primer1 separated by primer2 (should reverse complement be included as well?) by between 1 and 50 characters. For these lines, print a FASTA record with a dummy sequence ID containing the middle sequence.

If you're *only* using this for the counting script, then you can save a bit of plastic [piping] by only printing the sequence, and then skip out the 'grep' step in the subsequent stages.

Topics	Statistics	Last Post
A Close Examination at Probiotic-Related Bacteremia by seqadmin Started by seqadmin, 05-02-2024, 08:06 AM	0 responses 16 views 0 likes	Last Post by seqadmin 05-02-2024, 08:06 AM
Expanded Genetic Insights into Blood Pressure Regulation by seqadmin Started by seqadmin, 04-30-2024, 12:17 PM	0 responses 20 views 0 likes	Last Post by seqadmin 04-30-2024, 12:17 PM
The Role of Enhancers in Defining Cell Fate by seqadmin Started by seqadmin, 04-29-2024, 10:49 AM	0 responses 26 views 0 likes	Last Post by seqadmin 04-29-2024, 10:49 AM
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM

Seqanswers Leaderboard Ad

Announcement

Removing repeats but accounting for them

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News