Unconfigured Ad

**hugorody** · 08-03-2014, 09:30 PM

just use grep

why don't use only grep?

cat filein.fasta | grep '>' --after-context=1 > fileout.fast a

**Shorash** · 08-03-2014, 10:04 PM

Originally posted by hugorody View Post

why don't use only grep?

cat filein.fasta | grep '>' --after-context=1 > fileout.fast a

Sorry what exactly can I use this for?

**Shorash** · 08-03-2014, 10:09 PM

Originally posted by Shorash View Post

Sorry what exactly can I use this for?

I don't think I could use this for multiple extractions (up to 200-300)? It also doesn't seem to extract the full sequence following the contig name.

**hugorody** · 08-03-2014, 10:12 PM

Originally posted by Shorash View Post

Sorry what exactly can I use this for?

could you explore a bit more what exactly you need to do?

based on your post I think you just want to capture fasta sequences from a file. is that right?

**hugorody** · 08-03-2014, 10:15 PM

grep will catch all lines that contains the symbol >, and also one line after context ">".
you can use this for hundreds... thousands

**Shorash** · 08-03-2014, 10:17 PM

Originally posted by hugorody View Post

could you explore a bit more what exactly you need to do?

based on your post I think you just want to capture fasta sequences from a file. is that right?

Ok, I have a fasta file containing nucleotide sequences with two different headings, eg:

Heading 1:
>comp35107_c0_seq1
GTGGCAGGGACCAGAGCAAGCAGTTCTTCACAGACTTGTGGGAGTTCAGCCTGAAGGACC
TTGAGTGGAAGGACAAGAGTCAACTCATCATTAGTGATGTGGCAGGCATGGTGCCCAGTG
GCCGAAGTGGCGCCTCCACATGGGTCGGAAAGGATCAAGCACTCTACATGTTTGGTGGAA
ACACTGTGGTCCGCACAGACTCAGGTCTCCGCAGCGGCATTGGATATGGAGCTGATCTCT
GGAGGATGTCCACAAACAACCACAGCTGGCAGCTTTTGTCAGGCACTACAAAACCTGGGA
CTCCAGCCAAGTTTGGTCGCCTTGGGGAGTACACTATAATGAGTCAGCCTGGCAGTCGGT
GTGGGGCCATCACCTGGGTGGACACAGCCGGCAACCTGTGGATGTTTGGCGGTGATGGCA
CAGACACAAGTCTTCCTTCTCCCTACCACGCATCACTGCTGCTCTCTGACCT

Heading 2:
>BN2_l1_1_(paired)_merged_contig_20016
TCTCTCTCTCTCTCTCTGTGCCTATTCACATATCTCTTTTTTGTGCGTCTTCTCCTCTAA
ACCACTGCAATAAAACTGTCCGAGTGCAGTCTCTGTCGGGACGCTGATGAAGGGAGGCTG
GGGAGATGGAGAGAGGAGATGACACCCCCAGGTCCTGATTAAGCTGAGAGCTATTGCCGT
AATGGACTAAAAGCACACGGGCGCCGTATTTCCGCTCNNCCGCTCAGACTCCATCCGCTT
TATTCGGGACTTCGATGAGATGAAAGTCCTCGTTGCATTACGCCAATTTGATTACGGCAC
TGATTTGACCCTGCAAACGAACCCCTGCAACTTCAGGAGTGCTCGCCCAATTGGGGTTGG
CACGCTGTGGAACGCTCGAGGCACCGTGGGCAGCCGCCAGACCTTCGGTCTCCAATCTGC
AACGCCGTGGCAGGTGGAATTACAAGGAAATGGACACTCGAACCTCTTTGTGTCAGGAGC
AGATTGCTTGCGGCTGTGGGATTTATTGTAGG

I require a script to extract multiple of these sequences using the headers I have in bold above. So basically I will copy a large number of the headers above from Excel and using a script I want to extract all the corresponding sequences. My current script which I've outlined in the opening post, for some reason, only extracts the sequences from header 1.

Hope that clarifies a little.

**hugorody** · 08-03-2014, 10:27 PM

OK. are u using Linux right?

So, create a list.txt file with all headers you need separated by enter:

header1
header2
...

then type on shell:

$cat name_your_fasta_file.fasta | grep --file=list.txt --after-context=1 > my_sequences.fasta

and that's it.
you don't need a script.

**rhinoceros** · 08-03-2014, 11:56 PM

Originally posted by hugorody View Post

OK. are u using Linux right?

So, create a list.txt file with all headers you need separated by enter:

header1
header2
...

then type on shell:

$cat name_your_fasta_file.fasta | grep --file=list.txt --after-context=1 > my_sequences.fasta

and that's it.
you don't need a script.

It doesn't work if there are linebreaks in the seqs like OP posted. You first have to deal with them..

**SylvainL** · 08-04-2014, 02:18 AM

Hi, just use R and the seqinr package. You will find the read.fasta function...

**hugorody** · 08-04-2014, 11:21 AM

Originally posted by rhinoceros View Post

It doesn't work if there are linebreaks in the seqs like OP posted. You first have to deal with them..

**Shorash** · 08-04-2014, 04:49 PM

Originally posted by hugorody View Post

Hi there,

I receive the following error when attempting to remove the linebreaks:

:~/Extract_fasta/Orthologs> $cat BN_clc.fasta | sed 's/ //g' | sed 's/$>.*$/\1 /g' | sed ':a;N;s/\n//g;ta' | sed 's/>/\n>/g' | sed 's/ /\n/g' > BN_Fixed.fasta
./BN_clc.fasta: line 1: syntax error near unexpected token `('
'/BN_clc.fasta: line 1: `>BN2_l1_1_(paired)_merged_contig_4

**GenoMax** · 08-04-2014, 05:00 PM

faSomeRecords from Kent utilities would be the simplest/fast solution (http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/)

Extract sequence from multi fasta file with PERL - SEQanswers

http://seqanswers.com/forums/showthread.php?t=9498

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

Just a moment...

https://www.biostars.org/p/2822/

Just a moment...

https://www.biostars.org/p/1195/

**hugorody** · 08-04-2014, 05:04 PM

Originally posted by Shorash View Post

Hi there,

I receive the following error when attempting to remove the linebreaks:

:~/Extract_fasta/Orthologs> $cat BN_clc.fasta | sed 's/ //g' | sed 's/$>.*$/\1 /g' | sed ':a;N;s/\n//g;ta' | sed 's/>/\n>/g' | sed 's/ /\n/g' > BN_Fixed.fasta
./BN_clc.fasta: line 1: syntax error near unexpected token `('
'/BN_clc.fasta: line 1: `>BN2_l1_1_(paired)_merged_contig_4

which linux distribution do you use?
you should try use double quotes ( " ) instead single quotes ( ' ).

**Shorash** · 08-04-2014, 05:14 PM

Originally posted by hugorody View Post

which linux distribution do you use?
you should try use double quotes ( " ) instead single quotes ( ' ).

I'm using a portable batch system (PBS).

Topics	Statistics	Last Post
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 38 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 100 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 121 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 114 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM

Unconfigured Ad

Extract fasta script

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News