Seqanswers Leaderboard Ad

**SNPsaurus** · 05-12-2016, 03:57 PM

Can you just split on " " and print the second element of the split array? Or split on '|' and take the last element. It just depends on how standard the formatting is of the header.
@header_split = split(" ",$line);
$changed_line = $header_split[1];

**kmcarr** · 05-13-2016, 06:53 AM

Originally posted by Katty1 View Post

I did a script in Perl that breaks several sequences of a multifasta file, but I need remove a part of string of header.

For example:

input file:

Code:

>gi|983431797|ref|NZ_LN868938.1| Nocardia farcinica genome assembly NCTC11134, chromosome : 1
CTGACTGGGAGTACGAAGGCCGCCTGCACAAGACAACGGGGCAGCGAACCTTCTTCTGCACCGGCACGGA
CGACGCCGAGATGCCTCGACCTGGAGAACCTCGGCCGCGGCGAACCGCTCGCCCATGTCCGCGCCGAGTT

Output file:

Code:

>Nocardia farcinica genome assembly NCTC11134, chromosome : 1
CTGACTGGGAGTACGAAGGCCGCCTGCACAAGACAACGGGGCAGCGAACCTTCTTCTGCACCGGCACGGA
CGACGCCGAGATGCCTCGACCTGGAGAACCTCGGCCGCGGCGAACCGCTCGCCCATGTCCGCGCCGAGTT

Katty,

I would caution you that what you plan to do is potentially problematic. The generally accepted format for FASTA file deflines is that the first word after the ">" represents the unique identifier for the sequence. The "first word" is defined as everything up to the first "whitespace" which may be a space or tab character. Everything that comes after that is optional description text. If you also have included in your analysis:

Code:

>gi|873551602|emb|LN868939.1| Nocardia farcinica genome assembly NCTC11134, plasmid : 2
GGCTTTGTGCCCGCCGAAAAAAGGTTGCCTATGTCCAAGCCTGCATTTACCGAAATCGACCGAATGACGG
GCGGAGGGCGGAGTAATCGCACCCGCCCACCGGTCAACTTCCTTCTTCACACCGAGGAAGGAAACTCGAG...

Which you also edit to:

Code:

>Nocardia farcinica genome assembly NCTC11134, plasmid : 2
GGCTTTGTGCCCGCCGAAAAAAGGTTGCCTATGTCCAAGCCTGCATTTACCGAAATCGACCGAATGACGG
GCGGAGGGCGGAGTAATCGCACCCGCCCACCGGTCAACTTCCTTCTTCACACCGAGGAAGGAAACTCGAG...

You have two sequences as part of your analysis which share the same ID, "Nocardia".

**GenoMax** · 05-13-2016, 07:11 AM

One alternative would be to change all spaces to "_" so that you have a long string (that should stay unique) for each fasta header. It would be cumbersome but would at least avoid the problem @kmcarr pointed out.

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, Today, 11:09 AM	0 responses 24 views 0 likes	Last Post by seqadmin Today, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, Today, 06:13 AM	0 responses 20 views 0 likes	Last Post by seqadmin Today, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 30 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

How do I remove a part of the header of a fasta file in perl?

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News