Unconfigured Ad

**Apexy** · 10-30-2012, 04:19 AM

Hello cdlam,
Assuming your sequences are in a file name file.fa. At the command prompt run this
perl -e 'while($id = <>){$seq = <>; $seq = substr($seq,0,10); print $id.$seq."\n"}' file.fa
you should get this:
>Fasta format Identification stuff
ACTTGTACAT
>Fasta format Identification stuff
ACTTGTACAT
>Fasta format Identification stuff
ACTTGTACAT

Enjoy!

MBandi

**pbluescript** · 10-30-2012, 05:23 AM

awk solution:

awk '{if (/^>/)
print $0
else
print(substr($1,1,10))
}' file.fasta

**cdlam** · 10-30-2012, 07:44 AM

Thanks for the replies. I'm not exactly sure how to do the awk solution,is that run from the command prompt?

Apexy, that seems to have worked. However, unforseen problem....one of my files is "file.fasta" but apparently isn't in the format... It's like:

scaffold_8 2234626-2234945 (-) GATCCGTAAGgtgaccacttccagcggcctggtgaagcgcctgatcgaggtgcagacgacgaccgtgacgaagaccgtggcgggcgagacgagcacgactgtggagacggagactcgcgaggagatcgatgacagcagtgccacgagctcgaccacgacgacgtcggatgcgactctggttagctcgtcgacgacgcacgagacggacgaggacggcgctgctggcgcgacggtgaagaccgaggtgttccgtacgatgggtgccaacggcagcgtcatcaccaagacggttcgcacgacgatccgtaagGTGACCACTT asmbl_6481

Except it's just in one line when I open it:
scaffold_8 2234626-2234945 (-) GATCCGTAAG.........

It does do multiple lines per entry, but the breaks seem unpredictable, it's weird. I don't suppose anyone has any ideas for getting this into the proper format?

**pbluescript** · 10-30-2012, 07:54 AM

cdlam, the awk script should be run from the command line.

Is the format of the weird fasta file consistent? Does the sequence name always start with "scaffold"? Do the sequences always start with capitol letters?

**cdlam** · 10-30-2012, 08:03 AM

Hey,

Yes, it seems consistent. The beginning of each on is "scaffold_8 2234626-2234945 (-) " either with a (+) or (-) depending on what strand it came from. So it will always go ( ) GTGAA....and each line ends with "asmbl_####"

So,

scaffold_# ###-##### ( ) SEQUENCE-DATA assembl_###

It starts a new line for a new "scaffold_#...." and the "assembl_###" is just on the end of the last line with sequence data.

Edit: Sorry, yes there are always 10 capitol letters at the beginning and 10 capitol letters at the end of the sequence.

**pbluescript** · 10-30-2012, 08:23 AM

Originally posted by cdlam View Post

Hey,

Yes, it seems consistent. The beginning of each on is "scaffold_8 2234626-2234945 (-) " either with a (+) or (-) depending on what strand it came from. So it will always go ( ) GTGAA....and each line ends with "asmbl_####"

So,

scaffold_# ###-##### ( ) SEQUENCE-DATA assembl_###

It starts a new line for a new "scaffold_#...." and the "assembl_###" is just on the end of the last line with sequence data.

Edit: Sorry, yes there are always 10 capitol letters at the beginning and 10 capitol letters at the end of the sequence.

Start with:

Code:

scaffold_8      2234626-2234945 (-) GATCCGTAAGgtgaccacttccagcggcctggtgaagcgcctgatcgaggtgcagacgacgaccgtgacgaagaccgtggcgggcgagacgagcacgactgtggagacggagactcgcgaggagatcgatgacagcagtgccacgagctcgaccacgacgacgtcggatgcgactctggttagctcgtcgacgacgcacgagacggacgaggacggcgctgctggcgcgacggtgaagaccgaggtgttccgtacgatgggtgccaacggcagcgtcatcaccaagacggttcgcacgacgatccgtaagGTGACCACTT asmbl_6481

Run:

Code:

sed 's/^/>/g' input | sed 's/(*)\s/&\n/g' | sed 's/ as.*$//g' | sed 's/\s/ /g' > output

I broke the sed code down into piped steps to make it easier to understand.
's/^/>/g' adds a > to each line.
's/(*)\s/&\n/g' inserts a new line after the strand designation.
's/ as.*$//g' removes the assembl_### portion. You could just move it instead if you want to keep it.
's/\s/ /g' replaces white space with a space. I added this since it seemed like some of the fields in the name line might have been separated by a tab.

End with:

Code:

>scaffold_8 2234626-2234945 (-)
GATCCGTAAGgtgaccacttccagcggcctggtgaagcgcctgatcgaggtgcagacgacgaccgtgacgaagaccgtggcgggcgagacgagcacgactgtggagacggagactcgcgaggagatcgatgacagcagtgccacgagctcgaccacgacgacgtcggatgcgactctggttagctcgtcgacgacgcacgagacggacgaggacggcgctgctggcgcgacggtgaagaccgaggtgttccgtacgatgggtgccaacggcagcgtcatcaccaagacggttcgcacgacgatccgtaagGTGACCACTT

**cdlam** · 10-30-2012, 08:55 AM

That worked, thanks a lot!

Hmm in terms of my original question, when using the

perl -e 'while($id = <>){$seq = <>; $seq = substr($seq,0,10); print $id.$seq."\n"}' file.fa

To get the ending sequence, I just used $seq,-20, which gave me the final 20 bases. Is there a way I can start 5 bases from the end and take the preceding 20 bases?

**Apexy** · 10-30-2012, 01:08 PM

I would suggest pragmatism. first extract last 25 and put in a file. From this file extract first 20
OR
perl -e 'while($id = <>){$seq = <>; $seq = substr($seq,-26, 20); print $id.$seq."\n"}' file.fa

**cdlam** · 10-30-2012, 02:21 PM

Haha, wow I really feel like I should have been able to think of that myself

Oh well.

Thanks a lot!

Topics	Statistics	Last Post
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, Yesterday, 12:03 PM	0 responses 19 views 0 reactions	Last Post by SEQadmin2 Yesterday, 12:03 PM
DNA Methylation Study Reveals How Epigenetic Changes Pass Between Generations by SEQadmin2 Started by SEQadmin2, Yesterday, 11:40 AM	0 responses 14 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:40 AM
MetaBeeAI Helps Scientists Process Research Literature Faster by SEQadmin2 Started by SEQadmin2, 05-28-2026, 11:40 AM	0 responses 29 views 0 reactions	Last Post by SEQadmin2 05-28-2026, 11:40 AM
Scientists Solve a 25-Year Mystery in RNA Interference by SEQadmin2 Started by SEQadmin2, 05-26-2026, 10:12 AM	0 responses 31 views 0 reactions	Last Post by SEQadmin2 05-26-2026, 10:12 AM

Unconfigured Ad

Extract partial sequence from FASTA record

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News