Seqanswers Leaderboard Ad

**Apexy** · 10-30-2012, 04:19 AM

Hello cdlam,
Assuming your sequences are in a file name file.fa. At the command prompt run this
perl -e 'while($id = <>){$seq = <>; $seq = substr($seq,0,10); print $id.$seq."\n"}' file.fa
you should get this:
>Fasta format Identification stuff
ACTTGTACAT
>Fasta format Identification stuff
ACTTGTACAT
>Fasta format Identification stuff
ACTTGTACAT

Enjoy!

MBandi

**pbluescript** · 10-30-2012, 05:23 AM

awk solution:

awk '{if (/^>/)
print $0
else
print(substr($1,1,10))
}' file.fasta

**cdlam** · 10-30-2012, 07:44 AM

Thanks for the replies. I'm not exactly sure how to do the awk solution,is that run from the command prompt?

Apexy, that seems to have worked. However, unforseen problem....one of my files is "file.fasta" but apparently isn't in the format... It's like:

scaffold_8 2234626-2234945 (-) GATCCGTAAGgtgaccacttccagcggcctggtgaagcgcctgatcgaggtgcagacgacgaccgtgacgaagaccgtggcgggcgagacgagcacgactgtggagacggagactcgcgaggagatcgatgacagcagtgccacgagctcgaccacgacgacgtcggatgcgactctggttagctcgtcgacgacgcacgagacggacgaggacggcgctgctggcgcgacggtgaagaccgaggtgttccgtacgatgggtgccaacggcagcgtcatcaccaagacggttcgcacgacgatccgtaagGTGACCACTT asmbl_6481

Except it's just in one line when I open it:
scaffold_8 2234626-2234945 (-) GATCCGTAAG.........

It does do multiple lines per entry, but the breaks seem unpredictable, it's weird. I don't suppose anyone has any ideas for getting this into the proper format?

**pbluescript** · 10-30-2012, 07:54 AM

cdlam, the awk script should be run from the command line.

Is the format of the weird fasta file consistent? Does the sequence name always start with "scaffold"? Do the sequences always start with capitol letters?

**cdlam** · 10-30-2012, 08:03 AM

Hey,

Yes, it seems consistent. The beginning of each on is "scaffold_8 2234626-2234945 (-) " either with a (+) or (-) depending on what strand it came from. So it will always go ( ) GTGAA....and each line ends with "asmbl_####"

So,

scaffold_# ###-##### ( ) SEQUENCE-DATA assembl_###

It starts a new line for a new "scaffold_#...." and the "assembl_###" is just on the end of the last line with sequence data.

Edit: Sorry, yes there are always 10 capitol letters at the beginning and 10 capitol letters at the end of the sequence.

**pbluescript** · 10-30-2012, 08:23 AM

Originally posted by cdlam View Post

Hey,

Yes, it seems consistent. The beginning of each on is "scaffold_8 2234626-2234945 (-) " either with a (+) or (-) depending on what strand it came from. So it will always go ( ) GTGAA....and each line ends with "asmbl_####"

So,

scaffold_# ###-##### ( ) SEQUENCE-DATA assembl_###

It starts a new line for a new "scaffold_#...." and the "assembl_###" is just on the end of the last line with sequence data.

Edit: Sorry, yes there are always 10 capitol letters at the beginning and 10 capitol letters at the end of the sequence.

Start with:

Code:

scaffold_8      2234626-2234945 (-) GATCCGTAAGgtgaccacttccagcggcctggtgaagcgcctgatcgaggtgcagacgacgaccgtgacgaagaccgtggcgggcgagacgagcacgactgtggagacggagactcgcgaggagatcgatgacagcagtgccacgagctcgaccacgacgacgtcggatgcgactctggttagctcgtcgacgacgcacgagacggacgaggacggcgctgctggcgcgacggtgaagaccgaggtgttccgtacgatgggtgccaacggcagcgtcatcaccaagacggttcgcacgacgatccgtaagGTGACCACTT asmbl_6481

Run:

Code:

sed 's/^/>/g' input | sed 's/(*)\s/&\n/g' | sed 's/ as.*$//g' | sed 's/\s/ /g' > output

I broke the sed code down into piped steps to make it easier to understand.
's/^/>/g' adds a > to each line.
's/(*)\s/&\n/g' inserts a new line after the strand designation.
's/ as.*$//g' removes the assembl_### portion. You could just move it instead if you want to keep it.
's/\s/ /g' replaces white space with a space. I added this since it seemed like some of the fields in the name line might have been separated by a tab.

End with:

Code:

>scaffold_8 2234626-2234945 (-)
GATCCGTAAGgtgaccacttccagcggcctggtgaagcgcctgatcgaggtgcagacgacgaccgtgacgaagaccgtggcgggcgagacgagcacgactgtggagacggagactcgcgaggagatcgatgacagcagtgccacgagctcgaccacgacgacgtcggatgcgactctggttagctcgtcgacgacgcacgagacggacgaggacggcgctgctggcgcgacggtgaagaccgaggtgttccgtacgatgggtgccaacggcagcgtcatcaccaagacggttcgcacgacgatccgtaagGTGACCACTT

**cdlam** · 10-30-2012, 08:55 AM

That worked, thanks a lot!

Hmm in terms of my original question, when using the

perl -e 'while($id = <>){$seq = <>; $seq = substr($seq,0,10); print $id.$seq."\n"}' file.fa

To get the ending sequence, I just used $seq,-20, which gave me the final 20 bases. Is there a way I can start 5 bases from the end and take the preceding 20 bases?

**Apexy** · 10-30-2012, 01:08 PM

I would suggest pragmatism. first extract last 25 and put in a file. From this file extract first 20
OR
perl -e 'while($id = <>){$seq = <>; $seq = substr($seq,-26, 20); print $id.$seq."\n"}' file.fa

**cdlam** · 10-30-2012, 02:21 PM

Haha, wow I really feel like I should have been able to think of that myself

Oh well.

Thanks a lot!

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Extract partial sequence from FASTA record

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News