Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • cdlam
    replied
    Haha, wow I really feel like I should have been able to think of that myself

    Oh well.


    Thanks a lot!

    Leave a comment:


  • Apexy
    replied
    I would suggest pragmatism. first extract last 25 and put in a file. From this file extract first 20
    OR
    perl -e 'while($id = <>){$seq = <>; $seq = substr($seq,-26, 20); print $id.$seq."\n"}' file.fa

    Leave a comment:


  • cdlam
    replied
    That worked, thanks a lot!

    Hmm in terms of my original question, when using the

    perl -e 'while($id = <>){$seq = <>; $seq = substr($seq,0,10); print $id.$seq."\n"}' file.fa
    To get the ending sequence, I just used $seq,-20, which gave me the final 20 bases. Is there a way I can start 5 bases from the end and take the preceding 20 bases?

    Leave a comment:


  • pbluescript
    replied
    Originally posted by cdlam View Post
    Hey,

    Yes, it seems consistent. The beginning of each on is "scaffold_8 2234626-2234945 (-) " either with a (+) or (-) depending on what strand it came from. So it will always go ( ) GTGAA....and each line ends with "asmbl_####"

    So,

    scaffold_# ###-##### ( ) SEQUENCE-DATA assembl_###

    It starts a new line for a new "scaffold_#...." and the "assembl_###" is just on the end of the last line with sequence data.

    Edit: Sorry, yes there are always 10 capitol letters at the beginning and 10 capitol letters at the end of the sequence.
    Start with:
    Code:
    scaffold_8      2234626-2234945 (-) GATCCGTAAGgtgaccacttccagcggcctggtgaagcgcctgatcgaggtgcagacgacgaccgtgacgaagaccgtggcgggcgagacgagcacgactgtggagacggagactcgcgaggagatcgatgacagcagtgccacgagctcgaccacgacgacgtcggatgcgactctggttagctcgtcgacgacgcacgagacggacgaggacggcgctgctggcgcgacggtgaagaccgaggtgttccgtacgatgggtgccaacggcagcgtcatcaccaagacggttcgcacgacgatccgtaagGTGACCACTT asmbl_6481
    Run:
    Code:
    sed 's/^/>/g' input | sed 's/(*)\s/&\n/g' | sed 's/ as.*$//g' | sed 's/\s/ /g' > output
    I broke the sed code down into piped steps to make it easier to understand.
    's/^/>/g' adds a > to each line.
    's/(*)\s/&\n/g' inserts a new line after the strand designation.
    's/ as.*$//g' removes the assembl_### portion. You could just move it instead if you want to keep it.
    's/\s/ /g' replaces white space with a space. I added this since it seemed like some of the fields in the name line might have been separated by a tab.

    End with:
    Code:
    >scaffold_8 2234626-2234945 (-)
    GATCCGTAAGgtgaccacttccagcggcctggtgaagcgcctgatcgaggtgcagacgacgaccgtgacgaagaccgtggcgggcgagacgagcacgactgtggagacggagactcgcgaggagatcgatgacagcagtgccacgagctcgaccacgacgacgtcggatgcgactctggttagctcgtcgacgacgcacgagacggacgaggacggcgctgctggcgcgacggtgaagaccgaggtgttccgtacgatgggtgccaacggcagcgtcatcaccaagacggttcgcacgacgatccgtaagGTGACCACTT

    Leave a comment:


  • cdlam
    replied
    Hey,

    Yes, it seems consistent. The beginning of each on is "scaffold_8 2234626-2234945 (-) " either with a (+) or (-) depending on what strand it came from. So it will always go ( ) GTGAA....and each line ends with "asmbl_####"

    So,

    scaffold_# ###-##### ( ) SEQUENCE-DATA assembl_###

    It starts a new line for a new "scaffold_#...." and the "assembl_###" is just on the end of the last line with sequence data.

    Edit: Sorry, yes there are always 10 capitol letters at the beginning and 10 capitol letters at the end of the sequence.

    Leave a comment:


  • pbluescript
    replied
    cdlam, the awk script should be run from the command line.

    Is the format of the weird fasta file consistent? Does the sequence name always start with "scaffold"? Do the sequences always start with capitol letters?

    Leave a comment:


  • cdlam
    replied
    Thanks for the replies. I'm not exactly sure how to do the awk solution,is that run from the command prompt?

    Apexy, that seems to have worked. However, unforseen problem....one of my files is "file.fasta" but apparently isn't in the format... It's like:

    scaffold_8 2234626-2234945 (-) GATCCGTAAGgtgaccacttccagcggcctggtgaagcgcctgatcgaggtgcagacgacgaccgtgacgaagaccgtggcgggcgagacgagcacgactgtggagacggagactcgcgaggagatcgatgacagcagtgccacgagctcgaccacgacgacgtcggatgcgactctggttagctcgtcgacgacgcacgagacggacgaggacggcgctgctggcgcgacggtgaagaccgaggtgttccgtacgatgggtgccaacggcagcgtcatcaccaagacggttcgcacgacgatccgtaagGTGACCACTT asmbl_6481

    Except it's just in one line when I open it:
    scaffold_8 2234626-2234945 (-) GATCCGTAAG.........

    It does do multiple lines per entry, but the breaks seem unpredictable, it's weird. I don't suppose anyone has any ideas for getting this into the proper format?

    Leave a comment:


  • pbluescript
    replied
    awk solution:

    awk '{if (/^>/)
    print $0
    else
    print(substr($1,1,10))
    }' file.fasta

    Leave a comment:


  • Apexy
    replied
    Hello cdlam,
    Assuming your sequences are in a file name file.fa. At the command prompt run this
    perl -e 'while($id = <>){$seq = <>; $seq = substr($seq,0,10); print $id.$seq."\n"}' file.fa
    you should get this:
    >Fasta format Identification stuff
    ACTTGTACAT
    >Fasta format Identification stuff
    ACTTGTACAT
    >Fasta format Identification stuff
    ACTTGTACAT

    Enjoy!

    MBandi

    Leave a comment:


  • cdlam
    started a topic Extract partial sequence from FASTA record

    Extract partial sequence from FASTA record

    Hi all,

    So I'm fairly new to anything bioinformatics related and I've been kind of muddling my way through so far.

    I have extracted the introns from three different species and have their sequences stored in three FASTA files. I need a way to extract the first 10 bases from each of these sequences and put them in a new file. I don't know if it helps or not, but the first 10 bases are in caps while the rest of the sequence is in lowercase. I'm not sure if I could use some sort of regex hereor something. So for example,

    >Fasta format Identification stuff
    ACTTGTACATatgggtatcataatcagggagatcc
    >Fasta format Identification stuff
    ACTTGTACATatgggtatcataatcagggagatcc
    >Fasta format Identification stuff
    ACTTGTACATatgggtatcataatcagggagatcc

    Ideally I could preserve the ID lines with the extracted 10 base sequences. I have been using UNIX and perl for some of this (doing the actual extractions), but I also have access to Windows with python, biopython, and Emboss. Thanks for any help you guys could give!

Latest Articles

Collapse

  • seqadmin
    Exploring the Dynamics of the Tumor Microenvironment
    by seqadmin




    The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
    07-08-2024, 03:19 PM
  • seqadmin
    Exploring Human Diversity Through Large-Scale Omics
    by seqadmin


    In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
    06-25-2024, 06:43 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 07-19-2024, 07:20 AM
0 responses
30 views
0 likes
Last Post seqadmin  
Started by seqadmin, 07-16-2024, 05:49 AM
0 responses
42 views
0 likes
Last Post seqadmin  
Started by seqadmin, 07-15-2024, 06:53 AM
0 responses
51 views
0 likes
Last Post seqadmin  
Started by seqadmin, 07-10-2024, 07:30 AM
0 responses
43 views
0 likes
Last Post seqadmin  
Working...
X