Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • cdlam
    Junior Member
    • Oct 2012
    • 7

    Extract partial sequence from FASTA record

    Hi all,

    So I'm fairly new to anything bioinformatics related and I've been kind of muddling my way through so far.

    I have extracted the introns from three different species and have their sequences stored in three FASTA files. I need a way to extract the first 10 bases from each of these sequences and put them in a new file. I don't know if it helps or not, but the first 10 bases are in caps while the rest of the sequence is in lowercase. I'm not sure if I could use some sort of regex hereor something. So for example,

    >Fasta format Identification stuff
    ACTTGTACATatgggtatcataatcagggagatcc
    >Fasta format Identification stuff
    ACTTGTACATatgggtatcataatcagggagatcc
    >Fasta format Identification stuff
    ACTTGTACATatgggtatcataatcagggagatcc

    Ideally I could preserve the ID lines with the extracted 10 base sequences. I have been using UNIX and perl for some of this (doing the actual extractions), but I also have access to Windows with python, biopython, and Emboss. Thanks for any help you guys could give!
  • Apexy
    Member
    • Apr 2011
    • 62

    #2
    Hello cdlam,
    Assuming your sequences are in a file name file.fa. At the command prompt run this
    perl -e 'while($id = <>){$seq = <>; $seq = substr($seq,0,10); print $id.$seq."\n"}' file.fa
    you should get this:
    >Fasta format Identification stuff
    ACTTGTACAT
    >Fasta format Identification stuff
    ACTTGTACAT
    >Fasta format Identification stuff
    ACTTGTACAT

    Enjoy!

    MBandi

    Comment

    • pbluescript
      Senior Member
      • Nov 2009
      • 224

      #3
      awk solution:

      awk '{if (/^>/)
      print $0
      else
      print(substr($1,1,10))
      }' file.fasta

      Comment

      • cdlam
        Junior Member
        • Oct 2012
        • 7

        #4
        Thanks for the replies. I'm not exactly sure how to do the awk solution,is that run from the command prompt?

        Apexy, that seems to have worked. However, unforseen problem....one of my files is "file.fasta" but apparently isn't in the format... It's like:

        scaffold_8 2234626-2234945 (-) GATCCGTAAGgtgaccacttccagcggcctggtgaagcgcctgatcgaggtgcagacgacgaccgtgacgaagaccgtggcgggcgagacgagcacgactgtggagacggagactcgcgaggagatcgatgacagcagtgccacgagctcgaccacgacgacgtcggatgcgactctggttagctcgtcgacgacgcacgagacggacgaggacggcgctgctggcgcgacggtgaagaccgaggtgttccgtacgatgggtgccaacggcagcgtcatcaccaagacggttcgcacgacgatccgtaagGTGACCACTT asmbl_6481

        Except it's just in one line when I open it:
        scaffold_8 2234626-2234945 (-) GATCCGTAAG.........

        It does do multiple lines per entry, but the breaks seem unpredictable, it's weird. I don't suppose anyone has any ideas for getting this into the proper format?

        Comment

        • pbluescript
          Senior Member
          • Nov 2009
          • 224

          #5
          cdlam, the awk script should be run from the command line.

          Is the format of the weird fasta file consistent? Does the sequence name always start with "scaffold"? Do the sequences always start with capitol letters?

          Comment

          • cdlam
            Junior Member
            • Oct 2012
            • 7

            #6
            Hey,

            Yes, it seems consistent. The beginning of each on is "scaffold_8 2234626-2234945 (-) " either with a (+) or (-) depending on what strand it came from. So it will always go ( ) GTGAA....and each line ends with "asmbl_####"

            So,

            scaffold_# ###-##### ( ) SEQUENCE-DATA assembl_###

            It starts a new line for a new "scaffold_#...." and the "assembl_###" is just on the end of the last line with sequence data.

            Edit: Sorry, yes there are always 10 capitol letters at the beginning and 10 capitol letters at the end of the sequence.

            Comment

            • pbluescript
              Senior Member
              • Nov 2009
              • 224

              #7
              Originally posted by cdlam View Post
              Hey,

              Yes, it seems consistent. The beginning of each on is "scaffold_8 2234626-2234945 (-) " either with a (+) or (-) depending on what strand it came from. So it will always go ( ) GTGAA....and each line ends with "asmbl_####"

              So,

              scaffold_# ###-##### ( ) SEQUENCE-DATA assembl_###

              It starts a new line for a new "scaffold_#...." and the "assembl_###" is just on the end of the last line with sequence data.

              Edit: Sorry, yes there are always 10 capitol letters at the beginning and 10 capitol letters at the end of the sequence.
              Start with:
              Code:
              scaffold_8      2234626-2234945 (-) GATCCGTAAGgtgaccacttccagcggcctggtgaagcgcctgatcgaggtgcagacgacgaccgtgacgaagaccgtggcgggcgagacgagcacgactgtggagacggagactcgcgaggagatcgatgacagcagtgccacgagctcgaccacgacgacgtcggatgcgactctggttagctcgtcgacgacgcacgagacggacgaggacggcgctgctggcgcgacggtgaagaccgaggtgttccgtacgatgggtgccaacggcagcgtcatcaccaagacggttcgcacgacgatccgtaagGTGACCACTT asmbl_6481
              Run:
              Code:
              sed 's/^/>/g' input | sed 's/(*)\s/&\n/g' | sed 's/ as.*$//g' | sed 's/\s/ /g' > output
              I broke the sed code down into piped steps to make it easier to understand.
              's/^/>/g' adds a > to each line.
              's/(*)\s/&\n/g' inserts a new line after the strand designation.
              's/ as.*$//g' removes the assembl_### portion. You could just move it instead if you want to keep it.
              's/\s/ /g' replaces white space with a space. I added this since it seemed like some of the fields in the name line might have been separated by a tab.

              End with:
              Code:
              >scaffold_8 2234626-2234945 (-)
              GATCCGTAAGgtgaccacttccagcggcctggtgaagcgcctgatcgaggtgcagacgacgaccgtgacgaagaccgtggcgggcgagacgagcacgactgtggagacggagactcgcgaggagatcgatgacagcagtgccacgagctcgaccacgacgacgtcggatgcgactctggttagctcgtcgacgacgcacgagacggacgaggacggcgctgctggcgcgacggtgaagaccgaggtgttccgtacgatgggtgccaacggcagcgtcatcaccaagacggttcgcacgacgatccgtaagGTGACCACTT

              Comment

              • cdlam
                Junior Member
                • Oct 2012
                • 7

                #8
                That worked, thanks a lot!

                Hmm in terms of my original question, when using the

                perl -e 'while($id = <>){$seq = <>; $seq = substr($seq,0,10); print $id.$seq."\n"}' file.fa
                To get the ending sequence, I just used $seq,-20, which gave me the final 20 bases. Is there a way I can start 5 bases from the end and take the preceding 20 bases?

                Comment

                • Apexy
                  Member
                  • Apr 2011
                  • 62

                  #9
                  I would suggest pragmatism. first extract last 25 and put in a file. From this file extract first 20
                  OR
                  perl -e 'while($id = <>){$seq = <>; $seq = substr($seq,-26, 20); print $id.$seq."\n"}' file.fa

                  Comment

                  • cdlam
                    Junior Member
                    • Oct 2012
                    • 7

                    #10
                    Haha, wow I really feel like I should have been able to think of that myself

                    Oh well.


                    Thanks a lot!

                    Comment

                    Latest Articles

                    Collapse

                    • SEQadmin2
                      From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                      by SEQadmin2


                      Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                      The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                      ...
                      Yesterday, 10:05 AM
                    • SEQadmin2
                      Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                      by SEQadmin2


                      With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                      Introduction

                      Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                      05-22-2026, 06:42 AM
                    • SEQadmin2
                      Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                      by SEQadmin2

                      Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                      Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                      05-06-2026, 09:04 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by SEQadmin2, Yesterday, 12:03 PM
                    0 responses
                    19 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, Yesterday, 11:40 AM
                    0 responses
                    14 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 05-28-2026, 11:40 AM
                    0 responses
                    29 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 05-26-2026, 10:12 AM
                    0 responses
                    31 views
                    0 reactions
                    Last Post SEQadmin2  
                    Working...