Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • nucleotide sequence extraction

    I wish to extract part of a sequence from a particular sequence/scaffold ID like 437 to 959 bases from a 3 Mb scaffold.

    I am more familiar with grep and used it before for like:
    grep -A 1 scaffoldID sequencefasta.fa > saveoutput.fa

    but don't know how to extract a particular part of the sequence.

    Could anyone help me with this please.

    S

  • #2
    Have a look at Galaxy.

    Alternatively you can use Biopieces like this:

    Code:
    read_fasta -i input.fasta |
    grab -p scaffoldID -k SEQ_NAME |
    extract_seq -b 437 -e 959 |
    write_fasta -o output.fasta -x

    Martin

    Comment


    • #3
      You could also use bedtools (code.google.com/p/bedtools/). I've used this tool to extract sub-sequence data before and I really like it because its fast and efficient.

      The tool in bedtools is called fastaFromBed (Creates FASTA sequences based on intervals in a BED/GFF/VCF file) and can extract sub-regions of a fasta by specifying those regions in a bed file.

      The manual is present here: http://code.google.com/p/bedtools/do...-Manual.v3.pdf

      Example of the command from the mannual

      fastaFromBed [OPTIONS] -fi <input FASTA> -bed <BED/GFF/VCF> -fo <output
      FASTA>

      Comment


      • #4
        Thanks Maasha and NextGenGirl,
        I could not install these tools in my system. Scaffold name and sequence ID name are same. Could you please suggest solution from perl (like grep) only? I am using biolinux.
        Regards,
        S

        Comment


        • #5
          I assume you are perhaps missing a compiler (gcc)/libraries when you say that you could not install these tools.

          Are you using a "live" image of biolinux to temporarily boot into a unix environment or are you using someone else's biolinux machine?


          Originally posted by struggler View Post
          I could not install these tools in my system. Scaffold name and sequence ID name are same. Could you please suggest solution from perl (like grep) only? I am using biolinux.
          Regards,
          S

          Comment


          • #6
            Thanks for your message.

            This is on my own machine through VMWare. I guess I can install these using SUDO command. Instead of 'could not' it is more like I was afraid or sceptical to install these tools as if anything goes messy then I don't have much knowhow to correct it. So, I don't want to play with my standard installation.
            Regards,
            S

            Comment


            • #7
              Give it a try. This is something you need to learn if you are planning to keep using *nix in some form.

              I doubt that you can cause major damage by installing bedtools ... but if you did manage to do that then perhaps you should not be using *nix in the first place

              I have not used VMWare lately. Are there any tools that allow you to make a backup of the image so just in case something does go wrong you can revert back to the old image.

              Originally posted by struggler View Post
              Thanks for your message.

              This is on my own machine through VMWare. I guess I can install these using SUDO command. Instead of 'could not' it is more like I was afraid or sceptical to install these tools as if anything goes messy then I don't have much knowhow to correct it. So, I don't want to play with my standard installation.
              Regards,
              S
              Last edited by GenoMax; 05-16-2012, 09:06 AM.

              Comment


              • #8
                Although my username tells my status of knowledge but with your encouragement I shall give it a try sometime later.
                Regards,
                S

                Comment


                • #9
                  Hi struggler,

                  I agree with GenoMax. Try and install these tools. Otherwise, if you are concerned about that maasha's suggestion of Galaxy is also good. They have a tool there under Fetch sequences called "Extract Genomic DNA" and that is the tool I used to use before I learned how to use unix.

                  Comment


                  • #10
                    EMBOSS (http://emboss.sourceforge.net/) is probably the most useful package for basic sequence manipulation/analysis.

                    Note that in order to utilize stdin/stdout you need to call the '-filter' flag and the '-auto' flag disables the parameter prompting. Their manual on the website is very informative.

                    I hope this helps!

                    Comment


                    • #11
                      @struggler .. try this

                      #fasta file: pa101.fasta
                      >gi|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED)
                      QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE
                      KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS
                      VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP
                      FLFLIKHNPTNTIVYFGRYWSP
                      #script: sequence_extractor.sh
                      #!/bin/bash

                      # The 1 based sequence extractor - sequence_extractor.sh
                      # No guarantees offered.

                      # usage:
                      # 1) download the script or copy the contents
                      # of the script and save it as sequence_extractor.sh
                      # 2) make it executable: chmod 755 sequence_extractor.sh
                      # reads from standard input or command line
                      # 3) run the script: ./sequence_extractor.sh ps101.fasta 4 6

                      # create a backup copy of the input fasta file
                      # and delete the header
                      sed -i.tmp -e '1d' $1 || exit $?

                      # merge the lines
                      temp_var1=`awk '{printf $0;}' $1` || exit $?

                      # select the region
                      temp_var2=$(((($3-1)-($2-1))+1)) || exit $?

                      # display the extracted sequence
                      echo ${temp_var1:$(($2-1)):$temp_var2} && mv $1.tmp $1 || exit $?

                      Comment


                      • #12
                        From the ncbi toolkit, formatdb and fastacmd works nicely

                        first format your sequence file

                        formatdb -i <fasta sequence file> -p F -o T


                        This creates a blastable sequence db (a useful bonus). The "o" flag makes it searchable by fastacmd

                        then

                        fastacmd -d <fasta sequence file> -o <output file name> -p F -s <ID of record you want to retrieve> -L <start position,end_position>

                        Fastacmd can also retrieve many records at once. See the documentation.

                        Comment


                        • #13
                          Dear Mark,
                          Many many thanks! The fastacmd command worked like a bullet!!

                          I am also thankful to all others for their helpful suggestions.

                          Regards,
                          S

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            The Impact of AI in Genomic Medicine
                            by seqadmin



                            Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                            02-26-2024, 02:07 PM
                          • seqadmin
                            Multiomics Techniques Advancing Disease Research
                            by seqadmin


                            New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

                            A major leap in the field has
                            ...
                            02-08-2024, 06:33 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 02-28-2024, 06:12 AM
                          0 responses
                          21 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 02-23-2024, 04:11 PM
                          0 responses
                          69 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 02-21-2024, 08:52 AM
                          0 responses
                          77 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 02-20-2024, 08:57 AM
                          0 responses
                          67 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X