Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • extract subsequence from genomic fasta file

    Hi,

    I am new to this forum. The reason I joined is to ask the following:

    How do I extract a sub-sequence from a fasta file containing the human genome?

    TIRG's cdbyank was good for pulling whole sequences out of a multifasta file, but that's now what I want. I want to extract something like the following:

    "chr6:26795842-26792619"

    ie given chromosome number, start and end points, get the DNA sequence.

    I have BWA indexed fasta files files, samtools indexed fasta files, and another. Can this be done with BWA, or with samtools?

    Cheers,
    Joe White
    Mass. Eye and Ear Infirmary

  • #2
    EMBOSS has tools available to do this

    Comment


    • #3
      samtools can do this too

      Comment


      • #4
        ncbi blast has start & end parameters to retrieve subseuences

        Comment


        • #5
          Originally posted by genericforms View Post
          samtools can do this too
          Which tool?

          jwhite

          Comment


          • #6
            UCSC, of course, has a tool for occasional web based use:



            Hack the url to hg19 or whatever. You can get to it from the link at the top of a browser page.

            If you need something really fast for calling from the command line, compile this file (fetchdna.c"). (I know, "it's not using a nib file" or the like, I got one! ... but this is simpler).
            Code:
            /*
            compile : gcc -Wall -O2 -o fetchdna fetchdna.c
            
            fetchdna - gets dna from canonical genomic fastas with line len = 50
            Usage : fetchdna range directory_path_to_canonical_genomic_fastas\n");
            Example : ./fetchdna chr17:15000-16000 /h1/finneyr/amd64/hg18/ \n");
            
            Instead of typing the location of your path to the genomic fastas, make a bash file like this ...
            
            echo "~/bin/fetchdna $1 /h1/finneyr/amd64/hg18/" > myfetch
            chmod +x myfetch
            use like this :  ./myfetch chr17:15000-16000
            */
            
            #include <stdio.h>
            #include <stdlib.h>
            #include <string.h>
            
            #define MAXBUFF 5000
            #define MAX_FETCH_SIZE 1000002
            char path[MAXBUFF];
            char dna[MAX_FETCH_SIZE ];
            
            static void comma_gin(char *s) // gets rid of commas in a string  - dang commanists !!!
            {
                char *z;
                char tmps[MAXBUFF+10];
                while (1)
                {
                    z = strstr(s,",");
                    if (!z) break;
                    strcpy(tmps,z+1);
                    strcpy(z,tmps);
                }
                return;
            }
            
            static int parse_position(const char argposition[],char chr[],int *start,int *end)
            {
                int i;
                char tmps[1024];
                char t[MAXBUFF];
                char tmps1[MAXBUFF];
                char tmps2[MAXBUFF];
            
                tmps[0] = tmps1[0] = tmps2[0] = t[0] = (char)0;
                strcpy(t,argposition);
                   // un_escape(t);        for when parsing via a URL which often put escape codes in
                strcpy(tmps,t);
                for (i=0 ; tmps[i] ; i++) { if (tmps[i] == ':') { tmps[i] = ' '; } if (tmps[i] == '-') { tmps[i] = ' '; } }
                sscanf(tmps,"%s %s %s",chr,tmps1,tmps2);
                if (strcmp(chr,"chr23") == 0) strcpy(chr,"chrX");
                else if (strcmp(chr,"chr24") == 0) strcpy(chr,"chrY");
                else if (strcmp(chr,"chr25") == 0) strcpy(chr,"chrM");
                comma_gin(tmps1);
                *start = atoi(tmps1);
                comma_gin(tmps2);
                *end = atoi(tmps2);
                return 0;
            }
            
            void fetch(char fn[],long int pos,char dna[],int len)
            {
                char s[MAXBUFF];
                register int i;
                register int k;
                int headlen;
                long spot;
                FILE *fp;
            
                           // printf("fetch : %s %ld %d\n",fn,pos,len);
                dna[0] = (char)0;
                fp = fopen(fn,"r");
                if (fp == (void *)0) { fprintf(stderr,"%s not open error.  Invalid filename. \n",fn); exit(0); }
                fgets(s,1022,fp);  // eat head line
                headlen = strlen(s) ;
                spot = headlen +  pos + (pos/50);
                           // printf("spot = %ld headlen = %d pos = %ld  len=%d\n",spot,headlen,pos,len);
                fseek(fp,spot,SEEK_SET);
                i  = 0;
                while (i < len)
                {
                    k = fgetc(fp);
                    if ((char)k == '\r') continue;
                    if ((char)k == '\n') continue;
                    dna[i++] = (char)k;
                }
                dna[i+1] = (char)0;
                fclose(fp);
                return;
            }
            
            
            int fetchwrap(char *chr,int s, int e)
            {
                int k;
                char fn[512]; //  file name to canonical fasta for a chromsome
                if ((e-s)>MAX_FETCH_SIZE-2)
                {
                    fprintf(stderr,"ERROR- too big %s %d %d , cant be bigger than %d \n",chr,s,e,MAX_FETCH_SIZE);
                    return -1;
                }
                sprintf(fn,"%s/%s.fa",path,chr);
                dna[0] = (char)0;
                fetch(fn,s,dna,e-s);
                for (k=0 ; dna[k] ; k++)
                {
                    printf("%c",dna[k]);
                    if ((k%50) == 49) printf("\n");
                }
                printf("\n");
                return 0;
            }
            
            int main(int argc,char *argv[])
            {
                char position[MAXBUFF];
                char chr[MAXBUFF];
                int start,end;
            
                if (argc != 3)
                {
                    fprintf(stderr,"ERROR: usage example is ./fetchdna chr17:15000-16000 /h1/finneyr/amd64/hg18/ \n");
                    fprintf(stderr,"Usage is fetchdna range directory_path_to_canonical_genomic_fastas\n");
                    fprintf(stderr,"Note: start of chromosome is 1, not 0\n");
                    return 1;
                }
            
                strcpy(position,argv[1]);      // buffer overflow , careful if webizing
                strcpy(path,argv[2]);
                parse_position(position,chr,&start,&end); // eg.: "position=chrX:37301314-37347604"
                start = start - 1;
                end = end ;
                fetchwrap(chr,start,end);
                return 0;
            }
            Last edited by Richard Finney; 06-28-2012, 09:00 AM. Reason: spelling

            Comment


            • #7
              Just to add another to the list, bedtools getfasta will do this, and it is very easy to integrate into a variety of scripts.

              Comment


              • #8
                Thanks for your answers folks.

                The c program and remarks about samtools, bedtools were helpful..

                Just by chance I noticed in the samtools docs that faidx not only indexes sequences, but will retrieve sequences if given a region (eg chr2:1234-1266).
                This works very nicely:

                samtools faidx fasta.fa region

                This can be scripted easily.

                Just wanted to pass this on for future reference.

                Cheers,
                Joe

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Recent Advances in Sequencing Technologies
                  by seqadmin







                  Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                  Long-Read Sequencing
                  Long-read sequencing has...
                  12-02-2024, 01:49 PM
                • seqadmin
                  Genetic Variation in Immunogenetics and Antibody Diversity
                  by seqadmin



                  The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
                  11-06-2024, 07:24 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 12-02-2024, 09:29 AM
                0 responses
                139 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-02-2024, 09:06 AM
                0 responses
                49 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-02-2024, 08:03 AM
                0 responses
                38 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 11-22-2024, 07:36 AM
                0 responses
                69 views
                0 likes
                Last Post seqadmin  
                Working...
                X