Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • looking for a simple script to pull a subset of contigs from an assembly

    i'm sure this is a simple enough task but i'm just an end user no scripting experience at all.

    looking for a script to pull contigs listed in a .txt file from assembly.fa and output the results to a new .fa file

    any help would be much appreciated. thanks.

  • #2
    Use seqtk subseq, https://github.com/lh3/seqtk

    Comment


    • #4
      Out of curiosity, GenoMax, how does second perl function handle the searching of the sample name file?
      I wrote something similar under linux bash and it went bonkers with the RAM. I think the issue was that when ever it found an ID in the fasta file, it would then not remove this ID from the file containing the query IDs, and then start the search again from the start. Either way, my scripting abilities produced something that seemed unfeasible to execute on a large query ID file and large fasta file.

      Cheers,

      J

      Comment


      • #5
        Thanks all!

        Comment


        • #7
          awk 'BEGIN{while((getline x<ARGV[1])>0){a[i++]=x;}while((getline y<ARGV[2])>0){if(substr(y,0,1)==">"){m=0;for(j=0;j<i;j++){if(y==a[j])m=1;}}if(m==1)print y;}}' $1 $2


          $1 is match file
          $2 is fasta file

          Comment


          • #8
            Originally posted by JackieBadger View Post
            Out of curiosity, GenoMax, how does second perl function handle the searching of the sample name file?
            I wrote something similar under linux bash and it went bonkers with the RAM. I think the issue was that when ever it found an ID in the fasta file, it would then not remove this ID from the file containing the query IDs, and then start the search again from the start. Either way, my scripting abilities produced something that seemed unfeasible to execute on a large query ID file and large fasta file.

            Cheers,

            J
            @JackieBadger: Second perl function is using -n and -e switches. -n wraps a while loop around the program while -p feeds the program value of $_ each time.

            A nice example that illustrates this (equivalent to unix 'cat' command)

            Code:
            $ perl -ne 'print $_' filename
            or
            Code:
            $ perl -ne 'print' filename
            Last edited by GenoMax; 02-08-2014, 06:05 PM.

            Comment


            • #9
              This little BioPython script will nicely do the job:

              Code:
              from Bio import SeqIO
              import sys
              
              #Usage: filter_fasta_per_ids.py input.fasta filter_ids.txt output.fasta
              
              input_file =sys.argv[1]
              id_file =sys.argv[2]
              output_file =sys.argv[3]
              wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
              print("Found %i unique identifiers in %s" % (len(wanted), id_file))
              records = (r for r in SeqIO.parse(input_file, "fasta") if r.id in wanted)
              count = SeqIO.write(records, output_file, "fasta")
              print("Saved %i records from %s to %s" % (count, input_file, output_file))
              if count < len(wanted):
                  print("Warning %i IDs not found in %s" % (len(wanted)-count, input_file))

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Understanding Genetic Influence on Infectious Disease
                by seqadmin




                During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

                Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
                09-09-2024, 10:59 AM
              • seqadmin
                Addressing Off-Target Effects in CRISPR Technologies
                by seqadmin






                The first FDA-approved CRISPR-based therapy marked the transition of therapeutic gene editing from a dream to reality1. CRISPR technologies have streamlined gene editing, and CRISPR screens have become an important approach for identifying genes involved in disease processes2. This technique introduces targeted mutations across numerous genes, enabling large-scale identification of gene functions, interactions, and pathways3. Identifying the full range...
                08-27-2024, 04:44 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 09-11-2024, 02:44 PM
              0 responses
              11 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 09-06-2024, 08:02 AM
              0 responses
              145 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 09-03-2024, 08:30 AM
              0 responses
              152 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 08-27-2024, 04:40 AM
              0 responses
              161 views
              0 likes
              Last Post seqadmin  
              Working...
              X