Header Leaderboard Ad

Collapse

Script for extracting random sub-set of sequences

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Script for extracting random sub-set of sequences

    I have sequence data that was created using several different primers. I found that the samples have large variation in sequences, depending on the primer used. I therefore wanted to create a file that randomly selects 1000 sequences, and then put these in a new FASTA file.

    My strategy is to put the sequences in a hash, split by the '>'. Then, assign each a number, and then randomly select the numbers that are assigned to the sequences, and then put this in a new FASTA file.
    I haven't gotten this to work yet, but does anyone know a simpler method for doing this?

  • #2
    With Biopieces you can do:

    Code:
    read_fasta -i input.fna | random_records -n 1000 | write_fasta -o random.fna -x

    Comment


    • #3
      cat data | shuf | head -NUMBEROFLINES

      If the input is fasta, you'll have to join every other line with the next line and undo it. Example:

      cat test.fa| awk '{if ((NR%2)==0)print prev"XXXXXX"$0;prev=$0;}' | shuf | head -1000 | sed 's/XXXXXX/\n/'

      The shuf may not hang out in your /usr/sbin/ , if not, try

      sort -R file.txt | head -NUMBER OF LINES

      yep, "sort by random" !!!

      Comment


      • #4
        http://seqanswers.com/forums/showthread.php?t=16505
        --------------
        Ethan

        Comment


        • #5
          Originally posted by Richard Finney View Post
          cat data | shuf | head -NUMBEROFLINES

          If the input is fasta, you'll have to join every other line with the next line and undo it. Example:

          cat test.fa| awk '{if ((NR%2)==0)print prev"XXXXXX"$0;prev=$0;}' | shuf | head -1000 | sed 's/XXXXXX/\n/'

          The shuf may not hang out in your /usr/sbin/ , if not, try

          sort -R file.txt | head -NUMBER OF LINES

          yep, "sort by random" !!!
          Beware that shuf and sort -R are GNU. If you have a BSD system (OS X, FreeBSD...) those won't work.

          Comment


          • #6
            I think this technique would work on BSD systems (and GNU systems) ...

            cat file.txt | awk '{print rand()" "$0}' | sort -n | head -1000 | cut -f2-9999 -d" "

            Comment


            • #7
              Suppose we want to sample n elements from a pool of N. The space complexity of the Biopieces solution is O(N) as it loads all sequences into memory. I guess shuf is no better. The optimal algorithm is to use reservoir sampling. The space complexity is O(n) instead of O(N). Of course, if N is not so large, it does not matter.

              The following is an awk snippet that randomly samples k=10 lines from a text file. Note that this program maximally keeps k=10 lines in memory.

              Code:
              cat file.txt|awk -v k=10 '{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=$0}END{for(z in a)print a[z]}'
              With bioawk, you can process fasta files this way:

              Code:
              awk -c fastx -v k=10 '{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=">"$name"\n"$seq}END{for(z in a)print a[z]}' seq.fa.gz
              Last edited by lh3; 01-14-2012, 08:47 PM.

              Comment


              • #8
                Re: ETHANol's post: Here's a link to a script we made that will subsample any fastq or fasta file:

                http://cl.ly/3Q2Y1Z222M0J220w3c1I

                Type subsampler.py -h for instructions - can do SE and PE reads.

                Comment


                • #9
                  Originally posted by maasha View Post
                  With Biopieces you can do:

                  Code:
                  read_fasta -i input.fna | random_records -n 1000 | write_fasta -o random.fna -x
                  Thanks Maasha. I've had some difficulty installing Biopieces. Particularly, I haven't been able to install the required perl modules. I get errors like this:
                  ERROR: Can't create '/Library/Perl/Updates/5.12.3/Module'
                  mkdir /Library/Perl/Updates/5.12.3/Module: Permission denied at /System/Library/Perl/5.12/ExtUtils/Install.pm line 494

                  Do you know how to get around this?

                  Comment


                  • #10
                    You need the correct permissions. Try using "sudo". Also, you should probably not contaminate this thread with Perl support request. Try stack-exchange or the Biopieces google group.


                    Cheers


                    Martin

                    Comment

                    Working...
                    X