Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    I think a simpler way to do it...just get all the reads from a particular set of tiles (I don't know how that translates on HiSeq machines) Since the first few and last few tiles are bad, pick some tiles in the middle, by grepping the read names, which have teh cluster cooridnates embedded in them. Or zgrep, if the fastqs are already zipped. That should be random with respect to where the read align to the genome, with better quality than picking the top and bottom tiles with head or foot.

    But yes, this won't work on multi-line fastqs.

    Comment


    • #17
      Hey, just notices this thread. This Python solution may be a bit late but it will work on both multi-line fasta and fastq files. Set up is such that all files in same dir will be subsampled, but this can be changed. Also three subsamplings taken, but you can change this easily enough or just use one of them.

      Code:
      import random
      import glob
      inFiles = glob.glob('*.fasta') + glob.glob('*.fastq')
      outFiles = []
      num = int(raw_input("Enter number of random sequences to select:\n"))
      
      for i in range(len(inFiles)):
      
              for k in range(3):
                      fNames = []
                      fSeqs = []
                      same = 0
                      outFiles.append(file(str(inFiles[i])+'_'+'Rand_'+str(num)+'-'+str(k+1)+'.fa', 'wt'))
                      for line in open(inFiles[i]):
                              if ((line[0] == '>') or (line[0] == '@')):
                                      fNames.append(line)
                                      same = 0
                              else:
                                      if (same != 1):
                                              fSeqs.append(line)
                                              same = 1
                                      else:
                                              fSeqs[(len(fSeqs)-1)] += line
                      curr = (len(outFiles)-1)
                      for j in range(num):
                              a = random.randint(0, (len(fNames)-1))
                              outFiles[curr].write(fNames.pop(a))
                              outFiles[curr].write(fSeqs.pop(a))
      raw_input("Done.")

      Comment


      • #18
        This Python solution may be a bit late but it will work on both multi-line fasta and fastq files.
        A few warnings about this Python code:
        • ASCII quality scores in fastq files can contain arbitrary characters like '>' and '@'. The only way to protect against this for multi-line fastq files is to keep a record of the sequence length.
        • This code reads the entire set of reads into memory, then outputs random sequences -- this will not work on fasta/fastq files that are larger than the memory limit. While this is the only way to do it for streamed input, the code here uses file names, so can do a two-pass operation and reopen a file (see my previous post).

        Comment


        • #19
          A simple script extracting random fastq records using reservoir sampling ...

          # usage: program inputfastq numsubsamples outputsubampfqfile
          # example: ./job.fqreservoir SRR451888.fq 1000 subsamp.fq

          cat $1 | \
          awk '{if ((p%4)<3) printf("%s\t",$0); else printf("%s\n",$0); p++;}' | \
          awk -v k=$2 'BEGIN{srand(systime() + PROCINFO["pid"]);}{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=$0}END{for(z in a)print a[z]}' | \
          tr '\t' '\n' > $3


          See previous discussion here: http://seqanswers.com/forums/showthread.php?t=16845

          Note the acutal line #3 is a hack of lh3's script WITH seeding, i.e. BEGIN{srand(systime() + PROCINFO["pid"]);}
          Remove the seed code to get same results every time you run it. Otherwise leave it in to get different results.

          This program joins the 4 fastq lines together into one line , spearating them with a tab character.
          These combined lines are subsampled.
          At the end, the "tr" command seperates the selected combined lines back into 4 lines.
          This script used the UNIX/LINUX/BSD standard utilities "tr", "cat", and "awk".
          Last edited by Richard Finney; 06-17-2012, 08:39 AM.

          Comment


          • #20
            This program joins the 4 fastq lines together into one line , spearating them with a tab character.
            That particular code won't work with multi-line fasta/fastq files (the linked subsampler.py script has the same issue, as well as pre-reading all the lines to determine the total number of lines, and some modulus code that may influence probabilities in a bad way), but the reservoir sampling method used would provide a way to process files from standard input in a single pass, and use multi-line files from standard input. However, you need to be a bit careful with the selection procedure to ensure a uniform chance of selection for every record. After a bit of hunting, I see some reasonable pseudo-code here (linked from a stackoverflow thread):

            http://blogs.msdn.com/b/spt/archive/...-sampling.aspx

            The modified pseudo-code for a fasta/fastq file using reservoir sampling would be as follows:
            1. Look for magic record start sequences. If found, increment the record counter and clear 'current record', otherwise print out the output array and finish
            2. If a fasta file, read through and store up to the next record start as the 'current record', then go to step 6
            3. If a fastq file, read up to the next '+' to work out sequence length (store in 'current record')
            4. read the next line and store in 'current record'
            5. read as many quality values as were in the sequence, store in 'current record'
            6. if record counter is less than desired number of records, add current record to end of output array, return to step 1
            7. generate a random number between 0 and the total number of records read
            8. if this number is less than the number of desired records, replace the array element indexed by that number by the current record, return to step 1

            The selection procedure here is (I think) the same as in the awk script that Richard Finney wrote / copied, which is (I think) the same as in the msdn pseudo-code.

            Comment


            • #21
              I had this problem too. I actually built a sub sampler that works on fragment and paired-end fastq (gzip'ed or not) data. Its written in Java and based on the Picard framework. https://github.com/golharam/FastqSubsample.

              Benefits of using this:
              1) No external Python packages to install. All you need is a JRE and the JAR file.
              2) Doesn't load everything into memory.

              Comment


              • #22
                Originally posted by gringer View Post
                A few warnings about this Python code:
                • ASCII quality scores in fastq files can contain arbitrary characters like '>' and '@'. The only way to protect against this for multi-line fastq files is to keep a record of the sequence length.
                • This code reads the entire set of reads into memory, then outputs random sequences -- this will not work on fasta/fastq files that are larger than the memory limit. While this is the only way to do it for streamed input, the code here uses file names, so can do a two-pass operation and reopen a file (see my previous post).
                This revised version addresses both these issues:
                Code:
                import random
                import glob
                inFiles = glob.glob('*.fasta') + glob.glob('*.fastq')
                outFiles = []
                num = int(raw_input("Enter number of random sequences to select:\n"))
                
                
                for i in range(len(inFiles)):
                        bookMarks = []
                        fNames = []
                        fSeqs = []
                        start = 0
                        BkMk = 0                                             
                        outFiles.append(file(str(inFiles[i])+'_'+'Rand_'+str(num)+'.fa', 'wt'))
                
                        for line in open(inFiles[i]):
                                if ((line[0] == '>') or (line[0] == '@')):
                                        start = 1        
                                        bookMarks.append(BkMk)
                                else:
                                        if (start == 1):
                                                if (((line.count('A') + line.count('a')+ line.count('C') + line.count('c') +
                                                line.count('G') + line.count('g') + line.count('T') + line.count('t')+
                                                line.count('N')+ line.count('n')+ line.count('\n')+ line.count('\r')+
                                                line.count('\t')) == len(line)) and ((len(line))>1)):
                                                        start = 0
                                                else:
                                                        bookMarks.pop()
                                                        start = 0         
                                BkMk+=1
                        size = (len(bookMarks))
                        print str(size) +" sequences identified in " + inFiles[i] + "\n"
                
                        randomMarks = range(size)
                        for j in range((len(bookMarks)-num-1)):
                                a = random.randint(0, (len(randomMarks)-1))
                                randomMarks.pop(a)
                
                        j = 0
                        k = 0
                        write = 0
                        for line in open(inFiles[i]):
                                if (write == 1):
                                        if ((randomMarks[k]+1) >= size):
                                                outFiles[i].write(line)
                                                break
                                        elif(j == (bookMarks[randomMarks[k]+1])):
                                               write = 0
                                               k+=1
                                               if (k >= num):
                                                        break
                                        else:
                                                outFiles[i].write(line)
                                if (j == bookMarks[randomMarks[k]]):
                                       write = 1
                                       outFiles[i].write(line)
                                j+=1
                raw_input("Done.")

                Comment


                • #23
                  It would have been less effort to reuse Biopython's FASTQ parser which already deals with the nasty line wrapping cases. e.g.

                  Comment


                  • #24
                    Thanks for posting this code. I really need it and I really hope it works?

                    Comment


                    • #25
                      Ad:random sampling in forward and reverse

                      Thanks for posting this code. I really need it and I really hope it works.

                      Comment


                      • #26
                        Thanks. This is what I'd need. Hope it works :-)

                        Comment


                        • #27
                          Originally posted by Richard Finney View Post
                          A simple script extracting random fastq records using reservoir sampling ...

                          # usage: program inputfastq numsubsamples outputsubampfqfile
                          # example: ./job.fqreservoir SRR451888.fq 1000 subsamp.fq

                          cat $1 | \
                          awk '{if ((p%4)<3) printf("%s\t",$0); else printf("%s\n",$0); p++;}' | \
                          awk -v k=$2 'BEGIN{srand(systime() + PROCINFO["pid"]);}{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=$0}END{for(z in a)print a[z]}' | \
                          tr '\t' '\n' > $3


                          See previous discussion here: http://seqanswers.com/forums/showthread.php?t=16845

                          Note the acutal line #3 is a hack of lh3's script WITH seeding, i.e. BEGIN{srand(systime() + PROCINFO["pid"]);}
                          Remove the seed code to get same results every time you run it. Otherwise leave it in to get different results.

                          This program joins the 4 fastq lines together into one line , spearating them with a tab character.
                          These combined lines are subsampled.
                          At the end, the "tr" command seperates the selected combined lines back into 4 lines.
                          This script used the UNIX/LINUX/BSD standard utilities "tr", "cat", and "awk".
                          Reservoir sampling! Thank you so much, this is exactly what I was looking for!

                          Comment


                          • #28
                            hii all
                            how can i use the above mentioned scripts to randomly extract a fixed % of output from .sff file, cause am working of a iontorrent denovo assembly with MIRA and i need to extract both the fastq as well as traceinfo.xml... (as mentioned by the sequence provider as well as Bastein).. thanx for any help in advance

                            Regards
                            Chayan

                            Comment


                            • #29
                              Originally posted by chayan View Post
                              hii all
                              how can i use the above mentioned scripts to randomly extract a fixed % of output from .sff file, cause am working of a iontorrent denovo assembly with MIRA and i need to extract both the fastq as well as traceinfo.xml... (as mentioned by the sequence provider as well as Bastein).. thanx for any help in advance

                              Regards
                              Chayan
                              SFF is a binary format with a non-trivial header. Unix text tools like grep are not going to be any help. I would suggest using the Roche 'off instrument' applications which include an SFF tool which can select random reads (Linux only, search for how to get Newbler on these forums for more information).

                              Alternatively, if you program in Python, this would be pretty trivial using the Biopython SFF support. http://biopython.org Likewise BioHaskell has SFF support and command line tools for working with these files. http://biohaskell.org

                              Comment


                              • #30
                                Thanx maubp, i am not sure but probably those Roche applications are not free to access..still i ll search those and surely ll give a try to your other suggestions..again thnx for your kind help

                                Regards

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 06:37 PM
                                0 responses
                                8 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 06:07 PM
                                0 responses
                                8 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                49 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                67 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X