Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • swbarnes2
    Senior Member
    • May 2008
    • 910

    #16
    I think a simpler way to do it...just get all the reads from a particular set of tiles (I don't know how that translates on HiSeq machines) Since the first few and last few tiles are bad, pick some tiles in the middle, by grepping the read names, which have teh cluster cooridnates embedded in them. Or zgrep, if the fastqs are already zipped. That should be random with respect to where the read align to the genome, with better quality than picking the top and bottom tiles with head or foot.

    But yes, this won't work on multi-line fastqs.

    Comment

    • jmun7616
      Junior Member
      • Jun 2012
      • 4

      #17
      Hey, just notices this thread. This Python solution may be a bit late but it will work on both multi-line fasta and fastq files. Set up is such that all files in same dir will be subsampled, but this can be changed. Also three subsamplings taken, but you can change this easily enough or just use one of them.

      Code:
      import random
      import glob
      inFiles = glob.glob('*.fasta') + glob.glob('*.fastq')
      outFiles = []
      num = int(raw_input("Enter number of random sequences to select:\n"))
      
      for i in range(len(inFiles)):
      
              for k in range(3):
                      fNames = []
                      fSeqs = []
                      same = 0
                      outFiles.append(file(str(inFiles[i])+'_'+'Rand_'+str(num)+'-'+str(k+1)+'.fa', 'wt'))
                      for line in open(inFiles[i]):
                              if ((line[0] == '>') or (line[0] == '@')):
                                      fNames.append(line)
                                      same = 0
                              else:
                                      if (same != 1):
                                              fSeqs.append(line)
                                              same = 1
                                      else:
                                              fSeqs[(len(fSeqs)-1)] += line
                      curr = (len(outFiles)-1)
                      for j in range(num):
                              a = random.randint(0, (len(fNames)-1))
                              outFiles[curr].write(fNames.pop(a))
                              outFiles[curr].write(fSeqs.pop(a))
      raw_input("Done.")

      Comment

      • gringer
        David Eccles (gringer)
        • May 2011
        • 845

        #18
        This Python solution may be a bit late but it will work on both multi-line fasta and fastq files.
        A few warnings about this Python code:
        • ASCII quality scores in fastq files can contain arbitrary characters like '>' and '@'. The only way to protect against this for multi-line fastq files is to keep a record of the sequence length.
        • This code reads the entire set of reads into memory, then outputs random sequences -- this will not work on fasta/fastq files that are larger than the memory limit. While this is the only way to do it for streamed input, the code here uses file names, so can do a two-pass operation and reopen a file (see my previous post).

        Comment

        • Richard Finney
          Senior Member
          • Feb 2009
          • 701

          #19
          A simple script extracting random fastq records using reservoir sampling ...

          # usage: program inputfastq numsubsamples outputsubampfqfile
          # example: ./job.fqreservoir SRR451888.fq 1000 subsamp.fq

          cat $1 | \
          awk '{if ((p%4)<3) printf("%s\t",$0); else printf("%s\n",$0); p++;}' | \
          awk -v k=$2 'BEGIN{srand(systime() + PROCINFO["pid"]);}{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=$0}END{for(z in a)print a[z]}' | \
          tr '\t' '\n' > $3


          See previous discussion here: http://seqanswers.com/forums/showthread.php?t=16845

          Note the acutal line #3 is a hack of lh3's script WITH seeding, i.e. BEGIN{srand(systime() + PROCINFO["pid"]);}
          Remove the seed code to get same results every time you run it. Otherwise leave it in to get different results.

          This program joins the 4 fastq lines together into one line , spearating them with a tab character.
          These combined lines are subsampled.
          At the end, the "tr" command seperates the selected combined lines back into 4 lines.
          This script used the UNIX/LINUX/BSD standard utilities "tr", "cat", and "awk".
          Last edited by Richard Finney; 06-17-2012, 08:39 AM.

          Comment

          • gringer
            David Eccles (gringer)
            • May 2011
            • 845

            #20
            This program joins the 4 fastq lines together into one line , spearating them with a tab character.
            That particular code won't work with multi-line fasta/fastq files (the linked subsampler.py script has the same issue, as well as pre-reading all the lines to determine the total number of lines, and some modulus code that may influence probabilities in a bad way), but the reservoir sampling method used would provide a way to process files from standard input in a single pass, and use multi-line files from standard input. However, you need to be a bit careful with the selection procedure to ensure a uniform chance of selection for every record. After a bit of hunting, I see some reasonable pseudo-code here (linked from a stackoverflow thread):



            The modified pseudo-code for a fasta/fastq file using reservoir sampling would be as follows:
            1. Look for magic record start sequences. If found, increment the record counter and clear 'current record', otherwise print out the output array and finish
            2. If a fasta file, read through and store up to the next record start as the 'current record', then go to step 6
            3. If a fastq file, read up to the next '+' to work out sequence length (store in 'current record')
            4. read the next line and store in 'current record'
            5. read as many quality values as were in the sequence, store in 'current record'
            6. if record counter is less than desired number of records, add current record to end of output array, return to step 1
            7. generate a random number between 0 and the total number of records read
            8. if this number is less than the number of desired records, replace the array element indexed by that number by the current record, return to step 1

            The selection procedure here is (I think) the same as in the awk script that Richard Finney wrote / copied, which is (I think) the same as in the msdn pseudo-code.

            Comment

            • golharam
              Member
              • Dec 2009
              • 55

              #21
              I had this problem too. I actually built a sub sampler that works on fragment and paired-end fastq (gzip'ed or not) data. Its written in Java and based on the Picard framework. https://github.com/golharam/FastqSubsample.

              Benefits of using this:
              1) No external Python packages to install. All you need is a JRE and the JAR file.
              2) Doesn't load everything into memory.

              Comment

              • jmun7616
                Junior Member
                • Jun 2012
                • 4

                #22
                Originally posted by gringer View Post
                A few warnings about this Python code:
                • ASCII quality scores in fastq files can contain arbitrary characters like '>' and '@'. The only way to protect against this for multi-line fastq files is to keep a record of the sequence length.
                • This code reads the entire set of reads into memory, then outputs random sequences -- this will not work on fasta/fastq files that are larger than the memory limit. While this is the only way to do it for streamed input, the code here uses file names, so can do a two-pass operation and reopen a file (see my previous post).
                This revised version addresses both these issues:
                Code:
                import random
                import glob
                inFiles = glob.glob('*.fasta') + glob.glob('*.fastq')
                outFiles = []
                num = int(raw_input("Enter number of random sequences to select:\n"))
                
                
                for i in range(len(inFiles)):
                        bookMarks = []
                        fNames = []
                        fSeqs = []
                        start = 0
                        BkMk = 0                                             
                        outFiles.append(file(str(inFiles[i])+'_'+'Rand_'+str(num)+'.fa', 'wt'))
                
                        for line in open(inFiles[i]):
                                if ((line[0] == '>') or (line[0] == '@')):
                                        start = 1        
                                        bookMarks.append(BkMk)
                                else:
                                        if (start == 1):
                                                if (((line.count('A') + line.count('a')+ line.count('C') + line.count('c') +
                                                line.count('G') + line.count('g') + line.count('T') + line.count('t')+
                                                line.count('N')+ line.count('n')+ line.count('\n')+ line.count('\r')+
                                                line.count('\t')) == len(line)) and ((len(line))>1)):
                                                        start = 0
                                                else:
                                                        bookMarks.pop()
                                                        start = 0         
                                BkMk+=1
                        size = (len(bookMarks))
                        print str(size) +" sequences identified in " + inFiles[i] + "\n"
                
                        randomMarks = range(size)
                        for j in range((len(bookMarks)-num-1)):
                                a = random.randint(0, (len(randomMarks)-1))
                                randomMarks.pop(a)
                
                        j = 0
                        k = 0
                        write = 0
                        for line in open(inFiles[i]):
                                if (write == 1):
                                        if ((randomMarks[k]+1) >= size):
                                                outFiles[i].write(line)
                                                break
                                        elif(j == (bookMarks[randomMarks[k]+1])):
                                               write = 0
                                               k+=1
                                               if (k >= num):
                                                        break
                                        else:
                                                outFiles[i].write(line)
                                if (j == bookMarks[randomMarks[k]]):
                                       write = 1
                                       outFiles[i].write(line)
                                j+=1
                raw_input("Done.")

                Comment

                • maubp
                  Peter (Biopython etc)
                  • Jul 2009
                  • 1544

                  #23
                  It would have been less effort to reuse Biopython's FASTQ parser which already deals with the nasty line wrapping cases. e.g.

                  Comment

                  • JahnDavik
                    Junior Member
                    • Aug 2012
                    • 8

                    #24
                    Thanks for posting this code. I really need it and I really hope it works?

                    Comment

                    • JahnDavik
                      Junior Member
                      • Aug 2012
                      • 8

                      #25
                      Ad:random sampling in forward and reverse

                      Thanks for posting this code. I really need it and I really hope it works.

                      Comment

                      • JahnDavik
                        Junior Member
                        • Aug 2012
                        • 8

                        #26
                        Thanks. This is what I'd need. Hope it works :-)

                        Comment

                        • syfo
                          Just a member
                          • Nov 2012
                          • 103

                          #27
                          Originally posted by Richard Finney View Post
                          A simple script extracting random fastq records using reservoir sampling ...

                          # usage: program inputfastq numsubsamples outputsubampfqfile
                          # example: ./job.fqreservoir SRR451888.fq 1000 subsamp.fq

                          cat $1 | \
                          awk '{if ((p%4)<3) printf("%s\t",$0); else printf("%s\n",$0); p++;}' | \
                          awk -v k=$2 'BEGIN{srand(systime() + PROCINFO["pid"]);}{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=$0}END{for(z in a)print a[z]}' | \
                          tr '\t' '\n' > $3


                          See previous discussion here: http://seqanswers.com/forums/showthread.php?t=16845

                          Note the acutal line #3 is a hack of lh3's script WITH seeding, i.e. BEGIN{srand(systime() + PROCINFO["pid"]);}
                          Remove the seed code to get same results every time you run it. Otherwise leave it in to get different results.

                          This program joins the 4 fastq lines together into one line , spearating them with a tab character.
                          These combined lines are subsampled.
                          At the end, the "tr" command seperates the selected combined lines back into 4 lines.
                          This script used the UNIX/LINUX/BSD standard utilities "tr", "cat", and "awk".
                          Reservoir sampling! Thank you so much, this is exactly what I was looking for!

                          Comment

                          • chayan
                            Member
                            • Nov 2012
                            • 52

                            #28
                            hii all
                            how can i use the above mentioned scripts to randomly extract a fixed % of output from .sff file, cause am working of a iontorrent denovo assembly with MIRA and i need to extract both the fastq as well as traceinfo.xml... (as mentioned by the sequence provider as well as Bastein).. thanx for any help in advance

                            Regards
                            Chayan

                            Comment

                            • maubp
                              Peter (Biopython etc)
                              • Jul 2009
                              • 1544

                              #29
                              Originally posted by chayan View Post
                              hii all
                              how can i use the above mentioned scripts to randomly extract a fixed % of output from .sff file, cause am working of a iontorrent denovo assembly with MIRA and i need to extract both the fastq as well as traceinfo.xml... (as mentioned by the sequence provider as well as Bastein).. thanx for any help in advance

                              Regards
                              Chayan
                              SFF is a binary format with a non-trivial header. Unix text tools like grep are not going to be any help. I would suggest using the Roche 'off instrument' applications which include an SFF tool which can select random reads (Linux only, search for how to get Newbler on these forums for more information).

                              Alternatively, if you program in Python, this would be pretty trivial using the Biopython SFF support. http://biopython.org Likewise BioHaskell has SFF support and command line tools for working with these files. http://biohaskell.org

                              Comment

                              • chayan
                                Member
                                • Nov 2012
                                • 52

                                #30
                                Thanx maubp, i am not sure but probably those Roche applications are not free to access..still i ll search those and surely ll give a try to your other suggestions..again thnx for your kind help

                                Regards

                                Comment

                                Latest Articles

                                Collapse

                                • SEQadmin2
                                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                  by SEQadmin2


                                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                  ...
                                  06-02-2026, 10:05 AM
                                • SEQadmin2
                                  Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                  by SEQadmin2


                                  With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                  Introduction

                                  Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                  05-22-2026, 06:42 AM
                                • SEQadmin2
                                  Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                  by SEQadmin2

                                  Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                  Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                  05-06-2026, 09:04 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by SEQadmin2, Today, 08:59 AM
                                0 responses
                                9 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-02-2026, 12:03 PM
                                0 responses
                                21 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-02-2026, 11:40 AM
                                0 responses
                                17 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 05-28-2026, 11:40 AM
                                0 responses
                                30 views
                                0 reactions
                                Last Post SEQadmin2  
                                Working...