Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • dnusol
    Senior Member
    • Jul 2009
    • 136

    random subset paired-end fastq

    Hi,
    I found this page where they discussed different methods for selecting pairs of reads randomly from a set of fastq files.



    I have tried Aaron's and and Pierre's bash options but I run out of memory. I also tried Martin's R commands but I also run out of memory.

    I wanted to try Brent's python script but I cannot understand whether I have to use only the bit shown in the link or I have to insert this bit in the original script he mentions wrote for single-end reads (link in the same answer: https://github.com/brentp/bio-playgr...mples/bench.py)

    Can anyone help me with that?

    Cheers,

    Dave
  • Simon Anders
    Senior Member
    • Feb 2010
    • 995

    #2
    This sounds like a job for HTSeq:

    Code:
    import sys, random
    import HTSeq
    
    fraction = float( sys.argv[1] )
    in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
    in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
    out1 = open( sys.argv[4], "w" )
    out2 = open( sys.argv[5], "w" )
    
    while True:
       read1 = next( in1 )
       read2 = next( in2 )
       if random.random() < fraction:
          read1.write_to_fastq_file( out1 )
          read2.write_to_fastq_file( out2 )
          
    out1.close()
    out2.close()
    Save this as subsample.py and call it as
    Code:
    python subsample.py <fraction> <input file 1> <input file 2> <output file 1> <output file 2>
    (where <fraction> is a number between 0 and 1, giving the sampling faction)

    Simon

    Comment

    • dnusol
      Senior Member
      • Jul 2009
      • 136

      #3
      Hi Simon, thanks for your help. I run the script and it seemed to work, the new files are generated, but I get this stderr:

      Traceback (most recent call last):
      File "/home/david/pyscripts/fastqPairedSubsample.py", line 15, in <module>
      read1 = next( in1 )
      File "/usr/local/lib/python2.6/dist-packages/HTSeq/__init__.py", line 381, in __iter__
      id1 = fin.next()
      StopIteration

      Is that something I should worry about? What does it mean?

      Cheers,

      Dave

      Comment

      • Simon Anders
        Senior Member
        • Feb 2010
        • 995

        #4
        Hi Dave

        should be fine, the error just indicates that it finished reading the input, and I forgot to catch it. Correction (untested, sorry):

        Code:
         
        import sys, random, itertools
        import HTSeq
        
        fraction = float( sys.argv[1] )
        in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
        in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
        out1 = open( sys.argv[4], "w" )
        out2 = open( sys.argv[5], "w" )
        
        for read1, read2 in itertools.izip( in1, in2 ):
           if random.random() < fraction:
              read1.write_to_fastq_file( out1 )
              read2.write_to_fastq_file( out2 )
              
        out1.close()
        out2.close()
        Last edited by Simon Anders; 06-16-2011, 07:09 AM. Reason: made code nicer

        Comment

        • sheng
          Junior Member
          • Apr 2011
          • 5

          #5
          Single-end version of random sampling?

          Hi Simon,

          Thanks a lot for presenting this python code for pair-end seq data random sampling. I am new to python and the code is very helpful.
          I was wondering for single-end data, can I also use similar function and script? I tried to change your code into a single-end version, but it got an error:
          Traceback (most recent call last):
          File "subsample_se.py", line 10, in <module>
          read1.write_to_fastq_file( out1 )
          AttributeError: 'tuple' object has no attribute 'write_to_fastq_file'

          Any idea about that? Thanks in advance!

          The code is:
          Code:
          import sys, random, itertools
          import HTSeq
          
          fraction = float( sys.argv[1] )
          in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
          out1 = open( sys.argv[3], "w" )
          
          for read1 in itertools.izip( in1 ):
             if random.random() < fraction:
                read1.write_to_fastq_file( out1 )
          
          out1.close()

          Originally posted by Simon Anders View Post
          Hi Dave

          should be fine, the error just indicates that it finished reading the input, and I forgot to catch it. Correction (untested, sorry):

          Code:
           
          import sys, random, itertools
          import HTSeq
          
          fraction = float( sys.argv[1] )
          in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
          in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
          out1 = open( sys.argv[4], "w" )
          out2 = open( sys.argv[5], "w" )
          
          for read1, read2 in itertools.izip( in1, in2 ):
             if random.random() < fraction:
                read1.write_to_fastq_file( out1 )
                read2.write_to_fastq_file( out2 )
                
          out1.close()
          out2.close()

          Comment

          • Simon Anders
            Senior Member
            • Feb 2010
            • 995

            #6
            Try replacing this line

            Code:
            for read1 in itertools.izip( in1 ):
            with

            Code:
            for read1 in in1:
            and maybe have a look at the Python Tutorial; you'll see that it doesn't take that much time to learn Python. ;-)

            S

            Comment

            • sheng
              Junior Member
              • Apr 2011
              • 5

              #7
              Cool! Thanks a lot for your reply and suggestions! lol
              Will do!

              Originally posted by Simon Anders View Post
              Try replacing this line

              Code:
              for read1 in itertools.izip( in1 ):
              with

              Code:
              for read1 in in1:
              and maybe have a look at the Python Tutorial; you'll see that it doesn't take that much time to learn Python. ;-)

              S

              Comment

              • brentp
                Member
                • Apr 2010
                • 72

                #8
                Originally posted by dnusol View Post
                Hi,
                I found this page where they discussed different methods for selecting pairs of reads randomly from a set of fastq files.


                I wanted to try Brent's python script but I cannot understand whether I have to use only the bit shown in the link or I have to insert this bit in the original script he mentions wrote for single-end reads (link in the same answer: https://github.com/brentp/bio-playgr...mples/bench.py)


                Dave
                Dave, you can use just the bit posted in that answer. You dont need the linked original script.

                Comment

                • jbar
                  Junior Member
                  • Jun 2011
                  • 1

                  #9
                  Hi Simon,

                  Thank you very much for a nice tool. Is there any possibility to modify the script? I will apreciate, if I can set number of reads in output rather than fraction of original file.

                  Best regards.
                  Jan

                  Comment

                  • JahnDavik
                    Junior Member
                    • Aug 2012
                    • 8

                    #10
                    sampling fw and rev

                    Originally posted by Simon Anders View Post
                    Hi Dave

                    should be fine, the error just indicates that it finished reading the input, and I forgot to catch it. Correction (untested, sorry):

                    Code:
                     
                    import sys, random, itertools
                    import HTSeq
                    
                    fraction = float( sys.argv[1] )
                    in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
                    in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
                    out1 = open( sys.argv[4], "w" )
                    out2 = open( sys.argv[5], "w" )
                    
                    for read1, read2 in itertools.izip( in1, in2 ):
                       if random.random() < fraction:
                          read1.write_to_fastq_file( out1 )
                          read2.write_to_fastq_file( out2 )
                          
                    out1.close()
                    out2.close()

                    Another one from a biologist: Will this routing sample the pairs belonging to each other? As opposed to a random selection from each of the fw and rev files, I mean.

                    Comment

                    • Simon Anders
                      Senior Member
                      • Feb 2010
                      • 995

                      #11
                      Of course. Otherwise, it would be a bit pointless.

                      Comment

                      • haripriya
                        Junior Member
                        • Jun 2013
                        • 2

                        #12
                        Originally posted by Simon Anders View Post
                        This sounds like a job for HTSeq:

                        Code:
                        import sys, random
                        import HTSeq
                        
                        fraction = float( sys.argv[1] )
                        in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
                        in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
                        out1 = open( sys.argv[4], "w" )
                        out2 = open( sys.argv[5], "w" )
                        
                        while True:
                           read1 = next( in1 )
                           read2 = next( in2 )
                           if random.random() < fraction:
                              read1.write_to_fastq_file( out1 )
                              read2.write_to_fastq_file( out2 )
                              
                        out1.close()
                        out2.close()
                        Save this as subsample.py and call it as
                        Code:
                        python subsample.py <fraction> <input file 1> <input file 2> <output file 1> <output file 2>
                        (where <fraction> is a number between 0 and 1, giving the sampling faction)

                        Simon
                        Hi Simon,

                        When you say a number between 0 and 1, do you mean 0 and 100 instead, for fraction?

                        I am trying to extract a random sample of SE fastq files and am going to try your script.

                        thanks!

                        Comment

                        • Simon Anders
                          Senior Member
                          • Feb 2010
                          • 995

                          #13
                          Originally posted by haripriya View Post
                          ]
                          When you say a number between 0 and 1, do you mean 0 and 100 instead, for fraction?
                          Sorry, I don't understand your question. Why should a fraction be between 0 and 100? Or do you mean a percentage?

                          If you want to sub-sample, e.g., a quarter of the reads, you use 0.25 as fraction.

                          Comment

                          • haripriya
                            Junior Member
                            • Jun 2013
                            • 2

                            #14
                            Yes sorry my bad I was thinking about percentages.

                            Comment

                            • everestial
                              Member
                              • Feb 2015
                              • 31

                              #15
                              This is still helpful !

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              19 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-26-2026, 10:12 AM
                              0 responses
                              31 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...