Announcement

Collapse
No announcement yet.

random subset paired-end fastq

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • random subset paired-end fastq

    Hi,
    I found this page where they discussed different methods for selecting pairs of reads randomly from a set of fastq files.

    http://biostar.stackexchange.com/que...irs-from-fastq

    I have tried Aaron's and and Pierre's bash options but I run out of memory. I also tried Martin's R commands but I also run out of memory.

    I wanted to try Brent's python script but I cannot understand whether I have to use only the bit shown in the link or I have to insert this bit in the original script he mentions wrote for single-end reads (link in the same answer: https://github.com/brentp/bio-playgr...mples/bench.py)

    Can anyone help me with that?

    Cheers,

    Dave

  • #2
    This sounds like a job for HTSeq:

    Code:
    import sys, random
    import HTSeq
    
    fraction = float( sys.argv[1] )
    in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
    in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
    out1 = open( sys.argv[4], "w" )
    out2 = open( sys.argv[5], "w" )
    
    while True:
       read1 = next( in1 )
       read2 = next( in2 )
       if random.random() < fraction:
          read1.write_to_fastq_file( out1 )
          read2.write_to_fastq_file( out2 )
          
    out1.close()
    out2.close()
    Save this as subsample.py and call it as
    Code:
    python subsample.py <fraction> <input file 1> <input file 2> <output file 1> <output file 2>
    (where <fraction> is a number between 0 and 1, giving the sampling faction)

    Simon

    Comment


    • #3
      Hi Simon, thanks for your help. I run the script and it seemed to work, the new files are generated, but I get this stderr:

      Traceback (most recent call last):
      File "/home/david/pyscripts/fastqPairedSubsample.py", line 15, in <module>
      read1 = next( in1 )
      File "/usr/local/lib/python2.6/dist-packages/HTSeq/__init__.py", line 381, in __iter__
      id1 = fin.next()
      StopIteration

      Is that something I should worry about? What does it mean?

      Cheers,

      Dave

      Comment


      • #4
        Hi Dave

        should be fine, the error just indicates that it finished reading the input, and I forgot to catch it. Correction (untested, sorry):

        Code:
         
        import sys, random, itertools
        import HTSeq
        
        fraction = float( sys.argv[1] )
        in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
        in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
        out1 = open( sys.argv[4], "w" )
        out2 = open( sys.argv[5], "w" )
        
        for read1, read2 in itertools.izip( in1, in2 ):
           if random.random() < fraction:
              read1.write_to_fastq_file( out1 )
              read2.write_to_fastq_file( out2 )
              
        out1.close()
        out2.close()
        Last edited by Simon Anders; 06-16-2011, 07:09 AM. Reason: made code nicer

        Comment


        • #5
          Single-end version of random sampling?

          Hi Simon,

          Thanks a lot for presenting this python code for pair-end seq data random sampling. I am new to python and the code is very helpful.
          I was wondering for single-end data, can I also use similar function and script? I tried to change your code into a single-end version, but it got an error:
          Traceback (most recent call last):
          File "subsample_se.py", line 10, in <module>
          read1.write_to_fastq_file( out1 )
          AttributeError: 'tuple' object has no attribute 'write_to_fastq_file'

          Any idea about that? Thanks in advance!

          The code is:
          Code:
          import sys, random, itertools
          import HTSeq
          
          fraction = float( sys.argv[1] )
          in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
          out1 = open( sys.argv[3], "w" )
          
          for read1 in itertools.izip( in1 ):
             if random.random() < fraction:
                read1.write_to_fastq_file( out1 )
          
          out1.close()

          Originally posted by Simon Anders View Post
          Hi Dave

          should be fine, the error just indicates that it finished reading the input, and I forgot to catch it. Correction (untested, sorry):

          Code:
           
          import sys, random, itertools
          import HTSeq
          
          fraction = float( sys.argv[1] )
          in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
          in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
          out1 = open( sys.argv[4], "w" )
          out2 = open( sys.argv[5], "w" )
          
          for read1, read2 in itertools.izip( in1, in2 ):
             if random.random() < fraction:
                read1.write_to_fastq_file( out1 )
                read2.write_to_fastq_file( out2 )
                
          out1.close()
          out2.close()

          Comment


          • #6
            Try replacing this line

            Code:
            for read1 in itertools.izip( in1 ):
            with

            Code:
            for read1 in in1:
            and maybe have a look at the Python Tutorial; you'll see that it doesn't take that much time to learn Python. ;-)

            S

            Comment


            • #7
              Cool! Thanks a lot for your reply and suggestions! lol
              Will do!

              Originally posted by Simon Anders View Post
              Try replacing this line

              Code:
              for read1 in itertools.izip( in1 ):
              with

              Code:
              for read1 in in1:
              and maybe have a look at the Python Tutorial; you'll see that it doesn't take that much time to learn Python. ;-)

              S

              Comment


              • #8
                Originally posted by dnusol View Post
                Hi,
                I found this page where they discussed different methods for selecting pairs of reads randomly from a set of fastq files.

                http://biostar.stackexchange.com/que...irs-from-fastq
                I wanted to try Brent's python script but I cannot understand whether I have to use only the bit shown in the link or I have to insert this bit in the original script he mentions wrote for single-end reads (link in the same answer: https://github.com/brentp/bio-playgr...mples/bench.py)


                Dave
                Dave, you can use just the bit posted in that answer. You dont need the linked original script.

                Comment


                • #9
                  Hi Simon,

                  Thank you very much for a nice tool. Is there any possibility to modify the script? I will apreciate, if I can set number of reads in output rather than fraction of original file.

                  Best regards.
                  Jan

                  Comment


                  • #10
                    sampling fw and rev

                    Originally posted by Simon Anders View Post
                    Hi Dave

                    should be fine, the error just indicates that it finished reading the input, and I forgot to catch it. Correction (untested, sorry):

                    Code:
                     
                    import sys, random, itertools
                    import HTSeq
                    
                    fraction = float( sys.argv[1] )
                    in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
                    in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
                    out1 = open( sys.argv[4], "w" )
                    out2 = open( sys.argv[5], "w" )
                    
                    for read1, read2 in itertools.izip( in1, in2 ):
                       if random.random() < fraction:
                          read1.write_to_fastq_file( out1 )
                          read2.write_to_fastq_file( out2 )
                          
                    out1.close()
                    out2.close()

                    Another one from a biologist: Will this routing sample the pairs belonging to each other? As opposed to a random selection from each of the fw and rev files, I mean.

                    Comment


                    • #11
                      Of course. Otherwise, it would be a bit pointless.

                      Comment


                      • #12
                        Originally posted by Simon Anders View Post
                        This sounds like a job for HTSeq:

                        Code:
                        import sys, random
                        import HTSeq
                        
                        fraction = float( sys.argv[1] )
                        in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
                        in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
                        out1 = open( sys.argv[4], "w" )
                        out2 = open( sys.argv[5], "w" )
                        
                        while True:
                           read1 = next( in1 )
                           read2 = next( in2 )
                           if random.random() < fraction:
                              read1.write_to_fastq_file( out1 )
                              read2.write_to_fastq_file( out2 )
                              
                        out1.close()
                        out2.close()
                        Save this as subsample.py and call it as
                        Code:
                        python subsample.py <fraction> <input file 1> <input file 2> <output file 1> <output file 2>
                        (where <fraction> is a number between 0 and 1, giving the sampling faction)

                        Simon
                        Hi Simon,

                        When you say a number between 0 and 1, do you mean 0 and 100 instead, for fraction?

                        I am trying to extract a random sample of SE fastq files and am going to try your script.

                        thanks!

                        Comment


                        • #13
                          Originally posted by haripriya View Post
                          ]
                          When you say a number between 0 and 1, do you mean 0 and 100 instead, for fraction?
                          Sorry, I don't understand your question. Why should a fraction be between 0 and 100? Or do you mean a percentage?

                          If you want to sub-sample, e.g., a quarter of the reads, you use 0.25 as fraction.

                          Comment


                          • #14
                            Yes sorry my bad I was thinking about percentages.

                            Comment


                            • #15
                              This is still helpful !

                              Comment

                              Working...
                              X