Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • random subset paired-end fastq

    Hi,
    I found this page where they discussed different methods for selecting pairs of reads randomly from a set of fastq files.



    I have tried Aaron's and and Pierre's bash options but I run out of memory. I also tried Martin's R commands but I also run out of memory.

    I wanted to try Brent's python script but I cannot understand whether I have to use only the bit shown in the link or I have to insert this bit in the original script he mentions wrote for single-end reads (link in the same answer: https://github.com/brentp/bio-playgr...mples/bench.py)

    Can anyone help me with that?

    Cheers,

    Dave

  • #2
    This sounds like a job for HTSeq:

    Code:
    import sys, random
    import HTSeq
    
    fraction = float( sys.argv[1] )
    in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
    in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
    out1 = open( sys.argv[4], "w" )
    out2 = open( sys.argv[5], "w" )
    
    while True:
       read1 = next( in1 )
       read2 = next( in2 )
       if random.random() < fraction:
          read1.write_to_fastq_file( out1 )
          read2.write_to_fastq_file( out2 )
          
    out1.close()
    out2.close()
    Save this as subsample.py and call it as
    Code:
    python subsample.py <fraction> <input file 1> <input file 2> <output file 1> <output file 2>
    (where <fraction> is a number between 0 and 1, giving the sampling faction)

    Simon

    Comment


    • #3
      Hi Simon, thanks for your help. I run the script and it seemed to work, the new files are generated, but I get this stderr:

      Traceback (most recent call last):
      File "/home/david/pyscripts/fastqPairedSubsample.py", line 15, in <module>
      read1 = next( in1 )
      File "/usr/local/lib/python2.6/dist-packages/HTSeq/__init__.py", line 381, in __iter__
      id1 = fin.next()
      StopIteration

      Is that something I should worry about? What does it mean?

      Cheers,

      Dave

      Comment


      • #4
        Hi Dave

        should be fine, the error just indicates that it finished reading the input, and I forgot to catch it. Correction (untested, sorry):

        Code:
         
        import sys, random, itertools
        import HTSeq
        
        fraction = float( sys.argv[1] )
        in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
        in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
        out1 = open( sys.argv[4], "w" )
        out2 = open( sys.argv[5], "w" )
        
        for read1, read2 in itertools.izip( in1, in2 ):
           if random.random() < fraction:
              read1.write_to_fastq_file( out1 )
              read2.write_to_fastq_file( out2 )
              
        out1.close()
        out2.close()
        Last edited by Simon Anders; 06-16-2011, 07:09 AM. Reason: made code nicer

        Comment


        • #5
          Single-end version of random sampling?

          Hi Simon,

          Thanks a lot for presenting this python code for pair-end seq data random sampling. I am new to python and the code is very helpful.
          I was wondering for single-end data, can I also use similar function and script? I tried to change your code into a single-end version, but it got an error:
          Traceback (most recent call last):
          File "subsample_se.py", line 10, in <module>
          read1.write_to_fastq_file( out1 )
          AttributeError: 'tuple' object has no attribute 'write_to_fastq_file'

          Any idea about that? Thanks in advance!

          The code is:
          Code:
          import sys, random, itertools
          import HTSeq
          
          fraction = float( sys.argv[1] )
          in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
          out1 = open( sys.argv[3], "w" )
          
          for read1 in itertools.izip( in1 ):
             if random.random() < fraction:
                read1.write_to_fastq_file( out1 )
          
          out1.close()

          Originally posted by Simon Anders View Post
          Hi Dave

          should be fine, the error just indicates that it finished reading the input, and I forgot to catch it. Correction (untested, sorry):

          Code:
           
          import sys, random, itertools
          import HTSeq
          
          fraction = float( sys.argv[1] )
          in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
          in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
          out1 = open( sys.argv[4], "w" )
          out2 = open( sys.argv[5], "w" )
          
          for read1, read2 in itertools.izip( in1, in2 ):
             if random.random() < fraction:
                read1.write_to_fastq_file( out1 )
                read2.write_to_fastq_file( out2 )
                
          out1.close()
          out2.close()

          Comment


          • #6
            Try replacing this line

            Code:
            for read1 in itertools.izip( in1 ):
            with

            Code:
            for read1 in in1:
            and maybe have a look at the Python Tutorial; you'll see that it doesn't take that much time to learn Python. ;-)

            S

            Comment


            • #7
              Cool! Thanks a lot for your reply and suggestions! lol
              Will do!

              Originally posted by Simon Anders View Post
              Try replacing this line

              Code:
              for read1 in itertools.izip( in1 ):
              with

              Code:
              for read1 in in1:
              and maybe have a look at the Python Tutorial; you'll see that it doesn't take that much time to learn Python. ;-)

              S

              Comment


              • #8
                Originally posted by dnusol View Post
                Hi,
                I found this page where they discussed different methods for selecting pairs of reads randomly from a set of fastq files.


                I wanted to try Brent's python script but I cannot understand whether I have to use only the bit shown in the link or I have to insert this bit in the original script he mentions wrote for single-end reads (link in the same answer: https://github.com/brentp/bio-playgr...mples/bench.py)


                Dave
                Dave, you can use just the bit posted in that answer. You dont need the linked original script.

                Comment


                • #9
                  Hi Simon,

                  Thank you very much for a nice tool. Is there any possibility to modify the script? I will apreciate, if I can set number of reads in output rather than fraction of original file.

                  Best regards.
                  Jan

                  Comment


                  • #10
                    sampling fw and rev

                    Originally posted by Simon Anders View Post
                    Hi Dave

                    should be fine, the error just indicates that it finished reading the input, and I forgot to catch it. Correction (untested, sorry):

                    Code:
                     
                    import sys, random, itertools
                    import HTSeq
                    
                    fraction = float( sys.argv[1] )
                    in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
                    in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
                    out1 = open( sys.argv[4], "w" )
                    out2 = open( sys.argv[5], "w" )
                    
                    for read1, read2 in itertools.izip( in1, in2 ):
                       if random.random() < fraction:
                          read1.write_to_fastq_file( out1 )
                          read2.write_to_fastq_file( out2 )
                          
                    out1.close()
                    out2.close()

                    Another one from a biologist: Will this routing sample the pairs belonging to each other? As opposed to a random selection from each of the fw and rev files, I mean.

                    Comment


                    • #11
                      Of course. Otherwise, it would be a bit pointless.

                      Comment


                      • #12
                        Originally posted by Simon Anders View Post
                        This sounds like a job for HTSeq:

                        Code:
                        import sys, random
                        import HTSeq
                        
                        fraction = float( sys.argv[1] )
                        in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
                        in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
                        out1 = open( sys.argv[4], "w" )
                        out2 = open( sys.argv[5], "w" )
                        
                        while True:
                           read1 = next( in1 )
                           read2 = next( in2 )
                           if random.random() < fraction:
                              read1.write_to_fastq_file( out1 )
                              read2.write_to_fastq_file( out2 )
                              
                        out1.close()
                        out2.close()
                        Save this as subsample.py and call it as
                        Code:
                        python subsample.py <fraction> <input file 1> <input file 2> <output file 1> <output file 2>
                        (where <fraction> is a number between 0 and 1, giving the sampling faction)

                        Simon
                        Hi Simon,

                        When you say a number between 0 and 1, do you mean 0 and 100 instead, for fraction?

                        I am trying to extract a random sample of SE fastq files and am going to try your script.

                        thanks!

                        Comment


                        • #13
                          Originally posted by haripriya View Post
                          ]
                          When you say a number between 0 and 1, do you mean 0 and 100 instead, for fraction?
                          Sorry, I don't understand your question. Why should a fraction be between 0 and 100? Or do you mean a percentage?

                          If you want to sub-sample, e.g., a quarter of the reads, you use 0.25 as fraction.

                          Comment


                          • #14
                            Yes sorry my bad I was thinking about percentages.

                            Comment


                            • #15
                              This is still helpful !

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              66 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X