Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Interleaved mate pair fastq after quality filtering

    I am looking for a script that can produce an interleaved fastq file and a file of orphaned [single end] reads from two input fastq files. The two read pairs have been quality filtered using fastx-toolkit 'fastq_quality_filter', which breaks the exact pairing that previously existed.

    Has anyone come across a script to do this rather than just to interleave to perfectly paired sets of fastq files?

    This is a cross-post (apologies) but I wasn't sure where it fitted best.

  • #2
    I once wrote a script to do just that a while back using Biopython's Bio.SeqIO.index() function for random access to FASTQ files. This holds the offsets in memory so there is a limit on how many reads you can work with before having to store the index on disk, e.g. using SQLite.

    How many reads do you have?

    You'd also need to specify how to match up the read pairs, for example are you using the /1 and /2 suffix convention?

    Comment


    • #3
      #Reads

      I have about 250 millions reads. I concatenated the reads from all lanes into a single file but I could go back and work on the per-lane files. My machine has 96 GB RAM so it should be able to handle it I think.

      The pairs are specified by /1 /2 for the Illumina data I have and _F _R for the SOLiD data - but it's mainly the Illumina data I'm interested in doing this for.
      Last edited by natstreet; 07-28-2010, 11:41 PM.

      Comment


      • #4
        Just to be clear, your two input files will be for the forward and reverse reads?

        Comment


        • #5
          Yes. Ideally I want to input read1 and read2 (forward and reverse reads) that have each been pre-filtered using the fastx fastq_quality_filter tool and to then output an interleaved file of proper pairs and a file of the reads that have been orphaned by the quality filtering.

          Comment


          • #6
            Personally I would process the reads in per-lane batches, which will help greatly with reducing the scale of the problem (and let you spread the work over multiple machines or cores).

            I couldn't find the original script I used, so I rewrote one using a straightforward but fairly memory naive approach:
            Code:
            from Bio import SeqIO #Biopython 1.54 or later needed
            
            #######################################################
            #
            # Change the following settings to suit your needs
            #
            
            input_forward_filename = "SRR001666_1_rnd.fastq"
            input_reverse_filename = "SRR001666_2_rnd.fastq"
            
            output_pairs_filename = "out_interleaved_pairs.fastq"
            output_orphan_filename = "out_unpaired_orphans.fastq"
            
            f_suffix = "/1"
            r_suffix = "/2"
            
            #######################################################
            
            if f_suffix:
                f_suffix_crop = -len(f_suffix)
                def f_name(name):
                    """Remove the suffix from a forward read name."""
                    assert name.endswith(f_suffix), name
                    return name[:f_suffix_crop]
            else:
                #No suffix, don't need a function to fix the name
                f_name = None
            
            if r_suffix:
                r_suffix_crop = -len(r_suffix)
                def r_name(name):
                    """Remove the suffix from a reverse read name."""
                    assert name.endswith(r_suffix), name
                    return name[:r_suffix_crop]
            else:
                #No suffix, don't need a function to fix the name
                r_name = None
                
            print "Indexing forward file..."
            forward_dict = SeqIO.index(input_forward_filename, "fastq", key_function=f_name)
            
            print "Indexing reverse file..."
            reverse_dict = SeqIO.index(input_reverse_filename, "fastq", key_function=r_name)
            
            print "Ouputing pairs and forward orphans..."
            pair_handle = open(output_pairs_filename, "w")
            orphan_handle = open(output_orphan_filename, "w")
            for key in forward_dict:
                if key in reverse_dict:
                     pair_handle.write(forward_dict.get_raw(key))
                     pair_handle.write(reverse_dict.get_raw(key))
                else:
                     orphan_handle.write(forward_dict.get_raw(key))
            pair_handle.close()
            print "Ouputing reverse orphans..."
            for key in reverse_dict:
                if key not in forward_dict:
                     orphan_handle.write(reverse_dict.get_raw(key))
            orphan_handle.close()
            print "Done"
            Given you have a big memory machine, this may work fine as is.

            Alternatively, this second version should need much less memory but produces three files forward/reverse/orphaned, so you'd need to interleave the forward/reverse files afterwards. This makes three passes though the files and just keeps a list of IDs in memory (as a Python set for speed) rather what the first script does which holds a map of IDs to file offsets (as a Python dictionary). The records are output in the order found in the input files, so as long as the input files follow the same sorting, doing the interleaving step is trivial (and easy enough to add on in Python).

            Code:
            from Bio.SeqIO.QualityIO import FastqGeneralIterator #Biopython 1.51 or later
            
            ##########################################################
            #
            # Change the following settings to suit your needs
            #
            
            input_forward_filename = "SRR001666_1_rnd.fastq"
            input_reverse_filename = "SRR001666_2_rnd.fastq"
            
            output_paired_forward_filename = "out_forward_pairs.fastq"
            output_paired_reverse_filename = "out_reverse_pairs.fastq"
            output_orphan_filename = "out_unpaired_orphans.fastq"
            
            f_suffix = "/1"
            r_suffix = "/2"
            
            ##########################################################
            
            if f_suffix:
                f_suffix_crop = -len(f_suffix)
                def f_name(title):
                    """Remove the suffix from a forward read name."""
                    name = title.split()[0]
                    assert name.endswith(f_suffix), name
                    return name[:f_suffix_crop]
            else:
                def f_name(title):
                    return title.split()[0]
            
            if r_suffix:
                r_suffix_crop = -len(r_suffix)
                def r_name(title):
                    """Remove the suffix from a reverse read name."""
                    name = title.split()[0]
                    assert name.endswith(r_suffix), name
                    return name[:r_suffix_crop]
            else:
                def r_name(title):
                    return title.split()[0]
            
            print "Scaning reverse file to build list of names..."    
            reverse_ids = set()
            paired_ids = set()
            for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
                reverse_ids.add(r_name(title))
            
            print "Processing forward file..."
            forward_handle = open(output_paired_forward_filename, "w")
            orphan_handle = open(output_orphan_filename, "w")
            for title, seq, qual in FastqGeneralIterator(open(input_forward_filename)):
                name = f_name(title)
                if name in reverse_ids:
                    #Paired
                    paired_ids.add(name)
                    reverse_ids.remove(name) #frees a little memory
                    forward_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
                else:
                    #Orphan
                    orphan_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
            forward_handle.close()
            del reverse_ids #frees memory, although we won't need more now
            
            print "Processing reverse file..."
            reverse_handle = open(output_paired_reverse_filename, "w")
            for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
                name = r_name(title)
                if name in paired_ids:
                    #Paired
                    reverse_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
                else:
                    #Orphan
                    orphan_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
            orphan_handle.close()
            reverse_handle.close()
            print "Done"
            Last edited by maubp; 08-10-2010, 03:17 AM. Reason: Added second script; correction from hl450; show name in asserts

            Comment


            • #7
              That's brilliant thanks! It's also great to have some example python scripts to help me learn from

              I really appreciate this and hopefully others will also.

              Comment


              • #8
                wow I think it's time someone really sat down and made Fastx-toolkit "pair-safe"
                --
                Jeremy Leipzig
                Bioinformatics Programmer
                --
                My blog
                Twitter

                Comment


                • #9
                  Originally posted by maubp View Post
                  Code:
                  from Bio.SeqIO.QualityIO import FastqGeneralIterator #Biopython 1.51 or later
                  
                  ##########################################################
                  #
                  # Change the following settings to suit your needs
                  #
                  
                  input_forward_filename = "SRR001666_1_rnd.fastq"
                  input_reverse_filename = "SRR001666_2_rnd.fastq"
                  
                  output_paired_forward_filename = "out_forward_pairs.fastq"
                  output_paired_reverse_filename = "out_reverse_pairs.fastq"
                  output_orphan_filename = "out_unpaired_orphans.fastq"
                  
                  f_suffix = "/1"
                  r_suffix = "/2"
                  
                  ##########################################################
                  
                  if f_suffix:
                      f_suffix_crop = -len(f_suffix)
                      def f_name(title):
                          """Remove the suffix from a forward read name."""
                          name = title.split()[0]
                          assert name.endswith(f_suffix)
                          return name[:f_suffix_crop]
                  else:
                      def f_name(title):
                          return title.split()[0]
                  
                  if r_suffix:
                      r_suffix_crop = -len(r_suffix)
                      def r_name(title):
                          """Remove the suffix from a reverse read name."""
                          name = title.split()[0]
                          assert name.endswith(r_suffix)
                          return name[:r_suffix_crop]
                  else:
                      def r_name(title):
                          return title.split()[0]
                  
                  print "Scaning reverse file to build list of names..."    
                  reverse_ids = set()
                  paired_ids = set()
                  for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
                      reverse_ids.add(r_name(title))
                  
                  print "Processing forward file..."
                  forward_handle = open(output_paired_forward_filename, "w")
                  orphan_handle = open(output_orphan_filename, "w")
                  for title, seq, qual in FastqGeneralIterator(open(input_forward_filename)):
                      name = f_name(title)
                      if name in reverse_ids:
                          #Paired
                          paired_ids.add(name)
                          reverse_ids.remove(name) #frees a little memory
                          forward_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
                      else:
                          #Orphan
                          orphan_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
                  forward_handle.close()
                  del reverse_ids #frees memory, although we won't need more now
                  
                  print "Processing reverse file..."
                  reverse_handle = open(output_paired_reverse_filename, "w")
                  for title, seq, qual in FastqGeneralIterator(open(input_forward_filename)):
                      name = r_name(title)
                      if name in paired_ids:
                          #Paired
                          reverse_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
                      else:
                          #Orphan
                          orphan_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
                  orphan_handle.close()
                  reverse_handle.close()
                  print "Done"

                  I think there is an error in the code it should be input_reverse_filename rather than input_foward_filename

                  Code:
                  print "Processing reverse file..."
                  reverse_handle = open(output_paired_reverse_filename, "w")
                  for title, seq, qual in FastqGeneralIterator(open([COLOR="Red"]input_reverse_filename[/COLOR])):

                  Comment


                  • #10
                    Originally posted by hl450 View Post
                    I think there is an error in the code it should be input_reverse_filename rather than input_foward_filename
                    Yes, that does look better. I've updated the example. Thanks

                    Comment


                    • #11
                      casava 1.8

                      Hello,
                      I have been using this script with great ease and pleasure, till Casava 1.8 update. Now the read id structure is changed, the script no longer recognizes the pair information. Do you have an update for this script?

                      Comment


                      • #12
                        Originally posted by menenuh View Post
                        Hello,
                        I have been using this script with great ease and pleasure, till Casava 1.8 update. Now the read id structure is changed, the script no longer recognizes the pair information. Do you have an update for this script?
                        You've run into the very problem I predicted on this thread announcing the Casava 1.8 changes:
                        Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


                        Simple answer, just look at the name (first word before the space) and they should match. i.e. There is no suffix - change these lines:
                        Code:
                        f_suffix = "/1"
                        r_suffix = "/2"
                        to:
                        Code:
                        f_suffix = ""
                        r_suffix = ""
                        and the script should work.

                        A slightly more robust script would look at the read number field in the second word.

                        Comment


                        • #13
                          Hi,
                          @maubp: thank you very much for the script! I have no experience in programming but I guess that this script https://github.com/lexnederbragt/den...leave_pairs.py is e is exactly what I need (produce paired/singleton file from trimmed/filtered paired-end reads). However, my data is also based on CASAVA 1.8. Thus, I guess that I will need to change the script as recommended in the previous post. But my question is: How does the script identify the pairs? Based on the file they come from (forward or reverse)?
                          Thank you very much!!

                          Comment


                          • #14
                            If your FASTQ pairs come in separate forward and reverse files that tells you which is which. However, for the CASAVA 1.8 naming you could/should also look for 1 or 2 after the first space.

                            Comment


                            • #15
                              Thanks guys for these scripts! I was struggling trying to create one from scratch and these have been a huge help!

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              26 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              29 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X