Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by MattB View Post
    Just wondering if anyone has a script to do the opposite to the scripts in this thread, ie. split an already interleaved forward and reverse file into separate forward and reverse files (for Casava 1.8 and beyond read naming)?
    I've got a tool (fastq_paired_unpaired) in the Galaxy Tool Shed to do just that (also handling the older Illumina /1 and /2 names, Sanger naming, etc). It is just a Python script internally but since it uses the Galaxy FASTQ library perhaps not ideal for your needs? http://toolshed.g2.bx.psu.edu/

    Comment


    • #32
      Thanks, I had actually planned to use the Galaxy option but our local installation is undergoing some maintenance at the moment hence I was after a simple script alternative.

      The Perl script that Wallysb01 pointed me to seems to work nicely in any case!

      Matt

      Comment


      • #33
        Hi, maubp
        Thank you for your python code. Just a question, if the sequences order in the 2 input fastq files are randomized, then the sequences order in the output two paired fastq files are not the same. The two output fastq files are indeed paired files, just the sequence orders are not the same. Can you please update your code to make the output paired fastq files with the same sequence order if the input fastq files are randomized? Thank you.

        Originally posted by maubp View Post
        Alternatively, this second version should need much less memory but produces three files forward/reverse/orphaned, so you'd need to interleave the forward/reverse files afterwards. This makes three passes though the files and just keeps a list of IDs in memory (as a Python set for speed) rather what the first script does which holds a map of IDs to file offsets (as a Python dictionary). The records are output in the order found in the input files, so as long as the input files follow the same sorting, doing the interleaving step is trivial (and easy enough to add on in Python).

        Code:
        from Bio.SeqIO.QualityIO import FastqGeneralIterator #Biopython 1.51 or later
        
        ##########################################################
        #
        # Change the following settings to suit your needs
        #
        
        input_forward_filename = "SRR001666_1_rnd.fastq"
        input_reverse_filename = "SRR001666_2_rnd.fastq"
        
        output_paired_forward_filename = "out_forward_pairs.fastq"
        output_paired_reverse_filename = "out_reverse_pairs.fastq"
        output_orphan_filename = "out_unpaired_orphans.fastq"
        
        f_suffix = "/1"
        r_suffix = "/2"
        
        ##########################################################
        
        if f_suffix:
            f_suffix_crop = -len(f_suffix)
            def f_name(title):
                """Remove the suffix from a forward read name."""
                name = title.split()[0]
                assert name.endswith(f_suffix), name
                return name[:f_suffix_crop]
        else:
            def f_name(title):
                return title.split()[0]
        
        if r_suffix:
            r_suffix_crop = -len(r_suffix)
            def r_name(title):
                """Remove the suffix from a reverse read name."""
                name = title.split()[0]
                assert name.endswith(r_suffix), name
                return name[:r_suffix_crop]
        else:
            def r_name(title):
                return title.split()[0]
        
        print "Scaning reverse file to build list of names..."    
        reverse_ids = set()
        paired_ids = set()
        for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
            reverse_ids.add(r_name(title))
        
        print "Processing forward file..."
        forward_handle = open(output_paired_forward_filename, "w")
        orphan_handle = open(output_orphan_filename, "w")
        for title, seq, qual in FastqGeneralIterator(open(input_forward_filename)):
            name = f_name(title)
            if name in reverse_ids:
                #Paired
                paired_ids.add(name)
                reverse_ids.remove(name) #frees a little memory
                forward_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
            else:
                #Orphan
                orphan_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
        forward_handle.close()
        del reverse_ids #frees memory, although we won't need more now
        
        print "Processing reverse file..."
        reverse_handle = open(output_paired_reverse_filename, "w")
        for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
            name = r_name(title)
            if name in paired_ids:
                #Paired
                reverse_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
            else:
                #Orphan
                orphan_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
        orphan_handle.close()
        reverse_handle.close()
        print "Done"

        Comment


        • #34
          Originally posted by cwzkevin View Post
          Hi, maubp
          Thank you for your python code. Just a question, if the sequences order in the 2 input fastq files are randomized, then the sequences order in the output two paired fastq files are not the same. The two output fastq files are indeed paired files, just the sequence orders are not the same. Can you please update your code to make the output paired fastq files with the same sequence order if the input fastq files are randomized? Thank you.
          If the order is randomised, you can either presort both files (and then you are back in the original situation), or use an indexing strategy (you could use the SeqIO indexing functions in Biopython for this if you wanted to).

          One way to do the sorting is to convert your FASTQ file into a tab-separated file, use the Unix sort command, and then turn the data back into FASTQ. The trick here is the Unix sort command sorts at the line level, so we must transform the data to be one line per record.

          There is probably an 'off the shelf' solution to this, perhaps using biopieces?

          Or go back and find out why your FASTQ files are out of order, and fix that

          Comment


          • #35
            how about fasta files?

            Hi,
            could anyone please modify the script so that it works with fasta files instead of fastq files?

            many many thanks!

            Comment


            • #36
              Error: Sequence and quality captions differ.

              I ran the script but its giving me the error:
              Sequence and quality captions differ.

              Can anyone help me with this error for the script posted above.???

              Comment


              • #37
                Originally posted by safina View Post
                I ran the script but its giving me the error:
                Sequence and quality captions differ.

                Can anyone help me with this error for the script posted above.???
                In FASTQ, the @ lines and + lines should be consistent (or the plus line can leave out the caption completely which is preferable as it otherwise is a waste of disk space).

                It sounds like somehow your FASTQ files have been messed up

                Comment


                • #38
                  Im sending you some of my fastq file reads. pls see.

                  ==> forward_sequences.fastq <==
                  @SRR1562087.10.1/1
                  GAGCTAGATCAGCACCATATATTACACGATGATCAGCTGTAACATTTACCTGCATCTGGTTCTTCATTCCTATCCGACCATCCTTGG
                  +
                  JJJJJJIIJJJJJJJJIJJJJJJJJJJJJJJJIJJJJJJJGIIJJJJIJJJJJJJJJIJJJJDHIHHHHHHHFDFFDDDDDDDDD>C
                  @SRR1562087.11.1/1
                  AGGTTGACTATGGTCCAGGCCATGCCAGGAGAGCAACCGAAAACAGAGAGAACGGTAAGCCAGGAGAAGAACAGTATGAGTATATAG
                  +
                  IJJGHIJIIIFIBHHGAFHGGIHJIJGJEGIGGGHGIJJJJHHGFEFEDACEEDDBDBCCCDDDDDDBDDDCDDCADDDCCCDDDDD
                  @SRR1562087.15.1/1
                  TAACATCCACAATCTCCTTCTACCCAAGAAGTCTGGAACTTCAGCATCAAAGGCTGGTGATGACGACAACTAATCCATTTACTGAAT

                  ==> reverse_sequences.fastq <==

                  @SRR1562087.7.2/2
                  CCTGTAGATATACGTACTGCCAAAGGGTAGATAGTTGCCCATCTCAGAAAACACAACTTCAACAGCCAAGATTAATATCCATGTGAT
                  +
                  IJJJGGJBHIJJGHHHIIHJJGJGJIIDFHIJIJJJGHJJJJJJJIJGIGH@FHJIJIHIIIHHH=BDFFAEECCEEFDEDDCDCA>
                  @SRR1562087.9.2/2
                  GTAATCCAAATAAGGTATACTCACTCATCGGAGGATTTTGTGCTTCCCCTGTGAATTTCCACGCTAAGGATGGCTCCGGCTATAAAT
                  +
                  JIJIIJJJGGIIJIBC@FH@HHJGIJGCHGIEGIFHDFHJIJIJIHHIIIIJGGHHHHHCDDFDDDBDDDDDDDCDBDDBD@CDCEE
                  @SRR1562087.11.2/2
                  GAAACACTGATTGGTTCACGTATCCAGGTGTATGGACCACCTATATACTCATACTGTTCTTCTCCTGGCTTACCGTTCTCTCTGTTT

                  Comment


                  • #39
                    Your FASTQ files look fine from those snippets. Which script exactly are you using, what was the command line you used, and most importantly, the exact error message?

                    Update: Hang on, the names don't match, e.g. SRR1562087.10.1/1 vs SRR1562087.7.2/2
                    Last edited by maubp; 04-01-2015, 12:20 AM.

                    Comment


                    • #40
                      Originally posted by MattB View Post
                      Thanks for quick reply. Actually, I'm looking for something even simpler, where I just have one interleaved fastq file that I want to split into separate forward and reverse. Probably should have posted in a separate thread since I'm not so worried about trimming in this case.

                      I'm sure it is some very simple python that I could manage in the end, such a script is probably already out there.
                      This is a late reply, but you can split interleaved files with Pairfq like so:

                      Code:
                      curl -sL git.io/pairfq_lite | perl - splitpairs -i interl.fq.gz -f forward.fq -r reverse.fq
                      Interleaving the pairs is also possible with a slightly different command:

                      Code:
                      curl -sL git.io/pairfq_lite | perl - joinpairs -f forward.fq -r reverse.fq -o interl.fq
                      Originally posted by Daniel1977 View Post
                      Hi,
                      could anyone please modify the script so that it works with fasta files instead of fastq files?

                      many many thanks!
                      These commands work with FASTA/Q, compressed or uncompressed. There is more information about the commands on the Pairfq wiki. The 'splitpairs' command should work on any system and has a tiny memory footprint, but the 'joinpairs' and 'makepairs' (for fixing broken paired-end files) require more memory. If you have something like 250 million reads, as described in the original post, I would follow the install instructions for this tool and use the "--index" flag to reduce the memory usage (for pairing files, that is).

                      Comment


                      • #41
                        Hi. Im not getting any eroor infact the pairfq program putting all my reads in single.fq where as 1-paired.fq and 2_paired.fq remains empty. the link to the script is below:


                        the cammand was:

                        $ pairfq makepairs -f s_1_1_trimmed.fq \
                        -r s_1_2_trimmed.fq \
                        -fp s_1_1_trimmed_p.fq \
                        -rp s_1_2_trimmed_p.fq \
                        -fs s_1_1_trimmed_s.fq \
                        -rs s_1_2_trimmed_s.fq \
                        --index

                        Comment


                        • #42
                          Hi i tries pairfq but it giving me empty paired files whereas putting all reads in unique. Can anyone help?

                          Comment


                          • #43
                            Hi. Im not getting any eroor infact the pairfq program putting all my reads in single.fq where as 1-paired.fq and 2_paired.fq remains empty. the link to the script is below:


                            the cammand was:

                            $ pairfq makepairs -f s_1_1_trimmed.fq \
                            -r s_1_2_trimmed.fq \
                            -fp s_1_1_trimmed_p.fq \
                            -rp s_1_2_trimmed_p.fq \
                            -fs s_1_1_trimmed_s.fq \
                            -rs s_1_2_trimmed_s.fq \
                            --index

                            Comment


                            • #44
                              Originally posted by maubp View Post
                              Your FASTQ files look fine from those snippets. Which script exactly are you using, what was the command line you used, and most importantly, the exact error message?
                              Hi. Im not getting any eroor infact the pairfq program putting all my reads in single.fq where as 1-paired.fq and 2_paired.fq remains empty. the link to the script is below:


                              the cammand was:

                              $ pairfq makepairs -f s_1_1_trimmed.fq \
                              -r s_1_2_trimmed.fq \
                              -fp s_1_1_trimmed_p.fq \
                              -rp s_1_2_trimmed_p.fq \
                              -fs s_1_1_trimmed_s.fq \
                              -rs s_1_2_trimmed_s.fq \
                              --index

                              Comment


                              • #45
                                i got this error while running the above script:


                                File "new_pairing.py", line 51, in <module>
                                forward_handle = open(output_paired_forward_filename, "w")
                                NameError: name 'output_paired_forward_filename' is not defined

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Essential Discoveries and Tools in Epitranscriptomics
                                  by seqadmin




                                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                  04-22-2024, 07:01 AM
                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-25-2024, 11:49 AM
                                0 responses
                                20 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-24-2024, 08:47 AM
                                0 responses
                                20 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                62 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                61 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X