Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • matching up paired-end reads after fastx-toolkit filtering

    Hi,

    I just learned that most genome assemblers rely on the order of sequences within paired read files to maintain read pairings. Unfortunately, I have already gone and processed each of my paired-end read files separately using fastx-toolkit and now sequences within each file are not paired up correctly. I've searched for ways to compare the two files and extract matching sequences into two new files and put all the orphaned reads into a third file, but so far I haven't found anything that has worked. Does anyone have any simple workarounds for this (eg. perl scripts) other than starting over using other filtering software (eg. Trimmomatic)? I'm working with RADseq data (fastq = phred33).

    Anyhelp would be great!

    Thanks,
    Bryan

  • #2
    maybe try FixMateInformation of picard tools.
    Not sure whether it will work.

    Comment


    • #3
      You can easily install a local instance of Galaxy and edit the start up script to give you admin rights. Then you can add the tool via Galaxy toolshed Fastq interlacer/deinterlacer. this groups by sequence ID into pairs i.e. cluster coordinates/1 and /2.

      These guys also have a perl script for the same function...http://sfg.stanford.edu/
      I noticed the scrit in question was not working properly and the authors sent me an updated working version. They may or may not have updated the download

      Comment


      • #4
        Originally posted by bmtb View Post
        Does anyone have any simple workarounds for this (eg. perl scripts) other than starting over using other filtering software (eg. Trimmomatic)?
        Hi- Sure you files can be fixed. However, I suspect that the easiest and fastest solution is to go back to the original files and use some tool that preserves the pairing order. trim_galore for example is very easy to use and does a great job.

        All the best
        Dario

        Comment


        • #5
          Thanks Jackie! I downloaded the scripts available from the link you posted and checked the PEcombine.sh script, and it appears to be exactly what I need. I'll post again to let those interested know if it worked or not.
          Cheers,
          Bryan

          Comment


          • #6
            Originally posted by JackieBadger View Post
            You can easily install a local instance of Galaxy and edit the start up script to give you admin rights. Then you can add the tool via Galaxy toolshed Fastq interlacer/deinterlacer. this groups by sequence ID into pairs i.e. cluster coordinates/1 and /2.

            These guys also have a perl script for the same function...http://sfg.stanford.edu/
            I noticed the scrit in question was not working properly and the authors sent me an updated working version. They may or may not have updated the download
            Hello again, so I made several attempts to use the PECombiner.sh/fastqcombinepairedreads.py script to work on my paired reads but Jackie previously noted the script is faulty.

            Comment


            • #7
              I was sent the following fixed script from the authors with instructions. Reference them if you use it. I haven't tested it so contact the authors if it fails again

              "You can either replace the name in the PECombiner.sh script with fastqcombinepairedend_revision.py or delete the fastqcombinepairedend.py that you downloaded with the SFG and remove the _revision from the revised script name to match the old one"



              #!/usr/bin/env python

              import sys
              #####Authors: Team Ladner-Barshis-Tepolt
              ######Usage########
              ######This script takes a set of separate quality trimmed paired end .fastq files
              ######and pairs the reads that still have a mate and combined the solitary reads
              ######into a single file of orphans for input into denovo assembly software (e.g. CLC Genomics).
              ######Command line usage: fastqcombinepairedend.py 'the seqheader text' 'the delimiter' infiledirection1.fastq infiledirection2.fastq


              seqheader=sys.argv[1]
              paireddelim=sys.argv[2]
              in1=sys.argv[3]
              in2=sys.argv[4]
              # filecount=0
              # pairedlist=[]
              # for file in infilelist:
              # if filecount==0:
              # firstname=file
              # filecount+=1
              # else:
              # secondname=file
              # filecount=0
              # pairedlist.append((firstname,secondname))

              #Opens an infile specified by the user.
              IN1 = open(sys.argv[3], 'r')
              IN2 = open(sys.argv[4], 'r')
              setoffiles=[in1, in2]
              #Opens an output text file as specified by user
              SIN = open('%s_trimmed_clipped_singles.fastq'%(sys.argv[3][:-23]), 'w') #Unpaired reads from both directions for CLC
              PAIR1 = open('%s_trimmed_clipped_stillpaired.fastq'%(sys.argv[3][:-6]), 'w') #Read one of paired reads for CLC
              PAIR2 = open('%s_trimmed_clipped_stillpaired.fastq'%(sys.argv[4][:-6]), 'w') #Read two of paired reads for CLC


              names1=[]
              names2=[]
              seqs1={}
              seqs2={}
              quals1={}
              quals2={}
              linenum1=0

              #Starts a for loop to read through the infile line by line
              for line in IN1:

              line = line.rstrip()
              linenum1+=1

              #sets count to 1 when it finds the header before a sequence
              if line[0:4]==seqheader:
              line=line.split(paireddelim)
              name=line[0][1:]
              # print name
              names1.append(name)
              count = 0
              else:
              if count==1:
              seqs1[name]=line
              if count==3:
              quals1[name]=line

              count=count+1
              # print linenum1
              print 'finished reading in: %s' %(setoffiles[0])
              for line in IN2:

              # print 2

              line = line.rstrip()

              #sets count to 1 when it finds the header before a sequence
              if line[0:4]=='@HWI':
              line=line.split(paireddelim)
              name=line[0][1:]
              names2.append(name)
              count = 0
              else:

              if count==1:
              seqs2[name]=line
              if count==3:
              quals2[name]=line

              count=count+1

              # print 'here'
              print 'finished reading in: %s'%(setoffiles[0])
              paired = list(set(names1) & set(names2))
              # print len(paired)
              print 'number of paired reads: %d'%(len(paired))
              single = list(set(names1) ^ set(names2))
              # print len(single)
              single1 = list(set(single) & set(names1))
              # print len(single1)
              print 'number of single reads left from %s: %d'%(setoffiles[0],len(single1))
              single2 = list(set(single) & set(names2))
              # print len(single2)
              print 'number of single reads left from %s: %d'%(setoffiles[1],len(single2))
              # for label in paired:
              # PAIR.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n@' + str(label) + '2\n' + str(seqs2[label]) + '\n+' + str(label) + '2\n' + str(quals2[label]) + '\n')

              for label in single1:
              SIN.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n')

              for label in single2:
              SIN.write('@' + str(label) + '2\n' + str(seqs2[label]) + '\n+' + str(label) + '2\n' + str(quals2[label]) + '\n')

              for label in paired:
              PAIR1.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n')
              PAIR2.write('@' + str(label) + '2\n' + str(seqs2[label]) + '\n+' + str(label) + '2\n' + str(quals2[label]) + '\n')

              IN1.close()
              IN2.close()
              SIN.close()
              PAIR1.close()
              PAIR2.close()

              Comment


              • #8
                Thanks Jackie! I tried the revised script but it still didn't work correctly. Only after I went through the script re-setting all the indents and after correcting some lines did I manage to get it working. My only concern now is that I have to ramp up the requested RAM on the cluster i'm using just to use it with my data. So far 8GB of RAM won't cut it as I keep getting 'MemoryErrors'. It works fine on smaller datasets though. Anyway, here is the corrected code (again reference the authors if you use it):


                ####################################################
                #!/usr/bin/env python

                import sys
                #####Authors: Team Ladner-Barshis-Tepolt
                ######Usage########
                ######This script takes a set of separate quality trimmed paired end .fastq files
                ######and pairs the reads that still have a mate and combined the solitary reads
                ######into a single file of orphans for input into denovo assembly software (e.g. CLC Genomics).
                ######Command line usage: fastqcombinepairedend.py 'the delimiter' 'the seqheader text' infiledirection1.fastq infiledirection2.fastq


                seqheader=sys.argv[1]
                paireddelim=sys.argv[2]
                in1=sys.argv[3]
                in2=sys.argv[4]

                ###This line had to be added 15Oct2012 for script to work properly
                setoffiles=[in1, in2]

                #filecount=0
                #pairedlist=[]
                #for file in infilelist:
                #if filecount==0:
                #firstname=file
                #filecount+=1
                #else:
                #secondname=file
                #filecount=0
                #pairedlist.append((firstname,secondname))

                #Opens an infile specified by the user.
                IN1 = open(sys.argv[3], 'r')
                IN2 = open(sys.argv[4], 'r')

                #Opens an output text file as specified by user
                SIN = open('%s_trimmed_clipped_singles.fastq'%(sys.argv[3][:-23]), 'w') #Unpaired reads from both directions for CLC
                PAIR1 = open('%s_trimmed_clipped_stillpaired.fastq'%(sys.argv[3][:-6]), 'w') #Read one of paired reads for CLC
                PAIR2 = open('%s_trimmed_clipped_stillpaired.fastq'%(sys.argv[4][:-6]), 'w') #Read two of paired reads for CLC


                names1=[]
                names2=[]
                seqs1={}
                seqs2={}
                quals1={}
                quals2={}
                linenum1=0

                #Starts a for loop to read through the infile line by line
                for line in IN1:

                line = line.rstrip()
                linenum1+=1

                #sets count to 1 when it finds the header before a sequence
                if line[0:4]==seqheader:
                line=line.split(paireddelim)
                name=line[0][1:]
                # print name
                names1.append(name)
                count = 0
                else:
                if count==1:
                seqs1[name]=line
                if count==3:
                quals1[name]=line

                count=count+1
                # print linenum1
                print 'finished reading in: %s' %(setoffiles[0])

                for line in IN2:

                # print 2

                line = line.rstrip()

                #sets count to 1 when it finds the header before a sequence
                if line[0:4]==seqheader: #change to seqheader from '@HWI' 15oct2012 - I'm not sure why they didn't just use seqheader to begin with as done in previous for loop.
                line=line.split(paireddelim)
                name=line[0][1:]
                names2.append(name)
                count = 0
                else:
                if count==1:
                seqs2[name]=line
                if count==3:
                quals2[name]=line

                count=count+1
                # print linenum1
                # print 'here'
                print 'finished reading in: %s'%(setoffiles[1])

                paired = list(set(names1) & set(names2))
                #print paired
                # print len(paired)
                print 'number of paired reads: %d'%(len(paired))
                single = list(set(names1) ^ set(names2))
                # print len(single)
                single1 = list(set(single) & set(names1))
                # print len(single1)
                print 'number of single reads left from %s: %d'%(setoffiles[0],len(single1))
                single2 = list(set(single) & set(names2))
                # print len(single2)
                print 'number of single reads left from %s: %d'%(setoffiles[1],len(single2))

                # for label in paired:
                # PAIR.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n@' + str(label) + '2\n' + str(seqs2[label]) + $

                for label in single1:
                SIN.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n')

                for label in single2:
                SIN.write('@' + str(label) + '2\n' + str(seqs2[label]) + '\n+' + str(label) + '2\n' + str(quals2[label]) + '\n')

                for label in paired:
                PAIR1.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n')
                PAIR2.write('@' + str(label) + '2\n' + str(seqs2[label]) + '\n+' + str(label) + '2\n' + str(quals2[label]) + '\n')

                IN1.close()
                IN2.close()
                SIN.close()
                PAIR1.close()
                PAIR2.close()
                ####################################################

                Comment


                • #9
                  Hi again, I just realized my last post with the correct code didn't keep the indents. so I've attached a file containing the corrected script. To use it remove the .txt extension.

                  Cheers,
                  Bryan
                  Attached Files

                  Comment


                  • #10
                    Just thought I'd let you know I needed ~227GB of RAM in order to use this script on my data which consisted of two paired-end files 58GB and 29GB in size (after filtering for quality).

                    Comment


                    • #11
                      bmtb,

                      Thanks for the corrected script and the info on memory usage.

                      I'd like to ask what the 'the delimiter' 'the seqheader text' refer to in the program usage.

                      Command line usage: fastqcombinepairedend.py 'the delimiter' 'the seqheader text' infiledirection1.fastq infiledirection2.fastq

                      Carmen

                      Comment


                      • #12
                        Hi! Follow-up / bump to my question above:

                        Does i have to be the entire line, ie,

                        @HWI-M00149:16:000000000-A12VK:1:1101:17796:2300 1:N:0:

                        , or just @HWI ?

                        I'm guessing for delimiter I should use " : " ?

                        Thanks!

                        Comment


                        • #13
                          Dear btmb,

                          I have tried the code and it seems not to be working 'name' and 'count' are not defined before they are evaluated it seems, and the 'count' block seems to me as it sets to 0 when the header is encountered, but I'm not sure. Are there any python-saavy bioinformaticians that would care to help us out with this code?

                          Thanks!

                          Comment


                          • #14
                            Hi carmeyeii,

                            Sorry it took so long for me to reply. If you open the script in a text editor there should be a description in the commented out section at the top of the page. If I recall correctly, the seqheader text should just be "@HWI" and the delimiter should be what ever separates the string of characters adjacent to the seqheader text from the read identifier part of the header line. In your case, you would use " " as the delimiter since there is a space that separates the read identifier (i.e. 1:N:0) from the rest of the line.

                            Cheers,

                            Bryan





                            Originally posted by carmeyeii View Post
                            Hi! Follow-up / bump to my question above:

                            Does i have to be the entire line, ie,

                            @HWI-M00149:16:000000000-A12VK:1:1101:17796:2300 1:N:0:

                            , or just @HWI ?

                            I'm guessing for delimiter I should use " : " ?

                            Thanks!

                            Comment


                            • #15
                              Hi again,

                              I just tested the attached copy of the script on 2500 paired seqs and it worked as expected using the following command:

                              python fastqcombinepairedend.py "@HWI" " " fastq1 fastq2

                              Make sure to change the .py.txt extension to just .py.

                              Bryan



                              Originally posted by carmeyeii View Post
                              Dear btmb,

                              I have tried the code and it seems not to be working 'name' and 'count' are not defined before they are evaluated it seems, and the 'count' block seems to me as it sets to 0 when the header is encountered, but I'm not sure. Are there any python-saavy bioinformaticians that would care to help us out with this code?

                              Thanks!
                              Attached Files

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Genetic Variation in Immunogenetics and Antibody Diversity
                                by seqadmin



                                The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
                                11-06-2024, 07:24 PM
                              • seqadmin
                                Choosing Between NGS and qPCR
                                by seqadmin



                                Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                                10-18-2024, 07:11 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 11:09 AM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Today, 06:13 AM
                              0 responses
                              20 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 11-01-2024, 06:09 AM
                              0 responses
                              30 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-30-2024, 05:31 AM
                              0 responses
                              21 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X