Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • bmtb
    replied
    Hi again,

    I just tested the attached copy of the script on 2500 paired seqs and it worked as expected using the following command:

    python fastqcombinepairedend.py "@HWI" " " fastq1 fastq2

    Make sure to change the .py.txt extension to just .py.

    Bryan



    Originally posted by carmeyeii View Post
    Dear btmb,

    I have tried the code and it seems not to be working 'name' and 'count' are not defined before they are evaluated it seems, and the 'count' block seems to me as it sets to 0 when the header is encountered, but I'm not sure. Are there any python-saavy bioinformaticians that would care to help us out with this code?

    Thanks!
    Attached Files

    Leave a comment:


  • bmtb
    replied
    Hi carmeyeii,

    Sorry it took so long for me to reply. If you open the script in a text editor there should be a description in the commented out section at the top of the page. If I recall correctly, the seqheader text should just be "@HWI" and the delimiter should be what ever separates the string of characters adjacent to the seqheader text from the read identifier part of the header line. In your case, you would use " " as the delimiter since there is a space that separates the read identifier (i.e. 1:N:0) from the rest of the line.

    Cheers,

    Bryan





    Originally posted by carmeyeii View Post
    Hi! Follow-up / bump to my question above:

    Does i have to be the entire line, ie,

    @HWI-M00149:16:000000000-A12VK:1:1101:17796:2300 1:N:0:

    , or just @HWI ?

    I'm guessing for delimiter I should use " : " ?

    Thanks!

    Leave a comment:


  • carmeyeii
    replied
    Dear btmb,

    I have tried the code and it seems not to be working 'name' and 'count' are not defined before they are evaluated it seems, and the 'count' block seems to me as it sets to 0 when the header is encountered, but I'm not sure. Are there any python-saavy bioinformaticians that would care to help us out with this code?

    Thanks!

    Leave a comment:


  • carmeyeii
    replied
    Hi! Follow-up / bump to my question above:

    Does i have to be the entire line, ie,

    @HWI-M00149:16:000000000-A12VK:1:1101:17796:2300 1:N:0:

    , or just @HWI ?

    I'm guessing for delimiter I should use " : " ?

    Thanks!

    Leave a comment:


  • carmeyeii
    replied
    bmtb,

    Thanks for the corrected script and the info on memory usage.

    I'd like to ask what the 'the delimiter' 'the seqheader text' refer to in the program usage.

    Command line usage: fastqcombinepairedend.py 'the delimiter' 'the seqheader text' infiledirection1.fastq infiledirection2.fastq

    Carmen

    Leave a comment:


  • bmtb
    replied
    Just thought I'd let you know I needed ~227GB of RAM in order to use this script on my data which consisted of two paired-end files 58GB and 29GB in size (after filtering for quality).

    Leave a comment:


  • bmtb
    replied
    Hi again, I just realized my last post with the correct code didn't keep the indents. so I've attached a file containing the corrected script. To use it remove the .txt extension.

    Cheers,
    Bryan
    Attached Files

    Leave a comment:


  • bmtb
    replied
    Thanks Jackie! I tried the revised script but it still didn't work correctly. Only after I went through the script re-setting all the indents and after correcting some lines did I manage to get it working. My only concern now is that I have to ramp up the requested RAM on the cluster i'm using just to use it with my data. So far 8GB of RAM won't cut it as I keep getting 'MemoryErrors'. It works fine on smaller datasets though. Anyway, here is the corrected code (again reference the authors if you use it):


    ####################################################
    #!/usr/bin/env python

    import sys
    #####Authors: Team Ladner-Barshis-Tepolt
    ######Usage########
    ######This script takes a set of separate quality trimmed paired end .fastq files
    ######and pairs the reads that still have a mate and combined the solitary reads
    ######into a single file of orphans for input into denovo assembly software (e.g. CLC Genomics).
    ######Command line usage: fastqcombinepairedend.py 'the delimiter' 'the seqheader text' infiledirection1.fastq infiledirection2.fastq


    seqheader=sys.argv[1]
    paireddelim=sys.argv[2]
    in1=sys.argv[3]
    in2=sys.argv[4]

    ###This line had to be added 15Oct2012 for script to work properly
    setoffiles=[in1, in2]

    #filecount=0
    #pairedlist=[]
    #for file in infilelist:
    #if filecount==0:
    #firstname=file
    #filecount+=1
    #else:
    #secondname=file
    #filecount=0
    #pairedlist.append((firstname,secondname))

    #Opens an infile specified by the user.
    IN1 = open(sys.argv[3], 'r')
    IN2 = open(sys.argv[4], 'r')

    #Opens an output text file as specified by user
    SIN = open('%s_trimmed_clipped_singles.fastq'%(sys.argv[3][:-23]), 'w') #Unpaired reads from both directions for CLC
    PAIR1 = open('%s_trimmed_clipped_stillpaired.fastq'%(sys.argv[3][:-6]), 'w') #Read one of paired reads for CLC
    PAIR2 = open('%s_trimmed_clipped_stillpaired.fastq'%(sys.argv[4][:-6]), 'w') #Read two of paired reads for CLC


    names1=[]
    names2=[]
    seqs1={}
    seqs2={}
    quals1={}
    quals2={}
    linenum1=0

    #Starts a for loop to read through the infile line by line
    for line in IN1:

    line = line.rstrip()
    linenum1+=1

    #sets count to 1 when it finds the header before a sequence
    if line[0:4]==seqheader:
    line=line.split(paireddelim)
    name=line[0][1:]
    # print name
    names1.append(name)
    count = 0
    else:
    if count==1:
    seqs1[name]=line
    if count==3:
    quals1[name]=line

    count=count+1
    # print linenum1
    print 'finished reading in: %s' %(setoffiles[0])

    for line in IN2:

    # print 2

    line = line.rstrip()

    #sets count to 1 when it finds the header before a sequence
    if line[0:4]==seqheader: #change to seqheader from '@HWI' 15oct2012 - I'm not sure why they didn't just use seqheader to begin with as done in previous for loop.
    line=line.split(paireddelim)
    name=line[0][1:]
    names2.append(name)
    count = 0
    else:
    if count==1:
    seqs2[name]=line
    if count==3:
    quals2[name]=line

    count=count+1
    # print linenum1
    # print 'here'
    print 'finished reading in: %s'%(setoffiles[1])

    paired = list(set(names1) & set(names2))
    #print paired
    # print len(paired)
    print 'number of paired reads: %d'%(len(paired))
    single = list(set(names1) ^ set(names2))
    # print len(single)
    single1 = list(set(single) & set(names1))
    # print len(single1)
    print 'number of single reads left from %s: %d'%(setoffiles[0],len(single1))
    single2 = list(set(single) & set(names2))
    # print len(single2)
    print 'number of single reads left from %s: %d'%(setoffiles[1],len(single2))

    # for label in paired:
    # PAIR.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n@' + str(label) + '2\n' + str(seqs2[label]) + $

    for label in single1:
    SIN.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n')

    for label in single2:
    SIN.write('@' + str(label) + '2\n' + str(seqs2[label]) + '\n+' + str(label) + '2\n' + str(quals2[label]) + '\n')

    for label in paired:
    PAIR1.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n')
    PAIR2.write('@' + str(label) + '2\n' + str(seqs2[label]) + '\n+' + str(label) + '2\n' + str(quals2[label]) + '\n')

    IN1.close()
    IN2.close()
    SIN.close()
    PAIR1.close()
    PAIR2.close()
    ####################################################

    Leave a comment:


  • JackieBadger
    replied
    I was sent the following fixed script from the authors with instructions. Reference them if you use it. I haven't tested it so contact the authors if it fails again

    "You can either replace the name in the PECombiner.sh script with fastqcombinepairedend_revision.py or delete the fastqcombinepairedend.py that you downloaded with the SFG and remove the _revision from the revised script name to match the old one"



    #!/usr/bin/env python

    import sys
    #####Authors: Team Ladner-Barshis-Tepolt
    ######Usage########
    ######This script takes a set of separate quality trimmed paired end .fastq files
    ######and pairs the reads that still have a mate and combined the solitary reads
    ######into a single file of orphans for input into denovo assembly software (e.g. CLC Genomics).
    ######Command line usage: fastqcombinepairedend.py 'the seqheader text' 'the delimiter' infiledirection1.fastq infiledirection2.fastq


    seqheader=sys.argv[1]
    paireddelim=sys.argv[2]
    in1=sys.argv[3]
    in2=sys.argv[4]
    # filecount=0
    # pairedlist=[]
    # for file in infilelist:
    # if filecount==0:
    # firstname=file
    # filecount+=1
    # else:
    # secondname=file
    # filecount=0
    # pairedlist.append((firstname,secondname))

    #Opens an infile specified by the user.
    IN1 = open(sys.argv[3], 'r')
    IN2 = open(sys.argv[4], 'r')
    setoffiles=[in1, in2]
    #Opens an output text file as specified by user
    SIN = open('%s_trimmed_clipped_singles.fastq'%(sys.argv[3][:-23]), 'w') #Unpaired reads from both directions for CLC
    PAIR1 = open('%s_trimmed_clipped_stillpaired.fastq'%(sys.argv[3][:-6]), 'w') #Read one of paired reads for CLC
    PAIR2 = open('%s_trimmed_clipped_stillpaired.fastq'%(sys.argv[4][:-6]), 'w') #Read two of paired reads for CLC


    names1=[]
    names2=[]
    seqs1={}
    seqs2={}
    quals1={}
    quals2={}
    linenum1=0

    #Starts a for loop to read through the infile line by line
    for line in IN1:

    line = line.rstrip()
    linenum1+=1

    #sets count to 1 when it finds the header before a sequence
    if line[0:4]==seqheader:
    line=line.split(paireddelim)
    name=line[0][1:]
    # print name
    names1.append(name)
    count = 0
    else:
    if count==1:
    seqs1[name]=line
    if count==3:
    quals1[name]=line

    count=count+1
    # print linenum1
    print 'finished reading in: %s' %(setoffiles[0])
    for line in IN2:

    # print 2

    line = line.rstrip()

    #sets count to 1 when it finds the header before a sequence
    if line[0:4]=='@HWI':
    line=line.split(paireddelim)
    name=line[0][1:]
    names2.append(name)
    count = 0
    else:

    if count==1:
    seqs2[name]=line
    if count==3:
    quals2[name]=line

    count=count+1

    # print 'here'
    print 'finished reading in: %s'%(setoffiles[0])
    paired = list(set(names1) & set(names2))
    # print len(paired)
    print 'number of paired reads: %d'%(len(paired))
    single = list(set(names1) ^ set(names2))
    # print len(single)
    single1 = list(set(single) & set(names1))
    # print len(single1)
    print 'number of single reads left from %s: %d'%(setoffiles[0],len(single1))
    single2 = list(set(single) & set(names2))
    # print len(single2)
    print 'number of single reads left from %s: %d'%(setoffiles[1],len(single2))
    # for label in paired:
    # PAIR.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n@' + str(label) + '2\n' + str(seqs2[label]) + '\n+' + str(label) + '2\n' + str(quals2[label]) + '\n')

    for label in single1:
    SIN.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n')

    for label in single2:
    SIN.write('@' + str(label) + '2\n' + str(seqs2[label]) + '\n+' + str(label) + '2\n' + str(quals2[label]) + '\n')

    for label in paired:
    PAIR1.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n')
    PAIR2.write('@' + str(label) + '2\n' + str(seqs2[label]) + '\n+' + str(label) + '2\n' + str(quals2[label]) + '\n')

    IN1.close()
    IN2.close()
    SIN.close()
    PAIR1.close()
    PAIR2.close()

    Leave a comment:


  • bmtb
    replied
    Originally posted by JackieBadger View Post
    You can easily install a local instance of Galaxy and edit the start up script to give you admin rights. Then you can add the tool via Galaxy toolshed Fastq interlacer/deinterlacer. this groups by sequence ID into pairs i.e. cluster coordinates/1 and /2.

    These guys also have a perl script for the same function...http://sfg.stanford.edu/
    I noticed the scrit in question was not working properly and the authors sent me an updated working version. They may or may not have updated the download
    Hello again, so I made several attempts to use the PECombiner.sh/fastqcombinepairedreads.py script to work on my paired reads but Jackie previously noted the script is faulty.

    Leave a comment:


  • bmtb
    replied
    Thanks Jackie! I downloaded the scripts available from the link you posted and checked the PEcombine.sh script, and it appears to be exactly what I need. I'll post again to let those interested know if it worked or not.
    Cheers,
    Bryan

    Leave a comment:


  • dariober
    replied
    Originally posted by bmtb View Post
    Does anyone have any simple workarounds for this (eg. perl scripts) other than starting over using other filtering software (eg. Trimmomatic)?
    Hi- Sure you files can be fixed. However, I suspect that the easiest and fastest solution is to go back to the original files and use some tool that preserves the pairing order. trim_galore for example is very easy to use and does a great job.

    All the best
    Dario

    Leave a comment:


  • JackieBadger
    replied
    You can easily install a local instance of Galaxy and edit the start up script to give you admin rights. Then you can add the tool via Galaxy toolshed Fastq interlacer/deinterlacer. this groups by sequence ID into pairs i.e. cluster coordinates/1 and /2.

    These guys also have a perl script for the same function...http://sfg.stanford.edu/
    I noticed the scrit in question was not working properly and the authors sent me an updated working version. They may or may not have updated the download

    Leave a comment:


  • bioliyezhang
    replied
    maybe try FixMateInformation of picard tools.
    Not sure whether it will work.

    Leave a comment:


  • matching up paired-end reads after fastx-toolkit filtering

    Hi,

    I just learned that most genome assemblers rely on the order of sequences within paired read files to maintain read pairings. Unfortunately, I have already gone and processed each of my paired-end read files separately using fastx-toolkit and now sequences within each file are not paired up correctly. I've searched for ways to compare the two files and extract matching sequences into two new files and put all the orphaned reads into a third file, but so far I haven't found anything that has worked. Does anyone have any simple workarounds for this (eg. perl scripts) other than starting over using other filtering software (eg. Trimmomatic)? I'm working with RADseq data (fastq = phred33).

    Anyhelp would be great!

    Thanks,
    Bryan

Latest Articles

Collapse

  • seqadmin
    Strategies for Sequencing Challenging Samples
    by seqadmin


    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
    03-22-2024, 06:39 AM
  • seqadmin
    Techniques and Challenges in Conservation Genomics
    by seqadmin



    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

    Avian Conservation
    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
    03-08-2024, 10:41 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Yesterday, 06:37 PM
0 responses
8 views
0 likes
Last Post seqadmin  
Started by seqadmin, Yesterday, 06:07 PM
0 responses
8 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-22-2024, 10:03 AM
0 responses
49 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-21-2024, 07:32 AM
0 responses
66 views
0 likes
Last Post seqadmin  
Working...
X