Seqanswers Leaderboard Ad

**bioliyezhang** · 10-12-2012, 04:38 PM

maybe try FixMateInformation of picard tools.
Not sure whether it will work.

**JackieBadger** · 10-12-2012, 06:16 PM

You can easily install a local instance of Galaxy and edit the start up script to give you admin rights. Then you can add the tool via Galaxy toolshed Fastq interlacer/deinterlacer. this groups by sequence ID into pairs i.e. cluster coordinates/1 and /2.

These guys also have a perl script for the same function...http://sfg.stanford.edu/
I noticed the scrit in question was not working properly and the authors sent me an updated working version. They may or may not have updated the download

**dariober** · 10-13-2012, 01:40 AM

Originally posted by bmtb View Post

Does anyone have any simple workarounds for this (eg. perl scripts) other than starting over using other filtering software (eg. Trimmomatic)?

Hi- Sure you files can be fixed. However, I suspect that the easiest and fastest solution is to go back to the original files and use some tool that preserves the pairing order. trim_galore for example is very easy to use and does a great job.

All the best
Dario

**bmtb** · 10-14-2012, 04:03 PM

Thanks Jackie! I downloaded the scripts available from the link you posted and checked the PEcombine.sh script, and it appears to be exactly what I need. I'll post again to let those interested know if it worked or not.
Cheers,
Bryan

**bmtb** · 10-14-2012, 06:23 PM

Originally posted by JackieBadger View Post

You can easily install a local instance of Galaxy and edit the start up script to give you admin rights. Then you can add the tool via Galaxy toolshed Fastq interlacer/deinterlacer. this groups by sequence ID into pairs i.e. cluster coordinates/1 and /2.

These guys also have a perl script for the same function...http://sfg.stanford.edu/
I noticed the scrit in question was not working properly and the authors sent me an updated working version. They may or may not have updated the download

Hello again, so I made several attempts to use the PECombiner.sh/fastqcombinepairedreads.py script to work on my paired reads but Jackie previously noted the script is faulty.

**JackieBadger** · 10-14-2012, 06:45 PM

I was sent the following fixed script from the authors with instructions. Reference them if you use it. I haven't tested it so contact the authors if it fails again

"You can either replace the name in the PECombiner.sh script with fastqcombinepairedend_revision.py or delete the fastqcombinepairedend.py that you downloaded with the SFG and remove the _revision from the revised script name to match the old one"

#!/usr/bin/env python

import sys
#####Authors: Team Ladner-Barshis-Tepolt
######Usage########
######This script takes a set of separate quality trimmed paired end .fastq files
######and pairs the reads that still have a mate and combined the solitary reads
######into a single file of orphans for input into denovo assembly software (e.g. CLC Genomics).
######Command line usage: fastqcombinepairedend.py 'the seqheader text' 'the delimiter' infiledirection1.fastq infiledirection2.fastq

seqheader=sys.argv[1]
paireddelim=sys.argv[2]
in1=sys.argv[3]
in2=sys.argv[4]
# filecount=0
# pairedlist=[]
# for file in infilelist:
# if filecount==0:
# firstname=file
# filecount+=1
# else:
# secondname=file
# filecount=0
# pairedlist.append((firstname,secondname))

#Opens an infile specified by the user.
IN1 = open(sys.argv[3], 'r')
IN2 = open(sys.argv[4], 'r')
setoffiles=[in1, in2]
#Opens an output text file as specified by user
SIN = open('%s_trimmed_clipped_singles.fastq'%(sys.argv[3][:-23]), 'w') #Unpaired reads from both directions for CLC
PAIR1 = open('%s_trimmed_clipped_stillpaired.fastq'%(sys.argv[3][:-6]), 'w') #Read one of paired reads for CLC
PAIR2 = open('%s_trimmed_clipped_stillpaired.fastq'%(sys.argv[4][:-6]), 'w') #Read two of paired reads for CLC

names1=[]
names2=[]
seqs1={}
seqs2={}
quals1={}
quals2={}
linenum1=0

#Starts a for loop to read through the infile line by line
for line in IN1:

line = line.rstrip()
linenum1+=1

#sets count to 1 when it finds the header before a sequence
if line[0:4]==seqheader:
line=line.split(paireddelim)
name=line[0][1:]
# print name
names1.append(name)
count = 0
else:
if count==1:
seqs1[name]=line
if count==3:
quals1[name]=line

count=count+1
# print linenum1
print 'finished reading in: %s' %(setoffiles[0])
for line in IN2:

# print 2

line = line.rstrip()

#sets count to 1 when it finds the header before a sequence
if line[0:4]=='@HWI':
line=line.split(paireddelim)
name=line[0][1:]
names2.append(name)
count = 0
else:

if count==1:
seqs2[name]=line
if count==3:
quals2[name]=line

count=count+1

# print 'here'
print 'finished reading in: %s'%(setoffiles[0])
paired = list(set(names1) & set(names2))
# print len(paired)
print 'number of paired reads: %d'%(len(paired))
single = list(set(names1) ^ set(names2))
# print len(single)
single1 = list(set(single) & set(names1))
# print len(single1)
print 'number of single reads left from %s: %d'%(setoffiles[0],len(single1))
single2 = list(set(single) & set(names2))
# print len(single2)
print 'number of single reads left from %s: %d'%(setoffiles[1],len(single2))
# for label in paired:
# PAIR.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n@' + str(label) + '2\n' + str(seqs2[label]) + '\n+' + str(label) + '2\n' + str(quals2[label]) + '\n')

for label in single1:
SIN.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n')

for label in single2:
SIN.write('@' + str(label) + '2\n' + str(seqs2[label]) + '\n+' + str(label) + '2\n' + str(quals2[label]) + '\n')

for label in paired:
PAIR1.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n')
PAIR2.write('@' + str(label) + '2\n' + str(seqs2[label]) + '\n+' + str(label) + '2\n' + str(quals2[label]) + '\n')

IN1.close()
IN2.close()
SIN.close()
PAIR1.close()
PAIR2.close()

**bmtb** · 10-16-2012, 12:07 PM

Thanks Jackie! I tried the revised script but it still didn't work correctly. Only after I went through the script re-setting all the indents and after correcting some lines did I manage to get it working. My only concern now is that I have to ramp up the requested RAM on the cluster i'm using just to use it with my data. So far 8GB of RAM won't cut it as I keep getting 'MemoryErrors'. It works fine on smaller datasets though. Anyway, here is the corrected code (again reference the authors if you use it):

####################################################
#!/usr/bin/env python

import sys
#####Authors: Team Ladner-Barshis-Tepolt
######Usage########
######This script takes a set of separate quality trimmed paired end .fastq files
######and pairs the reads that still have a mate and combined the solitary reads
######into a single file of orphans for input into denovo assembly software (e.g. CLC Genomics).
######Command line usage: fastqcombinepairedend.py 'the delimiter' 'the seqheader text' infiledirection1.fastq infiledirection2.fastq

seqheader=sys.argv[1]
paireddelim=sys.argv[2]
in1=sys.argv[3]
in2=sys.argv[4]

###This line had to be added 15Oct2012 for script to work properly
setoffiles=[in1, in2]

#filecount=0
#pairedlist=[]
#for file in infilelist:
#if filecount==0:
#firstname=file
#filecount+=1
#else:
#secondname=file
#filecount=0
#pairedlist.append((firstname,secondname))

#Opens an infile specified by the user.
IN1 = open(sys.argv[3], 'r')
IN2 = open(sys.argv[4], 'r')

#Opens an output text file as specified by user
SIN = open('%s_trimmed_clipped_singles.fastq'%(sys.argv[3][:-23]), 'w') #Unpaired reads from both directions for CLC
PAIR1 = open('%s_trimmed_clipped_stillpaired.fastq'%(sys.argv[3][:-6]), 'w') #Read one of paired reads for CLC
PAIR2 = open('%s_trimmed_clipped_stillpaired.fastq'%(sys.argv[4][:-6]), 'w') #Read two of paired reads for CLC

names1=[]
names2=[]
seqs1={}
seqs2={}
quals1={}
quals2={}
linenum1=0

#Starts a for loop to read through the infile line by line
for line in IN1:

line = line.rstrip()
linenum1+=1

#sets count to 1 when it finds the header before a sequence
if line[0:4]==seqheader:
line=line.split(paireddelim)
name=line[0][1:]
# print name
names1.append(name)
count = 0
else:
if count==1:
seqs1[name]=line
if count==3:
quals1[name]=line

count=count+1
# print linenum1
print 'finished reading in: %s' %(setoffiles[0])

for line in IN2:

# print 2

line = line.rstrip()

#sets count to 1 when it finds the header before a sequence
if line[0:4]==seqheader: #change to seqheader from '@HWI' 15oct2012 - I'm not sure why they didn't just use seqheader to begin with as done in previous for loop.
line=line.split(paireddelim)
name=line[0][1:]
names2.append(name)
count = 0
else:
if count==1:
seqs2[name]=line
if count==3:
quals2[name]=line

count=count+1
# print linenum1
# print 'here'
print 'finished reading in: %s'%(setoffiles[1])

paired = list(set(names1) & set(names2))
#print paired
# print len(paired)
print 'number of paired reads: %d'%(len(paired))
single = list(set(names1) ^ set(names2))
# print len(single)
single1 = list(set(single) & set(names1))
# print len(single1)
print 'number of single reads left from %s: %d'%(setoffiles[0],len(single1))
single2 = list(set(single) & set(names2))
# print len(single2)
print 'number of single reads left from %s: %d'%(setoffiles[1],len(single2))

# for label in paired:
# PAIR.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n@' + str(label) + '2\n' + str(seqs2[label]) + $

for label in single1:
SIN.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n')

for label in single2:
SIN.write('@' + str(label) + '2\n' + str(seqs2[label]) + '\n+' + str(label) + '2\n' + str(quals2[label]) + '\n')

for label in paired:
PAIR1.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n')
PAIR2.write('@' + str(label) + '2\n' + str(seqs2[label]) + '\n+' + str(label) + '2\n' + str(quals2[label]) + '\n')

IN1.close()
IN2.close()
SIN.close()
PAIR1.close()
PAIR2.close()
####################################################

**bmtb** · 10-16-2012, 12:15 PM

Hi again, I just realized my last post with the correct code didn't keep the indents. so I've attached a file containing the corrected script. To use it remove the .txt extension.

Cheers,
Bryan

Attached Files

fastqcombinepairedend.py.txt (3.6 KB, 733 views)

**bmtb** · 10-18-2012, 09:04 AM

Just thought I'd let you know I needed ~227GB of RAM in order to use this script on my data which consisted of two paired-end files 58GB and 29GB in size (after filtering for quality).

**carmeyeii** · 12-28-2012, 12:00 PM

bmtb,

Thanks for the corrected script and the info on memory usage.

I'd like to ask what the 'the delimiter' 'the seqheader text' refer to in the program usage.

Command line usage: fastqcombinepairedend.py 'the delimiter' 'the seqheader text' infiledirection1.fastq infiledirection2.fastq

Carmen

**carmeyeii** · 01-03-2013, 01:02 PM

Hi! Follow-up / bump to my question above:

Does i have to be the entire line, ie,

@HWI-M00149:16:000000000-A12VK:1:1101:17796:2300 1:N:0:

, or just @HWI ?

I'm guessing for delimiter I should use " : " ?

Thanks!

**carmeyeii** · 01-03-2013, 02:35 PM

Dear btmb,

I have tried the code and it seems not to be working

'name' and 'count' are not defined before they are evaluated it seems, and the 'count' block seems to me as it sets to 0 when the header is encountered, but I'm not sure. Are there any python-saavy bioinformaticians that would care to help us out with this code?

Thanks!

**bmtb** · 01-03-2013, 07:09 PM

Hi carmeyeii,

Sorry it took so long for me to reply. If you open the script in a text editor there should be a description in the commented out section at the top of the page. If I recall correctly, the seqheader text should just be "@HWI" and the delimiter should be what ever separates the string of characters adjacent to the seqheader text from the read identifier part of the header line. In your case, you would use " " as the delimiter since there is a space that separates the read identifier (i.e. 1:N:0) from the rest of the line.

Cheers,

Bryan

Originally posted by carmeyeii View Post

Hi! Follow-up / bump to my question above:

Does i have to be the entire line, ie,

@HWI-M00149:16:000000000-A12VK:1:1101:17796:2300 1:N:0:

, or just @HWI ?

I'm guessing for delimiter I should use " : " ?

Thanks!

**bmtb** · 01-03-2013, 07:38 PM

Hi again,

I just tested the attached copy of the script on 2500 paired seqs and it worked as expected using the following command:

python fastqcombinepairedend.py "@HWI" " " fastq1 fastq2

Make sure to change the .py.txt extension to just .py.

Bryan

Originally posted by carmeyeii View Post

Dear btmb,

I have tried the code and it seems not to be working

'name' and 'count' are not defined before they are evaluated it seems, and the 'count' block seems to me as it sets to 0 when the header is encountered, but I'm not sure. Are there any python-saavy bioinformaticians that would care to help us out with this code?

Thanks!

Attached Files

fastqcombinepairedend.py.txt (3.5 KB, 428 views)

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, Today, 11:09 AM	0 responses 22 views 0 likes	Last Post by seqadmin Today, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, Today, 06:13 AM	0 responses 20 views 0 likes	Last Post by seqadmin Today, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 30 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

matching up paired-end reads after fastx-toolkit filtering

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News