Unconfigured Ad

**maubp** · 07-28-2010, 06:46 AM

I once wrote a script to do just that a while back using Biopython's Bio.SeqIO.index() function for random access to FASTQ files. This holds the offsets in memory so there is a limit on how many reads you can work with before having to store the index on disk, e.g. using SQLite.

How many reads do you have?

You'd also need to specify how to match up the read pairs, for example are you using the /1 and /2 suffix convention?

**natstreet** · 07-28-2010, 11:38 PM

#Reads

I have about 250 millions reads. I concatenated the reads from all lanes into a single file but I could go back and work on the per-lane files. My machine has 96 GB RAM so it should be able to handle it I think.

The pairs are specified by /1 /2 for the Illumina data I have and _F _R for the SOLiD data - but it's mainly the Illumina data I'm interested in doing this for.

**maubp** · 07-29-2010, 03:01 AM

Just to be clear, your two input files will be for the forward and reverse reads?

**natstreet** · 07-29-2010, 03:03 AM

Yes. Ideally I want to input read1 and read2 (forward and reverse reads) that have each been pre-filtered using the fastx fastq_quality_filter tool and to then output an interleaved file of proper pairs and a file of the reads that have been orphaned by the quality filtering.

**maubp** · 07-29-2010, 05:10 AM

Personally I would process the reads in per-lane batches, which will help greatly with reducing the scale of the problem (and let you spread the work over multiple machines or cores).

I couldn't find the original script I used, so I rewrote one using a straightforward but fairly memory naive approach:

Code:

from Bio import SeqIO #Biopython 1.54 or later needed

#######################################################
#
# Change the following settings to suit your needs
#

input_forward_filename = "SRR001666_1_rnd.fastq"
input_reverse_filename = "SRR001666_2_rnd.fastq"

output_pairs_filename = "out_interleaved_pairs.fastq"
output_orphan_filename = "out_unpaired_orphans.fastq"

f_suffix = "/1"
r_suffix = "/2"

#######################################################

if f_suffix:
    f_suffix_crop = -len(f_suffix)
    def f_name(name):
        """Remove the suffix from a forward read name."""
        assert name.endswith(f_suffix), name
        return name[:f_suffix_crop]
else:
    #No suffix, don't need a function to fix the name
    f_name = None

if r_suffix:
    r_suffix_crop = -len(r_suffix)
    def r_name(name):
        """Remove the suffix from a reverse read name."""
        assert name.endswith(r_suffix), name
        return name[:r_suffix_crop]
else:
    #No suffix, don't need a function to fix the name
    r_name = None
    
print "Indexing forward file..."
forward_dict = SeqIO.index(input_forward_filename, "fastq", key_function=f_name)

print "Indexing reverse file..."
reverse_dict = SeqIO.index(input_reverse_filename, "fastq", key_function=r_name)

print "Ouputing pairs and forward orphans..."
pair_handle = open(output_pairs_filename, "w")
orphan_handle = open(output_orphan_filename, "w")
for key in forward_dict:
    if key in reverse_dict:
         pair_handle.write(forward_dict.get_raw(key))
         pair_handle.write(reverse_dict.get_raw(key))
    else:
         orphan_handle.write(forward_dict.get_raw(key))
pair_handle.close()
print "Ouputing reverse orphans..."
for key in reverse_dict:
    if key not in forward_dict:
         orphan_handle.write(reverse_dict.get_raw(key))
orphan_handle.close()
print "Done"

Given you have a big memory machine, this may work fine as is.

Alternatively, this second version should need much less memory but produces three files forward/reverse/orphaned, so you'd need to interleave the forward/reverse files afterwards. This makes three passes though the files and just keeps a list of IDs in memory (as a Python set for speed) rather what the first script does which holds a map of IDs to file offsets (as a Python dictionary). The records are output in the order found in the input files, so as long as the input files follow the same sorting, doing the interleaving step is trivial (and easy enough to add on in Python).

Code:

from Bio.SeqIO.QualityIO import FastqGeneralIterator #Biopython 1.51 or later

##########################################################
#
# Change the following settings to suit your needs
#

input_forward_filename = "SRR001666_1_rnd.fastq"
input_reverse_filename = "SRR001666_2_rnd.fastq"

output_paired_forward_filename = "out_forward_pairs.fastq"
output_paired_reverse_filename = "out_reverse_pairs.fastq"
output_orphan_filename = "out_unpaired_orphans.fastq"

f_suffix = "/1"
r_suffix = "/2"

##########################################################

if f_suffix:
    f_suffix_crop = -len(f_suffix)
    def f_name(title):
        """Remove the suffix from a forward read name."""
        name = title.split()[0]
        assert name.endswith(f_suffix), name
        return name[:f_suffix_crop]
else:
    def f_name(title):
        return title.split()[0]

if r_suffix:
    r_suffix_crop = -len(r_suffix)
    def r_name(title):
        """Remove the suffix from a reverse read name."""
        name = title.split()[0]
        assert name.endswith(r_suffix), name
        return name[:r_suffix_crop]
else:
    def r_name(title):
        return title.split()[0]

print "Scaning reverse file to build list of names..."    
reverse_ids = set()
paired_ids = set()
for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
    reverse_ids.add(r_name(title))

print "Processing forward file..."
forward_handle = open(output_paired_forward_filename, "w")
orphan_handle = open(output_orphan_filename, "w")
for title, seq, qual in FastqGeneralIterator(open(input_forward_filename)):
    name = f_name(title)
    if name in reverse_ids:
        #Paired
        paired_ids.add(name)
        reverse_ids.remove(name) #frees a little memory
        forward_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
    else:
        #Orphan
        orphan_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
forward_handle.close()
del reverse_ids #frees memory, although we won't need more now

print "Processing reverse file..."
reverse_handle = open(output_paired_reverse_filename, "w")
for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
    name = r_name(title)
    if name in paired_ids:
        #Paired
        reverse_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
    else:
        #Orphan
        orphan_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
orphan_handle.close()
reverse_handle.close()
print "Done"

**natstreet** · 07-29-2010, 10:38 AM

That's brilliant thanks! It's also great to have some example python scripts to help me learn from

I really appreciate this and hopefully others will also.

**Zigster** · 08-02-2010, 08:06 AM

wow I think it's time someone really sat down and made Fastx-toolkit "pair-safe"

**hl450** · 08-09-2010, 01:12 PM

Originally posted by maubp View Post

Code:

from Bio.SeqIO.QualityIO import FastqGeneralIterator #Biopython 1.51 or later

##########################################################
#
# Change the following settings to suit your needs
#

input_forward_filename = "SRR001666_1_rnd.fastq"
input_reverse_filename = "SRR001666_2_rnd.fastq"

output_paired_forward_filename = "out_forward_pairs.fastq"
output_paired_reverse_filename = "out_reverse_pairs.fastq"
output_orphan_filename = "out_unpaired_orphans.fastq"

f_suffix = "/1"
r_suffix = "/2"

##########################################################

if f_suffix:
    f_suffix_crop = -len(f_suffix)
    def f_name(title):
        """Remove the suffix from a forward read name."""
        name = title.split()[0]
        assert name.endswith(f_suffix)
        return name[:f_suffix_crop]
else:
    def f_name(title):
        return title.split()[0]

if r_suffix:
    r_suffix_crop = -len(r_suffix)
    def r_name(title):
        """Remove the suffix from a reverse read name."""
        name = title.split()[0]
        assert name.endswith(r_suffix)
        return name[:r_suffix_crop]
else:
    def r_name(title):
        return title.split()[0]

print "Scaning reverse file to build list of names..."    
reverse_ids = set()
paired_ids = set()
for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
    reverse_ids.add(r_name(title))

print "Processing forward file..."
forward_handle = open(output_paired_forward_filename, "w")
orphan_handle = open(output_orphan_filename, "w")
for title, seq, qual in FastqGeneralIterator(open(input_forward_filename)):
    name = f_name(title)
    if name in reverse_ids:
        #Paired
        paired_ids.add(name)
        reverse_ids.remove(name) #frees a little memory
        forward_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
    else:
        #Orphan
        orphan_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
forward_handle.close()
del reverse_ids #frees memory, although we won't need more now

print "Processing reverse file..."
reverse_handle = open(output_paired_reverse_filename, "w")
for title, seq, qual in FastqGeneralIterator(open(input_forward_filename)):
    name = r_name(title)
    if name in paired_ids:
        #Paired
        reverse_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
    else:
        #Orphan
        orphan_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
orphan_handle.close()
reverse_handle.close()
print "Done"

I think there is an error in the code it should be input_reverse_filename rather than input_foward_filename

Code:

print "Processing reverse file..."
reverse_handle = open(output_paired_reverse_filename, "w")
for title, seq, qual in FastqGeneralIterator(open([COLOR="Red"]input_reverse_filename[/COLOR])):

**maubp** · 08-10-2010, 03:13 AM

Originally posted by hl450 View Post

I think there is an error in the code it should be input_reverse_filename rather than input_foward_filename

Yes, that does look better. I've updated the example. Thanks

**menenuh** · 07-05-2011, 01:05 AM

casava 1.8

Hello,
I have been using this script with great ease and pleasure, till Casava 1.8 update. Now the read id structure is changed, the script no longer recognizes the pair information. Do you have an update for this script?

**maubp** · 07-05-2011, 01:40 AM

Originally posted by menenuh View Post

Hello,
I have been using this script with great ease and pleasure, till Casava 1.8 update. Now the read id structure is changed, the script no longer recognizes the pair information. Do you have an update for this script?

You've run into the very problem I predicted on this thread announcing the Casava 1.8 changes:

Upcoming changes in CASAVA - SEQanswers

http://seqanswers.com/forums/showthread.php?t=8895

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

Simple answer, just look at the name (first word before the space) and they should match. i.e. There is no suffix - change these lines:

Code:

f_suffix = "/1"
r_suffix = "/2"

to:

Code:

f_suffix = ""
r_suffix = ""

and the script should work.

A slightly more robust script would look at the read number field in the second word.

**nposnien** · 03-06-2012, 06:07 AM

Hi,
@maubp: thank you very much for the script! I have no experience in programming but I guess that this script https://github.com/lexnederbragt/den...leave_pairs.py is e is exactly what I need (produce paired/singleton file from trimmed/filtered paired-end reads). However, my data is also based on CASAVA 1.8. Thus, I guess that I will need to change the script as recommended in the previous post. But my question is: How does the script identify the pairs? Based on the file they come from (forward or reverse)?
Thank you very much!!

**maubp** · 03-07-2012, 08:15 AM

If your FASTQ pairs come in separate forward and reverse files that tells you which is which. However, for the CASAVA 1.8 naming you could/should also look for 1 or 2 after the first space.

**odoyle81** · 03-19-2012, 09:22 PM

Thanks guys for these scripts! I was struggling trying to create one from scratch and these have been a huge help!

Topics	Statistics	Last Post
UC San Diego Bioengineers Map Gene Function in Human Stem Cells by SEQadmin2 Started by SEQadmin2, 07-13-2026, 10:26 AM	0 responses 15 views 0 reactions	Last Post by SEQadmin2 07-13-2026, 10:26 AM
New Analysis Splits Leukemia Into 16 Epigenomic Subgroups by SEQadmin2 Started by SEQadmin2, 07-09-2026, 10:04 AM	0 responses 29 views 0 reactions	Last Post by SEQadmin2 07-09-2026, 10:04 AM
Genome-Wide CRISPR Screen Uncovers Unlikely Psoriasis Target by SEQadmin2 Started by SEQadmin2, 07-08-2026, 10:08 AM	0 responses 16 views 0 reactions	Last Post by SEQadmin2 07-08-2026, 10:08 AM
Engineered Protein Motor Takes Its First Steps Along DNA Track by SEQadmin2 Started by SEQadmin2, 07-07-2026, 11:05 AM	0 responses 34 views 0 reactions	Last Post by SEQadmin2 07-07-2026, 11:05 AM

Unconfigured Ad

Interleaved mate pair fastq after quality filtering

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News