Unconfigured Ad

gringer · 06-17-2012, 06:30 AM

This Python solution may be a bit late but it will work on both multi-line fasta and fastq files.

A few warnings about this Python code:

ASCII quality scores in fastq files can contain arbitrary characters like '>' and '@'. The only way to protect against this for multi-line fastq files is to keep a record of the sequence length.
This code reads the entire set of reads into memory, then outputs random sequences -- this will not work on fasta/fastq files that are larger than the memory limit. While this is the only way to do it for streamed input, the code here uses file names, so can do a two-pass operation and reopen a file (see my previous post).

jmun7616 · 06-15-2012, 04:40 PM

Hey, just notices this thread. This Python solution may be a bit late but it will work on both multi-line fasta and fastq files. Set up is such that all files in same dir will be subsampled, but this can be changed. Also three subsamplings taken, but you can change this easily enough or just use one of them.

Code:

import random
import glob
inFiles = glob.glob('*.fasta') + glob.glob('*.fastq')
outFiles = []
num = int(raw_input("Enter number of random sequences to select:\n"))

for i in range(len(inFiles)):

        for k in range(3):
                fNames = []
                fSeqs = []
                same = 0
                outFiles.append(file(str(inFiles[i])+'_'+'Rand_'+str(num)+'-'+str(k+1)+'.fa', 'wt'))
                for line in open(inFiles[i]):
                        if ((line[0] == '>') or (line[0] == '@')):
                                fNames.append(line)
                                same = 0
                        else:
                                if (same != 1):
                                        fSeqs.append(line)
                                        same = 1
                                else:
                                        fSeqs[(len(fSeqs)-1)] += line
                curr = (len(outFiles)-1)
                for j in range(num):
                        a = random.randint(0, (len(fNames)-1))
                        outFiles[curr].write(fNames.pop(a))
                        outFiles[curr].write(fSeqs.pop(a))
raw_input("Done.")

swbarnes2 · 01-16-2012, 11:07 AM

I think a simpler way to do it...just get all the reads from a particular set of tiles (I don't know how that translates on HiSeq machines) Since the first few and last few tiles are bad, pick some tiles in the middle, by grepping the read names, which have teh cluster cooridnates embedded in them. Or zgrep, if the fastqs are already zipped. That should be random with respect to where the read align to the genome, with better quality than picking the top and bottom tiles with head or foot.

But yes, this won't work on multi-line fastqs.

gringer · 01-16-2012, 09:36 AM

Edit: Just noticed that it won't work with multi-line FASTA or FASTQ files, but for most NGS things you don't need to worry about that.

So we have a situation that the script wasn't designed for. If you have fastx-toolkit handy, you can convert multi-line fasta files into single-line fasta files using the fasta-formatter tool, and then run through the script. Changing this script to work in a memory-efficient manner for multi-line fasta and fastq files would increase the complexity a reasonable amount.

A rough pseudo-code for this is as follows:

First pass -- counting records

Look for magic record start sequences, if found, increment the record counter
If a fasta file, read through and discard up to the next record start, return to the first step
If a fastq file, read up to the next '+' to work out sequence length
read the next line and discard
read as many quality values as were in the sequence, return to the first step

Second pass -- printing sequences

Generate n random numbers between 1 and the number of records
iterate through the sequences as before, but print the entire sequence record if the current record number matches to the random list

silver_steve · 01-16-2012, 09:22 AM

Originally posted by kga1978 View Post

Sorry, yes, that was a typo - there's a gunzip in front of that.

So, a talented student wrote up a quick script in python for me and it works really well. It appears that it is about as fast as writing to disk, so overall what I was looking for. It takes the following argument:

Code:

python subsampler.py <filename> <number of sequences to extract>

The script is as follows:

Code:

#!/usr/bin/env python
import random
import sys

#name of the input file (fasta or fastq)
#assumes input file is standard fasta/fastq format
fileName = sys.argv[1]
#number of sequences to subsample
numSeq = int(sys.argv[2])
increment = 0

#if it's a fasta file
if (fileName.find(".fasta") != -1):
  increment = 2
#else if it's a fastq file
elif (fileName.find(".fastq") != -1):
  increment = 4
#quit if neither
else:
  sys.stdout.write("not a fasta/fastq file\n")
  sys.exit()

FILE = open(fileName, 'r')
reads = list()
index = 0
#put the entire file in a list
for line in FILE:
  if(index % increment == 0):
    reads.append(line)
  else:
    reads[(index/increment)] += line
  index += 1
FILE.close()

#figure out total number of reads, error if applicable
totalReads = index/increment
if(totalReads < numSeq):
  sys.stdout.write("You only have "+str(totalReads)+" reads!\n")
  sys.exit()

#shuffle the reads!
random.shuffle(reads)
#output the reads to stdout
for i in range(0, numSeq):
  sys.stdout.write(reads[i])

Thank you kga1978. I ran this script successfully, yet the output was 1000 lines, not 1000 random sequences. I believe this is because my file is formatted like:
>File_Name
GACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGAGTCTGCCTTGAAGATCGG
AGTGCTTGCACTCTGTGAAACAAGATACAGGCTAGCGGCGGACGGGTGAGTAACACGTGG...
i.e., the sequence is not one continuous string, but rather a set of lines.
Do you know how to modify the code or modify my files to work with subsampler.py?

Adjuvant · 01-03-2012, 03:01 PM

Nobody asked for it, but I wanted it so maybe somebody else can use it, too.

Here's a rewrite of the python script that can now take two paired files as input and maintain pairing in the subsets. It can also still take just a single-end file as input. It's not pretty, but it does the job.

Changes: Now have to enter the number of reads desired first, then the input file name(s). Instead of going to stdout, the reads are output to out_1.fastq (or .fasta) and out_2.fastq (or .fasta). It would be relatively simple to put in an output prefix parameter, but this was good enough for my purposes.

Code:

#!/usr/bin/env python
import random
import sys

random.seed()
#number of sequences to subsample
numSeq = int(sys.argv[1])
#name of the input file (fasta or fastq)
#assumes input file is standard fasta/fastq format
fileName1 = sys.argv[2]
fileName2 = sys.argv[3]
increment = 0

#if it's a fasta file
if (fileName1.find(".fasta") != -1):
  increment = 2
#else if it's a fastq file
elif (fileName1.find(".fastq") != -1):
  increment = 4
#quit if neither
else:
  sys.stdout.write("not a fasta/fastq file\n")
  sys.exit()

FILE1 = open(fileName1, 'r')
totalReads1 = list()
index = 0
for line in FILE1:
  if(index % increment == 0):
    totalReads1.append(index/increment)
  index += 1
FILE1.close()
if(len(totalReads1) < numSeq):
  sys.stdout.write("You only have "+str(len(totalReads))+" reads!\n")
  sys.exit()
  
if (fileName2):
  FILE2 = open(fileName2, 'r')
  totalReads2 = list()
  index = 0
  for line in FILE2:
    if (index % increment == 0):
      totalReads2.append(index/increment)
    index += 1
  FILE2.close()
  if (len(totalReads1) != len(totalReads2)):
    sys.stdout.write("read counts do not match\n")
    sys.exit()

ttl = len(totalReads1)
random.shuffle(totalReads1)
totalReads1 = totalReads1[0: numSeq]
totalReads1.sort()

FILE1 = open(fileName1, 'r')
if (fileName2):
  FILE2 = open(fileName2, 'r')
listIndex = 0

if (increment == 4):
  OUTFILE1 = open('out_1.fastq', 'w')
  if (fileName2):
    OUTFILE2 = open('out_2.fastq', 'w')
else:
  OUTFILE1 = open('out_1.fasta', 'w')
  if (fileName2):
    OUTFILE2 = open('out_2.fasta', 'w')

for i in range(0, ttl):
  curRead1 = ""
  curRead2 = ""
  for j in range(0, increment):
    curRead1 += FILE1.readline()
    if (fileName2):
      curRead2 += FILE2.readline()
  if (i == totalReads1[listIndex]):
    OUTFILE1.write(curRead1)
    if (fileName2):
      OUTFILE2.write(curRead2)
    listIndex += 1
    if(listIndex == numSeq):
      break
FILE1.close()
if (fileName2):
  FILE2.close()
OUTFILE1.close()
if (fileName2):
  OUTFILE2.close()

kga1978 · 12-29-2011, 10:16 AM

Also, following gringer's suggestion, here's a less memory-hungry version of the downsampling script:

Code:

#!/usr/bin/env python
import random
import sys

random.seed()
#name of the input file (fasta or fastq)
#assumes input file is standard fasta/fastq format
fileName = sys.argv[1]
#number of sequences to subsample
numSeq = int(sys.argv[2])
increment = 0

#if it's a fasta file
if (fileName.find(".fasta") != -1):
  increment = 2
#else if it's a fastq file
elif (fileName.find(".fastq") != -1):
  increment = 4
#quit if neither
else:
  sys.stdout.write("not a fasta/fastq file\n")
  sys.exit()

FILE = open(fileName, 'r')
totalReads = list()
index = 0
for line in FILE:
  if(index % increment == 0):
    totalReads.append(index/increment)
  index += 1
FILE.close()
if(len(totalReads) < numSeq):
  sys.stdout.write("You only have "+str(len(totalReads))+" reads!\n")
  sys.exit()

ttl = len(totalReads)
random.shuffle(totalReads)
totalReads = totalReads[0: numSeq]
totalReads.sort()

FILE = open(fileName, 'r')
listIndex = 0

for i in range(0, ttl):
  curRead = ""
  for j in range(0, increment):
    curRead += FILE.readline()
  if (i == totalReads[listIndex]):
    sys.stdout.write(curRead)
    listIndex += 1
    if(listIndex == numSeq):
      break
FILE.close()

kga1978 · 12-29-2011, 10:14 AM

Just to follow up on the first part of my question, I now ran a couple of subsample experiments and compared them to the entire dataset as well as to 'head -n' and 'tail -n'. As noted by other posters, there are definitely biases when doing 'head' and 'tail'.

I subsampled 1 million reads from a total of 50 million and ran FastQC (to calculate avg. Q-score) and aligned the reads to hg19 using BWA. This is what I got:

kga1978 · 12-29-2011, 05:43 AM

Hey gringer,

Thanks for your comments - very helpful and I'll make sure to pass them on. For now though, I don't really mind holding everything in memory - should make it faster and memory isn't a problem (for now....).

And actually, I won't work on stdin at the moment as it requires the .fasta or .fastq extension to work (this could be easily fixed by specifying file-type though...).

ETHANol · 12-29-2011, 03:30 AM

kga1978, thanks for posting the script. I was just working on a perl script to do the same, but being a beginner at scripting it was going to take me an annoying amount of time. So thanks and now I can do something more productive today!!!

gringer · 12-29-2011, 01:59 AM

Originally posted by kga1978 View Post

So, a talented student wrote up a quick script in python for me and it works really well. It appears that it is about as fast as writing to disk, so overall what I was looking for.

This code is fine, but bear in mind that while this process will work on piped input (possibly with minor modifications to file opening), it requires your entire input file to be stored in memory (in the 'reads' list). If you don't care about piped input and can afford a second sweep through the file, you can reduce the memory footprint substantially by only using the first sweep to count records, then the second sweep to print out the randomly selected sequences.

Edit: Just noticed that it won't work with multi-line FASTA or FASTQ files, but for most NGS things you don't need to worry about that.

kga1978 · 12-28-2011, 07:36 PM

I assume that is a typo

Sorry, yes, that was a typo - there's a gunzip in front of that.

So, a talented student wrote up a quick script in python for me and it works really well. It appears that it is about as fast as writing to disk, so overall what I was looking for. It takes the following argument:

Code:

python subsampler.py <filename> <number of sequences to extract>

The script is as follows:

Code:

#!/usr/bin/env python
import random
import sys

#name of the input file (fasta or fastq)
#assumes input file is standard fasta/fastq format
fileName = sys.argv[1]
#number of sequences to subsample
numSeq = int(sys.argv[2])
increment = 0

#if it's a fasta file
if (fileName.find(".fasta") != -1):
  increment = 2
#else if it's a fastq file
elif (fileName.find(".fastq") != -1):
  increment = 4
#quit if neither
else:
  sys.stdout.write("not a fasta/fastq file\n")
  sys.exit()

FILE = open(fileName, 'r')
reads = list()
index = 0
#put the entire file in a list
for line in FILE:
  if(index % increment == 0):
    reads.append(line)
  else:
    reads[(index/increment)] += line
  index += 1
FILE.close()

#figure out total number of reads, error if applicable
totalReads = index/increment
if(totalReads < numSeq):
  sys.stdout.write("You only have "+str(totalReads)+" reads!\n")
  sys.exit()

#shuffle the reads!
random.shuffle(reads)
#output the reads to stdout
for i in range(0, numSeq):
  sys.stdout.write(reads[i])

gringer · 12-28-2011, 04:33 PM

Originally posted by kga1978 View Post

gringer, as for shuf, I don't think that will work? Reason being that each read spans 2 (fasta) or 4 (fastq) lines?

You're correct... it won't work because of the serial nature of the files. Yet another case of me not thinking enough before posting.

The only ways I can think of at the moment are fairly complex piped shell commands, or small programs.

I guess one "simple" way to do it would be to seek to a random file position, then read forward until you find a start of sequence marker, but that's a lot of work for too little benefit. FWIW, doing a strict random seek on a fastq file is not possible because both @ and + can appear in the quality strings, and you can get multi-line fastq files.

Edit: I've just thought of another way that would work for a single pass (i.e. could work for piped input), assuming you knew the approximate number of records in the file. You decide in advance what records will be selected (i.e. generate 1M numbers between 1 and 100M), sort the list, then run through the file and print only the records that appear in the list.

maubp · 12-28-2011, 03:56 PM

Originally posted by kga1978 View Post

I then subsample 25.000 reads from this file using the following command:

Code:

head -n 100000 file.fastq.gz | <downstream analysis, e.g. blastn>

I assume that is a typo, since you must unzip the gripped FASTQ before taking the first 1000000 lines.

Originally posted by kga1978 View Post

IMy worry is that by doing that, I will get some sort of bias in my analysis as I am only taking the 'head' of the first part of all my reads. The question is - is this a valid concern?

Yes, since the read order produced by Illumina reflects the slide positions, so the first and last reads will be from the edges and generally of poorer quality (so I've read).

kga1978 · 12-28-2011, 03:15 PM

Thanks for the replies - that's very helpful.

I have only been running this over the last couple of days (but we have loads where I need to do it) and didn't notice anything majorly wrong, other than some weird QC scores for a couple of files (gringer, I guess that is what you have seen as well).

We have a script to subsample correctly, so I will probably just use that - the main problem being that it needs to read the entire file before being able to output a subsample. In that sense 'head' was a lot faster.

gringer, as for shuf, I don't think that will work? Reason being that each read spans 2 (fasta) or 4 (fastq) lines? Otherwise, this is exactly what I would be looking for - given a file of n lines, randomly take a read from line k (so 4 lines, always starting at a position that can be divided by 4). But I haven't been able to find a script or command that will do that.

Topics	Statistics	Last Post
New Analysis Splits Leukemia Into 16 Epigenomic Subgroups by SEQadmin2 Started by SEQadmin2, Yesterday, 10:04 AM	0 responses 11 views 0 reactions	Last Post by SEQadmin2 Yesterday, 10:04 AM
Genome-Wide CRISPR Screen Uncovers Unlikely Psoriasis Target by SEQadmin2 Started by SEQadmin2, 07-08-2026, 10:08 AM	0 responses 9 views 0 reactions	Last Post by SEQadmin2 07-08-2026, 10:08 AM
Engineered Protein Motor Takes Its First Steps Along DNA Track by SEQadmin2 Started by SEQadmin2, 07-07-2026, 11:05 AM	0 responses 16 views 0 reactions	Last Post by SEQadmin2 07-07-2026, 11:05 AM
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 31 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM

Unconfigured Ad

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News