Subsampling using 'head -n #"?

chayan replied

05-04-2013, 12:22 PM
Additionally i want to know is it possible to extract fastq based on the read length (may or may not be random)??? what i mean actually, if i extract 40% reads from a fastq those will represent only the longest 40% of the total...not sure whether MIRA convert_project can do this or not..any help??

Regrads
Leave a comment:
chayan replied

05-04-2013, 12:13 PM
Hmm encouraging line..thnx a ton
Leave a comment:
kmcarr replied

05-04-2013, 11:54 AM
Originally posted by chayan View Post

Thanx maubp, i am not sure but probably those Roche applications are not free to access..still i ll search those and surely ll give a try to your other suggestions..again thnx for your kind help

Regards

While not open source (free as in speech) they are made available free (as in beer) of charge. Fill out the reqest form here.
Leave a comment:
chayan replied

05-04-2013, 11:08 AM
Thanx maubp, i am not sure but probably those Roche applications are not free to access..still i ll search those and surely ll give a try to your other suggestions..again thnx for your kind help

Regards
Leave a comment:
maubp replied

05-04-2013, 06:46 AM
Originally posted by chayan View Post

hii all
how can i use the above mentioned scripts to randomly extract a fixed % of output from .sff file, cause am working of a iontorrent denovo assembly with MIRA and i need to extract both the fastq as well as traceinfo.xml... (as mentioned by the sequence provider as well as Bastein).. thanx for any help in advance

Regards
Chayan

SFF is a binary format with a non-trivial header. Unix text tools like grep are not going to be any help. I would suggest using the Roche 'off instrument' applications which include an SFF tool which can select random reads (Linux only, search for how to get Newbler on these forums for more information).

Alternatively, if you program in Python, this would be pretty trivial using the Biopython SFF support. http://biopython.org Likewise BioHaskell has SFF support and command line tools for working with these files. http://biohaskell.org
Leave a comment:
chayan replied

05-04-2013, 05:31 AM
hii all
how can i use the above mentioned scripts to randomly extract a fixed % of output from .sff file, cause am working of a iontorrent denovo assembly with MIRA and i need to extract both the fastq as well as traceinfo.xml... (as mentioned by the sequence provider as well as Bastein).. thanx for any help in advance

Regards
Chayan
Leave a comment:
syfo replied

04-18-2013, 10:02 AM
Originally posted by Richard Finney View Post

A simple script extracting random fastq records using reservoir sampling ...

# usage: program inputfastq numsubsamples outputsubampfqfile
# example: ./job.fqreservoir SRR451888.fq 1000 subsamp.fq

cat $1 | \
awk '{if ((p%4)<3) printf("%s\t",$0); else printf("%s\n",$0); p++;}' | \
awk -v k=$2 'BEGIN{srand(systime() + PROCINFO["pid"]);}{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=$0}END{for(z in a)print a[z]}' | \
tr '\t' '\n' > $3

See previous discussion here: http://seqanswers.com/forums/showthread.php?t=16845

Note the acutal line #3 is a hack of lh3's script WITH seeding, i.e. BEGIN{srand(systime() + PROCINFO["pid"]);}
Remove the seed code to get same results every time you run it. Otherwise leave it in to get different results.

This program joins the 4 fastq lines together into one line , spearating them with a tab character.
These combined lines are subsampled.
At the end, the "tr" command seperates the selected combined lines back into 4 lines.
This script used the UNIX/LINUX/BSD standard utilities "tr", "cat", and "awk".

Reservoir sampling! Thank you so much, this is exactly what I was looking for!
Leave a comment:
JahnDavik replied

04-18-2013, 12:30 AM
Thanks. This is what I'd need. Hope it works :-)
Leave a comment:
JahnDavik replied

04-18-2013, 12:27 AM
Ad:random sampling in forward and reverse

Thanks for posting this code. I really need it and I really hope it works.
Leave a comment:
JahnDavik replied

04-18-2013, 12:25 AM
Thanks for posting this code. I really need it and I really hope it works?
Leave a comment:
maubp replied

06-20-2012, 05:08 AM
It would have been less effort to reuse Biopython's FASTQ parser which already deals with the nasty line wrapping cases. e.g.

OBF » Working with FASTQ files in Biopython when speed matters » Working with FASTQ files in Biopython when speed matters

http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

Open Bioinformatics Foundation Homepage
Leave a comment:

jmun7616 replied

06-20-2012, 03:24 AM

Originally posted by gringer View Post

A few warnings about this Python code:

ASCII quality scores in fastq files can contain arbitrary characters like '>' and '@'. The only way to protect against this for multi-line fastq files is to keep a record of the sequence length.
This code reads the entire set of reads into memory, then outputs random sequences -- this will not work on fasta/fastq files that are larger than the memory limit. While this is the only way to do it for streamed input, the code here uses file names, so can do a two-pass operation and reopen a file (see my previous post).

This revised version addresses both these issues:

Code:

import random
import glob
inFiles = glob.glob('*.fasta') + glob.glob('*.fastq')
outFiles = []
num = int(raw_input("Enter number of random sequences to select:\n"))


for i in range(len(inFiles)):
        bookMarks = []
        fNames = []
        fSeqs = []
        start = 0
        BkMk = 0                                             
        outFiles.append(file(str(inFiles[i])+'_'+'Rand_'+str(num)+'.fa', 'wt'))

        for line in open(inFiles[i]):
                if ((line[0] == '>') or (line[0] == '@')):
                        start = 1        
                        bookMarks.append(BkMk)
                else:
                        if (start == 1):
                                if (((line.count('A') + line.count('a')+ line.count('C') + line.count('c') +
                                line.count('G') + line.count('g') + line.count('T') + line.count('t')+
                                line.count('N')+ line.count('n')+ line.count('\n')+ line.count('\r')+
                                line.count('\t')) == len(line)) and ((len(line))>1)):
                                        start = 0
                                else:
                                        bookMarks.pop()
                                        start = 0         
                BkMk+=1
        size = (len(bookMarks))
        print str(size) +" sequences identified in " + inFiles[i] + "\n"

        randomMarks = range(size)
        for j in range((len(bookMarks)-num-1)):
                a = random.randint(0, (len(randomMarks)-1))
                randomMarks.pop(a)

        j = 0
        k = 0
        write = 0
        for line in open(inFiles[i]):
                if (write == 1):
                        if ((randomMarks[k]+1) >= size):
                                outFiles[i].write(line)
                                break
                        elif(j == (bookMarks[randomMarks[k]+1])):
                               write = 0
                               k+=1
                               if (k >= num):
                                        break
                        else:
                                outFiles[i].write(line)
                if (j == bookMarks[randomMarks[k]]):
                       write = 1
                       outFiles[i].write(line)
                j+=1
raw_input("Done.")

Leave a comment:

golharam replied

06-19-2012, 12:45 PM
I had this problem too. I actually built a sub sampler that works on fragment and paired-end fastq (gzip'ed or not) data. Its written in Java and based on the Picard framework. https://github.com/golharam/FastqSubsample.

Benefits of using this:
1) No external Python packages to install. All you need is a JRE and the JAR file.
2) Doesn't load everything into memory.
Leave a comment:
gringer replied

06-17-2012, 02:59 PM
This program joins the 4 fastq lines together into one line , spearating them with a tab character.

That particular code won't work with multi-line fasta/fastq files (the linked subsampler.py script has the same issue, as well as pre-reading all the lines to determine the total number of lines, and some modulus code that may influence probabilities in a bad way), but the reservoir sampling method used would provide a way to process files from standard input in a single pass, and use multi-line files from standard input. However, you need to be a bit careful with the selection procedure to ensure a uniform chance of selection for every record. After a bit of hunting, I see some reasonable pseudo-code here (linked from a stackoverflow thread):

http://blogs.msdn.com/b/spt/archive/...-sampling.aspx

The modified pseudo-code for a fasta/fastq file using reservoir sampling would be as follows:
Look for magic record start sequences. If found, increment the record counter and clear 'current record', otherwise print out the output array and finish

If a fasta file, read through and store up to the next record start as the 'current record', then go to step 6

If a fastq file, read up to the next '+' to work out sequence length (store in 'current record')

read the next line and store in 'current record'

read as many quality values as were in the sequence, store in 'current record'

if record counter is less than desired number of records, add current record to end of output array, return to step 1

generate a random number between 0 and the total number of records read

if this number is less than the number of desired records, replace the array element indexed by that number by the current record, return to step 1

The selection procedure here is (I think) the same as in the awk script that Richard Finney wrote / copied, which is (I think) the same as in the msdn pseudo-code.
Leave a comment:
Richard Finney replied

06-17-2012, 07:46 AM
A simple script extracting random fastq records using reservoir sampling ...

# usage: program inputfastq numsubsamples outputsubampfqfile
# example: ./job.fqreservoir SRR451888.fq 1000 subsamp.fq

cat $1 | \
awk '{if ((p%4)<3) printf("%s\t",$0); else printf("%s\n",$0); p++;}' | \
awk -v k=$2 'BEGIN{srand(systime() + PROCINFO["pid"]);}{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=$0}END{for(z in a)print a[z]}' | \
tr '\t' '\n' > $3

See previous discussion here: http://seqanswers.com/forums/showthread.php?t=16845

Note the acutal line #3 is a hack of lh3's script WITH seeding, i.e. BEGIN{srand(systime() + PROCINFO["pid"]);}
Remove the seed code to get same results every time you run it. Otherwise leave it in to get different results.

This program joins the 4 fastq lines together into one line , spearating them with a tab character.
These combined lines are subsampled.
At the end, the "tr" command seperates the selected combined lines back into 4 lines.
This script used the UNIX/LINUX/BSD standard utilities "tr", "cat", and "awk".

Last edited by Richard Finney; 06-17-2012, 08:39 AM.
Leave a comment:

Previous 1 2 3 template Next

Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM
Strategies for Sequencing Challenging Samples

by seqadmin

Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
- Channel: Articles
03-22-2024, 06:39 AM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 33 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News