Additionally i want to know is it possible to extract fastq based on the read length (may or may not be random)??? what i mean actually, if i extract 40% reads from a fastq those will represent only the longest 40% of the total...not sure whether MIRA convert_project can do this or not..any help??
Regrads
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Originally posted by chayan View PostThanx maubp, i am not sure but probably those Roche applications are not free to access..still i ll search those and surely ll give a try to your other suggestions..again thnx for your kind help
Regards
Leave a comment:
-
Thanx maubp, i am not sure but probably those Roche applications are not free to access..still i ll search those and surely ll give a try to your other suggestions..again thnx for your kind help
Regards
Leave a comment:
-
Originally posted by chayan View Posthii all
how can i use the above mentioned scripts to randomly extract a fixed % of output from .sff file, cause am working of a iontorrent denovo assembly with MIRA and i need to extract both the fastq as well as traceinfo.xml... (as mentioned by the sequence provider as well as Bastein).. thanx for any help in advance
Regards
Chayan
Alternatively, if you program in Python, this would be pretty trivial using the Biopython SFF support. http://biopython.org Likewise BioHaskell has SFF support and command line tools for working with these files. http://biohaskell.org
Leave a comment:
-
hii all
how can i use the above mentioned scripts to randomly extract a fixed % of output from .sff file, cause am working of a iontorrent denovo assembly with MIRA and i need to extract both the fastq as well as traceinfo.xml... (as mentioned by the sequence provider as well as Bastein).. thanx for any help in advance
Regards
Chayan
Leave a comment:
-
Originally posted by Richard Finney View PostA simple script extracting random fastq records using reservoir sampling ...
# usage: program inputfastq numsubsamples outputsubampfqfile
# example: ./job.fqreservoir SRR451888.fq 1000 subsamp.fq
cat $1 | \
awk '{if ((p%4)<3) printf("%s\t",$0); else printf("%s\n",$0); p++;}' | \
awk -v k=$2 'BEGIN{srand(systime() + PROCINFO["pid"]);}{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=$0}END{for(z in a)print a[z]}' | \
tr '\t' '\n' > $3
See previous discussion here: http://seqanswers.com/forums/showthread.php?t=16845
Note the acutal line #3 is a hack of lh3's script WITH seeding, i.e. BEGIN{srand(systime() + PROCINFO["pid"]);}
Remove the seed code to get same results every time you run it. Otherwise leave it in to get different results.
This program joins the 4 fastq lines together into one line , spearating them with a tab character.
These combined lines are subsampled.
At the end, the "tr" command seperates the selected combined lines back into 4 lines.
This script used the UNIX/LINUX/BSD standard utilities "tr", "cat", and "awk".
Leave a comment:
-
Ad:random sampling in forward and reverse
Thanks for posting this code. I really need it and I really hope it works.
Leave a comment:
-
Thanks for posting this code. I really need it and I really hope it works?
Leave a comment:
-
It would have been less effort to reuse Biopython's FASTQ parser which already deals with the nasty line wrapping cases. e.g.
Leave a comment:
-
Originally posted by gringer View PostA few warnings about this Python code:- ASCII quality scores in fastq files can contain arbitrary characters like '>' and '@'. The only way to protect against this for multi-line fastq files is to keep a record of the sequence length.
- This code reads the entire set of reads into memory, then outputs random sequences -- this will not work on fasta/fastq files that are larger than the memory limit. While this is the only way to do it for streamed input, the code here uses file names, so can do a two-pass operation and reopen a file (see my previous post).
Code:import random import glob inFiles = glob.glob('*.fasta') + glob.glob('*.fastq') outFiles = [] num = int(raw_input("Enter number of random sequences to select:\n")) for i in range(len(inFiles)): bookMarks = [] fNames = [] fSeqs = [] start = 0 BkMk = 0 outFiles.append(file(str(inFiles[i])+'_'+'Rand_'+str(num)+'.fa', 'wt')) for line in open(inFiles[i]): if ((line[0] == '>') or (line[0] == '@')): start = 1 bookMarks.append(BkMk) else: if (start == 1): if (((line.count('A') + line.count('a')+ line.count('C') + line.count('c') + line.count('G') + line.count('g') + line.count('T') + line.count('t')+ line.count('N')+ line.count('n')+ line.count('\n')+ line.count('\r')+ line.count('\t')) == len(line)) and ((len(line))>1)): start = 0 else: bookMarks.pop() start = 0 BkMk+=1 size = (len(bookMarks)) print str(size) +" sequences identified in " + inFiles[i] + "\n" randomMarks = range(size) for j in range((len(bookMarks)-num-1)): a = random.randint(0, (len(randomMarks)-1)) randomMarks.pop(a) j = 0 k = 0 write = 0 for line in open(inFiles[i]): if (write == 1): if ((randomMarks[k]+1) >= size): outFiles[i].write(line) break elif(j == (bookMarks[randomMarks[k]+1])): write = 0 k+=1 if (k >= num): break else: outFiles[i].write(line) if (j == bookMarks[randomMarks[k]]): write = 1 outFiles[i].write(line) j+=1 raw_input("Done.")
Leave a comment:
-
I had this problem too. I actually built a sub sampler that works on fragment and paired-end fastq (gzip'ed or not) data. Its written in Java and based on the Picard framework. https://github.com/golharam/FastqSubsample.
Benefits of using this:
1) No external Python packages to install. All you need is a JRE and the JAR file.
2) Doesn't load everything into memory.
Leave a comment:
-
This program joins the 4 fastq lines together into one line , spearating them with a tab character.
http://blogs.msdn.com/b/spt/archive/...-sampling.aspx
The modified pseudo-code for a fasta/fastq file using reservoir sampling would be as follows:- Look for magic record start sequences. If found, increment the record counter and clear 'current record', otherwise print out the output array and finish
- If a fasta file, read through and store up to the next record start as the 'current record', then go to step 6
- If a fastq file, read up to the next '+' to work out sequence length (store in 'current record')
- read the next line and store in 'current record'
- read as many quality values as were in the sequence, store in 'current record'
- if record counter is less than desired number of records, add current record to end of output array, return to step 1
- generate a random number between 0 and the total number of records read
- if this number is less than the number of desired records, replace the array element indexed by that number by the current record, return to step 1
The selection procedure here is (I think) the same as in the awk script that Richard Finney wrote / copied, which is (I think) the same as in the msdn pseudo-code.
Leave a comment:
-
A simple script extracting random fastq records using reservoir sampling ...
# usage: program inputfastq numsubsamples outputsubampfqfile
# example: ./job.fqreservoir SRR451888.fq 1000 subsamp.fq
cat $1 | \
awk '{if ((p%4)<3) printf("%s\t",$0); else printf("%s\n",$0); p++;}' | \
awk -v k=$2 'BEGIN{srand(systime() + PROCINFO["pid"]);}{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=$0}END{for(z in a)print a[z]}' | \
tr '\t' '\n' > $3
See previous discussion here: http://seqanswers.com/forums/showthread.php?t=16845
Note the acutal line #3 is a hack of lh3's script WITH seeding, i.e. BEGIN{srand(systime() + PROCINFO["pid"]);}
Remove the seed code to get same results every time you run it. Otherwise leave it in to get different results.
This program joins the 4 fastq lines together into one line , spearating them with a tab character.
These combined lines are subsampled.
At the end, the "tr" command seperates the selected combined lines back into 4 lines.
This script used the UNIX/LINUX/BSD standard utilities "tr", "cat", and "awk".Last edited by Richard Finney; 06-17-2012, 08:39 AM.
Leave a comment:
Latest Articles
Collapse
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
-
by seqadmin
Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...-
Channel: Articles
03-22-2024, 06:39 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 04-11-2024, 12:08 PM
|
0 responses
31 views
0 likes
|
Last Post
by seqadmin
04-11-2024, 12:08 PM
|
||
Started by seqadmin, 04-10-2024, 10:19 PM
|
0 responses
33 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 10:19 PM
|
||
Started by seqadmin, 04-10-2024, 09:21 AM
|
0 responses
28 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 09:21 AM
|
||
Started by seqadmin, 04-04-2024, 09:00 AM
|
0 responses
53 views
0 likes
|
Last Post
by seqadmin
04-04-2024, 09:00 AM
|
Leave a comment: