Announcement

Collapse
No announcement yet.

Subsampling using 'head -n #"?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • chayan
    replied
    Additionally i want to know is it possible to extract fastq based on the read length (may or may not be random)??? what i mean actually, if i extract 40% reads from a fastq those will represent only the longest 40% of the total...not sure whether MIRA convert_project can do this or not..any help??

    Regrads

    Leave a comment:


  • chayan
    replied
    Hmm encouraging line..thnx a ton

    Leave a comment:


  • kmcarr
    replied
    Originally posted by chayan View Post
    Thanx maubp, i am not sure but probably those Roche applications are not free to access..still i ll search those and surely ll give a try to your other suggestions..again thnx for your kind help

    Regards
    While not open source (free as in speech) they are made available free (as in beer) of charge. Fill out the reqest form here.

    Leave a comment:


  • chayan
    replied
    Thanx maubp, i am not sure but probably those Roche applications are not free to access..still i ll search those and surely ll give a try to your other suggestions..again thnx for your kind help

    Regards

    Leave a comment:


  • maubp
    replied
    Originally posted by chayan View Post
    hii all
    how can i use the above mentioned scripts to randomly extract a fixed % of output from .sff file, cause am working of a iontorrent denovo assembly with MIRA and i need to extract both the fastq as well as traceinfo.xml... (as mentioned by the sequence provider as well as Bastein).. thanx for any help in advance

    Regards
    Chayan
    SFF is a binary format with a non-trivial header. Unix text tools like grep are not going to be any help. I would suggest using the Roche 'off instrument' applications which include an SFF tool which can select random reads (Linux only, search for how to get Newbler on these forums for more information).

    Alternatively, if you program in Python, this would be pretty trivial using the Biopython SFF support. http://biopython.org Likewise BioHaskell has SFF support and command line tools for working with these files. http://biohaskell.org

    Leave a comment:


  • chayan
    replied
    hii all
    how can i use the above mentioned scripts to randomly extract a fixed % of output from .sff file, cause am working of a iontorrent denovo assembly with MIRA and i need to extract both the fastq as well as traceinfo.xml... (as mentioned by the sequence provider as well as Bastein).. thanx for any help in advance

    Regards
    Chayan

    Leave a comment:


  • syfo
    replied
    Originally posted by Richard Finney View Post
    A simple script extracting random fastq records using reservoir sampling ...

    # usage: program inputfastq numsubsamples outputsubampfqfile
    # example: ./job.fqreservoir SRR451888.fq 1000 subsamp.fq

    cat $1 | \
    awk '{if ((p%4)<3) printf("%s\t",$0); else printf("%s\n",$0); p++;}' | \
    awk -v k=$2 'BEGIN{srand(systime() + PROCINFO["pid"]);}{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=$0}END{for(z in a)print a[z]}' | \
    tr '\t' '\n' > $3


    See previous discussion here: http://seqanswers.com/forums/showthread.php?t=16845

    Note the acutal line #3 is a hack of lh3's script WITH seeding, i.e. BEGIN{srand(systime() + PROCINFO["pid"]);}
    Remove the seed code to get same results every time you run it. Otherwise leave it in to get different results.

    This program joins the 4 fastq lines together into one line , spearating them with a tab character.
    These combined lines are subsampled.
    At the end, the "tr" command seperates the selected combined lines back into 4 lines.
    This script used the UNIX/LINUX/BSD standard utilities "tr", "cat", and "awk".
    Reservoir sampling! Thank you so much, this is exactly what I was looking for!

    Leave a comment:


  • JahnDavik
    replied
    Thanks. This is what I'd need. Hope it works :-)

    Leave a comment:


  • JahnDavik
    replied
    Ad:random sampling in forward and reverse

    Thanks for posting this code. I really need it and I really hope it works.

    Leave a comment:


  • JahnDavik
    replied
    Thanks for posting this code. I really need it and I really hope it works?

    Leave a comment:


  • maubp
    replied
    It would have been less effort to reuse Biopython's FASTQ parser which already deals with the nasty line wrapping cases. e.g.
    http://news.open-bio.org/news/2009/0...on-fast-fastq/

    Leave a comment:


  • jmun7616
    replied
    Originally posted by gringer View Post
    A few warnings about this Python code:
    • ASCII quality scores in fastq files can contain arbitrary characters like '>' and '@'. The only way to protect against this for multi-line fastq files is to keep a record of the sequence length.
    • This code reads the entire set of reads into memory, then outputs random sequences -- this will not work on fasta/fastq files that are larger than the memory limit. While this is the only way to do it for streamed input, the code here uses file names, so can do a two-pass operation and reopen a file (see my previous post).
    This revised version addresses both these issues:
    Code:
    import random
    import glob
    inFiles = glob.glob('*.fasta') + glob.glob('*.fastq')
    outFiles = []
    num = int(raw_input("Enter number of random sequences to select:\n"))
    
    
    for i in range(len(inFiles)):
            bookMarks = []
            fNames = []
            fSeqs = []
            start = 0
            BkMk = 0                                             
            outFiles.append(file(str(inFiles[i])+'_'+'Rand_'+str(num)+'.fa', 'wt'))
    
            for line in open(inFiles[i]):
                    if ((line[0] == '>') or (line[0] == '@')):
                            start = 1        
                            bookMarks.append(BkMk)
                    else:
                            if (start == 1):
                                    if (((line.count('A') + line.count('a')+ line.count('C') + line.count('c') +
                                    line.count('G') + line.count('g') + line.count('T') + line.count('t')+
                                    line.count('N')+ line.count('n')+ line.count('\n')+ line.count('\r')+
                                    line.count('\t')) == len(line)) and ((len(line))>1)):
                                            start = 0
                                    else:
                                            bookMarks.pop()
                                            start = 0         
                    BkMk+=1
            size = (len(bookMarks))
            print str(size) +" sequences identified in " + inFiles[i] + "\n"
    
            randomMarks = range(size)
            for j in range((len(bookMarks)-num-1)):
                    a = random.randint(0, (len(randomMarks)-1))
                    randomMarks.pop(a)
    
            j = 0
            k = 0
            write = 0
            for line in open(inFiles[i]):
                    if (write == 1):
                            if ((randomMarks[k]+1) >= size):
                                    outFiles[i].write(line)
                                    break
                            elif(j == (bookMarks[randomMarks[k]+1])):
                                   write = 0
                                   k+=1
                                   if (k >= num):
                                            break
                            else:
                                    outFiles[i].write(line)
                    if (j == bookMarks[randomMarks[k]]):
                           write = 1
                           outFiles[i].write(line)
                    j+=1
    raw_input("Done.")

    Leave a comment:


  • golharam
    replied
    I had this problem too. I actually built a sub sampler that works on fragment and paired-end fastq (gzip'ed or not) data. Its written in Java and based on the Picard framework. https://github.com/golharam/FastqSubsample.

    Benefits of using this:
    1) No external Python packages to install. All you need is a JRE and the JAR file.
    2) Doesn't load everything into memory.

    Leave a comment:


  • gringer
    replied
    This program joins the 4 fastq lines together into one line , spearating them with a tab character.
    That particular code won't work with multi-line fasta/fastq files (the linked subsampler.py script has the same issue, as well as pre-reading all the lines to determine the total number of lines, and some modulus code that may influence probabilities in a bad way), but the reservoir sampling method used would provide a way to process files from standard input in a single pass, and use multi-line files from standard input. However, you need to be a bit careful with the selection procedure to ensure a uniform chance of selection for every record. After a bit of hunting, I see some reasonable pseudo-code here (linked from a stackoverflow thread):

    http://blogs.msdn.com/b/spt/archive/...-sampling.aspx

    The modified pseudo-code for a fasta/fastq file using reservoir sampling would be as follows:
    1. Look for magic record start sequences. If found, increment the record counter and clear 'current record', otherwise print out the output array and finish
    2. If a fasta file, read through and store up to the next record start as the 'current record', then go to step 6
    3. If a fastq file, read up to the next '+' to work out sequence length (store in 'current record')
    4. read the next line and store in 'current record'
    5. read as many quality values as were in the sequence, store in 'current record'
    6. if record counter is less than desired number of records, add current record to end of output array, return to step 1
    7. generate a random number between 0 and the total number of records read
    8. if this number is less than the number of desired records, replace the array element indexed by that number by the current record, return to step 1

    The selection procedure here is (I think) the same as in the awk script that Richard Finney wrote / copied, which is (I think) the same as in the msdn pseudo-code.

    Leave a comment:


  • Richard Finney
    replied
    A simple script extracting random fastq records using reservoir sampling ...

    # usage: program inputfastq numsubsamples outputsubampfqfile
    # example: ./job.fqreservoir SRR451888.fq 1000 subsamp.fq

    cat $1 | \
    awk '{if ((p%4)<3) printf("%s\t",$0); else printf("%s\n",$0); p++;}' | \
    awk -v k=$2 'BEGIN{srand(systime() + PROCINFO["pid"]);}{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=$0}END{for(z in a)print a[z]}' | \
    tr '\t' '\n' > $3


    See previous discussion here: http://seqanswers.com/forums/showthread.php?t=16845

    Note the acutal line #3 is a hack of lh3's script WITH seeding, i.e. BEGIN{srand(systime() + PROCINFO["pid"]);}
    Remove the seed code to get same results every time you run it. Otherwise leave it in to get different results.

    This program joins the 4 fastq lines together into one line , spearating them with a tab character.
    These combined lines are subsampled.
    At the end, the "tr" command seperates the selected combined lines back into 4 lines.
    This script used the UNIX/LINUX/BSD standard utilities "tr", "cat", and "awk".
    Last edited by Richard Finney; 06-17-2012, 08:39 AM.

    Leave a comment:

Working...
X