Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • gwilymh
    Member
    • Dec 2011
    • 72

    Tips on using nested for statements in Python to maximize program efficiency

    I am developing a Python script that will parse data from two input files into new output files using two nested for loops. One of the input files is a list of gene locations on a chromosome, while the other is a list of SNP locations on that same chromosome. The data in both files is ordered by position on the chromosome. The output files contain a list of SNPs which are located within each gene on the chromosome being analyzed.


    The first input file is read line by line into Python using a for loop. Within this for loop, the second input file is read line by line. Once certain criteria are met between the first and second sets of input files, the second for loop is closed with a break statement. The next iteration of the first for loop then begins.

    The problem with this script is that for each iteration of the first for loop (i.e. each line of the first input file), the second for loop starts reading the second input file from the very first line. This wastes a lot of time, as the second input file contains millions of lines of data. Does anyone know a technique to begin the next iteration of the first loop without beginning from the very first line of the second input file, i.e. a way to ‘save’ the iterator position on the second for loop?

    My script is below.

    import sys
    import fileinput
    import shlex

    nSNPsPerGene = open("C:/Users/gwilymh/Desktop/Python/SNPsPerGene/nSNPs per gene.txt", 'a')

    for i in fileinput.input("Gene Coordinates_full list.csv"):
    gene=shlex.shlex(i,posix=True)
    gene.whitespace += ','
    gene.whitespace_split = True
    gene=list(gene)
    geneStart=int(gene[2])
    geneStop=int(gene[3])
    output=open((("C:/Users/gwilymh/Desktop/Python/SNPsPerGene/%s.txt")%(str(geneStart))), 'a')

    for line in fileinput.FileInput("SNPs-1.txt"):
    SNP=shlex.shlex(line,posix=True)
    SNP.whitespace += '\t'
    SNP.whitespace_split = True
    SNP=list(SNP)
    SNPlocation=int(SNP[0])
    if SNPlocation < geneStart:
    continue
    if SNPlocation >= geneStart and SNPlocation <= geneStop:
    output.write(("%s\n")%(str(SNP)))
    nSNPs=nSNPs+1
    else:
    nSNPsPerGene.write(("%s\t%s\t%s")%(str(geneStart),str(nSNPs),str(geneStop-geneStart)))
    break
  • sBeier
    Member
    • Jan 2013
    • 41

    #2
    If you don't have memory considerations, why don't you read in both files first, map them in a way useful for you (in a dict, or OrderedDict) and then iterate once over the first map?

    Comment

    • dariober
      Senior Member
      • May 2010
      • 311

      #3
      Originally posted by gwilymh View Post
      Does anyone know a technique to begin the next iteration of the first loop without beginning from the very first line of the second input file, i.e. a way to ‘save’ the iterator position on the second for loop?
      What about using the tell() and seek() file methods? They seem to do what you need. From python docs http://docs.python.org/2/tutorial/inputoutput.html:

      f.tell() returns an integer giving the file object’s current position in the file, measured in bytes from the beginning of the file. To change the file object’s position, use f.seek(offset, from_what)
      (However, depending exactly on what you need to do a better data structure like interval trees might scale better...)

      Best
      Dario

      Comment

      • ECO
        --Site Admin--
        • Oct 2007
        • 1360

        #4
        Here's a tip, post in the correct forum!

        Moving to Bioinfx.

        Comment

        • Luyi Tian
          Member
          • Mar 2012
          • 15

          #5
          First
          there are two ways to read lines from file and remember the 'position'
          Code:
              file=open('yourfile','r')
              file.readline()##read one line from file. if you call it the second times it will return the next line
              file.next()##use the generator. return one line from file. similar to readline()
          Second
          you could use pypy to accerelate your script(if your script contains a lot 'for' 'while' loops, use pypy would make it 10 times faster). also you could use file.readlines(10000) to read 10000 line each time to save I/O time.

          Comment

          • maubp
            Peter (Biopython etc)
            • Jul 2009
            • 1544

            #6
            It sounds like you can just open file 1 and file 2 once BEFORE starting the nested loops, but perhaps I've not understood your problem fully.

            Based on the filenames you might have one line per gene in both files, so a loop iterating over both files together could work. For example something like this:

            Code:
            import itertools
            handle1 = open(...)
            handle2 = open(...)
            for line1, line2 in itertools.zip(handle1, handle):
                #assert line1 and line2 for same gene
            handle1.close()
            handle2.close()

            Comment

            • rflrob
              Member
              • May 2010
              • 50

              #7
              Originally posted by maubp View Post
              It sounds like you can just open file 1 and file 2 once BEFORE starting the nested loops, but perhaps I've not understood your problem fully.

              Based on the filenames you might have one line per gene in both files, so a loop iterating over both files together could work. For example something like this:

              Code:
              import itertools
              handle1 = open(...)
              handle2 = open(...)
              for line1, line2 in itertools.zip(handle1, handle):
                  #assert line1 and line2 for same gene
              handle1.close()
              handle2.close()

              Even if you don't have one line per gene, you can still use the same trick of opening the handles once:

              Code:
              handle1 = open(...)
              handle2 = open(...)
              
              for gene in handle1:
                  # do stuff
                  for snp in handle2:
                      # do stuff
                      if condition: 
                           break
              You'd have to be careful not to lose the first snp for each gene, of course.

              As a hint, there are code tags that you can use that will maintain the indentation of your post, which will make understanding your python code much easier.

              Comment

              • brentp
                Member
                • Apr 2010
                • 72

                #8
                use bedtools or pybedtools in python. if your data is in bed format, this will make your script much faster and much simpler.

                Comment

                Latest Articles

                Collapse

                • SEQadmin2
                  Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                  by SEQadmin2


                  I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                  Here are nine questions we think about, in roughly the order they matter, before...
                  06-18-2026, 07:11 AM
                • SEQadmin2
                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                  by SEQadmin2


                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                  ...
                  06-02-2026, 10:05 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by SEQadmin2, 06-26-2026, 11:10 AM
                0 responses
                14 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-17-2026, 06:09 AM
                0 responses
                48 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-09-2026, 11:58 AM
                0 responses
                107 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-05-2026, 10:09 AM
                0 responses
                125 views
                0 reactions
                Last Post SEQadmin2  
                Working...