I am developing a Python script that will parse data from two input files into new output files using two nested for loops. One of the input files is a list of gene locations on a chromosome, while the other is a list of SNP locations on that same chromosome. The data in both files is ordered by position on the chromosome. The output files contain a list of SNPs which are located within each gene on the chromosome being analyzed.
The first input file is read line by line into Python using a for loop. Within this for loop, the second input file is read line by line. Once certain criteria are met between the first and second sets of input files, the second for loop is closed with a break statement. The next iteration of the first for loop then begins.
The problem with this script is that for each iteration of the first for loop (i.e. each line of the first input file), the second for loop starts reading the second input file from the very first line. This wastes a lot of time, as the second input file contains millions of lines of data. Does anyone know a technique to begin the next iteration of the first loop without beginning from the very first line of the second input file, i.e. a way to ‘save’ the iterator position on the second for loop?
My script is below.
import sys
import fileinput
import shlex
nSNPsPerGene = open("C:/Users/gwilymh/Desktop/Python/SNPsPerGene/nSNPs per gene.txt", 'a')
for i in fileinput.input("Gene Coordinates_full list.csv"):
gene=shlex.shlex(i,posix=True)
gene.whitespace += ','
gene.whitespace_split = True
gene=list(gene)
geneStart=int(gene[2])
geneStop=int(gene[3])
output=open((("C:/Users/gwilymh/Desktop/Python/SNPsPerGene/%s.txt")%(str(geneStart))), 'a')
for line in fileinput.FileInput("SNPs-1.txt"):
SNP=shlex.shlex(line,posix=True)
SNP.whitespace += '\t'
SNP.whitespace_split = True
SNP=list(SNP)
SNPlocation=int(SNP[0])
if SNPlocation < geneStart:
continue
if SNPlocation >= geneStart and SNPlocation <= geneStop:
output.write(("%s\n")%(str(SNP)))
nSNPs=nSNPs+1
else:
nSNPsPerGene.write(("%s\t%s\t%s")%(str(geneStart),str(nSNPs),str(geneStop-geneStart)))
break
The first input file is read line by line into Python using a for loop. Within this for loop, the second input file is read line by line. Once certain criteria are met between the first and second sets of input files, the second for loop is closed with a break statement. The next iteration of the first for loop then begins.
The problem with this script is that for each iteration of the first for loop (i.e. each line of the first input file), the second for loop starts reading the second input file from the very first line. This wastes a lot of time, as the second input file contains millions of lines of data. Does anyone know a technique to begin the next iteration of the first loop without beginning from the very first line of the second input file, i.e. a way to ‘save’ the iterator position on the second for loop?
My script is below.
import sys
import fileinput
import shlex
nSNPsPerGene = open("C:/Users/gwilymh/Desktop/Python/SNPsPerGene/nSNPs per gene.txt", 'a')
for i in fileinput.input("Gene Coordinates_full list.csv"):
gene=shlex.shlex(i,posix=True)
gene.whitespace += ','
gene.whitespace_split = True
gene=list(gene)
geneStart=int(gene[2])
geneStop=int(gene[3])
output=open((("C:/Users/gwilymh/Desktop/Python/SNPsPerGene/%s.txt")%(str(geneStart))), 'a')
for line in fileinput.FileInput("SNPs-1.txt"):
SNP=shlex.shlex(line,posix=True)
SNP.whitespace += '\t'
SNP.whitespace_split = True
SNP=list(SNP)
SNPlocation=int(SNP[0])
if SNPlocation < geneStart:
continue
if SNPlocation >= geneStart and SNPlocation <= geneStop:
output.write(("%s\n")%(str(SNP)))
nSNPs=nSNPs+1
else:
nSNPsPerGene.write(("%s\t%s\t%s")%(str(geneStart),str(nSNPs),str(geneStop-geneStart)))
break
Comment