I am running the preseq c_curve and lc_extrap on a few
Code:
fastq
(Q: does it make sense at all to run preseq on the fastq files, or will it be more accurate, when running on the mapped files?)
I was wondering though how to interpret the results I am getting.
for example, I have a fastq file with this summarized results.
I am using the following script to run multiple files through preseq:
Code:
for file in *.fastq.gz do base=$(basename $file .fastq.gz) zcat ${base}.fastq.gz |awk '{if (NR%4==2) print substr ($0,1,20);}'| sort | uniq -c | awk '{print $1,$2}' > ${base}.counts preseq c_curve -v -V -o ${base}.preseq.complexity ${base}.counts 2> ${base}.complexitySummary.text preseq lc_extrap -v -V -o ${base}.preseq.yields ${base}.counts 2> ${base}.yieldsSummary.text done
Code:
[B]c_curve[/B]: VALS_INPUT TOTAL READS = 3130582 [I]- how many reads I have in the library[/I] COUNTS_SUM = 3130582 [I]- how many reads where counted in the run[/I] DISTINCT READS = 513863 [I]- that many distinct reads were founds[/I] DISTINCT COUNTS = 197 [B]- what does that mean?[/B] MAX COUNT = 1131097 [I]- the sequence with the highest copy number[/I] COUNTS OF 1 = 254836 [I]- number of unique reads in the library[/I] OBSERVED COUNTS (1131098) [B]- what does that mean?[/B] [B]lc_extrap[/B]: VALS_INPUT TOTAL READS = 3130582[I] - same as above[/I] DISTINCT READS = 513863[I] - same as above[/I] DISTINCT COUNTS = 197[B]- what does that mean?[/B] MAX COUNT = 1131097[I] - same as above[/I] COUNTS OF 1 = 254836[I] - same as above[/I] MAX TERMS = 100[B]- what does that mean?[/B] OBSERVED COUNTS (1131098)[B]- what does that mean?[/B]
Code:
[B]c_curve[/B]: total_reads distinct_reads 0 0 1000000 294482 2000000 414438 3000000 503167 [B]lc_extrap[/B]: TOTAL_READS EXPECTED_DISTINCT LOWER_0.95CI UPPER_0.95CI 0 0 0 0 1000000.0 294958.5 210028.8 414231.3 2000000.0 414987.0 304029.5 566439.1 3000000.0 503926.0 362806.3 699936.5 4000000.0 583253.5 410283.3 829145.8 5000000.0 658204.7 453968.8 954324.2 ... 9996000000.0 8300141.2 538673.3 127892632.4 9997000000.0 8300152.7 538664.3 127895121.7 9998000000.0 8300164.1 538655.3 127897610.5 9999000000.0 8300175.6 538646.3 127900099.0
How should I understand the two confidence intervals? (s. image below)
Q: Is there a way to say when the library is of such low quality / complexity, that it is not worth further investingating this one?
I have given here an example of what in my opinion would be not such a good library, as I have a lot of repeats (one read takes as much as a third of the data). I know there is probably no black or white in such experiments, but a rule of thumbs would be nice :-)
Q: How does the curve of the plot should look like, for a "good" and for a "bad" library?
Below are the plots I get for this library (done fast with Excel):
img=c_curve and lc_plot
thanks
Assa