Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • frymor
    started a topic interpreting preseq results

    interpreting preseq results

    Hi,

    I am running the preseq c_curve and lc_extrap on a few
    Code:
    fastq
    files using the option -V.
    (Q: does it make sense at all to run preseq on the fastq files, or will it be more accurate, when running on the mapped files?)

    I was wondering though how to interpret the results I am getting.
    for example, I have a fastq file with this summarized results.
    I am using the following script to run multiple files through preseq:
    Code:
    for file in *.fastq.gz
    do
            base=$(basename $file .fastq.gz)
    		zcat ${base}.fastq.gz |awk '{if (NR%4==2) print substr ($0,1,20);}'| sort | uniq -c | awk '{print $1,$2}' > ${base}.counts
    		preseq c_curve  -v -V  -o ${base}.preseq.complexity ${base}.counts 2> ${base}.complexitySummary.text
    		preseq lc_extrap -v -V -o ${base}.preseq.yields     ${base}.counts 2> ${base}.yieldsSummary.text
    done
    I get these values in the output files from the two commands (including my interpretations of the specific rows):
    Code:
    [B]c_curve[/B]:
    VALS_INPUT
    TOTAL READS     = 3130582 [I]- how many reads I have in the library[/I]
    COUNTS_SUM      = 3130582 [I]- how many reads where counted in the run[/I]
    DISTINCT READS  = 513863 [I]- that many distinct reads were founds[/I]
    DISTINCT COUNTS = 197 [B]- what does that mean?[/B]
    MAX COUNT       = 1131097 [I]- the sequence with the highest copy number[/I]
    COUNTS OF 1     = 254836 [I]- number of unique reads in the library[/I]
    OBSERVED COUNTS (1131098) [B]- what does that mean?[/B]
    
    [B]lc_extrap[/B]:
    VALS_INPUT
    TOTAL READS     = 3130582[I] - same as above[/I]
    DISTINCT READS  = 513863[I] - same as above[/I]
    DISTINCT COUNTS = 197[B]- what does that mean?[/B]
    MAX COUNT       = 1131097[I] - same as above[/I]
    COUNTS OF 1     = 254836[I] - same as above[/I]
    MAX TERMS       = 100[B]- what does that mean?[/B]
    OBSERVED COUNTS (1131098)[B]- what does that mean?[/B]
    the results from the two runs are as such:
    Code:
    [B]c_curve[/B]:
    total_reads	distinct_reads
    0	0
    1000000	294482
    2000000	414438
    3000000	503167
    
    [B]lc_extrap[/B]:
    TOTAL_READS	EXPECTED_DISTINCT	LOWER_0.95CI	UPPER_0.95CI
    0	0	0	0
    1000000.0	294958.5	210028.8	414231.3
    2000000.0	414987.0	304029.5	566439.1
    3000000.0	503926.0	362806.3	699936.5
    4000000.0	583253.5	410283.3	829145.8
    5000000.0	658204.7	453968.8	954324.2
    ...
    9996000000.0	8300141.2	538673.3	127892632.4
    9997000000.0	8300152.7	538664.3	127895121.7
    9998000000.0	8300164.1	538655.3	127897610.5
    9999000000.0	8300175.6	538646.3	127900099.0
    Q:Do I understand it correctly, when assuming, that in my experiment I have ~3.1M reads, from that ~255K are unique. If I'll use the same library and sequence it deeper to the depth of 9999M I will have ~8.3M unique reads?
    How should I understand the two confidence intervals? (s. image below)

    Q: Is there a way to say when the library is of such low quality / complexity, that it is not worth further investingating this one?
    I have given here an example of what in my opinion would be not such a good library, as I have a lot of repeats (one read takes as much as a third of the data). I know there is probably no black or white in such experiments, but a rule of thumbs would be nice :-)

    Q: How does the curve of the plot should look like, for a "good" and for a "bad" library?
    Below are the plots I get for this library (done fast with Excel):
    img=c_curve and lc_plot

    thanks
    Assa
    Last edited by frymor; 04-04-2016, 02:09 AM.

Latest Articles

Collapse

  • seqadmin
    Recent Advances in Sequencing Technologies
    by seqadmin



    Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

    Long-Read Sequencing
    Long-read sequencing has seen remarkable advancements,...
    12-02-2024, 01:49 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 12-02-2024, 09:29 AM
0 responses
158 views
0 likes
Last Post seqadmin  
Started by seqadmin, 12-02-2024, 09:06 AM
0 responses
56 views
0 likes
Last Post seqadmin  
Started by seqadmin, 12-02-2024, 08:03 AM
0 responses
48 views
0 likes
Last Post seqadmin  
Started by seqadmin, 11-22-2024, 07:36 AM
0 responses
76 views
0 likes
Last Post seqadmin  
Working...
X