Hi everyone!
I have a couple of simple questions regarding the perGC content results of FastQC. After running FastQC on my raw data the mean of my curve falls at the same place as the computed theoretical value, I have a single peak as expected but this is way higher than the theoretical one which (I guess) makes the module to be flagged as "failed". I am having over-represented sequences so I guess this higher peak can be attributed to that, right?
Another question I have is that, after running trimmomatic for clipping 3' and 5' ends and a Phret threshold of 25, both the theoretical and experimental peaks fall in the same position (as expected) but the theoretical curve is now wider. Is this because after filtering, fastQC recalculates the theoretical value with the passing sequences? The experimental curve is still higher than the theoretical which I believe is because over-represented sequences were not removed. Am I in the right path?
I am having trouble understanding the computation of the theoretical GC curve, does it take the length of the strings and it performs the calculations based on a biological model of GC distribution? I mean, GC abundance cannot be taken from my data, otherwise both peaks would look identical.
Thanks for any clarification on this!
I have a couple of simple questions regarding the perGC content results of FastQC. After running FastQC on my raw data the mean of my curve falls at the same place as the computed theoretical value, I have a single peak as expected but this is way higher than the theoretical one which (I guess) makes the module to be flagged as "failed". I am having over-represented sequences so I guess this higher peak can be attributed to that, right?
Another question I have is that, after running trimmomatic for clipping 3' and 5' ends and a Phret threshold of 25, both the theoretical and experimental peaks fall in the same position (as expected) but the theoretical curve is now wider. Is this because after filtering, fastQC recalculates the theoretical value with the passing sequences? The experimental curve is still higher than the theoretical which I believe is because over-represented sequences were not removed. Am I in the right path?
I am having trouble understanding the computation of the theoretical GC curve, does it take the length of the strings and it performs the calculations based on a biological model of GC distribution? I mean, GC abundance cannot be taken from my data, otherwise both peaks would look identical.
Thanks for any clarification on this!