To complement our cuffdiff analysis we thought to take a parallel route using Homer:
To do so we basically followed this protocol:
makeTagDirectory pairend.sam -format sam -fragLength 250 -removeSpikes 10000 5 -genome /data//projects/mm9/ref.fasta -checkGC -tbp 1 -sspe -flip
Then, we run analyzeRepeats.pl to make read summaries across genes:
analyzeRepeats.pl rna mm9 -noadj -count genes -strand + -d MIa9_1 MIa9_2 > Scramble_vs_Target_analyzeRepeats.txt
We run this step as recommended by the Homer tutorial with the unnormalized data so the edgeR analysis, which is the following step, would be able to use an unnormalized data.
Next step we run:
getDiffExpression.pl Scramble_vs_Target_analyzeRepeats.txt MIa9_1 MIa9_2 -repeats -edgeR > Scramble_vs_Target_analyzeRepeats_edgeR.txt
After testing the execution of this command and checking closely the perl script for getDiffExpression we run into the following understanding:
The getDiffExpression step takes unnormalized data and transmits it to edgeR so edgeR can calculate log2FDR, p-value, and FDR. off the original unnormalized data. edgeR creates a model for the data and uses pseudo values in cases of zero values. edgeR finally outputs the log2FDR, p-value, and FDR as calculated after transferring the original data into a modeled space.
Now, the output of getDiffExpression.pl is composed of two parts:
The three columns at the very right end, which contain the edgeR's output (Log2FC; p-value, FDR).
AND the output also contains - the rest of the table (all the left side-end columns) which forms a matrix. The matrix is of normalized values. BUT these are not normalized values created by edgeR, but instead - normalized values that were created as part of getDiffExpression.pl script execution.
This data is only presented in the final output table next to the edgeR output (three right columns) and is aimed only for the user to see, but this normalized values were NOT actually been used as part of the edgeR analysis (unlike what I originally thought...).
My issue is that we are currently not sure in what manner the data which presented in this table was actually normalized.
Please see this link for relevant indications:
Look for explanatory indication such as the following which exactly states what I've described above:
"This command will basically add 3 columns to the end of your original file per pairwise comparison. Most important are the log2 fold change and the p-value and FDR values. The raw read counts will be normalized in this file based on the total reads in each column so that the values are more easily to compare when looking through the results."
However, our attempts to re-run the previous step again - (run analyzeRepeats.pl) - using this time the normalization options in order to obtain the normalized values that appear in the output from getDiffExpression.pl did not prove to be successful...
We tried:
analyzeRepeats.pl rna mm9 -noadj -count genes -strand + -d MIa9_1 MIa9_2 -normMatrix 1e6
And also,
analyzeRepeats.pl rna mm9 -noadj -count genes -strand + -d MIa9_1 MIa9_2 -norm 1e7
but the normalized values that we obtained were quite different from the ones that were obtained after running the getDiffExpression.pl script...
We think that it would become important to know how the data that appears in the output table from getDiffExpression.pl was actually normalized. Was it per all reads? was it per all reads in each individual library? ...
If any of the users have some more information and ideas regarding this matter we'll be glad to hear!
Alternatively, we could ignore the normalized values content of the output table, and refer only to the results of edgeR that show the statistics.
It would be interesting to hear how other users decided to normalize their rna-seq data using this approach. Which selections were finally chosen.
Thanks a lot!
Roy
To do so we basically followed this protocol:
makeTagDirectory pairend.sam -format sam -fragLength 250 -removeSpikes 10000 5 -genome /data//projects/mm9/ref.fasta -checkGC -tbp 1 -sspe -flip
Then, we run analyzeRepeats.pl to make read summaries across genes:
analyzeRepeats.pl rna mm9 -noadj -count genes -strand + -d MIa9_1 MIa9_2 > Scramble_vs_Target_analyzeRepeats.txt
We run this step as recommended by the Homer tutorial with the unnormalized data so the edgeR analysis, which is the following step, would be able to use an unnormalized data.
Next step we run:
getDiffExpression.pl Scramble_vs_Target_analyzeRepeats.txt MIa9_1 MIa9_2 -repeats -edgeR > Scramble_vs_Target_analyzeRepeats_edgeR.txt
After testing the execution of this command and checking closely the perl script for getDiffExpression we run into the following understanding:
The getDiffExpression step takes unnormalized data and transmits it to edgeR so edgeR can calculate log2FDR, p-value, and FDR. off the original unnormalized data. edgeR creates a model for the data and uses pseudo values in cases of zero values. edgeR finally outputs the log2FDR, p-value, and FDR as calculated after transferring the original data into a modeled space.
Now, the output of getDiffExpression.pl is composed of two parts:
The three columns at the very right end, which contain the edgeR's output (Log2FC; p-value, FDR).
AND the output also contains - the rest of the table (all the left side-end columns) which forms a matrix. The matrix is of normalized values. BUT these are not normalized values created by edgeR, but instead - normalized values that were created as part of getDiffExpression.pl script execution.
This data is only presented in the final output table next to the edgeR output (three right columns) and is aimed only for the user to see, but this normalized values were NOT actually been used as part of the edgeR analysis (unlike what I originally thought...).
My issue is that we are currently not sure in what manner the data which presented in this table was actually normalized.
Please see this link for relevant indications:
Look for explanatory indication such as the following which exactly states what I've described above:
"This command will basically add 3 columns to the end of your original file per pairwise comparison. Most important are the log2 fold change and the p-value and FDR values. The raw read counts will be normalized in this file based on the total reads in each column so that the values are more easily to compare when looking through the results."
However, our attempts to re-run the previous step again - (run analyzeRepeats.pl) - using this time the normalization options in order to obtain the normalized values that appear in the output from getDiffExpression.pl did not prove to be successful...
We tried:
analyzeRepeats.pl rna mm9 -noadj -count genes -strand + -d MIa9_1 MIa9_2 -normMatrix 1e6
And also,
analyzeRepeats.pl rna mm9 -noadj -count genes -strand + -d MIa9_1 MIa9_2 -norm 1e7
but the normalized values that we obtained were quite different from the ones that were obtained after running the getDiffExpression.pl script...
We think that it would become important to know how the data that appears in the output table from getDiffExpression.pl was actually normalized. Was it per all reads? was it per all reads in each individual library? ...
If any of the users have some more information and ideas regarding this matter we'll be glad to hear!
Alternatively, we could ignore the normalized values content of the output table, and refer only to the results of edgeR that show the statistics.
It would be interesting to hear how other users decided to normalize their rna-seq data using this approach. Which selections were finally chosen.
Thanks a lot!
Roy