Hi, every one. This is my first thread at this forum, so please forgive me if I asked some naive questions.
My question is at the bottom of this thread.
I am currently working on RNA-seq data. I am using tophat + cufflinks pipeline from this paper. I did fastQC for my rna-seq data and I pasted some pictures from FastQC report here. The experiment is designed to comparing different gene expression and splicing isoforms.
I did some test, each of the data are preprocessed individually. (I extract 500,000 sequences from forward fastq as well as reverse fastq. So 500,000*2 reads in total)
1. Do nothing (1,000,000 sequences : 82.2%)
2. Only Trim first 15 bases (1,000,000 sequences : 82.6%)
3. Only Remove adapter (424612+470223=894,835 sequences : 29.6%)
4. Trim first 15 bases and remove adapter (454648+468550=923,198 sequences : 29.4%)
271326 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
271326 + 0 mapped (100.00%:-nan%)
271326 + 0 paired in sequencing
131398 + 0 read1
139928 + 0 read2
98 + 0 properly paired (0.04%:-nan%)
48556 + 0 with itself and mate mapped
222770 + 0 singletons (82.10%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
My question is, with an overall good quality score for each position, is it really necessary to remove adapter or remove first 15 bases (bias caused by hexamer). The result from samtools stat shows if I do adapter remove, I will lose a large amount of data. If I do hexamer trimming, I can get a better mapping result (actually only 0.4% improvement), but I lost 15 bases for each read!
Several things to be mentioned here.
1. Adapter removal was performed by using fastx_clipper. Adapter sequences was specified from corresponding adapter from fastQC contaminants.txt. I discard those trimmed sequences if they are less than 20bp after clipping.
2. I use fastx_trimmer to trim the first 15 bases for each reads.
3. For the last test "Trim first 15 bases and remove adapter", trimming was the first step and adapter removal was the second.
4. Number of sequences was given by fastQC "basic statistics" table.
Regards
Lynn
My question is at the bottom of this thread.
I am currently working on RNA-seq data. I am using tophat + cufflinks pipeline from this paper. I did fastQC for my rna-seq data and I pasted some pictures from FastQC report here. The experiment is designed to comparing different gene expression and splicing isoforms.
I did some test, each of the data are preprocessed individually. (I extract 500,000 sequences from forward fastq as well as reverse fastq. So 500,000*2 reads in total)
1. Do nothing (1,000,000 sequences : 82.2%)
822406 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
822406 + 0 mapped (100.00%:-nan%)
822406 + 0 paired in sequencing
413414 + 0 read1
408992 + 0 read2
718716 + 0 properly paired (87.39%:-nan%)
776272 + 0 with itself and mate mapped
46134 + 0 singletons (5.61%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
0 + 0 duplicates
822406 + 0 mapped (100.00%:-nan%)
822406 + 0 paired in sequencing
413414 + 0 read1
408992 + 0 read2
718716 + 0 properly paired (87.39%:-nan%)
776272 + 0 with itself and mate mapped
46134 + 0 singletons (5.61%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
2. Only Trim first 15 bases (1,000,000 sequences : 82.6%)
826128 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
826128 + 0 mapped (100.00%:-nan%)
826128 + 0 paired in sequencing
414760 + 0 read1
411368 + 0 read2
721394 + 0 properly paired (87.32%:-nan%)
777296 + 0 with itself and mate mapped
48832 + 0 singletons (5.91%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
0 + 0 duplicates
826128 + 0 mapped (100.00%:-nan%)
826128 + 0 paired in sequencing
414760 + 0 read1
411368 + 0 read2
721394 + 0 properly paired (87.32%:-nan%)
777296 + 0 with itself and mate mapped
48832 + 0 singletons (5.91%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
3. Only Remove adapter (424612+470223=894,835 sequences : 29.6%)
264949 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
264949 + 0 mapped (100.00%:-nan%)
264949 + 0 paired in sequencing
140743 + 0 read1
124206 + 0 read2
42 + 0 properly paired (0.02%:-nan%)
50210 + 0 with itself and mate mapped
214739 + 0 singletons (81.05%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
0 + 0 duplicates
264949 + 0 mapped (100.00%:-nan%)
264949 + 0 paired in sequencing
140743 + 0 read1
124206 + 0 read2
42 + 0 properly paired (0.02%:-nan%)
50210 + 0 with itself and mate mapped
214739 + 0 singletons (81.05%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
4. Trim first 15 bases and remove adapter (454648+468550=923,198 sequences : 29.4%)
271326 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
271326 + 0 mapped (100.00%:-nan%)
271326 + 0 paired in sequencing
131398 + 0 read1
139928 + 0 read2
98 + 0 properly paired (0.04%:-nan%)
48556 + 0 with itself and mate mapped
222770 + 0 singletons (82.10%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
My question is, with an overall good quality score for each position, is it really necessary to remove adapter or remove first 15 bases (bias caused by hexamer). The result from samtools stat shows if I do adapter remove, I will lose a large amount of data. If I do hexamer trimming, I can get a better mapping result (actually only 0.4% improvement), but I lost 15 bases for each read!
Several things to be mentioned here.
1. Adapter removal was performed by using fastx_clipper. Adapter sequences was specified from corresponding adapter from fastQC contaminants.txt. I discard those trimmed sequences if they are less than 20bp after clipping.
2. I use fastx_trimmer to trim the first 15 bases for each reads.
3. For the last test "Trim first 15 bases and remove adapter", trimming was the first step and adapter removal was the second.
4. Number of sequences was given by fastQC "basic statistics" table.
Regards
Lynn
Comment