This is a different issue than: http://seqanswers.com/forums/showthread.php?t=47604
FastQC shows all bases are > Q30 on both forward and reverse fastq files. (See attached)
I have some older 50 bp PE human data from a collaborator. They used CloneTech's SMARTer primers for polyA enrichment and the data has a lot of contaminating sequences including: Illumina adapters, CloneTech oligo dT primers, polyA and polyT (all 50 bases of Ts or As) sequences in them. After using Trimmomatic, to remove all 4 types of contaminants, only half of my read pairs remain, while ~40% of reads are now unpaired.
Note: No quality trimming was done. However, prior to trimming, I removed rRNA sequences using SortMeRNA (~2-3% reads removed)
Here are the results:
TrimmomaticPE: Started with arguments: -threads 12 -phred33 <2 inputs> <4 outputs>
ILLUMINACLIP:/usr/local/bioinf/trimmomatic/adapters/custom_a.fa:2:30:10 MINLEN:36
Input Read Pairs: 55,723,516
Both Surviving: 29,553,189 (53.04%)
Forward Only Surviving: 19,691,236 (35.34%)
Reverse Only Surviving: 2,350,462 (4.22%)
Dropped: 4128629 (7.41%)
It seems strange that the reverse only surviving reads are so low compared to the forward surviving reads... If I only remove the cloneTech primers and adapters and leave the polyA & Ts out of the specified custom.fa for ILLUMINACLIP:
Input Read Pairs: 55,723,516
Both Surviving: 29,583,888 (53.09%)
Forward Only Surviving: 22,506,092 (40.39%)
Reverse Only Surviving: 3,170,437 (5.69%)
Dropped: 463,099 (0.83%)
Fewer reads are dropped and more reads are piled into the forward only survinng reads, but the paired surviving remains almost the same.
What could lead to so many unpaired reads and mostly on the forward read? I'd hate to toss out 35%-40% of my data if I can help it. If I have to, is there anyway I could incorporate these unpaired reads into my down stream analysis? Or is there something seriously wrong with my pipeline? Thanks in advance.
FastQC shows all bases are > Q30 on both forward and reverse fastq files. (See attached)
I have some older 50 bp PE human data from a collaborator. They used CloneTech's SMARTer primers for polyA enrichment and the data has a lot of contaminating sequences including: Illumina adapters, CloneTech oligo dT primers, polyA and polyT (all 50 bases of Ts or As) sequences in them. After using Trimmomatic, to remove all 4 types of contaminants, only half of my read pairs remain, while ~40% of reads are now unpaired.
Note: No quality trimming was done. However, prior to trimming, I removed rRNA sequences using SortMeRNA (~2-3% reads removed)
Here are the results:
TrimmomaticPE: Started with arguments: -threads 12 -phred33 <2 inputs> <4 outputs>
ILLUMINACLIP:/usr/local/bioinf/trimmomatic/adapters/custom_a.fa:2:30:10 MINLEN:36
Input Read Pairs: 55,723,516
Both Surviving: 29,553,189 (53.04%)
Forward Only Surviving: 19,691,236 (35.34%)
Reverse Only Surviving: 2,350,462 (4.22%)
Dropped: 4128629 (7.41%)
It seems strange that the reverse only surviving reads are so low compared to the forward surviving reads... If I only remove the cloneTech primers and adapters and leave the polyA & Ts out of the specified custom.fa for ILLUMINACLIP:
Input Read Pairs: 55,723,516
Both Surviving: 29,583,888 (53.09%)
Forward Only Surviving: 22,506,092 (40.39%)
Reverse Only Surviving: 3,170,437 (5.69%)
Dropped: 463,099 (0.83%)
Fewer reads are dropped and more reads are piled into the forward only survinng reads, but the paired surviving remains almost the same.
What could lead to so many unpaired reads and mostly on the forward read? I'd hate to toss out 35%-40% of my data if I can help it. If I have to, is there anyway I could incorporate these unpaired reads into my down stream analysis? Or is there something seriously wrong with my pipeline? Thanks in advance.
Comment