Dear seqanswers users,
I am working with RNAseq data from a Illumina HiSeq and I like to perform a comparative analysis using trinity embedded perl scripts with edgeR. The RNA was prepped as TrueSeq library (newest version whatever that is right now).
To do that I need a estimation of the abundance of HiSeq reads in comparison to the reference transcriptome (which I created ab-initio using trinity version 2014/07/17). For this approach I use bowtie2 (version 2.2.3
64-bit) for the inital mapping step and samtools (version 0.2.0-rc12-1-gbbe85a9, 64bit self-compiled).
During a follow-up step the script from trinity/util "rsem-run-em" is used which throws the following error:
rsem-run-em: QualDist.h:39: int QualDist::c2q(char): Assertion `c >= 33 && c <= 126' failed.
of course that error infers that some sequences in my 22Gb reads have quality scores which are out of the range of ASCII dec 33 (equalling !) to 126 (equalling ~). By simple grep analysis for non-ASCII characters between 33 to 126 I could not get any lines out of the fastq read files which would contain such "illegal" characters. Form that I infer the problem lies within the bam files I got from bowtie2.
So, I used samtools with the following command to get rid of all lines which would not have acceptable quality-scores for the phred33 quality column in the bam file by using the following command:
samtools view -bq 2 bowtie2.bam > bowtie2.filtered.bam
But still the same error persists.
So to my question: Does anybody have a clue how one could identify the single "wrong" entry in a 2,3Gb bam file to get rid of low quality sequence entries with illegal characters?
Any help is highly appreciated.
Chris
PS: For further information I run a i5 intel machine (64bit) and use ubuntu 14.04 LTS.
I am working with RNAseq data from a Illumina HiSeq and I like to perform a comparative analysis using trinity embedded perl scripts with edgeR. The RNA was prepped as TrueSeq library (newest version whatever that is right now).
To do that I need a estimation of the abundance of HiSeq reads in comparison to the reference transcriptome (which I created ab-initio using trinity version 2014/07/17). For this approach I use bowtie2 (version 2.2.3
64-bit) for the inital mapping step and samtools (version 0.2.0-rc12-1-gbbe85a9, 64bit self-compiled).
During a follow-up step the script from trinity/util "rsem-run-em" is used which throws the following error:
rsem-run-em: QualDist.h:39: int QualDist::c2q(char): Assertion `c >= 33 && c <= 126' failed.
of course that error infers that some sequences in my 22Gb reads have quality scores which are out of the range of ASCII dec 33 (equalling !) to 126 (equalling ~). By simple grep analysis for non-ASCII characters between 33 to 126 I could not get any lines out of the fastq read files which would contain such "illegal" characters. Form that I infer the problem lies within the bam files I got from bowtie2.
So, I used samtools with the following command to get rid of all lines which would not have acceptable quality-scores for the phred33 quality column in the bam file by using the following command:
samtools view -bq 2 bowtie2.bam > bowtie2.filtered.bam
But still the same error persists.
So to my question: Does anybody have a clue how one could identify the single "wrong" entry in a 2,3Gb bam file to get rid of low quality sequence entries with illegal characters?
Any help is highly appreciated.
Chris
PS: For further information I run a i5 intel machine (64bit) and use ubuntu 14.04 LTS.
Comment