I'm analyzing my first set of RNA-seq data, ~33.7 million 100bp PE reads.
Here's the quality from FastQC: reads1 reads2. I haven't done any trimming for adapters or low quality sequence, I was just letting Tophat deal with that, although I'll admit I don't know exactly how Tophat deals with it...I guess adapter sequences and sequences below a certain quality threshold just wouldn't align? From FastQC and looking at some of the quality scores the data is Illumina 1.5, although I see lots of "i's" so I guess it has scores up to 41, I used the "--solexa1.3-quals" option during alignment.
I was thinking of using cutadapt to filter both adapters and low quality sequence, maybe also the first ~10bp, would this be a good idea?
Without any trimming I aligned with Tophat to the mouse genome, here are the percent alignments I'm getting from the log files:
If I use the instructions here to determine the number of reads in the original fastq file (33716355) and the number of unique reads in the accepted_hits.bam file (21284339) I get 63.1% aligned.
Do these percent alignments seem low?
I also get tons of "malformed closure" and "multiple closures" warnings in the long_spanning_reads.log file, about 225,000 of them. What do these warnings mean?
Here also is the output from samtools flagstat, in case it helps:
Any advice is appreciated, thank you!
Here's the quality from FastQC: reads1 reads2. I haven't done any trimming for adapters or low quality sequence, I was just letting Tophat deal with that, although I'll admit I don't know exactly how Tophat deals with it...I guess adapter sequences and sequences below a certain quality threshold just wouldn't align? From FastQC and looking at some of the quality scores the data is Illumina 1.5, although I see lots of "i's" so I guess it has scores up to 41, I used the "--solexa1.3-quals" option during alignment.
I was thinking of using cutadapt to filter both adapters and low quality sequence, maybe also the first ~10bp, would this be a good idea?
Without any trimming I aligned with Tophat to the mouse genome, here are the percent alignments I'm getting from the log files:
Code:
left_kept_reads.fixmap = 47.18% left_kept_reads_seg1.fixmap = 25.65% left_kept_reads_seg2.fixmap = 26.01% left_kept_reads_seg3.fixmap = 25.57% left_kept_reads_seg4.fixmap = 14.40% right_kept_reads.fixmap = 41.37% right_kept_reads_seg1.fixmap = 22.03% right_kept_reads_seg2.fixmap = 23.46% right_kept_reads_seg3.fixmap = 25.51% right_kept_reads_seg4.fixmap = 15.52%
Do these percent alignments seem low?
I also get tons of "malformed closure" and "multiple closures" warnings in the long_spanning_reads.log file, about 225,000 of them. What do these warnings mean?
Here also is the output from samtools flagstat, in case it helps:
Code:
4245668 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 44245668 + 0 mapped (100.00%:-nan%) 44245668 + 0 paired in sequencing 23568872 + 0 read1 20676796 + 0 read2 31410058 + 0 properly paired (70.99%:-nan%) 32757724 + 0 with itself and mate mapped 11487944 + 0 singletons (25.96%:-nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)
Comment