The problem is that I cannot get a huge part of my RNA-seq data aligned!
These are some samples from the liver, intestine and colon of mice. We are analyzing the expression level of their genes. The data has been sequenced in paired end by NextSeq 500 (2*75). And you can find the BioAnalyzer plot for the reads before sequencing in the attachment.
If you look at the trace the fragment sizes are distributed between approx. 190-650 bp with an average of somewhere between 300-350 bps. 130pbs of those belong to upper and lower primers, which won't be sequenced. So with 2*75 sequencing length we might have overlap in some cases (I looked in one sample and in that sample 30% of the fragments should have overlapped reads) and inner distance in most of reads. Please see the summary of that in the following table:
Now the problem is the standard deviation and median length options while you want to align data by TopHat2. To solve the problem I thought I could align data first by bowtie2 to get some idea about the inner distance. But unfortunately I cannot get about 50% of my reads aligned.
Here is the command I use for that:
And here is the statistical result gotten from Bowtie2 aligning:
Why this (8254156 (48.98%) aligned concordantly 0 times) happends???????
These are some samples from the liver, intestine and colon of mice. We are analyzing the expression level of their genes. The data has been sequenced in paired end by NextSeq 500 (2*75). And you can find the BioAnalyzer plot for the reads before sequencing in the attachment.
If you look at the trace the fragment sizes are distributed between approx. 190-650 bp with an average of somewhere between 300-350 bps. 130pbs of those belong to upper and lower primers, which won't be sequenced. So with 2*75 sequencing length we might have overlap in some cases (I looked in one sample and in that sample 30% of the fragments should have overlapped reads) and inner distance in most of reads. Please see the summary of that in the following table:
Initial read length Primer length Actual read length Sequenced length 2*75 Overlap length Inner distance
190 130 60 150 90 0
300 130 170 150 0 20
350 130 220 150 0 70
650 130 520 150 0 370
190 130 60 150 90 0
300 130 170 150 0 20
350 130 220 150 0 70
650 130 520 150 0 370
Now the problem is the standard deviation and median length options while you want to align data by TopHat2. To solve the problem I thought I could align data first by bowtie2 to get some idea about the inner distance. But unfortunately I cannot get about 50% of my reads aligned.
Here is the command I use for that:
Code:
bowtie2 -q --phred33 -D 20 -R 3 -N 1 -L 20 -i S,1,0.50 --n-ceil L,0,0.15 --end-to-end --score-min L,-0.6,-0.6 -I 45 -X 900 -t --met-file bowtie_align_metrix.txt --met-stderr bowtie_stderr.txt --no-unal --al ~/al --un ~/un --un-conc ~/un_conc --al-conc ~/all_conc -p 8 --non-deterministic -x ~/ref-files/mm37 -1 R1.paired.fq -2 R1.unpaired.fq -S result.sam >& bowtie_log_file
And here is the statistical result gotten from Bowtie2 aligning:
Code:
16852758 reads; of these: 16852758 (100.00%) were paired; of these: 8254156 (48.98%) aligned concordantly 0 times 6476798 (38.43%) aligned concordantly exactly 1 time 2121804 (12.59%) aligned concordantly >1 times ---- 8254156 pairs aligned concordantly 0 times; of these: 974394 (11.80%) aligned discordantly 1 time ---- 7279762 pairs aligned 0 times concordantly or discordantly; of these: 14559524 mates make up the pairs; of these: 10751027 (73.84%) aligned 0 times 3072453 (21.10%) aligned exactly 1 time 736044 (5.06%) aligned >1 times 68.10% overall alignment rate Time searching: 05:33:24 Overall time: 05:33:24
Comment