we have RNA-seq data sequenced of an insect in 2012, and assembled them by using one of the Trinity 2011 versions at the time (got the trinity.fasta) . now i analyzed the sequence length distribution in this file , and got the redult as follows:
Code:
kurban@kurban-X550VC:~/Downloads/bbmap$ sh stats.sh in=~/Downloads/gene.fa stats.sh: 52: stats.sh: Bad substitution stats.sh: 59: stats.sh: [[: not found stats.sh: 59: stats.sh: [[: not found stats.sh: 65: stats.sh: source: not found stats.sh: 66: stats.sh: parseXmx: not found A C G T N IUPAC Other GC GC_stdev 0.2875 0.2118 0.2067 0.2940 0.0000 0.0000 0.0000 0.4186 0.0894 Main genome scaffold total: 144777 Main genome contig total: 144777 Main genome scaffold sequence total: 67.067 MB Main genome contig sequence total: 67.067 MB 0.000% gap Main genome scaffold N/L50: 15033/1.075 KB Main genome contig N/L50: 15033/1.075 KB Max scaffold length: 24.081 KB Max contig length: 24.081 KB Number of scaffolds > 50 KB: 0 % main genome in scaffolds > 50 KB: 0.00% Minimum Number Number Total Total Scaffold Scaffold of of Scaffold Contig Contig Length Scaffolds Contigs Length Length Coverage -------- -------------- -------------- -------------- -------------- -------- All 144,777 144,777 67,066,997 67,066,997 100.00% 100 144,777 144,777 67,066,997 67,066,997 100.00% 250 56,929 56,929 53,670,774 53,670,774 100.00% 500 30,137 30,137 44,518,044 44,518,044 100.00% 1 KB 16,207 16,207 34,757,505 34,757,505 100.00% 2.5 KB 4,183 4,183 15,894,549 15,894,549 100.00% 5 KB 588 588 3,942,668 3,942,668 100.00% 10 KB 28 28 353,549 353,549 100.00%
past several days i used the latest trinity version- trinityrnaseq-2.0.6, assembled the raw data once again(after low quality reads teamed of course). this time the length distribution of the file is :
Code:
kurban@kurban-X550VC:~/Downloads/bbmap$ sh stats.sh in=~/Desktop/data_from_server/2015_6_04_assembled_CD_and_CK/Trinity.fasta stats.sh: 52: stats.sh: Bad substitution stats.sh: 59: stats.sh: [[: not found stats.sh: 59: stats.sh: [[: not found stats.sh: 65: stats.sh: source: not found stats.sh: 66: stats.sh: parseXmx: not found A C G T N IUPAC Other GC GC_stdev 0.2932 0.2083 0.2114 0.2871 0.0000 0.0000 0.0000 0.4197 0.0823 Main genome scaffold total: 56130 Main genome contig total: 56130 Main genome scaffold sequence total: 57.963 MB Main genome contig sequence total: 57.963 MB 0.000% gap Main genome scaffold N/L50: 9036/1.861 KB Main genome contig N/L50: 9036/1.861 KB Max scaffold length: 30.733 KB Max contig length: 30.733 KB Number of scaffolds > 50 KB: 0 % main genome in scaffolds > 50 KB: 0.00% Minimum Number Number Total Total Scaffold Scaffold of of Scaffold Contig Contig Length Scaffolds Contigs Length Length Coverage -------- -------------- -------------- -------------- -------------- -------- All 56,130 56,130 57,962,594 57,962,594 100.00% 100 56,130 56,130 57,962,594 57,962,594 100.00% 250 50,921 50,921 56,731,956 56,731,956 100.00% 500 29,025 29,025 49,248,962 49,248,962 100.00% 1 KB 18,003 18,003 41,494,038 41,494,038 100.00% 2.5 KB 5,541 5,541 21,499,015 21,499,015 100.00% 5 KB 900 900 5,895,754 5,895,754 100.00% 10 KB 35 35 466,389 466,389 100.00% 25 KB 1 1 30,733 30,733 100.00%
my questions are :
1. why two assembly results are different,e.g. the former version assembled lots of sequences in length range from 101 to ~200 ? but the minimum length of the assembled sequence by using latest version of trinity is 224?
2. which trinity.fasta file should i use in the following analysis process ? why?
could u please give me little bit detailed explanation ?!
thanks.