Thank you for reading this question, in general I understand how Velvet works, but can not explain 10 fold decrease of N50 when adding more reads to the dataset.
DETAILS
MiSeq v3 ~300 bp reads, mate-pair libraries 3-12kb inserts,
(I have also ~5% of paired-end 800bp insert library used in both assemblies).
Assembly 1.
Nextclip -> A only files (Junction Adapter in both reads) -> RevCompl -> Velvet k=91
results Assembly 1
Estimated Coverage = 36.798895
Pre-graph has 623415 nodes and 21026066 sequences 53626987 kmers found
Final graph has 3170 nodes and n50 of 1948774, max 3890197, total 35875507, using 14957532/21026066 reads
Assembly 2.
Nextclip -> A, B (JA in read2), C(JA in read 1), E(JA in both with relaxed cond) -> join A,B,C,E by "cat" -> RevCompl -> Velvet k=91
results Assembly 2
Pre-graph has 1937511 nodes and 31230404 sequences 110486096 kmers found
Estimated Coverage = 44.494662
Final graph has 8571 nodes and n50 of 225231, max 909759, total 35949035, using 21595103/31230404 reads
PS I've checked that no JA left in final assemblies.
PPS My guess now - adding many bad reads complicates the graph, so playing with filtering (by Trimmomatic) now.
DETAILS
MiSeq v3 ~300 bp reads, mate-pair libraries 3-12kb inserts,
(I have also ~5% of paired-end 800bp insert library used in both assemblies).
Assembly 1.
Nextclip -> A only files (Junction Adapter in both reads) -> RevCompl -> Velvet k=91
results Assembly 1
Estimated Coverage = 36.798895
Pre-graph has 623415 nodes and 21026066 sequences 53626987 kmers found
Final graph has 3170 nodes and n50 of 1948774, max 3890197, total 35875507, using 14957532/21026066 reads
Assembly 2.
Nextclip -> A, B (JA in read2), C(JA in read 1), E(JA in both with relaxed cond) -> join A,B,C,E by "cat" -> RevCompl -> Velvet k=91
results Assembly 2
Pre-graph has 1937511 nodes and 31230404 sequences 110486096 kmers found
Estimated Coverage = 44.494662
Final graph has 8571 nodes and n50 of 225231, max 909759, total 35949035, using 21595103/31230404 reads
PS I've checked that no JA left in final assemblies.
PPS My guess now - adding many bad reads complicates the graph, so playing with filtering (by Trimmomatic) now.
Comment