Seqanswers Leaderboard Ad

**GenoMax** · 09-22-2014, 01:51 PM

Can you provide a clear definition/limits of what you consider a primary/secondary "hit"? Is the secondary "hit" referring to a case when two proteins share a domain but are other wise unrelated. A protein like that may have to be considered unique for that organism.

**bastianwur** · 09-22-2014, 11:14 PM

Nope, easier, I mean simply orthologues.
Let's say e.g. (because I currently have that) that all proteins are some sort of kinases, e.g. acetate kinase, and in organism A there had been a duplication event a long time ago, whereas organism B + C only have one of them.
Due to that the acetate kinase A1 has high homology with B, which has high homology with C (due to the lack of any other orthologues in B + C; hypothetical case), which has high homology with A2, but lower homology with A1, due to the hypothetical evolutionary distance of these proteins.

That doesn't make all these proteins 1 entity, so no idea where to put them in the diagramm :/.

**Loris** · 11-11-2014, 10:38 AM

I think I've got the same sort of issue. I'm still trying to figure out what I've got, though.

For the case you give, I think it may be reasonable to split it up into a triplet and a singleton - assuming the two proteins in organism A are indeed similar. If the only output from the work is a Venn diagram with counts, then this is straightforward in the sense that no further action need be taken; you don't need to decide which of protein A1 and A2 is the full match and which is 'unique'.

In generating RBH it seems to be a standard to accept any hit of over 99% (of bit-score) the highest hit as a RBH. This caters to detecting near-identical duplications, which otherwise might be rejected due to random vagaries in the matching algorithm.
This finds (in my case very slightly) more RBH in the pairwise comparison than otherwise. However I haven't yet looked at the effect of propagating all near-best matches to the case of more than two genomes.

**bastianwur** · 11-12-2014, 12:55 AM

triplet + singleton...well...the problem is than still the Venn diagram doesn't reflect properly your data. While it makes logical sense, it is not what you have.

The final answer for my data was: If you data doesn't fit your approach, or your approach doesn't fit your data, then one has to change.
Venn diagram just isn't really suitable.

I was looking around how to make a properly weighted Venn diagramm for >3 organisms (because we had 4 at the end), and it turned out that this is apparently not really possible (at least with nice circles, and there are no packages for it, etc).
In the same thread on Stackoverflow (link) someone mentioned some alternative visualization ways, and we finally went for a weighted network graph, which for the 4 organisms only displays RBHs between all 4 organisms, RBHs between each pair, and separate notes for the amount of unique proteins, avoiding the problems with inconsistent hits, without mis-representing the data. It gives also a nice overview, and if you play a bit with colour gradients, then it also looks really nice (I'd show what my coworker did with it, but I'd fear that someone steals the design

).

I doubt though that it's nicely applicable for >4 organisms, because then you'll probably get spaghetti.

**Loris** · 11-12-2014, 04:55 AM

Originally posted by bastianwur View Post

triplet + singleton...well...the problem is than still the Venn diagram doesn't reflect properly your data. While it makes logical sense, it is not what you have.

Provided you're willing to accept the 99% of best hit as a RBH (which I already have in the creation of it) and the group fits within that, I think this doesn't damage the data. Whether the data is actually a useful reflection of the situation is rather a different matter - the issue is really that RBH don't capture the duplications in the first place.
On the other hand, in your original post one idea you rejected was to "make it 3 different hits, going in each of the overlaps between 2 genomes". I agree that would be bad, because it would mask the duplication.

Venn diagramm

I know it's a typo, but I really like this

... alternative visualization ways, and we finally went for a weighted network graph, which for the 4 organisms only displays RBHs between all 4 organisms, RBHs between each pair, and separate notes for the amount of unique proteins, avoiding the problems with inconsistent hits, without mis-representing the data.

I like this idea, but I'm having trouble 'seeing' it. I get the RBH pairs between each genome, but you say you are showing proteins in common between all somehow but can't show uniques. Did you mean the opposite way round?

Here is a sketch I made of what I think you are describing. Please forgive the rough nature and general ugliness. Is a pretty version of that what you meant?

I'm now wondering whether it's worth the pain of getting circos to render every RBH. It's kind of what it's for, and doing it would perhaps be interesting.

**bastianwur** · 11-12-2014, 05:17 AM