6

I have a set of data carrier objects represented using vertices in a strict tree hierarchy. I also have the following requirements:

  • some vertices should be "linked" to other vertices directly (think undirected edge)
  • the "links" themselves should be completely separate from the tree hierarchy
  • "links" can be either 1-1 or 1-many
  • each link has a "source" and a "target". 1-1 relationships have a single target, while 1-many have a list of targets.
  • a "source" vertex must be able to reach the linked "target" vertex (by querying the link), and vice versa
  • the setup needs to work in a multithreaded environment where different threads can work on different parts of the tree
  • threads can freely add/modify/delete these "links". The 1-1 links only permit changing of targets, but the 1-many links allow addition/removal of specific targets. I've been assured that such ops will be atomic so I'm not worried about race conditions.
  • I am required to ensure that link addition/modification/removals are resolved (e.g. if a "target" in a 1-many link is removed, the affected "target" should not have a "link" back to the "source", but the other links should remain unaffected )
  • the overall structure will undergo serialization/deserialization. As such, the tree hierarchy as well as "link" connectivity must either be maintained (via state) or inferrable (via graph traversal or some other means). I am only looking at the "link" part of the equation here.
  • both the tree itself, and the number of "links" could become quite large (around 100,000 or more vertices)

Here's an image showing what an example might look like: Example of tree hierarchy with proposed links

I thought up of three different ways to do this:

  1. have "links" be separate data carrier objects that hold "source" and "targets". This would require a separate graph structure of sorts for quick lookup
  2. "links" only store target references and get stored on the "source" node directly. Traversal from the "target" to "source" nodes is handled by a separate helper class (so reverse connectivity is inferred)
  3. "links" only store target references and get stored on the "source" as well as the "target" nodes. So the source node stores a reference to a target node, and the target node stores a reference to the source node.

Approach 1 is a non-starter for several reasons I can't share here. That leaves me with approaches 2 and 3. I've drawn up an image below showing how they might look like:

approach 2 and approach 3

I'm trying to suss out which of these approaches might be better overall. Some of my thoughts:

  • Approach 3 is good for speed as connectivity is stored in state. But this approach also increases the amount of state in the system which isn't good given it's multithreaded.
  • Approach 2 is better from a state pov (less state overall), but it shifts complexity to the helper method/class. The "reverse" link of approach 3 still needs to be inferred if not explicitly stored.
  • In both approaches I need to resolve link addition/modification/removals. Since both approaches require me to infer the "inverse" relationship, it's basically the same level of complexity (just handled in different places).

My questions (finally) are:

  • which of the two would be the better approach given my constraints ?
  • what would be the issues (if any)
  • is there a better approach than the 3 discussed here ?

Edit :

Re these points:

  • threads can freely add/modify/delete these "links". The 1-1 links only permit changing of targets, but the 1-many links allow addition/removal of specific targets. I've been assured that such ops will be atomic so I'm not worried about race conditions.
  • I am required to ensure that link addition/modification/removals are resolved (e.g. if a "target" in a 1-many link is removed, the affected "target" should not have a "link" back to the "source", but the other links should remain unaffected )

From what I've been told - before any mutative op (including "link" ops), each thread will bring relevant vertices into its scope and those parts of the tree that are within the thread's scope will be locked. So if a thread wishes to add a link between A and B, both of them would be brought into scope and locked before adding said link.

Also, I forgot to mention this - I'm allowed to traverse both the tree and the links for the purposes of link generation/mutation. Approach 2 did hint at this, but I should have made it explicit in the OP.

Finally, re the "helper" in approach 2 - My thinking was to have it internally use a threadsafe hashmap and maintain mappings between vertices and their backreferenced targets.

14
  • 1
    Can two nodes be linked by multiple distinct links at the same time? Commented Aug 8 at 9:10
  • 1
    Given that you couldn't give all of the rationale here, I think its highly likely there's not an answer we can give. I think the answer for you will involve really understanding how the algorithms are expected to operate on the tree and links. In my head, when I read this, I think of semantic graphs like in RDF, which keep track of links in a triple-store, a naturally serializable structure. But there's no guarantees that your particular algorithms fit well with that construct. Commented Aug 8 at 22:21
  • 1
    @JonasH no, any two nodes can only ever have one link between them Commented Aug 9 at 12:15
  • 2
    @Andorrax You've mentioned an atomicity requirement. A single (or pair) of independently concurrent hashmaps will need further synchrsation. To make sure other threads can never find A->B without (reverse) B->A (or vice versa). Gettting very technical if the number of links becomes large (relative to the platform) you may also need to reduce lock granularity. Commented Aug 9 at 12:45
  • 2
    Could you please elaborate on your dismissal of approach 1? This is the common solution for bidirectional links, and without elaboration as to why you think approach 1 is not viable, I cannot confirm that it actually doesn't work for you or if this is an XY problem that leads you to wrongly discount it as a solution. Commented Aug 11 at 5:07

6 Answers 6

7

None of the 3 approaches has a real advantage over the others - given you manage it to implement changes to the graph and links in an atomic and consistent (a.k.a. transactional) manner.

In the in-memory representation of your graph, for performant access, you will need explicit references and backreferences somewhere, and they must be consistent to each other (except for the short time while these links are manipulated, probably in some locked code section). This problem remains the same, regardless whether you

  • explicitly store references plus backreferences inside the tree nodes (approach 3),

  • store references inside the tree nodes, but backreferences outside (approach 2)

  • or store the both outside (approach 1)

In the serialized representation it is probably best not to store redundant backlinks, but I guess you will infer the backlinks of the in-memory representation during deserialization.

Of course, I am talking here about the conceptional level. When it comes to implementation details, the 3 approaches may show different performance characteristics. Still I cannot tell you which one will work "best" for your use case`s specific access patterns, that is something you will only find out by yourself by profiling your code. There may be also differences in regards to which graph library or framework you intend to use, in case that's your plan. But this is something you will also have to try out by yourself - there is no "best" solution for this.

3
  • 1
    "the 3 approaches may show different performance characteristics...this is something you will also have to try out by yourself - there is no "best" solution for this" - the more I think on this problem, the more I find myself agreeing with you. I suppose I won't really know for sure until I put in a test implementation and profile the heck out of the two options. Commented Aug 9 at 12:20
  • I've also added some clarifying details in an edit - not sure if they help swing things one way or the other but would be keen to have your thoughts it they do ! Commented Aug 9 at 12:22
  • @Andorrax: it seems the additional information you added just confirms my answer, so I think there is not much to add. Commented Aug 13 at 12:37
2
  • threads can freely add/modify/delete these "links". The 1-1 links only permit changing of targets, but the 1-many links allow addition/removal of specific targets. I've been assured that such ops will be atomic so I'm not worried about race conditions.
  • I am required to ensure that link addition/modification/removals are resolved (e.g. if a "target" in a 1-many link is removed, the affected "target" should not have a "link" back to the "source", but the other links should remain unaffected )

These two requirements can easily come into conflict, so you have to be careful here.

Specifically, it's not enough that a single operation against a single list is atomic. You need to make sure the entire state change is atomic.

Think about what needs to happen when you remove a node. This removal must be an atomic operation that leaves the tree in a consistent state afterwards. But this consistent state must not have any dangling links, which necessitates updating any links that may refer to the removed node within the single atomic operation.

Effectively, this means you need some form of lock to prevent simultaneous reading and updating of the tree. An entire node removal operation needs to be done within a lock.

Going back to your options, consider what needs to be done:

  1. <N/A>
  2. A node removal needs to scan every other node to detect any possible links to the removed node. This can potentially touch any node, therefore the entire tree needs to be locked for the (potentially lengthy) scan. This also means reads must also take a whole-tree lock, though if you can use a reader-writer lock you can mitigate this somewhat. As the updated question clarifies that this stores the backreference in the separate helper object, scanning would not be necessary. It is still important that you lock around the appropriate objects to ensure there cannot be a read in the middle of update: you may need to ensure both that a read cannot find a dangling reference, and that the reference is not removed before the node so that the node appears unlinked and present. In other words, you can only safely remove the node and references after obtaining locks on both the node itself and any references to it.
  3. In this option you already know which other nodes will be impacted by the removal of this node. Therefore you can potentially lock just these affected nodes and update them to remove the links, without needing to lock the entire tree. Tree operations that do not need to touch these affected nodes may proceed concurrently. Again a reader-writer lock may be beneficial if reads occur significantly more often than updates/removals.

So if:

  • You have many(!) threads concurrently accessing the tree
  • Removal is a frequent operation
  • Multi-threaded performance is important

Then I think you necessarily must use option 3. Using option 2 would require a global tree lock, and could significantly limit concurrency in a worst-case frequent-remove scenario.

Then you should consider how you can accomplish fine-grained locking so only the affected structures are locked, but also ensure that all affected structures are locked before any updates are made.

There are also lock-free ways to do this that may be better for performance, but they will have different constraints. To be lock-free, you will likely need to accept/handle in other code that dangling references may happen, and ignore them at read time rather than trying to guarantee consistency at write time.

4
  • 3
    My understanding of the question is not that the backreferences have always to be inferred in option 2 - my understanding is that the backreferences are stored in the helper class (or helper object) mentioned by the OP, just outside the nodes. Hence your assessment for #2 "node removal needs to scan every other node to detect any possible links to the removed node" does not apply, the affected nodes can be determined with the same effort as in option #3. Commented Aug 9 at 9:28
  • @Bob basically what Doc brown said - the backreferences would be stored outside of the tree (I've added details in an edit to the OP). But I do appreciate the response ! Commented Aug 9 at 12:24
  • 2
    @Andorrax Ah sorry, I misunderstood your proposal. I will say that you should still be careful since you mention a threadsafe hashmap: that only guarantees internal consistency of the hashmap but you still need to make sure you have appropriate locking to ensure consistency of the tree. Commented Aug 9 at 13:10
  • @Bob that's a very good point. Thinking it through - there are also more complex thread interleavings to having an external map since the map's lock acquisition will have permutations with the graph vertices' locks... which may or may not cause consistency issues. I'll think on this a bit more. Thanks for pointing this out ! Commented Aug 9 at 14:14
2

NB: You say 'think undirected edge' but then refer to source and target which appears to be a directed relationship. I see there are reverse (target to source) accesses (the helper class) but an undirected graph is one where when ever A->B exists B->A exists. However your terminology suggests the if B is a Target of a Source A it's not in general the case that A is automatically a Target for B as a Source. The symmetry of undirected graphs doesn't appear to be present.

That may or may not be relevant.

You're assured 'all operations will be atomic' and 'reverse connectivity will be inferred'. I think this must be a theoretical exercise because I'd say achieving those two requirements cannot normally be easily separated from the question you've asked in a performant way and certainly not easily in a concurrent environment.

The (mysterious) helper class handling reverse-connectivity implies you don't need to implement Approach 3 because reverse-connectivity is specifically outside the you scope of your requirements (a 'you don't need to worry about that'). If you do need to worry about that you may be being asked to resolve a Dining Philosopher's Problem if that isn't covered by the 'all operations are atomic' provision.

At it's simplest all you need is a Link structure and a collection of them (pseudo-code).

Link {
   SourceNodeID NodeID;
   TargetNodeID NodeID;
};

Links Array<Link>;

NB: This is a question in Software Engineering and no implementation language is offered or assumed. However it's clear from the question and logically necessary that there's a Node Reference Type (NodeID). In C or C++ that might be a pointer, in Java an Object Reference, and in a Database it might be surrogate key. How it's implemented is unspecified.

As per the requirements, these links are held outside the tree structure accessing one (or all) the links associated with a given Source will require a linear search of the Links array or further cleverness if that won't show adequate performance.

What I think you're being led to may be an Associative Array which is indexed (keyed) by the Source NodeID and holding an associated target NodeID (in the 1-1 model) or further collection (possibly an Array) of Target NodeIDs.

The common go-to Associative Arrays are a Hash Map or Binary Search Tree. To implement either of those we need to assume (respectively) there's a suitable Hash Function for NodeID or that NodeIDs can be absolutely ordered (at least within a given Tree).

Unless you need to store additional Data with the Source to Target Links an Associative Array Implementation makes the Link type (above) redundant.

I think the question as asked is only part of a bigger challenge for which a suitable solution requires a more wholistic design. But given the 'let offs' (everything's atomic, there's a reverse reference helper-class) an Associative Array fits the limited bill.

Footnote: There's a possible edge case that may be relevant. If it's possible that a Node have a reflexive Source-Target relation (i.e. A->A for some Node A) a lock based implementation may need to handle 'self deadlock'. Unless a Lock is recursive or otherwise specified it's common that a single thread attempting to lock a Lock it already holds causes the thread to dead-lock itself. The thread will wait indefinitely for the Lock to be released that it already holds. The same situation may occur (be more likely) if lock granularity is not one-for-one with Nodes/Links. Depending on the platform and context, the suggested scale of around 100,000 nodes and or links may mean that even 100,000 Locks is impossible or an impractical drain on resouces.

1

Frankly i've always found the best way to store/represent this type of hierarchical data is to simply use two lists

Node
{
  Id
  //Other properties
}

Link
{
   Id
   SourceNodeId
   TargetNodeId
   //Other properties
}

Graph
{
   Nodes[]
   Links[]
}

In your case you have the two types of relation so..

Graph
{
   Nodes[]
   ParentChildLinks[]
   Links[]
   //any other relationships that come up
}

For speed of access you should use HashMap/Dictionary(s) for the nodes and links so you can look up the links for a source or target by nodeId.

Of course the lookup will be slower than having the a list of object references on the Node, but you code will greatly be simplified in not having to traverse or pass around a huge object graph. Or hold onto references to objects rather than just their Ids.

eg. Display Node 123's name

graph.Nodes["123"].Name

vs

getById(n, id) {
  while(n.id != id)
  {
    forach(var child in n.Links) {result.add(recurse(child)
  }
  ///..etc
}
getById(rootNode, "123").Name

eg. Get all the Nodes with no links

graph.Nodes.Where(n => graph.Links.Where(l => l.sourceId == n.Id).Count() == 0);
  • Adding and removing or editing links is naturally atomic
  • You can display the Nodes without the links
  • You can different types of links without changing the node
  • You can convert to a tree structure easily if required.
  • Orphans, duplicates and islands are easily dealt with
2
  • 1
    Adding a new link requires to add a link object to Graph.Links, to add an entry to the Dictionary which maps SourceNodeId to the list of related links and to add an entry to the Dictionary which maps TargetNodeId to the list of related links. I fail to see why this is "naturally atomic". Commented Aug 8 at 22:59
  • true, if you add both maps. Commented Aug 9 at 10:56
1

In short

Option 3 is the way to go, as you have already eliminated option 1.

More explanations

The links are undirected. I therefore assume that you must be able to efficiently navigate in both directions:

some vertices should be "linked" to other vertices directly (think undirected edge)

Considering that you have already eliminated option 1 for undisclosed reasons, the following requirement is key:

both the tree itself, and the number of "links" could become quite large (around 100,000 or more vertices)

Indeed, with option 2 you cannot easily navigate from target to source, since the reverse "connectivity is inferred" by the helper class. This would mean that the helper class would have to go through a large number of source nodes to verify the targets. Maybe there are some ways to prune the potential sources that you didn't tell. But with a large graph, this seems a terrible waste of computing capacity, especially if you have to done this inference very often, e.g. if you use some graph traversal algorithms.

With option 3, you can easily and efficiently navigate from target to source. If you need to get link properties for the traversal (e.g. cost, speed or other attributes), all you'd need to do is to search the in the list of links of the source node for the best link to the target. So it's a fraction of the computing time that is used.

5
  • 1
    thanks for the reply ! With option 2, I don't think I'd need to do a traversal with each backref query if I simply store the backrefs in a map (I've added more details to the edit to my post in the OP). There would be some compute effort during first generation of this map (particularly during/after deserialization) but any subsequent ops could probably be fairly similarly performant between approaches 2 and 3 (though the additional cost of querying a threadsafe map in approach 2 might add up...) Commented Aug 9 at 12:38
  • @Andorrax I see. So your helper class constructs the back refs. This sounds like storing the back refs just somewhere else, and looks like a slight variant of 1, i.e. having a special structure for the links. But if you do so, how will your helper class update (efficiently) the map when the links change (or even, how will it know that the links changed)? Commented Aug 9 at 12:58
  • yep it's basically just offloading the backrefs to a different place. Re updating - the state stored in vertices is the source of truth for links. Since there can only ever be 1 "link" between any two vertices (i.e. one pair of ref + backref ... I forgot to mention this in the original OP) I can use the ref stored on the vertex to initiate mutations. Backref mutations follow from vertex mutations and should be consistent as they are performed by the same thread (though I'm rethinking this bit due to a point raised by Bob above) Commented Aug 9 at 14:21
  • Let me take another angle. The helper class is not linked to the graph, so you would need to check for changes, and update or reconstruct the helper class. The option 3 is a little bit like spreading the helper class in the graph itself. The advantage is that you can easily mutate the backlinks/helper pieces together with the graph. Updates will get smoother and be handled right at the source. What I suggest is that you compare the pros and cons of the helper class with option 3. Maybe there are other things that you didn't mention that speaks in its favour. But otherwise I'd go for 3. Commented Aug 9 at 14:41
  • Also, the helper class looks pretty much like the option 1 with the difference that the links themselves are stored in the nodes and the separate structure is an accelerator to find the links. So also cross-check if the reasons you eliminated 1 are not also applying to your helper class approach. Commented Aug 9 at 14:43
0

Approach 2 seem much simpler and safer, since you don't have to keep multiple (redundant) mutable data structures in sync, which is difficult problem in a multi-threaded environments.

Sure, approach 2 would require more work to find links which target a given node, but it is still much simpler than supporting transactional updates of multiple nodes. For example you have to take deadlocks into consideration: If one thread wants to add a link from A to B and another wants to add a link from B to A, they might each lock the source node and then wait indefinitely for the target node to become unlocked. Sure, this is a solvable problem, but still much more complex than just avoiding the issue altogether.

If you are in doubt, chose the simplest solution.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.