I started from @vindobona's answer and tried improving on it by leaning on some
matrix ops and mathematica's inbuilt graph programming algorithms to speed
things up.
This serves double benefit as these operations are accelerated not only with
SIMD but also utilise all available threads even when only using a single
kernel (which is massively beneficial for me as I only have a non-commercial
license and am limited to 4 kernels). IDK if slapping a Parallelize[] on this
would improve things by running on multiple machines but these improvements
have gotten me pretty acceptable speeds for pruning the graph.
For reference I am running on an AMD Ryzen 9 5950X (16 Core, 32 Thread w/ AVX2)
and 128GB of RAM (however Mathematica uses comparatively little RAM here).
Explanation
I'm taking roughly the same strategy as @vindobona and am condension all
straight-line paths inside the graph down into single edges but as an addition
I'm encoding the "length" of the edge into the graph's edge weights.
I started out by topologically sorting the graph prior to converting to indices
so that the adjacency matrix is triangular and the index of vertices along the
directed edges are always monotonically increasing.
From there we can get the adjacency matrix and build out boolean masks of the
vertex/index list that categorise vertices:
- root vertex: 0 incoming edges.
- terminator vertex: 0 outgoing edges.
- inline vertex: exactly 1 incoming edge, 1 outgoing edge.
- fork vertex: 2 or more incoming or outgoing edges.
We can then use these masks to find any vertex that could be the start point
for a path, the end point for a path, and the vertices that are not inline (and
therefore will be considered part of the final graph). And we can then take
those masks and use PositionIndex to convert these masks into lists. On a
normal array this is O(n) but luckily with sparse arrays this is only
O(density*n).
From there we can mask out the adjacency matrix and get the first edges on any
path in the graph as well as the last edges on any path in the graph.
And we can further narrow this down by separating out the edges that are both
first and last edges (as they don't either start on or end on an inline
vertex). We call these the "continuing" and "concluding" edges/segments.
Now we can construct a graph of all the inline paths without any of the forks
connecting them and then immediately extract a list containing the sets of
vertices in each path subgraph using WeaklyConnectedComponents[] which can be
performed in O(n log n) or potentially even O(log2 n) depending on the
algorithm.
A number of these paths will be single points with no edges so we can partition
the list to separate these out however while writing this up I've come to the
conclusion that separating these out is not actually required. However I did
find that execution was faster doing so (presumably because the array could be
packed for the single vertex components as it is no longer ragged when
separated out).
We then sort the component sets (each a list of vertices). Since these vertex
indices are monotonically increasing due to the topological sort in the
beginning, it is guaranteed that the first vertex in the sorted list is the
first inline vertex in the path and the last vertex in the sorted list is the
last inline vertex in the path.
Now that we know the first and last inline vertices that correspond to each
path, we can match those to the start and end points of each path using
associations created from the continuing and concluding segment lists.
From there we can get the length of each path using the length of the component
list and construct our list of edges and weights for all our paths (including
our single segment paths from earlier).
Now we can prepare our final lists of edges and weights, topologically sort our
vertices once more using these new edges for good measure (to potentially speed
up plotting), and prepare our final graph.
From here we can finally plot our graph.
Evaluation
computeResult = AbsoluteTiming@QuietEcho[
gPreSort = csv // RightComposition[
MapApply[{c, p} |->
Map[(c -> # &)@*(If[StringQ[#] && # != c, #,
Throw[{"Bad value", #}]] &)]@*StringSplit@p],
Flatten,
DeleteDuplicates,
Graph
];
shas = TopologicalSort[gPreSort];
gSha = Graph[shas, EdgeList@gPreSort];
vSz = Length@shas;
idxs = Range@vSz;
ToIdx = Dispatch@*MapThread[Rule]@{shas, idxs};
ToSha = Dispatch@*MapThread[Rule]@{idxs, shas};
rootSha = "7c32d0e3cc51c18f8da8991bf3f2a23a84bf7e90";
rootIdx = rootSha /. ToIdx;
Echo [rootSha, "Root Vertex SHA:"];
Echo [rootIdx, "Root Vertex Index:"];
g = Graph[idxs, EdgeList[gSha] /. ToIdx];
gEdge = EdgeList@g;
adj = AdjacencyMatrix@g;
ones = ConstantArray[1, vSz, SparseArray];
outs = adj . ones;
ins = ones . adj;
Assert[VertexOutDegree[g] == outs, "Out Degree is m.v"];
Assert[VertexInDegree[g] == ins, "In Degree is v.m"];
Assert[
ConnectedGraphQ@*Graph@*ReplaceAll[DirectedEdge -> UndirectedEdge]@
gEdge, "Graph is Connected:"];
Assert[AcyclicGraphQ@g, "Graph is Acyclic:"];
Assert[SimpleGraphQ@g, "Graph is Simple:"];
Assert[UpperTriangularMatrixQ@adj, "Graph is Topologically Sorted:"];
EqualToBool[x_] := Boole@*EqualTo[x];
SetAttributes[EqualToBool, {Flat, Listable, OneIdentity, Orderless}];
UnequalToBool[x_] := Boole@*UnequalTo[x];
SetAttributes[
UnequalToBool, {Flat, Listable, OneIdentity, Orderless}];
GreaterThanBool[x_] := Boole@*GreaterThan[x];
AdjMatrixToRules = RightComposition[
ArrayRules,
Most,
Association,
PositionIndex,
Values,
Flatten[#, 1] &,
MapApply[Rule]
];
MaskAdjIn = mask |-> matrix |-> Dot[DiagonalMatrix[mask], matrix];
MaskAdjOut =
mask |->
matrix |-> Transpose@Dot[DiagonalMatrix[mask], Transpose@matrix];
roots = EqualToBool[0] /@ ins;
terms = EqualToBool[0] /@ outs;
inlines = BitAnd[EqualToBool[1] /@ outs, EqualToBool[1] /@ ins];
forks :=
BitOr[GreaterThanBool[1] /@ ins, GreaterThanBool[1] /@ outs];
pathEnds = BitXor[roots, BitOr[forks, terms]];
pathStarts = BitOr[roots, forks];
nonPruned = BitXor[inlines, 1];
vInlines = PositionIndex[inlines]@1;
vStarts = PositionIndex[pathStarts]@1;
vEnds = PositionIndex[pathEnds]@1;
vForks = PositionIndex[forks]@1;
vFinal = PositionIndex[nonPruned]@1;
startAdj = MaskAdjIn[pathStarts]@adj;
endAdj = MaskAdjOut[pathEnds]@adj;
pathAdj = MaskAdjOut[inlines]@*MaskAdjIn[inlines]@adj;
initialSegments := AdjMatrixToRules@startAdj;
finalSegments := AdjMatrixToRules@endAdj;
inlineSegments := AdjMatrixToRules@pathAdj;
singleSegments = AdjMatrixToRules@*MaskAdjOut[pathEnds]@startAdj;
continuingSegments = AdjMatrixToRules@*MaskAdjOut[inlines]@startAdj;
concludingSegments = AdjMatrixToRules@*MaskAdjIn[inlines]@endAdj;
Echo[Length@gSegments, "Total Number of segments:"];
Echo[Length@inlineSegments, "Number of inline segments:"];
Echo[Length@singleSegments, "Number of one-step segments:"];
Echo[Length@continuingSegments,
"Number of continuing (non-one-stop initial) segments:"];
Echo[Length@concludingSegments,
"Number of concluding (non-one-stop final) segments:"];
gSegments := gEdge /. DirectedEdge -> Rule;
segmentSets := {inlineSegments, singleSegments, continuingSegments,
concludingSegments};
reconstructedSegments := Join @@ segmentSets;
Assert[
ContainsExactly[initialSegments]@
Join[singleSegments, continuingSegments],
"Initial segments should have no mismatches"
];
Assert[
ContainsExactly[finalSegments]@
Join[singleSegments, concludingSegments],
"Final segments should have no mismatches"
];
Assert[
ContainsExactly[gSegments]@reconstructedSegments,
"All segments should correspond to original edges"
];
Assert[
AllTrue[Apply@DisjointQ]@Subsets[segmentSets, {2}],
"Segment sets should have no overlap"
];
Assert[ContainsOnly[gSegments]@singleSegments,
"One-step segments are all valid edges"];
Assert[
ContainsNone[PositionIndex[inlines]@1]@*
DeleteDuplicates@(singleSegments /. Rule[x_, y_] -> y),
"One-step segments should not contain inline vertices"
];
Assert[
ContainsOnly[PositionIndex[pathStarts]@1]@*
DeleteDuplicates@(continuingSegments /. Rule[x_, y_] -> x),
"Continuing (non-one-stop initial) segments should only point \
from fork vertices or roots"
];
Assert[
ContainsNone[PositionIndex[pathEnds]@1]@*
DeleteDuplicates@(continuingSegments /. Rule[x_, y_] -> y),
"Continuing (non-one-stop initial) segments should not point \
towards any fork vertices or terminators"
];
Assert[
ContainsNone[PositionIndex[pathStarts]@1]@*
DeleteDuplicates@(concludingSegments /. Rule[x_, y_] -> x),
"Concluding (non-one-stop final) segments should not point from \
any fork vertices or roots"
];
Assert[
ContainsOnly[PositionIndex[pathEnds]@1]@*
DeleteDuplicates@(concludingSegments /. Rule[x_, y_] -> y),
"Concluding (non-one-stop final) segments should only point \
towards fork vertices or terminators"
];
gPaths = Graph[vInlines, inlineSegments];
vComponents =
SplitBy[WeaklyConnectedComponents@gPaths, GreaterThan[1]@*Length];
singleInlines = Flatten@*Last@vComponents;
multiInlines = Map[Sort]@*First@vComponents;
headMap = Association@*Map[Reverse]@continuingSegments;
tailMap = Association@concludingSegments;
weightedSingleInlines =
Map[{DirectedEdge[#[[1]], #[[2]]], 2} &]@*Transpose@
Lookup[{headMap, tailMap}, singleInlines];
weightedMultiInlines = multiInlines // RightComposition[
Map[{First[#], Last[#], Length[#] + 1, {First@#, Last@#}} &],
Transpose,
{
Lookup[headMap, #[[1]]],
Lookup[tailMap, #[[2]]],
#[[3]], #[[4]]
} &,
Transpose,
Map[{DirectedEdge[#[[1]], #[[2]]], #[[3]]} &],
Sort
];
weightedSingleSteps =
singleSegments /. Rule[x_, y_] -> {DirectedEdge[x, y], 1};
finalSegments =
Join[weightedSingleInlines, weightedMultiInlines,
weightedSingleSteps];
finalEdges = Transpose[finalSegments][[1]];
finalWeights = Transpose[finalSegments][[2]];
finalVertices = TopologicalSort[finalEdges];
gFinal =
Graph[finalVertices, finalEdges, EdgeWeight -> finalWeights];
Assert[
ConnectedGraphQ@*Graph@*ReplaceAll[DirectedEdge -> UndirectedEdge]@
finalEdges, "Graph is Connected:"];
Assert[AcyclicGraphQ@gFinal, "Graph is Acyclic:"];
Assert[UpperTriangularMatrixQ@*WeightedAdjacencyMatrix@gFinal,
"Graph is Topologically Sorted:"];
]
Execution Time 8.81156s
Layer Digraph Embedding Render
{diGraphTime, diGraphRender} = AbsoluteTiming@GraphPlot[
gFinal,
VertexShapeFunction -> (Tooltip[{PointSize[Small], Point[#]},
ToString@#2 <> " - " <> ToString[#2 /. ToSha]] &),
EdgeShapeFunction -> "CurvedArc",
GraphLayout -> {"LayeredDigraphEmbedding", "RootVertex" -> 1},
PlotTheme -> "LargeGraph"
];
diGraphTime = Quantity[diGraphTime, "Seconds"];
Echo[diGraphTime, "Execution Time:"];
diGraphRender
Execution Time: 611.237s

Spring Electrical Embedding Render
{springElecGraphTime, springElecGraphRender} =
AbsoluteTiming@GraphPlot[
gFinal,
VertexShapeFunction -> (Tooltip[{PointSize[Small], Point[#]},
ToString@#2 <> " - " <> ToString[#2 /. ToSha]] &),
EdgeShapeFunction -> "CurvedArc",
GraphLayout -> {"SpringElectricalEmbedding",
"EdgeWeighted" -> True},
PerformanceGoal -> "Speed",
PlotTheme -> "LargeGraph"
];
springElecGraphTime = Quantity[springElecGraphTime, "Seconds"];
Echo[springElecGraphTime, "Execution Time:"];
springElecGraphRender
Execution Time: 7.95155s

Notes
I found a spring embedding graph seems to work well for visualising the data at
a distance (and it executes far faster than a digraph embedding) however for
looking at the data up-close, the digraph embedding is still by far the best
tool for visualising it.