1
$\begingroup$

I have a large DAG that I'm trying to visualise. It's the history of a large (~350k+ commits) git repository with multiple separate origin points. I've tried using a handful of different graph layout modes but the ones that do work present extraordinarily unusual graphs (ex: a giant opaque sphere) and the rest take long enough that I give up executing them. One of the ones that hasn't presented me a solution yet is LayeredDigraphEmbedding which uses most of my CPU, uses ~64GB of ram, and runs for hours without termination.

I'm assuming I'm doing something wrong but as far as I can tell the rules for the graph are at minimum constructed correctly. I thought maybe the size of the data was causing problems so I tried mapping the SHAs to indices for the vertices instead but that only seemed like a minor improvement (debugging was at least faster).

My ideal outcome is either a layered graph or a radial graph of some kind (with the separate origins forking at their merge points away from the origin) but at this point I'll settle for any graph shape as long as it's legible.

Mathematica version: 14.2.1

Git SHA Dump Command:

git --no-pager log --format="format:%H, %P" > out.csv

Mathematica Code:

csv = Import["/path/to/opb-s-r-hashes.csv", "CSV"];
gSha = csv // RightComposition[
    MapApply[{c, p} |-> 
      Map[(c -> # &)@*(If[StringQ[#] && # != c, #, 
             Throw[{"Bad value", #}]] &)]@*StringSplit@p],
    Flatten,
    DeleteDuplicates
    ];

shas = DeleteDuplicates@*Flatten@{Keys@gSha, Values@gSha};
idxs = Range@Length@shas;
ToIdx = Dispatch@*MapThread[Rule]@{shas, idxs};
ToSha = Dispatch@*MapThread[Rule]@{idxs, shas};

rootSha = "7c32d0e3cc51c18f8da8991bf3f2a23a84bf7e90";
rootIdx = rootSha /. ToIdx;

Echo [rootSha, "Root Vertex SHA:"];
Echo [rootIdx, "Root Vertex Index:"];

g = gSha /. ToIdx;
Echo[ConnectedGraphQ@*Graph@*ReplaceAll[Rule -> UndirectedEdge]@g, 
  "Graph is Connected:"];
Echo[AcyclicGraphQ@Graph@g, "Graph is Acyclic:"];
Echo[SimpleGraphQ@Graph@g, "Graph is Simple:"];

graph = Graph[g,
  VertexLabels -> None,
  EdgeLabels -> None,
  GraphLayout -> {
    "LayeredDigraphEmbedding",
    "RootVertex" -> rootIdx
  }
]

The git repo in question

$\endgroup$
1
  • $\begingroup$ Sorry for not linking the CSV directly. I provided the command and the repo used to generate it but GitHub Gists refuses to store it and I didn't see an easy way to upload it to StackExchange directly. $\endgroup$ Commented Oct 14 at 2:07

2 Answers 2

2
$\begingroup$

An idea about removing non-branching vertices (having both In/Out VertexDegree equal 1) in order to improve performance and still preserving the graph main structure.

Test dataset

(first 25000 rows)

img = Import["https://i.sstatic.net/6nws94BM.png", "PNG"];
input = Uncompress@
   FromCharacterCode[
    DeleteCases[Flatten[ImageData[img, "Byte"]], 0], "UTF-8"];
Take[input, 2]
(*
  {{"7c32d0e3cc51c18f8da8991bf3f2a23a84bf7e90"   
    , " 4cb335befcb4a700322786690fdd8712bd8f3910 99ec2db64b869d0582f578f2feffe80e9fc18569"} 
  , {"4cb335befcb4a700322786690fdd8712bd8f3910" 
    , " a2e5454abbef065a63dc34b55d423a2ac6631863 1266dbb576db0f434d29c5b9a51e4267afd05c46"}}
*)

Initial data load

gSha = MapApply[Splice@Thread[#1 -> Flatten@DeleteDuplicates@StringSplit[#2, " "]] &]@input;
shas = DeleteDuplicates@*Flatten@{Keys@gSha, Values@gSha};
idxs = Range@Length@shas;
ToIdx = Dispatch@*MapThread[Rule]@{shas, idxs};
ToSha = Dispatch@*MapThread[Rule]@{idxs, shas};

rootSha = "7c32d0e3cc51c18f8da8991bf3f2a23a84bf7e90";
rootIdx = rootSha /. ToIdx;

Echo[rootSha, "Root Vertex SHA:"];
Echo[rootIdx, "Root Vertex Index:"];

g = gSha /. ToIdx;
Echo[ConnectedGraphQ@*Graph@*
  ReplaceAll[Rule -> UndirectedEdge]@gSha, "Graph is Connected:"];
Echo[AcyclicGraphQ@Graph@gSha, "Graph is Acyclic:"];
Echo[SimpleGraphQ@Graph@gSha, "Graph is Simple:"];

graph = Graph[g, VertexLabels -> None, EdgeLabels -> None]

Vertex pruning

vertexOut = 
  GroupBy[Thread[VertexOutDegree[#] -> VertexList[#]], First -> Last] &@graph;
vertexIn = 
  GroupBy[Thread[VertexInDegree[#] -> VertexList[#]], First -> Last] &@graph;
vertexNonBranching = 
  AssociationThread[Intersection[vertexOut[1], vertexIn[1]] -> 0];
vertexTerminators = 
  AssociationThread[Map[First]@Select[EdgeList[graph]
     , KeyExistsQ[AssociationThread[vertexOut[0] -> 0], #[[2]]] &] -> 0];
pruningPairs = 
   Select[EdgeList[graph], 
    KeyExistsQ[vertexNonBranching, #[[2]]] && !KeyExistsQ[vertexTerminators, #[[2]]] &]; 
pruningRules = Dispatch[Rule @@@ Map[Reverse]@pruningPairs];


ParallelEvaluate[pruningRules = Dispatch[Rule @@@ Map[Reverse]@pruningPairs]];
prune[edges_] := FixedPoint[# /. pruningRules &, edges];
partitions = Partition[List @@@ EdgeList[graph], UpTo[100]];
edges = ParallelMap[prune, partitions];(* ~10' on my machine (16x CPU, 32GB) *)
prunedGraph = 
  DirectedEdge @@@ Select[Join @@ edges, #[[1]] != #[[2]] &] // 
    Union // Graph[#, VertexLabels -> None] &

Graph plotting

(* pruned graph - test dataset (first 25000 input rows) *)
GraphPlot[Graph[prunedGraph, VertexStyle -> {1 -> Green}]
  , VertexShapeFunction -> (Tooltip[{PointSize[Small], Point[#]}, 
      ToString@#2 <> " - " <> ToString[#2 /. ToSha]] &) 
  , EdgeShapeFunction -> "CurvedArc"
  , GraphLayout -> {"LayeredDigraphEmbedding", "RootVertex" -> 1}
  , PerformanceGoal -> "Speed"
  , PlotTheme -> "LargeGraph"
  ] // AbsoluteTiming

pruned graph - test version

(* original graph - test dataset (first 25000 input rows) *)
GraphPlot[Graph[graph, VertexStyle -> {1 -> Green}]
  , VertexShapeFunction -> (Tooltip[{PointSize[Small], Point[#]}, 
      ToString@#2 <> " - " <> ToString[#2 /. ToSha]] &) 
  , EdgeShapeFunction -> "CurvedArc"
  , GraphLayout -> {"LayeredDigraphEmbedding", "RootVertex" -> 1}
  , PerformanceGoal -> "Speed"
  , PlotTheme -> "LargeGraph"
  ] // AbsoluteTiming

original graph - test version

(* pruned graph - full dataset *)

pruned graph - full version

Encoded test dataset

input dataset

$\endgroup$
3
$\begingroup$

I started from @vindobona's answer and tried improving on it by leaning on some matrix ops and mathematica's inbuilt graph programming algorithms to speed things up.

This serves double benefit as these operations are accelerated not only with SIMD but also utilise all available threads even when only using a single kernel (which is massively beneficial for me as I only have a non-commercial license and am limited to 4 kernels). IDK if slapping a Parallelize[] on this would improve things by running on multiple machines but these improvements have gotten me pretty acceptable speeds for pruning the graph.

For reference I am running on an AMD Ryzen 9 5950X (16 Core, 32 Thread w/ AVX2) and 128GB of RAM (however Mathematica uses comparatively little RAM here).

Explanation

I'm taking roughly the same strategy as @vindobona and am condension all straight-line paths inside the graph down into single edges but as an addition I'm encoding the "length" of the edge into the graph's edge weights.

I started out by topologically sorting the graph prior to converting to indices so that the adjacency matrix is triangular and the index of vertices along the directed edges are always monotonically increasing.

From there we can get the adjacency matrix and build out boolean masks of the vertex/index list that categorise vertices:

  • root vertex: 0 incoming edges.
  • terminator vertex: 0 outgoing edges.
  • inline vertex: exactly 1 incoming edge, 1 outgoing edge.
  • fork vertex: 2 or more incoming or outgoing edges.

We can then use these masks to find any vertex that could be the start point for a path, the end point for a path, and the vertices that are not inline (and therefore will be considered part of the final graph). And we can then take those masks and use PositionIndex to convert these masks into lists. On a normal array this is O(n) but luckily with sparse arrays this is only O(density*n).

From there we can mask out the adjacency matrix and get the first edges on any path in the graph as well as the last edges on any path in the graph.

And we can further narrow this down by separating out the edges that are both first and last edges (as they don't either start on or end on an inline vertex). We call these the "continuing" and "concluding" edges/segments.

Now we can construct a graph of all the inline paths without any of the forks connecting them and then immediately extract a list containing the sets of vertices in each path subgraph using WeaklyConnectedComponents[] which can be performed in O(n log n) or potentially even O(log2 n) depending on the algorithm.

A number of these paths will be single points with no edges so we can partition the list to separate these out however while writing this up I've come to the conclusion that separating these out is not actually required. However I did find that execution was faster doing so (presumably because the array could be packed for the single vertex components as it is no longer ragged when separated out).

We then sort the component sets (each a list of vertices). Since these vertex indices are monotonically increasing due to the topological sort in the beginning, it is guaranteed that the first vertex in the sorted list is the first inline vertex in the path and the last vertex in the sorted list is the last inline vertex in the path.

Now that we know the first and last inline vertices that correspond to each path, we can match those to the start and end points of each path using associations created from the continuing and concluding segment lists.

From there we can get the length of each path using the length of the component list and construct our list of edges and weights for all our paths (including our single segment paths from earlier).

Now we can prepare our final lists of edges and weights, topologically sort our vertices once more using these new edges for good measure (to potentially speed up plotting), and prepare our final graph.

From here we can finally plot our graph.

Evaluation

computeResult = AbsoluteTiming@QuietEcho[
   gPreSort = csv // RightComposition[
      MapApply[{c, p} |->
        Map[(c -> # &)@*(If[StringQ[#] && # != c, #,
               Throw[{"Bad value", #}]] &)]@*StringSplit@p],
      Flatten,
      DeleteDuplicates,
      Graph
      ];

   shas = TopologicalSort[gPreSort];
   gSha = Graph[shas, EdgeList@gPreSort];

   vSz = Length@shas;

   idxs = Range@vSz;
   ToIdx = Dispatch@*MapThread[Rule]@{shas, idxs};
   ToSha = Dispatch@*MapThread[Rule]@{idxs, shas};

   rootSha = "7c32d0e3cc51c18f8da8991bf3f2a23a84bf7e90";
   rootIdx = rootSha /. ToIdx;

   Echo [rootSha, "Root Vertex SHA:"];
   Echo [rootIdx, "Root Vertex Index:"];

   g = Graph[idxs, EdgeList[gSha] /. ToIdx];
   gEdge = EdgeList@g;

   adj = AdjacencyMatrix@g;

   ones = ConstantArray[1, vSz, SparseArray];
   outs = adj . ones;
   ins = ones . adj;

   Assert[VertexOutDegree[g] == outs, "Out Degree is m.v"];
   Assert[VertexInDegree[g] == ins, "In Degree is v.m"];

   Assert[
    ConnectedGraphQ@*Graph@*ReplaceAll[DirectedEdge -> UndirectedEdge]@
     gEdge, "Graph is Connected:"];
   Assert[AcyclicGraphQ@g, "Graph is Acyclic:"];
   Assert[SimpleGraphQ@g, "Graph is Simple:"];
   Assert[UpperTriangularMatrixQ@adj, "Graph is Topologically Sorted:"];

   EqualToBool[x_] := Boole@*EqualTo[x];
   SetAttributes[EqualToBool, {Flat, Listable, OneIdentity, Orderless}];

   UnequalToBool[x_] := Boole@*UnequalTo[x];
   SetAttributes[
    UnequalToBool, {Flat, Listable, OneIdentity, Orderless}];

   GreaterThanBool[x_] := Boole@*GreaterThan[x];

   AdjMatrixToRules = RightComposition[
     ArrayRules,
     Most,
     Association,
     PositionIndex,
     Values,
     Flatten[#, 1] &,
     MapApply[Rule]
     ];

   MaskAdjIn = mask |-> matrix |-> Dot[DiagonalMatrix[mask], matrix];
   MaskAdjOut =
    mask |->
     matrix |-> Transpose@Dot[DiagonalMatrix[mask], Transpose@matrix];

   roots = EqualToBool[0] /@ ins;
   terms = EqualToBool[0] /@ outs;
   inlines = BitAnd[EqualToBool[1] /@ outs, EqualToBool[1] /@ ins];

   forks :=
    BitOr[GreaterThanBool[1] /@ ins, GreaterThanBool[1] /@ outs];

   pathEnds = BitXor[roots, BitOr[forks, terms]];
   pathStarts = BitOr[roots, forks];
   nonPruned = BitXor[inlines, 1];

   vInlines = PositionIndex[inlines]@1;
   vStarts = PositionIndex[pathStarts]@1;
   vEnds = PositionIndex[pathEnds]@1;
   vForks = PositionIndex[forks]@1;

   vFinal = PositionIndex[nonPruned]@1;

   startAdj = MaskAdjIn[pathStarts]@adj;
   endAdj = MaskAdjOut[pathEnds]@adj;
   pathAdj = MaskAdjOut[inlines]@*MaskAdjIn[inlines]@adj;

   initialSegments := AdjMatrixToRules@startAdj;
   finalSegments := AdjMatrixToRules@endAdj;
   inlineSegments := AdjMatrixToRules@pathAdj;

   singleSegments = AdjMatrixToRules@*MaskAdjOut[pathEnds]@startAdj;
   continuingSegments = AdjMatrixToRules@*MaskAdjOut[inlines]@startAdj;
   concludingSegments = AdjMatrixToRules@*MaskAdjIn[inlines]@endAdj;

   Echo[Length@gSegments, "Total Number of segments:"];
   Echo[Length@inlineSegments, "Number of inline segments:"];
   Echo[Length@singleSegments, "Number of one-step segments:"];
   Echo[Length@continuingSegments,
    "Number of continuing (non-one-stop initial) segments:"];
   Echo[Length@concludingSegments,
    "Number of concluding (non-one-stop final) segments:"];


   gSegments := gEdge /. DirectedEdge -> Rule;
   segmentSets := {inlineSegments, singleSegments, continuingSegments,
      concludingSegments};
   reconstructedSegments := Join @@ segmentSets;

   Assert[
    ContainsExactly[initialSegments]@
     Join[singleSegments, continuingSegments],
    "Initial segments should have no mismatches"
    ];
   Assert[
    ContainsExactly[finalSegments]@
     Join[singleSegments, concludingSegments],
    "Final segments should have no mismatches"
    ];
   Assert[
    ContainsExactly[gSegments]@reconstructedSegments,
    "All segments should correspond to original edges"
    ];
   Assert[
    AllTrue[Apply@DisjointQ]@Subsets[segmentSets, {2}],
    "Segment sets should have no overlap"
    ];
   Assert[ContainsOnly[gSegments]@singleSegments,
    "One-step segments are all valid edges"];
   Assert[
    ContainsNone[PositionIndex[inlines]@1]@*
      DeleteDuplicates@(singleSegments /. Rule[x_, y_] -> y),
    "One-step segments should not contain inline vertices"
    ];
   Assert[
    ContainsOnly[PositionIndex[pathStarts]@1]@*
      DeleteDuplicates@(continuingSegments /. Rule[x_, y_] -> x),
    "Continuing (non-one-stop initial) segments should only point \
from fork vertices or roots"
    ];
   Assert[
    ContainsNone[PositionIndex[pathEnds]@1]@*
      DeleteDuplicates@(continuingSegments /. Rule[x_, y_] -> y),
    "Continuing (non-one-stop initial) segments should not point \
towards any fork vertices or terminators"
    ];
   Assert[
    ContainsNone[PositionIndex[pathStarts]@1]@*
      DeleteDuplicates@(concludingSegments /. Rule[x_, y_] -> x),
    "Concluding (non-one-stop final) segments should not point from \
any fork vertices or roots"
    ];
   Assert[
    ContainsOnly[PositionIndex[pathEnds]@1]@*
      DeleteDuplicates@(concludingSegments /. Rule[x_, y_] -> y),
    "Concluding (non-one-stop final) segments should only point \
towards fork vertices or terminators"
    ];


   gPaths = Graph[vInlines, inlineSegments];
   vComponents =
    SplitBy[WeaklyConnectedComponents@gPaths, GreaterThan[1]@*Length];
   singleInlines = Flatten@*Last@vComponents;
   multiInlines = Map[Sort]@*First@vComponents;

   headMap = Association@*Map[Reverse]@continuingSegments;
   tailMap = Association@concludingSegments;

   weightedSingleInlines =
    Map[{DirectedEdge[#[[1]], #[[2]]], 2} &]@*Transpose@
     Lookup[{headMap, tailMap}, singleInlines];
   weightedMultiInlines = multiInlines // RightComposition[
      Map[{First[#], Last[#], Length[#] + 1, {First@#, Last@#}} &],
      Transpose,
      {
        Lookup[headMap, #[[1]]],
        Lookup[tailMap, #[[2]]],
        #[[3]], #[[4]]
        } &,
      Transpose,
      Map[{DirectedEdge[#[[1]], #[[2]]], #[[3]]} &],
      Sort
      ];
   weightedSingleSteps =
    singleSegments /. Rule[x_, y_] -> {DirectedEdge[x, y], 1};

   finalSegments =
    Join[weightedSingleInlines, weightedMultiInlines,
     weightedSingleSteps];
   finalEdges = Transpose[finalSegments][[1]];
   finalWeights = Transpose[finalSegments][[2]];
   finalVertices = TopologicalSort[finalEdges];
   gFinal =
    Graph[finalVertices, finalEdges, EdgeWeight -> finalWeights];

   Assert[
    ConnectedGraphQ@*Graph@*ReplaceAll[DirectedEdge -> UndirectedEdge]@
     finalEdges, "Graph is Connected:"];
   Assert[AcyclicGraphQ@gFinal, "Graph is Acyclic:"];
   Assert[UpperTriangularMatrixQ@*WeightedAdjacencyMatrix@gFinal,
    "Graph is Topologically Sorted:"];
   ]

Execution Time 8.81156s

Layer Digraph Embedding Render

{diGraphTime, diGraphRender} = AbsoluteTiming@GraphPlot[
    gFinal,
    VertexShapeFunction -> (Tooltip[{PointSize[Small], Point[#]},
        ToString@#2 <> " - " <> ToString[#2 /. ToSha]] &),
    EdgeShapeFunction -> "CurvedArc",
    GraphLayout -> {"LayeredDigraphEmbedding", "RootVertex" -> 1},
    PlotTheme -> "LargeGraph"
    ];

diGraphTime = Quantity[diGraphTime, "Seconds"];
Echo[diGraphTime, "Execution Time:"];
diGraphRender

Execution Time: 611.237s

Layered Digraph Embedding Graph Render

Spring Electrical Embedding Render

{springElecGraphTime, springElecGraphRender} =
  AbsoluteTiming@GraphPlot[
    gFinal,
    VertexShapeFunction -> (Tooltip[{PointSize[Small], Point[#]},
        ToString@#2 <> " - " <> ToString[#2 /. ToSha]] &),
    EdgeShapeFunction -> "CurvedArc",
    GraphLayout -> {"SpringElectricalEmbedding",
      "EdgeWeighted" -> True},
    PerformanceGoal -> "Speed",
    PlotTheme -> "LargeGraph"
    ];

springElecGraphTime = Quantity[springElecGraphTime, "Seconds"];
Echo[springElecGraphTime, "Execution Time:"];
springElecGraphRender

Execution Time: 7.95155s

Spring Electrical Embedding Graph Render

Notes

I found a spring embedding graph seems to work well for visualising the data at a distance (and it executes far faster than a digraph embedding) however for looking at the data up-close, the digraph embedding is still by far the best tool for visualising it.

$\endgroup$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.