Deoptimizations kill the performance with binary trees

Question

I need some help to optimize the code below. I can't understand how to rewrite the code to avoid the deoptimizations. Below is the code that works very slow on the Node platform. I took it from benchmarksgame-team binary-trees benchmark and added minor changes.

When it is run with --trace-deopt it shows that the functions in the hot path are depotimized, i.e.

[bailout (kind: deopt-lazy, reason: (unknown)): begin. deoptimizing 0x02117de4aa11 <JSFunction bottomUpTree2 (sfi = 000002F24C7D7539)>, opt id 3, bytecode offset 9, deopt exit 17, FP to SP delta 80, caller SP 0x00807d7fe6f0, pc 0x7ff6afaca78d]

The benchmark, run it using node --trace-deopt a 20

function mainThread() {
    const maxDepth = Math.max(6, parseInt(process.argv[2]));

    const stretchDepth = maxDepth + 1;
    const check = itemCheck(bottomUpTree(stretchDepth));
    console.log(`stretch tree of depth ${stretchDepth}\t check: ${check}`);

    const longLivedTree = bottomUpTree(maxDepth);

    for (let depth = 4; depth <= maxDepth; depth += 2) {
        const iterations = 1 << maxDepth - depth + 4;
        work(iterations, depth);
    }

    console.log(`long lived tree of depth ${maxDepth}\t check: ${itemCheck(longLivedTree)}`);
}

function work(iterations, depth) {
    let check = 0;
    for (let i = 0; i < iterations; i++) {
        check += itemCheck(bottomUpTree(depth));
    }
    console.log(`${iterations}\t trees of depth ${depth}\t check: ${check}`);
}

function TreeNode(left, right) {
    return {left, right};
}

function itemCheck(node) {
    if (node.left === null) {
        return 1;
    }
    return 1 + itemCheck2(node);
}

function itemCheck2(node) {
    return itemCheck(node.left) + itemCheck(node.right);
}

function bottomUpTree(depth) {
    return depth > 0
        ? bottomUpTree2(depth)
        : new TreeNode(null, null);
}

function bottomUpTree2(depth) {
    return new TreeNode(bottomUpTree(depth - 1), bottomUpTree(depth - 1))
}

console.time();
mainThread();
console.timeEnd();

I've found that new TreeNode can be replaced with inplace {left, rigth} which gives about 15-20% boost. — Pavel Husakouski
– Pavel Husakouski, Commented Jul 7, 2022 at 12:38
Don't use new on functions that return objects. Either write a constructor (function TreeNode(l, r) { this.left = l; this.right = r; } and use new, or use a factory function (like you have) and call it without new. — Bergi
– Bergi, Commented Jul 7, 2022 at 20:02
@Bergi +1 to that. While that's not the primary issue here, it's (1) generally good practice, and (2) explains the "15-20% boost": new + factory function essentially means double-allocation, which is extra wasteful (and hurts especially on a benchmark that already bottlenecks on allocations+GC). — jmrk
– jmrk, Commented Jul 7, 2022 at 20:12

jmrk · Accepted Answer · 2022-07-07 19:00:26Z

(V8 developer here.)

The premise of this question is incorrect: a few deopts don't matter, and don't move the needle regarding performance. Trying to avoid them is an exercise in futility.

The first step when trying to improve performance of something is to profile it. In this case, a profile reveals that the benchmark is spending:

about 46.3% of the time in optimized code (about 4/5 of that for tree creation and 1/5 for tree iteration)
about 0.1% of the time in unoptimized code
about 52.8% of the time in the garbage collector, tracing and freeing all those short-lived objects.

This is as artificial a microbenchmark as they come. 50% GC time never happens in real-world code that does useful things aside from allocating multiple gigabytes of short-lived objects as fast as possible.

In fact, calling them "short-lived objects" is a bit inaccurate in this case. While the vast majority of the individual trees being constructed are indeed short-lived, the code allocates one super-large long-lived tree early on. That fools V8's adaptive mechanisms into assuming that all future TreeNodes will be long-lived too, so it allocates them in "old space" right away -- which would save time if the guess was correct, but ends up wasting time because the TreeNodes that follow are actually short-lived and would be better placed in "new space" (which is optimized for quickly freeing short-lived objects). So just by reshuffling the order of operations, I can get a 3x speedup.

This is a typical example of one of the general problems with microbenchmarks: by doing something extreme and unrealistic, they often create situations that are not at all representative of typical real-world scenarios. If engine developers optimized for such microbenchmarks, engines would perform worse for real-world code. If JavaScript developers try to derive insights from microbenchmarks, they'll write code that performs worse under realistic conditions.

Anyway, if you want to optimize this code, avoid as many of those object allocations as you can.

Concretely:

An artificial microbenchmark like this, by its nature, intentionally does useless work (such as: computing the same value a million times). You said you wanted to optimize it, which means avoiding useless work, but you didn't specify which parts of the useless work you'd like to preserve, if any. So in the absence of a preference, I'll assume that all useless work is useless. So let's optimize!

Looking at the code, it creates perfect binary trees of a given depth and counts their nodes. In other words, it sums up the 1s in these examples:
depth=0:

depth=1:

  1
 / \
1   1

depth=2:

     1
   /   \
  1     1
 / \   / \
1   1 1   1

and so on. If you think about it for a bit, you'll realize that such a tree of depth N has (2 ** (N+1)) - 1 nodes. So we can replace:

itemCheck(bottomUpTree(depth));

with

(2 ** (depth+1)) - 1

(and analogously for the "stretchDepth" line).

Next, we can take care of the useless repetitions. Since x + x + x + x + ... N times is the same as x*N, we can replace:

let check = 0;
for (let i = 0; i < iterations; i++) {
  check += (2 ** (depth + 1)) - 1;
}

with just:

let check = ((2 ** (depth + 1)) - 1) * iterations;

With that we're from 12 seconds down to about 0.1 seconds. Not bad for five minutes of work, eh?

And that remaining time is almost entirely due to the longLivedTree. To apply the same optimizations to the operations creating and iterating that tree, we'd have to move them together, getting rid of its "long-livedness". Would you find that acceptable? You could get the overall time down to less than a millisecond! Would that make the benchmark useless? Not actually any more useless than it was to begin with, just more obviously so.

Could you explain a bit what is "by reshuffling the order of operations, I can get a 3x speedup." ? Could you provide some example?
I mean allocating the "long-lived tree" after the main loop. Which means it won't be "long-lived" any more.

Collectives™ on Stack Overflow

Deoptimizations kill the performance with binary trees

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related