17

In Python 3.9, nested functions are surprisingly slower than normal functions, around 10% for my example.

from timeit import timeit

def f():
    return 0

def factory():
    def g():
        return 0

    return g

g = factory()

print(timeit("f()", globals=globals()))
#> 0.074835498
print(timeit("g()", globals=globals()))
#> 0.08470309999999998

dis.dis show the same bytecode, and the only difference that I've found was in function internal flags. Indeed, dis.show_code reveals that g has a flags NESTED while f has not.

However, the flags can be removed, and it makes g as fast as f.

import inspect
g.__code__ = g.__code__.replace(co_flags=g.__code__.co_flags ^ inspect.CO_NESTED)
print(timeit("f()", globals=globals()))
#> 0.07321161100000001
print(timeit("g()", globals=globals()))
#> 0.07439838800000001

I've tried to look at CPython code to understand how CO_NESTED flag could impact function execution, but I've found nothing. Is there any explanation to this performance difference relative to the CO_NESTED flag?

EDIT: Removing CO_NESTED flag seems also to have no impact on function execution, except the overhead, even when it has captured variable.

import inspect
global_var = 40
def factory():
    captured_var = 2
    def g():
        return global_var + captured_var
    return g
g = factory()
assert g() == 42

g.__code__ = g.__code__.replace(co_flags=g.__code__.co_flags ^ inspect.CO_NESTED)
assert g() == 42  # function still works as expected

1 Answer 1

1

I may be wrong about it but I think the difference comes from the fact, that g can potentially reference the variables local to factory and as such needs access to two scopes for any variable lookup: globals as well as factory. It may well be that securing this additional scope (or merging the scope from factory and globals) is the cause of the overhead you observe. A good hint that it happens is if you nest another level of functions:

def factory():
    def ff():
        def g():
            return 0

        return g
    return ff()

g = factory()  # please note that it is equivalent from the perspective of time measurement

Timings:

print(timeit("f()", globals=globals(), number=100000000))
# > 6.792911
print(timeit("g()", globals=globals(), number=100000000))
# > 7.8184555

In your first timing case I get +5,7% (it was +13.5% with your numbers), in my second example: +15,1%.

Sign up to request clarification or add additional context in comments.

4 Comments

Thank you for your answer. However, I've tested it and adding a second layer of factory doesn't add overhead compared to one-layer factory. I know that's hard to discuss about theses timeit results because the test is too fast, so it's too much affected by external processes interferences.
Concerning scope, I don't think there can be a merging because globals and nested are handled differently. Global scope is stored in __globals__ attribute and access with LOAD_GLOBAL bytecode, whereas nested scope is "captured" in an internal structure with names in __code__.co_freevars and values in __closure__, and is accessed by LOAD_DEREF bytecode.
I've edited my question to precise that the flag removing doesn't impact execution when there is captured variables, so it don't seems that the possible additional processing depending on CO_NESTED concerns scope resolution. But I may be wrong too.
Good point about checking the opcodes, etc. Please note that the exact implementation may be different between machines (I was running experiments on wintel), there can also be cache alignment, etc that depends on the amount of the code for the whole script. There are many variables to influence the execution here....

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.