How to validate Python bytecode?

Question

I'm thinking to do some bytecode manipulation (think genetic programming) in Python.

I came across a test case in crashers test section of Python source tree that states:

Broken bytecode objects can easily crash the interpreter. This is not going to be fixed.

Thus the question, how to validate given tweaked byte code that it will not crash interpreter? Is it even possible?

Test source, after http://nedbatchelder.com/blog/201206/eval_really_is_dangerous.html

cc = (lambda fc=(
    lambda n: [
        c for c in
            ().__class__.__bases__[0].__subclasses__()
            if c.__name__ == n
        ][0]
    ):
    fc("function")(
        fc("code")(
            0, 0, 0, 0, "KABOOM", (), (), (), "", "", 0, ""
        ), {}
    )()
)

Here, this module defines cc that, if called, mymod.cc() crashes interpreter. Granted this is a very tricky example that created new code object with custom bytecode "KABOOM" and then runs it.

I'd accept something that verifies predefined bytecode, e.g. from a .pyc file.

I know of no method that'll validate bytecode, no. This is a hard task; better just produce valid bytecode. — Martijn Pieters
– Martijn Pieters, Commented Apr 24, 2014 at 11:47
I think this may be undecidable. Suppose you have bytecode equivalent to: if method_that_may_loop_forever(): crash(). you would have to solve the Halting Problem to determine whether it will crash or not. — Kevin
– Kevin, Commented Apr 24, 2014 at 11:53
@Kevin I surely don't want to solve halting problem. I only want to determine if a particular bytecode sequence is guaranteed safe or is potentially unsafe. Similar to what JVM does. — Dima Tisnek
– Dima Tisnek, Commented Apr 24, 2014 at 13:57
Why would you want to generate bytecode directly, if one can generate python source code and execute it instead? First approach is not well documented, lacks tools, etc... Are there serious disadvantages of the source code generation for your case? — Tim
– Tim, Commented Sep 27, 2014 at 7:37
In genetic programming a quality or fitness is being optimized. If the ratio of invalid candidates is too high, genetic algorithms are ineffective. Better ensure candidates are correct by construction, so that a fitness can be calculated. Difficult, though! — cfi
– cfi, Commented Sep 27, 2014 at 20:12

devst3r · Accepted Answer · 2014-09-22 13:17:52Z

3

+200

Using a byte code Assembler does the Stack tracking across jumps, globally verifying stack level prediction consistency and automatically rejecting attempts to generate dead code. It is virtually impossible to accidentally generate bytecode that can crash the interpreter.

This Link might help you.

answered Sep 22, 2014 at 13:17

devst3r

5627 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Dima Tisnek Over a year ago

Good library/link, I have to verify it does indeed validate byte code to the degree I want. I suspect that validation of arbitrary code jumps is technically undecidable, thus the question is whether byte code assembler errs on the side of caution (rejects potentially bad bytecode) or on the side of user (allows potentially good code).

devst3r Over a year ago

if you use the BytecodeAssembler module (pypi.python.org/pypi/BytecodeAssembler), you won't need to figure out these stuff. For that matter, it has lots of support for labels, block handling, etc. The full manual for it is at (peak.telecommunity.com/DevCenter/BytecodeAssembler)

Dima Tisnek Over a year ago

I'm afraid this statement only applies to generating code using BytecodeAssembler and not when parsing existing byte code. It may prove possible to map existing byte code to sequences of API calls though, I'm trying to hack something up...

Alex · Accepted Answer · 2014-09-27 22:21:20Z

1

Both outdated, the first one without code (at least I can't find) but may be useful to give an idea of what/how can be done and what are the limitations.

perfectly valid bytecode can still do horrible things

answered Sep 27, 2014 at 22:21

Alex

3,4641 gold badge29 silver badges46 bronze badges

1 Comment

Dima Tisnek Over a year ago

At least for a simple "KABOOM" is noticed by Python-Bytecode-Verifier with verifier.VerificationError: Unverifiable code: Stack underflow. Offset: 0 Stack: 0 Boundary: 0 Required: 2; of course the package is hopelessly outdated.

Mikko Ohtamaa · Accepted Answer · 2014-09-22 12:33:12Z

1

Python might be not an ideal language for such tasks, for the reasons stated in the question.

One approach: Don't create or accept raw bytecode, accept only Python source code and compile it yourself.

Further, there exists libraries (RestrictedPython) which manipulate Python on AST level to have some security guarantees e.g. to prevent sandbox escaping.

answered Sep 22, 2014 at 12:33

Mikko Ohtamaa

85k63 gold badges296 silver badges479 bronze badges

2 Comments

Dima Tisnek Over a year ago

Please correct me if I'm wrong, but RestrictedPython requires python source as input, does it not?

Mikko Ohtamaa Over a year ago

Yes. That's AST - Abstract Syntax Tree. If you want to be pedantic that's not the source code itself. Thus, the disclaimer (might not suit for your approach).

Collectives™ on Stack Overflow

How to validate Python bytecode?

3 Answers 3

3 Comments

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related