6

I'm thinking to do some bytecode manipulation (think genetic programming) in Python.

I came across a test case in crashers test section of Python source tree that states:

Broken bytecode objects can easily crash the interpreter. This is not going to be fixed.

Thus the question, how to validate given tweaked byte code that it will not crash interpreter? Is it even possible?

Test source, after http://nedbatchelder.com/blog/201206/eval_really_is_dangerous.html

cc = (lambda fc=(
    lambda n: [
        c for c in
            ().__class__.__bases__[0].__subclasses__()
            if c.__name__ == n
        ][0]
    ):
    fc("function")(
        fc("code")(
            0, 0, 0, 0, "KABOOM", (), (), (), "", "", 0, ""
        ), {}
    )()
)

Here, this module defines cc that, if called, mymod.cc() crashes interpreter. Granted this is a very tricky example that created new code object with custom bytecode "KABOOM" and then runs it.

I'd accept something that verifies predefined bytecode, e.g. from a .pyc file.

7
  • 4
    I know of no method that'll validate bytecode, no. This is a hard task; better just produce valid bytecode. Commented Apr 24, 2014 at 11:47
  • 3
    I think this may be undecidable. Suppose you have bytecode equivalent to: if method_that_may_loop_forever(): crash(). you would have to solve the Halting Problem to determine whether it will crash or not. Commented Apr 24, 2014 at 11:53
  • 3
    @Kevin I surely don't want to solve halting problem. I only want to determine if a particular bytecode sequence is guaranteed safe or is potentially unsafe. Similar to what JVM does. Commented Apr 24, 2014 at 13:57
  • 1
    Why would you want to generate bytecode directly, if one can generate python source code and execute it instead? First approach is not well documented, lacks tools, etc... Are there serious disadvantages of the source code generation for your case? Commented Sep 27, 2014 at 7:37
  • 2
    In genetic programming a quality or fitness is being optimized. If the ratio of invalid candidates is too high, genetic algorithms are ineffective. Better ensure candidates are correct by construction, so that a fitness can be calculated. Difficult, though! Commented Sep 27, 2014 at 20:12

3 Answers 3

3
+200

Using a byte code Assembler does the Stack tracking across jumps, globally verifying stack level prediction consistency and automatically rejecting attempts to generate dead code. It is virtually impossible to accidentally generate bytecode that can crash the interpreter.

This Link might help you.

Sign up to request clarification or add additional context in comments.

3 Comments

Good library/link, I have to verify it does indeed validate byte code to the degree I want. I suspect that validation of arbitrary code jumps is technically undecidable, thus the question is whether byte code assembler errs on the side of caution (rejects potentially bad bytecode) or on the side of user (allows potentially good code).
if you use the BytecodeAssembler module (pypi.python.org/pypi/BytecodeAssembler), you won't need to figure out these stuff. For that matter, it has lots of support for labels, block handling, etc. The full manual for it is at (peak.telecommunity.com/DevCenter/BytecodeAssembler)
I'm afraid this statement only applies to generating code using BytecodeAssembler and not when parsing existing byte code. It may prove possible to map existing byte code to sequences of API calls though, I'm trying to hack something up...
1

Both outdated, the first one without code (at least I can't find) but may be useful to give an idea of what/how can be done and what are the limitations.

perfectly valid bytecode can still do horrible things

1 Comment

At least for a simple "KABOOM" is noticed by Python-Bytecode-Verifier with verifier.VerificationError: Unverifiable code: Stack underflow. Offset: 0 Stack: 0 Boundary: 0 Required: 2; of course the package is hopelessly outdated.
1

Python might be not an ideal language for such tasks, for the reasons stated in the question.

One approach: Don't create or accept raw bytecode, accept only Python source code and compile it yourself.

Further, there exists libraries (RestrictedPython) which manipulate Python on AST level to have some security guarantees e.g. to prevent sandbox escaping.

2 Comments

Please correct me if I'm wrong, but RestrictedPython requires python source as input, does it not?
Yes. That's AST - Abstract Syntax Tree. If you want to be pedantic that's not the source code itself. Thus, the disclaimer (might not suit for your approach).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.