-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write demonstrator of Python bytecode → Python AST #81
Comments
(Oh, I can't assign it to myself, but feel free to assign it to me. I'll do the work in a fork and make it a draft PR, which will probably never get out of draft status or be merged.) |
@jpivarski I have assigned you. BTW: the current numba-rvsdg 0.0.2 only supports transformation from Python bytecode to SCFG (structured control flow graph). While the documentation is still a bit sparse at present, we do have: Although the |
Thanks! I figured that at present, there would only be conversions from Python bytecode to something new, such as this SCFG class. Through this project, I'm hoping to learn about the new code representation, even if the API changes and I'll have to rewrite it later. |
I started investigating it, and learned that not just any bytecode will produce a graph—only bytecode with control flow. from numba_rvsdg.core.datastructures.byte_flow import ByteFlow
def foo(x, y):
return x + y
byteflow = ByteFlow.from_bytecode(foo.__code__)
print(byteflow.scfg.to_dict()) produces
but def foo(num, start):
out = start
for i in range(num):
out += i**2
return out produces
It mainly seems to be grouping bytecodes in a useful way, perhaps indicating data dependencies. Still going with that second example, bcmap = byteflow.scfg.bcmap_from_bytecode(byteflow.bc)
for i in range(4):
print(f"python_bytecode_block_{i}")
python_bytecode_block = byteflow.scfg.graph[f"python_bytecode_block_{i}"]
for instruction in python_bytecode_block.get_instructions(bcmap):
print(f" {instruction.opname} {instruction.arg} {instruction.argval}") produces
while import dis
print(dis.dis(foo)) produces
Looking at it with from numba_rvsdg.rendering.rendering import ByteFlowRenderer
bfr = ByteFlowRenderer()
bfr.bcmap_from_bytecode(byteflow.bc)
byteflow.scfg.view("scfg") produces I know that I can convert expressions from bytecode to AST; the control flow was always the hard part. This graph must be capturing the control flow in some way, though I'll need to play with it and maybe ask some specific questions to understand how. Now I'm going to try running through a complete set of Python expressions (no statements) to see if this always gives me single-node graphs, as I would expect. |
Hypothesis: all Python expressions produce an SCFG graph of length 1. Result: yup, they do. import ast
from numba_rvsdg.core.datastructures.byte_flow import ByteFlow
# all of the expression AST node types
expression_types = [
x
for x in dir(ast)
if (
isinstance(getattr(ast, x), type)
and issubclass(getattr(ast, x), ast.expr)
and x not in (
"expr",
"Num", # deprecated since Python 3.8
"Str", # |
"Bytes", # |
"NameConstant", # |
"Ellipsis", # V
)
)
]
# remove an expression node type when we see one
class RemoveFromList(ast.NodeVisitor):
def visit(self, node):
if type(node).__name__ in expression_types:
expression_types.remove(type(node).__name__)
return super().visit(node)
remove_from_list = RemoveFromList()
# examples of all expression types in Python
for string_expression in [
"x.y", # Attribute
"await f()", # Await
"x + y", # BinOp
"x and y", # BoolOp
"f()", # Call
"x < y", # Compare
"123", # Constant
"{1: 2}", # Dict
"{x: x for x in collection}", # DictComp
"f'{x}'", # FormattedValue
"(x for x in collection)", # GeneratorExp
"consequent if predicate else alternate", # IfExp
"f'x'", # JoinedStr
"lambda: None", # Lambda
"[1, 2, 3]", # List
"[x for x in collection]", # ListComp
"x", # Name
"(x := 4)", # NamedExpr
"{1, 2, 3}", # Set
"{x for x in collection}", # SetComp
"x[1:2]", # Slice
"(x, *y)", # Starred
"x[1]", # Subscript
"(1, 2, 3)", # Tuple
"~x", # UnaryOp
"yield x", # Yield
"yield from x", # YieldFrom
]:
# has to be wrapped in a function so that Await is valid
asyncy = "async " if "await " in string_expression else ""
syntax_tree = ast.parse(f"{asyncy}def _(): {string_expression}")
# remove from the list of expression types
remove_from_list.visit(syntax_tree)
code = compile(syntax_tree, "<string>", "exec")
# hypothesis: SCFG for all bare expressions is a one-node graph
byteflow = ByteFlow.from_bytecode(code)
assert len(byteflow.scfg.graph) == 1
# did we miss any expression types?
assert len(expression_types) == 0 |
@jpivarski what you have discovered is correct. Any linear sequence of bytecodes without |
@jpivarski you can also run the This will apply the https://dl.acm.org/doi/pdf/10.1145/2693261 and then render the graph to see what it looks like. The main goal of the RVSDG package right now is to transform Python bytecode into an SCFG which can be restructured. Note also, that some of the top-level classes are scheduled for removal in: But it shouldn't be that big of a change. |
If you get graph structures for For the following statement types, here's what I think about whether they'll split into multiple graph nodes. (YES = you said that they split, MAYBE = I think they would, and NO = I don't think they would; I think several of these can be part of the same graph node.)
It's probably the case that I'll only want to look at reconstructed graphs, then. Right now, I'm still getting a basic understanding of these things. I wonder if I should first focus on the parts that aren't spread across graph nodes. The parsing rules from uncompyle6 would still be useful here, applied to each graph node as an individual sequence of instructions, since they still need to be turned into AST trees. It would be like this, which targets Python expressions only, but making As for the control flow parts that are spread across graph nodes, they could be reconstructed into And maybe I shouldn't be thinking of this project as using uncompyle6, but maybe being part of uncompyle6, as a way to help @rocky get that tool up-to-date with recent Pythons. (Hi, @rocky, bringing you into this conversation for the first time. I've been thinking about the problem of converting bytecode → AST for recent versions of Python, which is also a problem for Numba and this RVSDG formalism promises to fix it. I've been doing these tests to see if the bytecode interpretation part can remain independent of LLVM/pure Python. Should we talk on https://github.com/rocky/python-uncompyle6/discussions about whether you'd want to fold that into your project? This is all very early/preliminary/just seeing what's possible.)
I understand that it's in flux. I'll keep doing |
@jpivarski - I am interested, but it will take me some time for me to get up to speed. |
Some initial comments. Overall I think the approach or idea is sound, although this is going to be a lot of work. (Believe it or not, that the approach is sound is a big deal. I have read about other approaches that feel a little whacky and not that sound.) A couple of observations though when I look at https://github.com/diana-hep/rejig/blob/master/reinterpreted-python/rejig/pybytecode.py If this is to be able to work on more than one Python bytecode, you will probably want to structure in a way that segregates versions. There are basically two approaches - one that treats each Python version as its own separate entity, and one accommodates sharing between versions. For accommodating sharing between versions, the approach I have been using has been something like version Python 3.9 is like 3.8 except here and here and here. In other words we say that this for kind of rule or action change it it such and such a way. Personally I have been moving away from the shared approach on the grammar or parse side because it was just too complicated to keep track of things. So here I just cut and paste. If I then need to adjust okay. On the semantic action side though there is still sharing going on. See https://github.com/rocky/python-uncompyle6/blob/master/uncompyle6/semantics/customize3.py#L354-L376 One other important thing to understand and keep in mind is the difference between uncompyle6, decompyle3, and this yet another not-public fork called decompile-cfg. In theory and ideally, there is no reason there could not be one code base for all versions. The problem has always been the amount of work needed to back-port stuff. There is a lot of work needed still to backport decompyle3 code to uncompyle6. And although you can't see decompile-cfg, the same is true there to backport to decompyle3. (And the decompile-cfg code is still changing a bit on its own.) So for my own use such as in the trepan3k debugger, I import the right code base of these depending on version: for 3.7 and 3.8 I use decompyle3. Otherwise uncompyle6. On recent and big change in the way I think about this, is that I am now doing more in the way of abstracting the parse tree that I get from a parse of tokenized bytecode instructions. See rocky/python-uncompyle6#412 (comment) This has a big and useful impact in that when we get to the semantic actions there is less junk in the Tree, and that means things will be more then same across different Python versions (and Python bytecode versions). In the semantic actions we have to go by position in the tree - 0th item, 1st item and so on. Again, reducing the noise or pseudo operations that don't have any real bytecode in them makes different versions or different derivations in the same Python version look more similar. You will find a little bit of this in https://github.com/rocky/python-decompile3/blob/master/decompyle3/semantics/transform.py#L30-L67 and https://github.com/rocky/python-decompile3/blob/master/decompyle3/semantics/transform.py#L558 |
In the list above for branching ops there are:
|
Having looked around the uncompyle6 codebase (though not the other ones), I'm aware of how big of a project it is. What I'm hoping is that we can take advantage of the modularization of Numba to get some parts from shared effort, in particular the hardest parts. After all, Numba has to unravel the control structures, too, though Numba does not need to parse instructions into an AST. (I think Numba may be able to do a Python instruction → LLVM instruction mapping without knowing the tree structure of expressions.)1 Separating by Python versions sounds like a good idea to me. Maybe the parsing code should be the same and each Python version should have its own set of grammar rules, copying where necessary? I say that because it's how I had thought that uncompyle6 works. I saw that there was a lot of separation between the Python versions already, and I had assumed that it went all the way down to the grammar rules.
That's a good point. In #81 (comment), these formed a single graph node, but they ought to be control flow (including and/or). Maybe that's still to be addressed. Footnotes
|
I think I am confused. What would you like to do, and how do you propose doing it? I see code, but perhaps I don't understand to bigger picture.
Python bytecode to Python AST (or any sort of equivalent AST) is about as hard as Python bytecode to source code - no easier and no more difficult. The reality is that with respect to source code is that I have not been able to keep up. I have been doing this as a hobby is something that would keep a couple of people fully occupied full time as a paid activity. So I am confused on what you expect to see. A proof of concept? Being satisfied working in some limited context? e.g. for basic blocks of only? (Then why RVSDS?) If a limited context, what limited context?
Yes, that is how uncompyle6 works. But not in the newer, not public decompiler that works for 3.8-3.10 though. Here there is no grammar sharing. I mention this though as kind of an aside. decompyle3 is in the middle - I started moving in this direction. The main thing here on parsing is that is important, is that over time there is too much drift in bytecode to have one piece of code that has tests for all the variations or a grammar that is a superset grammar of all the versions. Instead, it has been conceptually easier to separate these into separate pieces of code for a single bytecode version. |
The bigger picture isn't clear yet. As a first step, I want to see if this is going to be possible, so it's a proof of concept. Beyond that, if it does work, it could be
I think it would be better overall if there's only one package that provides Python bytecode → AST, though option (1) would have to be something you're interested in, too, if it's going to go forward. Still, the first thing (for me) to do is the proof of concept, to know if this is even a good approach. What scope? definitely more than basic blocks only, though I was thinking of starting with that and using uncompyle6 for it, exactly because it's an already-solved problem. Then, when addressing something like an
Especially because
That's interesting, and it's the reason I added option (4) above. I didn't know that you were ready to make this leap.
That's why I thought that using numba-rvsdg was a good idea: it will be kept up-to-date with Python bytecode changes, and the Numba team is large enough that I know this will happen. |
For a prototype, I'd start with https://github.com/rocky/python-decompile3 which handles 3.7 and 3.8 code only. The code style is generally more modern Python and it follows better the ideas in the branch that I believe addresses control flow issues. I have written a bit about recent thoughts and work as to how to do decompilation in rocky/python-uncompyle6#412 |
Oh, so sorry for the misunderstanding, |
numba-rvsdg supports 3.11 only. |
I just cloned numba-rvsdg and built it. The project seems to be pretty new - less than a year old. It seems similar to https://github.com/rocky/python-control-flow , although the end goal is to produce a RSVDG rather than a control-flow graph with dominator and reverse dominator information. If the future of Python is like its past, then if this project is to be long-lived it will have to come up with a strategy for supporting more Python versions than 3.11. Initially and the path of least resistance, is to assume that the next version bytecode won't be that different from the last. That generally happens. But periodically it doesn't. Python went from a variable 1 or 3 bytecode format to a fixed 2-byte "wordcode" format, I think between 3.5 and 3.6. Nowadays this kind of change wouldn't be a problem, because I think Python now has abstracted instructions from bytecode. But back in the 3.5 to 3.6 change, I don't believe that was the case. And although I haven't looked at the details all that much, Python 3.11 is a bit different from 3.10. Assuming 3.12 will be not that different from 3.11 there is the question whether you'll allow handing a different bytecode from the bytecode used by Python interpreter running numba-rvsdg. If so, then I believe you will need to use something other than the Python But as I said, this kind of complication and decision on how to handle more than one bytecode is definitely something that can be put off for later. Now that I understand numba-rvsdg better, let me close with how its overall approach to understanding control flow is different from the decompiling approach in uncompyle6 and its descendants. uncompyle6 leverages a couple features of the high-level bytecode interpreter approach (also found in Ruby, Smalltalk, Lua, various Lisps like Emacs Lisp). First, for all practical purposes the translation to bytecode is done by one kind of parsing system for any given version. (What happens afterwards can vary with say pyston or pypy). The fact that there is just one source code to bytecode translator reduces variation in how a given construct might be conceivably be represented in bytecode: the interpreter picks largely one way and uses that. Given this, it becomes easier to use parsing technology that is similar (but not the same) as in conventional compilers. Because the translation is like DNA and has its quirks and leaves fingerprints all over the place, we can use those quirks to retrieve more often the exact Python construct that was used, rather than some other construct or sequence that is semantically equivalent. The control-flow to higher-level construct used in numba-rvsdg, while it is more conventional and is general purpose, does not make use of these kinds of clues that specific compilers put in. Uncompyle6 can distinguish Python, in particular, has this rich and whacky set of control-flow structures, such as the "else" branches on "for", "while" and "try" statements. I doubt numba-rsvg would bother to go through the effort needed to look for these as opposed to trying to write these along simpler and more basic control flow structures involving if's wrapped around something else. Should it try to tackle "try ... else .. finally" I think you'll find the control-flow algorithm getting a bit more complicated. In sum, right now the two approaches to handling control flow, while each is valid, is a bit different from the other. |
Thanks @rocky for the detailed insights.
Yes, unlikely uncompyle6, the goal for numba-rvsdg is not to provide a syntactic equivalent (AST recovery) but instead to provide a semantic equivalent. For python compilers (particularly Numba), syntactic information is not important. The semantic equivalent in rvsdg will provide compilers a nice data-centric view (where control-flow is more or less implied) that makes compiler passes a lot easier to write. Also, we don't want to rely on the bytecode "clues" and "quirks" because they change overtime. The heuristic based approach is a huge burden in the current Numba bytecode frontend and we anticipate more bytecode changes due to the faster-cpython effort, which will be adding more bytecode level optimizations.
We will need to tackle all whacky python control-flow to reach parity with what the Numba bytecode frontend currently support. The good thing is that |
FWIW, the following PR does go from SCFG -> AST -- but the SCFG must be generated from a Python AST not bytecode: I suppose however that the |
@esc I had forgotten about his PR. Since 2023, I have been thinking and working on this, and after a few false starts, I have come to the belief that the way to handle control flow in high-level bytecode languages that are grammar based or LLM based is not by selecting the control flow but instead by marking dominator regions in basic blocks and let the grammar or LLM do the actual language-specific control-flow selection. Basically, this is what https://github.com/rocky/python-control-flow does. Most of it is generic in terms of computing the basic blocks and dominator regions. There is a step at the end that is very specific to marking up the bytecode so that a grammar based approach can use it. Looking again at the PR based on my current understanding, I am somewhat negatively disposed to the kind of approach in this PR. From a code size perspective alone, this is too small. It is like trying to fit 100 cubic meters of information in a 10 cubic-meter box. |
Also, I forgot to mention that I recently gave a talk at BlackHat Asia 2024 that briefly at the end explained some of the thinking here. A link to the slide fragments and text can be found at https://rocky.github.io/blackhat-asia-2024-additional/all-notes-print.html, specifically https://rocky.github.io/blackhat-asia-2024-additional/all-notes-print.html#cfg . Eventually, the video of this taken should be publically available. (Right now I think you can get it if you were registered for the conference or pay some fee to see videos) |
From #80 (comment), I noticed that it's possible to make RVSDG from Python bytecode already. (I hadn't been paying close attention to all the progress that was being made here, but I just noticed it now.)
This is a note to myself (which I will assign to myself) to write a demonstrator of Python bytecode → Python AST, at least for a subset of simple expressions, to get a sense of how it will go. I understand that the RVSDG development is very much in flux and that the demonstrator will get out of date quickly, but it would be good for me to do a first test-run, to get familiar with RVSDG as a concept and discover what the fundamental issues are.
The text was updated successfully, but these errors were encountered: