Why do compilers typically convert code into abstract syntax/parse trees before the final product?

Question

When I started researching parsing and compiling I started with simple mathematical expression parsers. Many of the existing implementations I found have an intermediate step of converting a string such as "1+sqrt(4)/(8+4)" into either an abstract syntax tree or parse tree before converting the tree into the final value.

I went on to design similar parsers but I never found the need to produce an intermediate representation before the final output.

A programming language is a long shot from a simple expression evaluator but ultimately what is the purpose of producing an intermediate representation such as a tree before producing the final product? What would be the limitations of a compiler that went directly from text to opcode skipping the step of creating a tree? As a compiler goes though the text what prevents the compiler from simply creating the compiled code directly as opposed to creating a tree and then creating the opcode by walking the tree?

Yes, significantly, as far as I am aware. I don't have any references off the top of my head though. — apropos
– apropos, Commented Mar 23, 2024 at 18:36
I mentioned optimizations but "analysis" isn't just about optimizations: you (generally) lose type information at the assembly level, which isn't super relevant for C but is very relevant for any strongly typed language. Error reporting can also be informed by the AST (and types), generic code is also made harder (though this is also partially to do with types), hygienic metaprogramming becomes impossible... there's certainly also other things that I'm forgetting. — apropos
– apropos, Commented Mar 23, 2024 at 18:42
Is this question actually "why isn't everything a single-pass compiler?", or is there something really about ASTs that it's looking for? — Michael Homer
– Michael Homer ♦, Commented Mar 23, 2024 at 23:32
There are things one could use other than an AST, but the question is talking about "directly from text to opcode", "a compiler goes though the text [...] simply creating the compiled code directly", so you have explicitly excluded any multi-pass compilers. Are you saying that's not what it's meant to be about? If so, perhaps the question can be edited to get at the core of what you're interested in. — Michael Homer
– Michael Homer ♦, Commented Mar 23, 2024 at 23:42
@user38 It is possible to have a multiple pass compiler without an AST, if you want to multiply by N the time your compiler spends parsing... But then you'd also need a separate data structure to store the progress that these N passes make! Choosing a good data representation is better. — Eldritch Conundrum
– Eldritch Conundrum, Commented Mar 23, 2024 at 23:43

Erik Eidt · Accepted Answer · 2024-03-23 23:41:58Z

Many modern languages require multiple passes over the input, i.e. those that allow forward references without forward declarations, like Java & C#. The input isn't fully understood without those forward references, so code cannot be generated in a single pass.

See here for example, to understand some of the intermediate data structures the C# compiler uses, including streams of token, and later, (parse) trees.

The first thing we do is take the text of the sources and break it up into a stream of tokens. That is, we do lexical analysis to determine that "class c : b { }" is class, identifier, colon, identifier, left curly, right curly.

We then do a "top level parse" where we verify that the token streams define a grammatically-correct C# program. However, we skip parsing method bodies. When we hit a method body, we just blaze through the tokens until we get to the matching close curly. We'll come back to it later; we only care about getting enough information to generate metadata at this point.

Of course, as others have mentioned, optimizations are often easier on a graph (i.e. tree) structure rather than on opcodes. Even when opcodes are finally used a good compiler needs to look ahead to see how to apply registers efficiently — in C#, some of this optimization work is left to the JIT, which represents another pass.

Many languages do not specify the order of evaluation of operands within an expression, and seeing the whole expression before doing anything with it allows the compiler to decide among possible orderings for execution efficiency. Again this goes to forward looking, which means storing something in order to look ahead, before later emitting opcodes. Trees are a good intermediate representation for storing expressions for later code generation. Another intermediate representation, like lists of instructions/opcodes, are lossy in that they require an advance determination of instruction ordering, whereas trees inherently indicate dependencies without dictating ordering.

Method inlining is a good example (an excellent optimization), which only works when the compiler can see the implementations of at least two methods (the current one being worked and the one potentially being in-lined), so again a possible forward reference issue. Again, trees serve well for storing the intermediates.

Yes, some languages like C are designed to support one-pass compilation, which C accommodates by requiring forward declarations, which is why there's the whole (messy) header file issue.

Even assembly languages, though, require either multiple passes or fix ups (another way of doing another "pass"). Some assembly languages benefit from multiple passes when variable sized instructions and/or instruction sequences can shortened under the right circumstances, which are not known until after one or more passes.

RISC V and other modern build systems employ a technique called linker relaxation, which is effectively another pass.

While these latter examples are going over opcodes and relocations — the point being another pass due to missing information on prior passes. It is really hard to do everything we might want in one pass.

another kind of additional pass - languages with metaprogramming features allow parts of the program to run at compile time to modify other parts, which is a lot less bug-prone when done with abstract representations rather than acting directly on the source text — Silver
– Silver, Commented Mar 24, 2024 at 19:51
Note that forward-declarations and headers are independent. C++ is moving to modules, for example, but still requires forward declarations within a module. — Matthieu M.
– Matthieu M., Commented Mar 25, 2024 at 11:33
@MatthieuM. C++ already requires multiple passes because class methods can reference other methods and variables in the class without forward declarations. — Aykhan Hagverdili
– Aykhan Hagverdili, Commented Mar 26, 2024 at 14:29

Eric Lippert · Accepted Answer · 2024-03-25 15:52:14Z

I never found the need to produce an intermediate representation before the final output.

Well then, you go!

what is the purpose of producing an intermediate representation such as a tree before producing the final product?

Your assumption is that the "final product" is machine code. Historically, that's spot on; by definition, a compiler is a program which takes in a string in one language and produces a semantically equivalent string of another language. If you can do that in one pass, great. (Bonus points if you can also produce sensible, actionable error messages when given an incorrect program.)

But that definition is no longer sufficient. A modern compiler provides multiple services. Translation is perhaps the most important service, but it is far, far from the only one. We designed Roslyn -- the C#/VB compiler architecture -- to be able to answer any question an IDE might reasonably have about code:

How should this fragment be coloured?
What is the type of this expression? Where is the declaration for that type, either in source or metadata?
If we performed an introduce-explanatory-variable refactoring, what would the resulting code look like? Would doing so introduce a new error?
If the dev has just typed "Console.W" in the IDE, what's the most likely completion?

And so on and so on and so on. And by the way all these questions must be answered about code that is syntactically wrong, as the user is typing in an editor, and the analysis must happen faster than the user can type; we budgeted 30 milliseconds maximum for a service call to return.

A lex and parse tree are in my experience necessary but insufficient to answer many of these questions. Reducing the parse tree to a lazily-computed semantic tree within a tight performance budget was the bulk of my work from 2005 to 2012. If you've got a better way to do that than building a syntactic or semantic tree I would love to know what it is!

The AST for C# is also part of the extension API (thanks Eric), which makes it possible for third parties to add their own checks, hints, and code generation without having to re-do the parsing work. — pjc50
– pjc50, Commented Mar 26, 2024 at 9:43
@pjc50: Indeed, the goal was not just to provide a library for the IDE team to use, but to create a "compiler as a service" that third parties could use to build their own tools. A major motivation for Roslyn in the first place was looking around Microsoft and finding over a dozen incorrect C# parsers being written and maintained by various teams that needed those services. A huge waste of duplicated effort to produce a low quality analysis and a heavy maintenance burden. Multiply that across the industry and it was clear we were meeting a need. — Eric Lippert
– Eric Lippert, Commented Mar 26, 2024 at 18:16

Cort Ammon · Accepted Answer · 2024-03-25 14:31:01Z

How could it not take on an intermediate form? That's actually rather tricky. Consider this fragment:

x + y...

What bytecode do you emit? Perhaps it's loading the value of x, then loading the value of y then adding them? What happens next?

x + y * 2

How unfortunate is that. Given that you've added x and y already, how will you recover? Perhaps you're clever and load y and add it again. But what if it was

x + y * z

Now you're trapped. You really needed something to hold onto the "x +" bit for a while, so that you can look ahead to see if there's a multiplication. A stack would do. And while it may not be immediately obvious, that's a intermediate representation. It's a stepping stone between the written text and the final output.

I point this out because there's nothing special about an intermediate representation. Every non-trivial program (i.e. more than one instruction) has some intermediate representation for the final output. Some are closer to the text than others, but they're all intermediate representations.

The more complex the application, the more complex its intermediate representations tend to be. Compilers are typically pretty darn complex. It turns out to be very convenient to be able to split up the problem they solve into smaller parts. The intermediate representations you are thinking of are the middle steps before the final output. They are typically chosen to have nice properties. They might strip strings that are no longer needed, or mark the lifespan of objects. They have properties that make it easier to write the next layer.

Yes, you can in theory do it in "one step." You still have an intermediate form, it's just in a far more difficult format to work with or prove. As you work with more complicated compiling tasks, it starts to be nice to not force yourself to work harder when you don't have to.

And, in the end, you still ended up with an intermediate representation, even if you somehow went from text to machine code in one step. There's not an x64 CPU today which actually executes x64 machine code. All of them treat the machine code as an intermediate representation and turn it into processor-specific microcode before executing that!

You may also be interested in the cut-elimination theorem from proof theory. The mathematical operation of a "cut" in proof theory is very close to the idea of intermediate representations. Formally, it says if you can prove $A\to B$ and $B\to C$, there exists a proof that $A\to C$ is also true. $B$ looks a lot like an intermediate representation, if you ask me. The cut-elimination theorem guarantees that if there is a proof that uses this "cut" rule, there must also exist a cut-free rule (lacking such intermediate representations).

In his essay "Don't Eliminate Cut!" George Boolos demonstrated that there was a derivation that could be completed in a page using cut, but whose analytic proof could not be completed in the lifespan of the universe.

Intermediate representations are a pretty nifty idea!

Of course, not all languages support infix notation. There's also RPN. — Ben Voigt
– Ben Voigt, Commented Mar 25, 2024 at 15:00
It's not terribly hard for a compiler to keep a stack of pending operations and defer outputting instructions until they'll be needed, without ever needing to hold more than one "item" on the stack for each layer of operations. So if a compiler sees a=b*c+d, it will know once it sees the + sign that it can output code to compute b*c. It won't know until it sees what follows d whether it will add d to the value it just computed, but it will be able to generate code in a temporary stream that calculates b*c and push the results and forget about that part of the expression. The... — supercat
– supercat, Commented Mar 25, 2024 at 15:17
@supercat I agree with you on that. How do you distinguish said stack from an intermediate representation? — Cort Ammon
– Cort Ammon, Commented Mar 26, 2024 at 3:05
Real world example from long ago: Turbo Pascal? or Borland Pascal? I was reading a configuration file that could contain a lot of options. One big case (aka switch) where most options simply stored a value somewhere--until I broke the compiler because the method was too long. It was a single-pass compiler but apparently something had to be held in memory while compiling a single method. — Loren Pechtel
– Loren Pechtel, Commented Mar 26, 2024 at 3:44
@supercat: In code typed by humans such expressions are rare but in machine-generated code they are quite common. Particularly huge string concatenations with thousands of expressions added together. I once spent a productive couple of days going through the C# compiler and finding every algorithm that did a recursion on a parse tree, figuring out which branch was the likeliest to be deep in machine-generated code, and turning it into an iterative algorithm. That eliminated a great many "expression too complex to compile" reports. — Eric Lippert
– Eric Lippert, Commented Mar 26, 2024 at 18:20

Adrian McCarthy · Accepted Answer · 2024-03-24 16:32:53Z

what is the purpose of producing an intermediate representation such as a tree before producing the final product?

Compilers have to parse the source code and produce machine code. Those responsibilities are often broken into a "front end" (the parser) and a "back end" (object code generator). Optimizing compilers might perform optimization as a separate phase after parsing and before code generation, sometimes called the "middle end." Intermediate representations are how the information is passed from one end to the next.

Having intermediate representations helps with separation of concerns. If the job of the front-end parser is to translate the source into an intermediate representation, then you could swap in another implementation of the front-end for a different source language, and (ideally) not have to re-write the optimizer or the code generator. Likewise, you could swap the back-end to generate code for a different target processor.

This separation also makes it easier to test and debug the compiler itself. Suppose you come across a program that, when fed to the compiler, produces bad object code. Where's the bug? Instead of trying to step through the compiler end-to-end, you could check the intermediate representation after the front end and after the middle end to figure out whether it's a bug in the parser, the optimizer, or the final code generation.

With the main steps of the pipeline separated, you can create test cases that exercise targeted portions of the compiler. For example, you could write a test that feeds a certain input (in the form of the intermediate representation) into the optimizer and that checks that its output (also as intermediate representation) is correct. The test would depend only on the optimizer code, not on the parser or the code generation.

A parse tree or abstract syntax tree is usually just one of the first intermediate representations in a compiler. At various phases, it may be convenient to have a control flow graph, a data flow graph, a series of instructions for an abstract machine, indexes of type definitions, etc.

Features like "whole program optimization" may rely on intermediate representations as well. The compiler cannot do much to optimize a call from one compilation unit to another. But at link time, the linker could see that your code calls a small function in another library. If the intermediate representation were available, the linker could sent the relevant clumps of IR back to the compiler, giving the optimizer a chance to inline the library function, optimize the result, and produce fresh machine code.

I think it's worth highlighting the possibility of multiple targets. Compilers these days may target multiple physical processors, bytecodes or other virtual machines, and/or source code for other languages. (For example, Kotlin can compile to Java bytecode, JavaScript, and/or native machine code for macOS, iOS/tvOS/watchOS, Linux, Windows, or Android.) Being able to share a front-end and rewrite only the back-end can avoid a huge amount of effort and duplication. — gidds
– gidds, Commented Mar 25, 2024 at 15:12
One pass compilers can easily do whole program optimization. They just go from the source code to an intermediate representation (which is not an abstract syntax tree or parse tree). — Kaz
– Kaz, Commented Mar 26, 2024 at 0:45
@Kaz: Passes are a distinct concept from pipeline stages. Nothing in my answer is specific to whether the compiler is single-pass or multi-pass. When compiler implementers say "IR", they usually mean a representation of the program (or chunk of a program) as a series of instructions for an abstract machine, but any and all representations used to represent code between the stages of the pipeline are intermediate representations. One could argue that an abstract syntax tree isn't an intermediate representation if it doesn't leave the front end, but I don't think that's material here. — Adrian McCarthy
– Adrian McCarthy, Commented May 19, 2024 at 16:12

Pseudonym · Accepted Answer · 2024-03-25 11:53:56Z

The other answers are very good, but I'd like to come at this from a few other angles that you may not have considered.

A single-pass compiler needs a language designed for that

If you compile in a single pass over the code, from source text to generated code, then you need all of the information to generate that code at the point where you parse it.

Such compilers have been extremely successful; the Borland "Turbo" compilers are famous examples. But they do require one important feature that not all languages have: You need to declare everything before you use them, such as forward-declared functions.

While C and Pascal have this property, many modern languages do not. Java and C# allow you to use a class member before it is declared, without forward declarations, so a single-pass compiler is simply not possible. And many other languages effectively require retaining some of the earlier source code, such as type inference or C++ templates.

But even so, there is always some tree (or more general graph) structure required, if only to store things like type information. If you have structs/records with structs/records as members, then you have a recursive structure, and the best way to represent that is as a tree of some kind.

Character-by-character processing is expensive

Consider a modestly-sized identifier, like print. Even though that's only 5 bytes long, anything that has to iterate over the characters in the identifier's name is always going to be more expensive than manipulating a single pointer. And it adds up; the pass over the program that has to deal with individual characters is usually one of the more expensive linear-time passes that a compiler has to do.

Program text is not only expensive to manipulate, but also expensive to store. Now it's true that a lot of thinking on compiler frontend design is partly informed by the more limited resources on a typical computer of the past, but even today, this is still a consideration.

In many popular interpreted languages of today like JavaScript or Python, the compiler frontend runs on the same machine, even in the same process, that is executing the program. And this could be on a device with limited performance, such as a mobile device, or a virtual machine provisioned from a cloud provider, where having more RAM or more CPU speed might mean mean more cost at scale.

It's true that today, virtual address space is extremely cheap, even on quite modest machines, so you can memory-map input files instead of storing them, assuming the source program comes from a file. Nonetheless, the less time a compiler spends scanning characters, the better.

A compiler is one part of a toolchain

The final thing you might want to consider is all of the other software development tools that a programmer might want or expect. Even leaving JIT compilation or linking to one side, consider debuggers, profilers, coverage analysers, and so forth.

Maybe you can generate code in a single pass from the input string more or less directly. But a compiler is, realistically, only one product in a product line of tools.

This goes double if you want to reuse existing tools. I mean, maybe you are happy with the existing state of debuggers. Fine. But you still have to generate whatever those tools need as output, as well as generating code. And that will either complicate your code generator, or usually require retaining some data about the structure of the program for use in a subsequent pass.

Kaz · Accepted Answer · 2024-03-26 01:28:17Z

You can make quite an advanced compiler without converting the surface syntax into abstract syntax. A compiler can go, in a single pass, to an intermediate representation (which is not a syntax tree) on which all sorts of optimizations can be applied.

An abstract syntax tree, however, lets you do certain useful things:

Transformations: you can implement language features by manipulating the abstract syntax. While this can be done by manipulating the surfaces syntax at the character or token level, it is more reliable and efficient to manipulate the AST. Transformations of concrete syntax can produce bad syntax by misplacing parentheses for instance; whereas those don't even exist in the abstract syntax tree: you can easily guard against producing an invalid AST node (bad syntax). Transformations at this level can implement not only language features, but certain optimizations, like ones based on algebraic simplifications. In the C language PTR->MEMB is equivalent to (*PTR).MEMB. In an AST-based compiler, it's very easy to convert the -> node in the tree into the two nodes representing (*PTR).MEMB, or vice versa. Note that (*PTR).MEMB is not parsed by a single rule. A compiler that doesn't build an AST cannot easily recognize the (*PTR).MEMB pattern. Tree pattern matching on an AST easily discover occurrences of the pattern. The parentheses disappear in the AST, so the problem boils down to identifying the . (member selection dot) nodes, whose left child is * (pointer dereference) node.
Multiple passes: abstract syntax lets you examine the program more than once, in a way whereby your compiler is sure that it's looking at the same thing. An AST node is an object in memory with a fixed identity. When the compiler makes another pass over the tree, the nodes are the same ones. Meta-information can be attached to the nodes in one pass, which can be counted on to be there in the next pass. Multiple passes are possible over the character or token-level representation of the program also, without building an AST, but each pass has to process the syntax rules and construct new information about the syntax, which somehow has to be related to the previous visit of the same part of the program (perhaps using a hash table keyed on file/line/column?) To write multiple passes without retaining an AST, we need to apply exactly the same tokenizing and parsing rules two or more times; if we parse anything differently between passes, we will end up with hard-to-find bugs. This means our parser has to have each rule implemented in one place, and each rule has to invoke pass-specific processing logic, once it has parsed out the pieces it needs. It doesn't seem impossible, only inefficient, inconvenient and conductive to poor program organization.

supercat · Accepted Answer · 2024-03-26 15:59:28Z

It's certainly possible to design compilers that accept a stream of source code and produce a stream of output which could be loaded into memory using a relatively simple loader without having to generate any other intermediate form. In many cases, however, having a compiler be able to inspect a section of code before generating code can greatly improve the efficiency of the code thus generated.

As a simple example, consider the following code snippet:

extern struct s foo[10]; // Assume `s` has member `int x;`.
int i;

for (i=0; i<10; i++)
{
  foo[i].x = 123456;
  ...

If the size of struct foo isn't a power of two (and on some systems, even if it is) code to compute the address of foo[i].x given i might need to do a fair amount of work (not all platforms can perform multiplication efficiently). It might be advantageous for a compiler to rewrite the inner loop code as:

{
  register int offset, n123456;
  register uintptr_t base = 10*sizeof (struct s) + (uintptr_t)&foo[0].x;

  n123456 = 123456;
  offset = -10*sizeof (struct s);
  do
  {
    *(int*)(base + offset) = n123456;
    ...
  } while ((offset += sizeof (struct s)) != 0);
}

but such an optimization would only be valid if nothing within loop modifies the value of i, and might be counter-productive if the value of i is used elsewhere in the loop, or if the registers used to hold base and n123456 could be more efficiently used for some other purpose. A compiler which had only scanned up to the ..., however, would have no way of knowing whether any of those conditions might apply.

Keeping the program as a syntax tree facilitates determinations such as whether any code within a loop modifies i, at least in common cases. If a loop contains any goto labels which are reached from code outside the loop, it may be impractical for a compiler to make such determinations, forcing it to presume that i might be modified within the loop in ways it doesn't understand, but syntax trees make it easy for a compiler to e.g. record for each node whether it contains any goto statements or target labels, which objects are read, written, or have their address taken within that node, etc. and then draw inferences such as "if a for loop is of the form for (x=constant; x<constant; x++) [same x used in all places), the body of the loop doesn't modify x, and x never has its address taken anywhere it is in scope, keep track of how often x ends up being scaled by different amounts. If all uses of x are scaled by the same amount as part of address computation, rewrite the loop using the pattern above." If the program is stored as a syntax tree, and the system records the appropriate information at each node, it will be much easier to determine when the conditions are right for various optimization transforms than if the compiler had to search through some other non-tree form to perform such analysis.

Stack Exchange Network

Why do compilers typically convert code into abstract syntax/parse trees before the final product?

7 Answers 7

A single-pass compiler needs a language designed for that

Character-by-character processing is expensive

A compiler is one part of a toolchain

You must log in to answer this question.

Hot Network Questions

Why do compilers typically convert code into abstract syntax/parse trees before the final product?

7 Answers 7

A single-pass compiler needs a language designed for that

Character-by-character processing is expensive

A compiler is one part of a toolchain

You must log in to answer this question.

Related

Hot Network Questions