How do Symbol Tables, Lexers and Parsers work together in a modern design? [closed]

Question

Closed. This question needs to be more focused. It is not currently accepting answers.

Want to improve this question? Guide the asker to update the question so it focuses on a single, specific problem. Narrowing the question will help others answer the question concisely. You may edit the question if you feel you can improve it yourself. If edited, the question will be reviewed and might be reopened.

Closed 10 years ago.

Improve this question

I'm working on creating my own scripting language for learning purposes. I've been reading through the Dragon Book and some things are a little unclear to me regarding the Symbol Table as well as where the strings that constitute an identifier actually reside and the ownership. Most of this confusion I think is due to the book using older techniques whereas I'm looking to use C++11.

As I understand it:

The SymbolTable would be a hash_map of some kind using the lexemes as the keys. I.e. string types. I've determined on my own that it's value would then be a tuple containing the SYMBOL_TYPE and a pointer/reference/lookup to a second array that contains the additional info needed for this SYMBOL_TYPE.

I.e. an entry in the SymbolTable could be <SYMBOL_TYPE::VARIABLE,3> which would mean that the Variables array at index 3 would feature this variables type + value + attributes.

I would then do the same for other identifiers like procedures etc.
The Lexer goes through characters until it's produced the longest possible lexeme each time the next Token is requestsed.

Question 1: Where do these pre-existing lexemes reside? For example, if the symbol table is pre-populated with keywords, punctuation, operators etc, who pre-populates it? Does the SymbolTable itself pre-populate itself with a list of keywords + punctuation + operators? Would I then just attempt to match each entry in the symbol table (starting with longest first)?

Question 2: When the Lexer has produced a token, does it add new identifiers to the symbol table? How does it know if this should be a SYMBOL_TYPE::FUNCTION or SYMBOL_TYPE:::VARIABLE? Or does it just populate it with SYMBOL_TYPE::ID and allow the Parser to later update the entry once it's determined the identifiers usage?

I think it can't be left to the Parser because when a new scope is started by the occurrence of a {, the Lexer would need to add a new SymbolTable to a stack otherwise the Token it returns to the Parser would point to the wrong entry... But then the Lexer can't be expected to also add the type and value information to the Variables or Procedures arrays.
The Parser requests each Token in turn from the Lexer.

Question 3: What exactly is needed in this Token type that's returned? At the moment it's just the SYMBOL_TYPE + the index into the SymbolTable. The SymbolTable then links to the corresponding additional list for that SYMBOL_TYPE. But should it be something other than SYMBOL_TYPE, like TOKEN_TYPE and allow the Parser to add to the SymbolTable rather than the Lexer? Because the Lexer can't determine whether an identifier is a variable or procedure. Because the Lexer needs to add them due to different scopes, does the Parser then discard what the Lexer added and add it's own?

I'm confused about who owns the creation and management of the SymbolTable and whether the identifiers in it should be the same as what's passed between the Lexer and Parser.

It seems to me like it would be simpler to have the Lexer only return Tokens which determine whether it's a known keyword/operator/punctuation (each with it's own TOKEN_TYPE) + the string that makes it up (if needed) and the Parser then adds entries to the SymbolTable and subsequent arrays (like the Variables array) when it needs to. It would also see a { and add a new SymbolTable to the stack when neccessary.

Am I mis-interpreting the goal of a SymbolTable?

amon · Accepted Answer · 2015-11-22 01:26:29Z

Usually, the symbol table does not include all strings of the source code. Instead, the symbol table represents the static environment of a program, i.e. the visible variables and possibly their types. During parsing and compilation, the current state of the symbol table indicates which variables are visible.

Certain languages such as C++ can have symbols of different types – variables and type names. These would usually be kept in different symbol tables. However, this is a fairly rare case. Most implementations only track variable names in a symbol table.

As an example, assume this C-like program:

1: int i = 42;
2: {
3:   char c = 'a';
4:   print_int_char(i, c);
5: }

The symbol table starts with a print_int_char function: { "print_int_char": void (*) (int, char) }. Let's assume that this symbol table maps names to types. During parsing, the state is changed as follows:

{ "print_int_char": void (*) (int, char), "i": int } – the i symbol was added.
{ "print_int_char": void (*) (int, char), "i": int } – a nested scope is entered.
{ "print_int_char": void (*) (int, char), "c": char, "i": int } – the c symbol was added.
{ "print_int_char": void (*) (int, char), "c": char, "i": int } – the symbols print_int_char, c, and i are retrieved from the symbol table to facilitate type checking.
{ "print_int_char": void (*) (int, char), "i": int } – the nested scope is left, all symbols defined in that scope are discarded.

Operators, keywords, and syntax fragments such as curly braces are not part of the symbol table, which is used to implement language semantics, such as lexical scopes or static typing. When a variable is encountered that is not found in the symbol table, we can immediately issue a compiler error.

These properties make symbol tables particularly popular in single-pass languages such as C, but when every symbol must be found in the symbol table at parse time, we cannot have functions without a pre-declaration. Therefore, many more modern languages defer symbol table construction until the whole compilation unit has been parsed, and then do a second pass to assert that all symbols can be resolved and to do type checking.

Nested scopes can be implemented with two strategies:

The symbol table is copied when a new scope is entered. When the scope is left, the old symbol table is restored.
The symbol table is a linked list or stack of tables. During symbol resolution, the scope chain is walked upwards until the symbol is found, or the end of the list is reached. Each scope chain entry will only contain the symbols that were defined in that scope. Since it is more elegant and avoids unnecessary copies, this approach is generally more common.

How is the type known when a symbol is added?

Some languages do not have static type information, e.g. Python. Here, a symbol table would only track existence of symbols.
In other languages such as C, the precise type is always known at declaration: the statement int x; declares a variable x of type int.
In languages such as ML/OCaml/Haskell with sophisticated type inference rules, the type of a variable is not immediately known at declaration. However, the type is guaranteed to be known at the end of the compilation unit. Here, the symbol table would map variable names to globally unique type variables that will be assigned once their concrete types are known. There might be another table mapping type variables to types. Read up on Algorithm W to learn more about this.

The symbol table is used as part of semantic analysis and not as part of lexical analysis. However, some languages or implementations might mix these. In particular, C was designed so that the whole compilation incl. parsing, semantic analysis (e.g. type checking) and code generation could happen in a single pass. The grammar of C++ requires that a symbol table is available during parsing to disambiguate types from variables. In other languages, it's not necessary that the parser creates a symbol table. Instead, the table could be generated by a later compiler pass from the resulting abstract syntax tree.

A lexer (if present) would never have to touch the symbol table. Lexers are only used as a simplification for parsers, but are never a necessary part of a language implementation. A lexer in a C-like language would usually return tokens such as <KW_FOR> or <IDENT:int> or <IDENT:i> rather than <TYPENAME:int> or <VARIABLE:i> which are then further categorized by a parser according to their position in a syntax. As an example why this is necessary, consider:

typedef int III;
III III = 42;
printf("%d", III);

Here, the symbol III refers to both a type and a variable, but these are in two different namespaces (= two different symbol tables). However, each reference to this symbol is unambiguous due to its syntactic context.

Thanks so much for the detailed answer. Follow-up questions: Where is the current value for each symbol stored? Alongside the type info in the SymbolTable? Second question, you have <IDENT:int>, which I assume would be used for a variable name as well e.g. <IDENT:myVariable>. Why wouldn't a known fundamental type like int be passed more like <KW_INT>? Does this have to do with types that could be structs or classes? — D.G. Redd
– D.G. Redd, Commented Nov 22, 2015 at 3:53
@NeomerArcana Yes, the C11 grammar treats int as a keyword, and the lexer determines token types from the symbol table. We have a rule type_specifier ::= VOID | CHAR | SHORT | INT | ... | struct_or_union_specifier | enum_specifier | TYPEDEF_NAME, where the lexer returns a TYPEDEF_NAME token instead of an IDENTIFIER if that symbol was found in the typedef symbol table. If I had designed the language, I would not have special-cased primitive types since it complicates the grammar, instead: type_spec ::= struct_spec | enum_spec | IDENT. — amon
– amon, Commented Nov 22, 2015 at 9:56

Jörg W Mittag · Accepted Answer · 2015-11-22 10:26:30Z

For a refreshingly different take on how to structure a modern compiler, see Martin Odersky's talk Compilers are Databases at the JVM Language Summit 2015 about the design of the dotty compiler. (Dotty is a language intended to be an experimentation area for future directions of Scala, likewise, the dotty compiler is an experimentation area for future directions of the Scala compiler.)

The Scala compiler tries to be a single canonical one-stop-shop modern compiler for traditional batch compilation, incremental batch compilation, interactive compilation (REPL, workbook, etc.), IDEs (everything from syntax coloring to autocompletion to refactoring), macros (letting the user run user code in the compiler during compilation), reflection (letting the user run compiler code in the user program during runtime), documentation formatting, static analysis, linting, pretty printing, and a whole lot of other things.

The basic idea of the dotty compiler is that there is no mutable state. Everything is fully immutable and purely functional. This is achieved by taking ideas from purely functional (aka "temporal") databases. Data that in a traditional compiler would change over time (such as a symbol table) is instead represented as a pair of (timestamp, current_value), i.e. as values indexed by time. (They don't use actual time, though, rather a notion of time internal to the compiler, based on the run number and the compiler phase.)

In particular, this means that there is no symbol table. Instead, the role that the symbol table plays in a traditional compiler, is split across multiple data structures, all of which are immutable, some are time-invariant, some are time-varying. These are References, Denotations, and Symbols; the discussion of References starts around 30:30, the discussion of Denotations around 34:26, and the discussion of Symbols around 37:30.

Stack Exchange Network

How do Symbol Tables, Lexers and Parsers work together in a modern design? [closed]

2 Answers 2

Hot Network Questions

How do Symbol Tables, Lexers and Parsers work together in a modern design? [closed]

2 Answers 2

Related

Hot Network Questions