I'm working on creating my own scripting language for learning purposes. I've been reading through the Dragon Book and some things are a little unclear to me regarding the Symbol Table as well as where the strings that constitute an identifier actually reside and the ownership. Most of this confusion I think is due to the book using older techniques whereas I'm looking to use C++11.
As I understand it:
The
SymbolTablewould be ahash_mapof some kind using the lexemes as the keys. I.e.stringtypes. I've determined on my own that it's value would then be a tuple containing theSYMBOL_TYPEand a pointer/reference/lookup to a secondarraythat contains the additional info needed for thisSYMBOL_TYPE.I.e. an entry in the
SymbolTablecould be<SYMBOL_TYPE::VARIABLE,3>which would mean that theVariablesarray at index3would feature this variablestype+value+attributes.I would then do the same for other identifiers like procedures etc.
The
Lexergoes through characters until it's produced the longest possible lexeme each time the nextTokenis requestsed.Question 1: Where do these pre-existing lexemes reside? For example, if the symbol table is pre-populated with keywords, punctuation, operators etc, who pre-populates it? Does the
SymbolTableitself pre-populate itself with a list of keywords + punctuation + operators? Would I then just attempt to match each entry in the symbol table (starting with longest first)?Question 2: When the
Lexerhas produced a token, does it add new identifiers to the symbol table? How does it know if this should be aSYMBOL_TYPE::FUNCTIONorSYMBOL_TYPE:::VARIABLE? Or does it just populate it withSYMBOL_TYPE::IDand allow theParserto later update the entry once it's determined the identifiers usage?I think it can't be left to the
Parserbecause when a new scope is started by the occurrence of a{, theLexerwould need to add a newSymbolTableto a stack otherwise theTokenit returns to theParserwould point to the wrong entry... But then theLexercan't be expected to also add thetypeandvalueinformation to theVariablesorProceduresarrays.The
Parserrequests eachTokenin turn from theLexer.Question 3: What exactly is needed in this
Tokentype that's returned? At the moment it's just theSYMBOL_TYPE+ the index into theSymbolTable. TheSymbolTablethen links to the corresponding additionallistfor thatSYMBOL_TYPE. But should it be something other thanSYMBOL_TYPE, likeTOKEN_TYPEand allow theParserto add to theSymbolTablerather than theLexer? Because theLexercan't determine whether an identifier is avariableorprocedure. Because theLexerneeds to add them due to different scopes, does theParserthen discard what theLexeradded and add it's own?
I'm confused about who owns the creation and management of the SymbolTable and whether the identifiers in it should be the same as what's passed between the Lexer and Parser.
It seems to me like it would be simpler to have the Lexer only return Tokens which determine whether it's a known keyword/operator/punctuation (each with it's own TOKEN_TYPE) + the string that makes it up (if needed) and the Parser then adds entries to the SymbolTable and subsequent arrays (like the Variables array) when it needs to. It would also see a { and add a new SymbolTable to the stack when neccessary.
Am I mis-interpreting the goal of a SymbolTable?