Revisions to Writing a Compiler Compiler - Insight on Use and Features

replaced http://programmers.stackexchange.com/ with https://softwareengineering.stackexchange.com/

Source Link

edited Apr 12, 2017 at 7:31

1

This is part of a series of questions which focuses on the sister project to the Abstraction Project, which aims to abstract the concepts used in language design in the form of a framework. The sister project is called OILexer, which aims to construct a parser from grammar files, without the use of code injection on matches.

Some other pages associated to these questions, related to structural typing, can be viewed here here, and ease of use, found here here. The meta-topic associated to an inquiry about the framework and the proper place to post can be found here.

I'm getting to the point where I'm about to start extracting the parse tree out of a given grammar, followed by a Recursive Descent parser which uses DFA to discern forward paths (similar to ANTLR 4's LL(*)), so I figured I'd open it up to get insight.

In a parser compiler, what kinds of features are ideal?

So far here is a brief overview of what's implemented:

Templates
Look ahead prediction, knowing what's valid at a given point.
Rule 'Deliteralization' taking the literals within rules and resolving which token they're from.
Nondeterministic Automata
Deterministic Automata
Simple lexical state machine for token recognition
Token automation methods:
- Scan - useful for comments: Comment := "/*" Scan("*/");
- Subtract - Useful for Identifiers: Identifier := Subtract(IdentifierBody, Keywords);
  - Ensures the identifier doesn't accept keywords.
- Encode - Encodes an automation as a series X count of base N transitions.
  - UnicodeEscape := "\\u" BaseEncode(IdentifierCharNoEscape, 16, 4);
    - Makes a unicode escape in hexadecimal, with hex 4-transitions. The difference between this and: [0-9A-Fa-f]{4} is the resulted automation with Encode limits the allowed set of hexadecimal values to the scope of IdentifierCharNoEscape. So if you give it \u005c, the encode version will not accept the value. Things like this have a serious caveat: Use sparingly. The resulted automation could be quite complex.

What isn't implemented is CST generation, I need to adjust the Deterministic automations to carry over the proper context to get this working.

For anyone interested, I've uploaded a pretty printed of the original form of the T*y♯ project. Each file should link to every other file, I started to link in the individual rules to follow them, but it would've taken far too long (would've been simpler to automate!)

If more context is needed, please post accordingly.

Edit 5-14-2013: I've written code to create GraphViz graphs for the state machines within a given language. Here is a GraphViz digraph of the AssemblyPart. The members linked in the language description should have a rulename.txt in their relative folder with the digraph for that rule. Some of the language description has changed since I posted the example, this is due to simplifying things about the grammar. Here's an interesting graphviz image.

This is part of a series of questions which focuses on the sister project to the Abstraction Project, which aims to abstract the concepts used in language design in the form of a framework. The sister project is called OILexer, which aims to construct a parser from grammar files, without the use of code injection on matches.

Some other pages associated to these questions, related to structural typing, can be viewed here, and ease of use, found here. The meta-topic associated to an inquiry about the framework and the proper place to post can be found here.

I'm getting to the point where I'm about to start extracting the parse tree out of a given grammar, followed by a Recursive Descent parser which uses DFA to discern forward paths (similar to ANTLR 4's LL(*)), so I figured I'd open it up to get insight.

In a parser compiler, what kinds of features are ideal?

So far here is a brief overview of what's implemented:

Templates
Look ahead prediction, knowing what's valid at a given point.
Rule 'Deliteralization' taking the literals within rules and resolving which token they're from.
Nondeterministic Automata
Deterministic Automata
Simple lexical state machine for token recognition
Token automation methods:
- Scan - useful for comments: Comment := "/*" Scan("*/");
- Subtract - Useful for Identifiers: Identifier := Subtract(IdentifierBody, Keywords);
  - Ensures the identifier doesn't accept keywords.
- Encode - Encodes an automation as a series X count of base N transitions.
  - UnicodeEscape := "\\u" BaseEncode(IdentifierCharNoEscape, 16, 4);
    - Makes a unicode escape in hexadecimal, with hex 4-transitions. The difference between this and: [0-9A-Fa-f]{4} is the resulted automation with Encode limits the allowed set of hexadecimal values to the scope of IdentifierCharNoEscape. So if you give it \u005c, the encode version will not accept the value. Things like this have a serious caveat: Use sparingly. The resulted automation could be quite complex.

What isn't implemented is CST generation, I need to adjust the Deterministic automations to carry over the proper context to get this working.

For anyone interested, I've uploaded a pretty printed of the original form of the T*y♯ project. Each file should link to every other file, I started to link in the individual rules to follow them, but it would've taken far too long (would've been simpler to automate!)

If more context is needed, please post accordingly.

Edit 5-14-2013: I've written code to create GraphViz graphs for the state machines within a given language. Here is a GraphViz digraph of the AssemblyPart. The members linked in the language description should have a rulename.txt in their relative folder with the digraph for that rule. Some of the language description has changed since I posted the example, this is due to simplifying things about the grammar. Here's an interesting graphviz image.

This is part of a series of questions which focuses on the sister project to the Abstraction Project, which aims to abstract the concepts used in language design in the form of a framework. The sister project is called OILexer, which aims to construct a parser from grammar files, without the use of code injection on matches.

Some other pages associated to these questions, related to structural typing, can be viewed here, and ease of use, found here. The meta-topic associated to an inquiry about the framework and the proper place to post can be found here.

I'm getting to the point where I'm about to start extracting the parse tree out of a given grammar, followed by a Recursive Descent parser which uses DFA to discern forward paths (similar to ANTLR 4's LL(*)), so I figured I'd open it up to get insight.

In a parser compiler, what kinds of features are ideal?

So far here is a brief overview of what's implemented:

Templates
Look ahead prediction, knowing what's valid at a given point.
Rule 'Deliteralization' taking the literals within rules and resolving which token they're from.
Nondeterministic Automata
Deterministic Automata
Simple lexical state machine for token recognition
Token automation methods:
- Scan - useful for comments: Comment := "/*" Scan("*/");
- Subtract - Useful for Identifiers: Identifier := Subtract(IdentifierBody, Keywords);
  - Ensures the identifier doesn't accept keywords.
- Encode - Encodes an automation as a series X count of base N transitions.
  - UnicodeEscape := "\\u" BaseEncode(IdentifierCharNoEscape, 16, 4);
    - Makes a unicode escape in hexadecimal, with hex 4-transitions. The difference between this and: [0-9A-Fa-f]{4} is the resulted automation with Encode limits the allowed set of hexadecimal values to the scope of IdentifierCharNoEscape. So if you give it \u005c, the encode version will not accept the value. Things like this have a serious caveat: Use sparingly. The resulted automation could be quite complex.

What isn't implemented is CST generation, I need to adjust the Deterministic automations to carry over the proper context to get this working.

For anyone interested, I've uploaded a pretty printed of the original form of the T*y♯ project. Each file should link to every other file, I started to link in the individual rules to follow them, but it would've taken far too long (would've been simpler to automate!)

If more context is needed, please post accordingly.

Edit 5-14-2013: I've written code to create GraphViz graphs for the state machines within a given language. Here is a GraphViz digraph of the AssemblyPart. The members linked in the language description should have a rulename.txt in their relative folder with the digraph for that rule. Some of the language description has changed since I posted the example, this is due to simplifying things about the grammar. Here's an interesting graphviz image.

replaced http://meta.stackexchange.com/ with https://meta.stackexchange.com/

Source Link

edited Mar 20, 2017 at 10:29

Community Bot

1

This is part of a series of questions which focuses on the sister project to the Abstraction Project, which aims to abstract the concepts used in language design in the form of a framework. The sister project is called OILexer, which aims to construct a parser from grammar files, without the use of code injection on matches.

Some other pages associated to these questions, related to structural typing, can be viewed here, and ease of use, found here. The meta-topic associated to an inquiry about the framework and the proper place to post can be found here here.

I'm getting to the point where I'm about to start extracting the parse tree out of a given grammar, followed by a Recursive Descent parser which uses DFA to discern forward paths (similar to ANTLR 4's LL(*)), so I figured I'd open it up to get insight.

In a parser compiler, what kinds of features are ideal?

So far here is a brief overview of what's implemented:

Templates
Look ahead prediction, knowing what's valid at a given point.
Rule 'Deliteralization' taking the literals within rules and resolving which token they're from.
Nondeterministic Automata
Deterministic Automata
Simple lexical state machine for token recognition
Token automation methods:
- Scan - useful for comments: Comment := "/*" Scan("*/");
- Subtract - Useful for Identifiers: Identifier := Subtract(IdentifierBody, Keywords);
  - Ensures the identifier doesn't accept keywords.
- Encode - Encodes an automation as a series X count of base N transitions.
  - UnicodeEscape := "\\u" BaseEncode(IdentifierCharNoEscape, 16, 4);
    - Makes a unicode escape in hexadecimal, with hex 4-transitions. The difference between this and: [0-9A-Fa-f]{4} is the resulted automation with Encode limits the allowed set of hexadecimal values to the scope of IdentifierCharNoEscape. So if you give it \u005c, the encode version will not accept the value. Things like this have a serious caveat: Use sparingly. The resulted automation could be quite complex.

What isn't implemented is CST generation, I need to adjust the Deterministic automations to carry over the proper context to get this working.

For anyone interested, I've uploaded a pretty printed of the original form of the T*y♯ project. Each file should link to every other file, I started to link in the individual rules to follow them, but it would've taken far too long (would've been simpler to automate!)

If more context is needed, please post accordingly.

Edit 5-14-2013: I've written code to create GraphViz graphs for the state machines within a given language. Here is a GraphViz digraph of the AssemblyPart. The members linked in the language description should have a rulename.txt in their relative folder with the digraph for that rule. Some of the language description has changed since I posted the example, this is due to simplifying things about the grammar. Here's an interesting graphviz image.

This is part of a series of questions which focuses on the sister project to the Abstraction Project, which aims to abstract the concepts used in language design in the form of a framework. The sister project is called OILexer, which aims to construct a parser from grammar files, without the use of code injection on matches.

Some other pages associated to these questions, related to structural typing, can be viewed here, and ease of use, found here. The meta-topic associated to an inquiry about the framework and the proper place to post can be found here.

I'm getting to the point where I'm about to start extracting the parse tree out of a given grammar, followed by a Recursive Descent parser which uses DFA to discern forward paths (similar to ANTLR 4's LL(*)), so I figured I'd open it up to get insight.

In a parser compiler, what kinds of features are ideal?

So far here is a brief overview of what's implemented:

Templates
Look ahead prediction, knowing what's valid at a given point.
Rule 'Deliteralization' taking the literals within rules and resolving which token they're from.
Nondeterministic Automata
Deterministic Automata
Simple lexical state machine for token recognition
Token automation methods:
- Scan - useful for comments: Comment := "/*" Scan("*/");
- Subtract - Useful for Identifiers: Identifier := Subtract(IdentifierBody, Keywords);
  - Ensures the identifier doesn't accept keywords.
- Encode - Encodes an automation as a series X count of base N transitions.
  - UnicodeEscape := "\\u" BaseEncode(IdentifierCharNoEscape, 16, 4);
    - Makes a unicode escape in hexadecimal, with hex 4-transitions. The difference between this and: [0-9A-Fa-f]{4} is the resulted automation with Encode limits the allowed set of hexadecimal values to the scope of IdentifierCharNoEscape. So if you give it \u005c, the encode version will not accept the value. Things like this have a serious caveat: Use sparingly. The resulted automation could be quite complex.

What isn't implemented is CST generation, I need to adjust the Deterministic automations to carry over the proper context to get this working.

For anyone interested, I've uploaded a pretty printed of the original form of the T*y♯ project. Each file should link to every other file, I started to link in the individual rules to follow them, but it would've taken far too long (would've been simpler to automate!)

If more context is needed, please post accordingly.

Edit 5-14-2013: I've written code to create GraphViz graphs for the state machines within a given language. Here is a GraphViz digraph of the AssemblyPart. The members linked in the language description should have a rulename.txt in their relative folder with the digraph for that rule. Some of the language description has changed since I posted the example, this is due to simplifying things about the grammar. Here's an interesting graphviz image.

This is part of a series of questions which focuses on the sister project to the Abstraction Project, which aims to abstract the concepts used in language design in the form of a framework. The sister project is called OILexer, which aims to construct a parser from grammar files, without the use of code injection on matches.

Some other pages associated to these questions, related to structural typing, can be viewed here, and ease of use, found here. The meta-topic associated to an inquiry about the framework and the proper place to post can be found here.

I'm getting to the point where I'm about to start extracting the parse tree out of a given grammar, followed by a Recursive Descent parser which uses DFA to discern forward paths (similar to ANTLR 4's LL(*)), so I figured I'd open it up to get insight.

In a parser compiler, what kinds of features are ideal?

So far here is a brief overview of what's implemented:

Templates
Look ahead prediction, knowing what's valid at a given point.
Rule 'Deliteralization' taking the literals within rules and resolving which token they're from.
Nondeterministic Automata
Deterministic Automata
Simple lexical state machine for token recognition
Token automation methods:
- Scan - useful for comments: Comment := "/*" Scan("*/");
- Subtract - Useful for Identifiers: Identifier := Subtract(IdentifierBody, Keywords);
  - Ensures the identifier doesn't accept keywords.
- Encode - Encodes an automation as a series X count of base N transitions.
  - UnicodeEscape := "\\u" BaseEncode(IdentifierCharNoEscape, 16, 4);
    - Makes a unicode escape in hexadecimal, with hex 4-transitions. The difference between this and: [0-9A-Fa-f]{4} is the resulted automation with Encode limits the allowed set of hexadecimal values to the scope of IdentifierCharNoEscape. So if you give it \u005c, the encode version will not accept the value. Things like this have a serious caveat: Use sparingly. The resulted automation could be quite complex.

What isn't implemented is CST generation, I need to adjust the Deterministic automations to carry over the proper context to get this working.

For anyone interested, I've uploaded a pretty printed of the original form of the T*y♯ project. Each file should link to every other file, I started to link in the individual rules to follow them, but it would've taken far too long (would've been simpler to automate!)

If more context is needed, please post accordingly.

Edit 5-14-2013: I've written code to create GraphViz graphs for the state machines within a given language. Here is a GraphViz digraph of the AssemblyPart. The members linked in the language description should have a rulename.txt in their relative folder with the digraph for that rule. Some of the language description has changed since I posted the example, this is due to simplifying things about the grammar. Here's an interesting graphviz image.

Fixup of bad MSO links to MSE links migration

Source Link

edited Apr 23, 2014 at 13:30

Community Bot

1

This is part of a series of questions which focuses on the sister project to the Abstraction Project, which aims to abstract the concepts used in language design in the form of a framework. The sister project is called OILexer, which aims to construct a parser from grammar files, without the use of code injection on matches.

Some other pages associated to these questions, related to structural typing, can be viewed here, and ease of use, found here. The meta-topic associated to an inquiry about the framework and the proper place to post can be found here.

I'm getting to the point where I'm about to start extracting the parse tree out of a given grammar, followed by a Recursive Descent parser which uses DFA to discern forward paths (similar to ANTLR 4's LL(**)), so I figured I'd open it up to get insight.

In a parser compiler, what kinds of features are ideal?

So far here is a brief overview of what's implemented:

TemplatesTemplates
Look ahead prediction, knowing what's valid at a given point.
Rule 'Deliteralization' taking the literals within rules and resolving which token they're from.
Nondeterministic Automata
Deterministic Automata
Simple lexical state machine for token recognition
Token automation methods:
- Scan - useful for comments: Comment := ""/*"*" Scan("*"*/"");
- Subtract - Useful for Identifiers: Identifier := Subtract(IdentifierBody, Keywords);
  - Ensures the identifier doesn't accept keywords.
- Encode - Encodes an automation as a series X count of base N transitions.
  - UnicodeEscape := "\\u""\\u" BaseEncode(IdentifierCharNoEscape, 16, 4);
    - Makes a unicode escape in hexadecimal, with hex 4-transitions. The difference between this and: [0-9A-Fa-f]{4} is the resulted automation with Encode limits the allowed set of hexadecimal values to the scope of IdentifierCharNoEscape. So if you give it \u005c, the encode version will not accept the value. Things like this have a serious caveat: Use sparingly. The resulted automation could be quite complex.

What isn't implemented is CST generation, I need to adjust the Deterministic automations to carry over the proper context to get this working.

For anyone interested, I've uploaded a pretty printed of the original form of the T*y♯ projectoriginal form of the T*y♯ project. Each file should link to every other file, I started to link in the individual rules to follow them, but it would've taken far too long (would've been simpler to automate!)

If more context is needed, please post accordingly.

Edit 5-14-2013: I've written code to create GraphViz graphs for the state machines within a given language. Here is a GraphViz digraph of the AssemblyPart Here is a GraphViz digraph of the AssemblyPart. The members linked in the language description should have a rulename.txt in their relative folder with the digraph for that rule. Some of the language description has changed since I posted the example, this is due to simplifying things about the grammar. Here's an interesting graphviz imagegraphviz image.

This is part of a series of questions which focuses on the sister project to the Abstraction Project, which aims to abstract the concepts used in language design in the form of a framework. The sister project is called OILexer, which aims to construct a parser from grammar files, without the use of code injection on matches.

Some other pages associated to these questions, related to structural typing, can be viewed here, and ease of use, found here. The meta-topic associated to an inquiry about the framework and the proper place to post can be found here.

I'm getting to the point where I'm about to start extracting the parse tree out of a given grammar, followed by a Recursive Descent parser which uses DFA to discern forward paths (similar to ANTLR 4's LL(*)), so I figured I'd open it up to get insight.

In a parser compiler, what kinds of features are ideal?

So far here is a brief overview of what's implemented:

Templates
Look ahead prediction, knowing what's valid at a given point.
Rule 'Deliteralization' taking the literals within rules and resolving which token they're from.
Nondeterministic Automata
Deterministic Automata
Simple lexical state machine for token recognition
Token automation methods:
- Scan - useful for comments: Comment := "/*" Scan("*/");
- Subtract - Useful for Identifiers: Identifier := Subtract(IdentifierBody, Keywords);
  - Ensures the identifier doesn't accept keywords.
- Encode - Encodes an automation as a series X count of base N transitions.
  - UnicodeEscape := "\\u" BaseEncode(IdentifierCharNoEscape, 16, 4);
    - Makes a unicode escape in hexadecimal, with hex 4-transitions. The difference between this and: [0-9A-Fa-f]{4} is the resulted automation with Encode limits the allowed set of hexadecimal values to the scope of IdentifierCharNoEscape. So if you give it \u005c, the encode version will not accept the value. Things like this have a serious caveat: Use sparingly. The resulted automation could be quite complex.

What isn't implemented is CST generation, I need to adjust the Deterministic automations to carry over the proper context to get this working.

For anyone interested, I've uploaded a pretty printed of the original form of the T*y♯ project. Each file should link to every other file, I started to link in the individual rules to follow them, but it would've taken far too long (would've been simpler to automate!)

If more context is needed, please post accordingly.

Edit 5-14-2013: I've written code to create GraphViz graphs for the state machines within a given language. Here is a GraphViz digraph of the AssemblyPart. The members linked in the language description should have a rulename.txt in their relative folder with the digraph for that rule. Some of the language description has changed since I posted the example, this is due to simplifying things about the grammar. Here's an interesting graphviz image.

This is part of a series of questions which focuses on the sister project to the Abstraction Project, which aims to abstract the concepts used in language design in the form of a framework. The sister project is called OILexer, which aims to construct a parser from grammar files, without the use of code injection on matches.

Some other pages associated to these questions, related to structural typing, can be viewed here, and ease of use, found here. The meta-topic associated to an inquiry about the framework and the proper place to post can be found here.

I'm getting to the point where I'm about to start extracting the parse tree out of a given grammar, followed by a Recursive Descent parser which uses DFA to discern forward paths (similar to ANTLR 4's LL(*)), so I figured I'd open it up to get insight.

In a parser compiler, what kinds of features are ideal?

So far here is a brief overview of what's implemented:

Templates
Look ahead prediction, knowing what's valid at a given point.
Rule 'Deliteralization' taking the literals within rules and resolving which token they're from.
Nondeterministic Automata
Deterministic Automata
Simple lexical state machine for token recognition
Token automation methods:
- Scan - useful for comments: Comment := "/*" Scan("*/");
- Subtract - Useful for Identifiers: Identifier := Subtract(IdentifierBody, Keywords);
  - Ensures the identifier doesn't accept keywords.
- Encode - Encodes an automation as a series X count of base N transitions.
  - UnicodeEscape := "\\u" BaseEncode(IdentifierCharNoEscape, 16, 4);
    - Makes a unicode escape in hexadecimal, with hex 4-transitions. The difference between this and: [0-9A-Fa-f]{4} is the resulted automation with Encode limits the allowed set of hexadecimal values to the scope of IdentifierCharNoEscape. So if you give it \u005c, the encode version will not accept the value. Things like this have a serious caveat: Use sparingly. The resulted automation could be quite complex.

What isn't implemented is CST generation, I need to adjust the Deterministic automations to carry over the proper context to get this working.

For anyone interested, I've uploaded a pretty printed of the original form of the T*y♯ project. Each file should link to every other file, I started to link in the individual rules to follow them, but it would've taken far too long (would've been simpler to automate!)

If more context is needed, please post accordingly.

Edit 5-14-2013: I've written code to create GraphViz graphs for the state machines within a given language. Here is a GraphViz digraph of the AssemblyPart. The members linked in the language description should have a rulename.txt in their relative folder with the digraph for that rule. Some of the language description has changed since I posted the example, this is due to simplifying things about the grammar. Here's an interesting graphviz image.

Migration of MSO links to MSE links

Source Link

edited Apr 23, 2014 at 9:16

Community Bot

1

Loading

added 4 characters in body; Post Made Community Wiki

Source Link

edited May 21, 2013 at 2:36

Allen Clark Copeland Jr

401
2
16

Loading

added 497 characters in body

Source Link

edited May 20, 2013 at 23:49

Allen Clark Copeland Jr

401
2
16

Loading

added 497 characters in body

Source Link

edited May 20, 2013 at 23:38

Allen Clark Copeland Jr

401
2
16

Loading

Added a link to a template, and removed the old pretty printed version of the syntax, as it lacks structure.

Source Link

edited May 18, 2013 at 3:46

Allen Clark Copeland Jr

401
2
16

Loading

Notice removed Draw attention by CommunityBot

occurred May 18, 2013 at 2:12

Bounty Ended with user39685's answer chosen by CommunityBot

occurred May 18, 2013 at 2:12

Condensed question, removed fluff.

Source Link

edited May 16, 2013 at 1:33

Allen Clark Copeland Jr

401
2
16

Loading

Added clarity

Source Link

edited May 14, 2013 at 13:42

Allen Clark Copeland Jr

401
2
16

Loading

Added update on digraph creation.

Source Link

edited May 14, 2013 at 13:11

Allen Clark Copeland Jr

401
2
16

Loading

Tweeted twitter.com/#!/StackProgrammer/status/332678075156422658

occurred May 10, 2013 at 2:06

Bolded grammar link.

Source Link

edited May 10, 2013 at 1:26

Allen Clark Copeland Jr

401
2
16

Loading

Notice added Draw attention by Allen Clark Copeland Jr

occurred May 10, 2013 at 1:11

Bounty Started worth 50 reputation by Allen Clark Copeland Jr

occurred May 10, 2013 at 1:11

Added updated T*y♯ grammar link.

Source Link

edited May 7, 2013 at 11:11

Allen Clark Copeland Jr

401
2
16

Loading

Trimmed question to focus on the question.

Source Link

edited May 7, 2013 at 7:50

Allen Clark Copeland Jr

401
2
16

Loading

Trimmed question to focus on the question.

Source Link

edited May 7, 2013 at 7:33

Allen Clark Copeland Jr

401
2
16

Loading

Source Link

asked May 7, 2013 at 6:59

Allen Clark Copeland Jr

401
2
16

Loading

Stack Exchange Network

Return to Question

In a parser compiler, what kinds of features are ideal?

In a parser compiler, what kinds of features are ideal?

In a parser compiler, what kinds of features are ideal?

In a parser compiler, what kinds of features are ideal?

In a parser compiler, what kinds of features are ideal?

In a parser compiler, what kinds of features are ideal?

In a parser compiler, what kinds of features are ideal?

In a parser compiler, what kinds of features are ideal?

In a parser compiler, what kinds of features are ideal?