Skip to main content
replaced http://programmers.stackexchange.com/ with https://softwareengineering.stackexchange.com/
Source Link

This is part of a series of questions which focuses on the sister project to the Abstraction Project, which aims to abstract the concepts used in language design in the form of a framework. The sister project is called OILexer, which aims to construct a parser from grammar files, without the use of code injection on matches.

Some other pages associated to these questions, related to structural typing, can be viewed herehere, and ease of use, found herehere. The meta-topic associated to an inquiry about the framework and the proper place to post can be found here.

I'm getting to the point where I'm about to start extracting the parse tree out of a given grammar, followed by a Recursive Descent parser which uses DFA to discern forward paths (similar to ANTLR 4's LL(*)), so I figured I'd open it up to get insight.

In a parser compiler, what kinds of features are ideal?

So far here is a brief overview of what's implemented:

  1. Templates
  2. Look ahead prediction, knowing what's valid at a given point.
  3. Rule 'Deliteralization' taking the literals within rules and resolving which token they're from.
  4. Nondeterministic Automata
  5. Deterministic Automata
  6. Simple lexical state machine for token recognition
  7. Token automation methods:
    • Scan - useful for comments: Comment := "/*" Scan("*/");
    • Subtract - Useful for Identifiers: Identifier := Subtract(IdentifierBody, Keywords);
      • Ensures the identifier doesn't accept keywords.
    • Encode - Encodes an automation as a series X count of base N transitions.
      • UnicodeEscape := "\\u" BaseEncode(IdentifierCharNoEscape, 16, 4);
        • Makes a unicode escape in hexadecimal, with hex 4-transitions. The difference between this and: [0-9A-Fa-f]{4} is the resulted automation with Encode limits the allowed set of hexadecimal values to the scope of IdentifierCharNoEscape. So if you give it \u005c, the encode version will not accept the value. Things like this have a serious caveat: Use sparingly. The resulted automation could be quite complex.

What isn't implemented is CST generation, I need to adjust the Deterministic automations to carry over the proper context to get this working.

For anyone interested, I've uploaded a pretty printed of the original form of the T*y♯ project. Each file should link to every other file, I started to link in the individual rules to follow them, but it would've taken far too long (would've been simpler to automate!)

If more context is needed, please post accordingly.

Edit 5-14-2013: I've written code to create GraphViz graphs for the state machines within a given language. Here is a GraphViz digraph of the AssemblyPart. The members linked in the language description should have a rulename.txt in their relative folder with the digraph for that rule. Some of the language description has changed since I posted the example, this is due to simplifying things about the grammar. Here's an interesting graphviz image.

This is part of a series of questions which focuses on the sister project to the Abstraction Project, which aims to abstract the concepts used in language design in the form of a framework. The sister project is called OILexer, which aims to construct a parser from grammar files, without the use of code injection on matches.

Some other pages associated to these questions, related to structural typing, can be viewed here, and ease of use, found here. The meta-topic associated to an inquiry about the framework and the proper place to post can be found here.

I'm getting to the point where I'm about to start extracting the parse tree out of a given grammar, followed by a Recursive Descent parser which uses DFA to discern forward paths (similar to ANTLR 4's LL(*)), so I figured I'd open it up to get insight.

In a parser compiler, what kinds of features are ideal?

So far here is a brief overview of what's implemented:

  1. Templates
  2. Look ahead prediction, knowing what's valid at a given point.
  3. Rule 'Deliteralization' taking the literals within rules and resolving which token they're from.
  4. Nondeterministic Automata
  5. Deterministic Automata
  6. Simple lexical state machine for token recognition
  7. Token automation methods:
    • Scan - useful for comments: Comment := "/*" Scan("*/");
    • Subtract - Useful for Identifiers: Identifier := Subtract(IdentifierBody, Keywords);
      • Ensures the identifier doesn't accept keywords.
    • Encode - Encodes an automation as a series X count of base N transitions.
      • UnicodeEscape := "\\u" BaseEncode(IdentifierCharNoEscape, 16, 4);
        • Makes a unicode escape in hexadecimal, with hex 4-transitions. The difference between this and: [0-9A-Fa-f]{4} is the resulted automation with Encode limits the allowed set of hexadecimal values to the scope of IdentifierCharNoEscape. So if you give it \u005c, the encode version will not accept the value. Things like this have a serious caveat: Use sparingly. The resulted automation could be quite complex.

What isn't implemented is CST generation, I need to adjust the Deterministic automations to carry over the proper context to get this working.

For anyone interested, I've uploaded a pretty printed of the original form of the T*y♯ project. Each file should link to every other file, I started to link in the individual rules to follow them, but it would've taken far too long (would've been simpler to automate!)

If more context is needed, please post accordingly.

Edit 5-14-2013: I've written code to create GraphViz graphs for the state machines within a given language. Here is a GraphViz digraph of the AssemblyPart. The members linked in the language description should have a rulename.txt in their relative folder with the digraph for that rule. Some of the language description has changed since I posted the example, this is due to simplifying things about the grammar. Here's an interesting graphviz image.

This is part of a series of questions which focuses on the sister project to the Abstraction Project, which aims to abstract the concepts used in language design in the form of a framework. The sister project is called OILexer, which aims to construct a parser from grammar files, without the use of code injection on matches.

Some other pages associated to these questions, related to structural typing, can be viewed here, and ease of use, found here. The meta-topic associated to an inquiry about the framework and the proper place to post can be found here.

I'm getting to the point where I'm about to start extracting the parse tree out of a given grammar, followed by a Recursive Descent parser which uses DFA to discern forward paths (similar to ANTLR 4's LL(*)), so I figured I'd open it up to get insight.

In a parser compiler, what kinds of features are ideal?

So far here is a brief overview of what's implemented:

  1. Templates
  2. Look ahead prediction, knowing what's valid at a given point.
  3. Rule 'Deliteralization' taking the literals within rules and resolving which token they're from.
  4. Nondeterministic Automata
  5. Deterministic Automata
  6. Simple lexical state machine for token recognition
  7. Token automation methods:
    • Scan - useful for comments: Comment := "/*" Scan("*/");
    • Subtract - Useful for Identifiers: Identifier := Subtract(IdentifierBody, Keywords);
      • Ensures the identifier doesn't accept keywords.
    • Encode - Encodes an automation as a series X count of base N transitions.
      • UnicodeEscape := "\\u" BaseEncode(IdentifierCharNoEscape, 16, 4);
        • Makes a unicode escape in hexadecimal, with hex 4-transitions. The difference between this and: [0-9A-Fa-f]{4} is the resulted automation with Encode limits the allowed set of hexadecimal values to the scope of IdentifierCharNoEscape. So if you give it \u005c, the encode version will not accept the value. Things like this have a serious caveat: Use sparingly. The resulted automation could be quite complex.

What isn't implemented is CST generation, I need to adjust the Deterministic automations to carry over the proper context to get this working.

For anyone interested, I've uploaded a pretty printed of the original form of the T*y♯ project. Each file should link to every other file, I started to link in the individual rules to follow them, but it would've taken far too long (would've been simpler to automate!)

If more context is needed, please post accordingly.

Edit 5-14-2013: I've written code to create GraphViz graphs for the state machines within a given language. Here is a GraphViz digraph of the AssemblyPart. The members linked in the language description should have a rulename.txt in their relative folder with the digraph for that rule. Some of the language description has changed since I posted the example, this is due to simplifying things about the grammar. Here's an interesting graphviz image.

replaced http://meta.stackexchange.com/ with https://meta.stackexchange.com/
Source Link

This is part of a series of questions which focuses on the sister project to the Abstraction Project, which aims to abstract the concepts used in language design in the form of a framework. The sister project is called OILexer, which aims to construct a parser from grammar files, without the use of code injection on matches.

Some other pages associated to these questions, related to structural typing, can be viewed here, and ease of use, found here. The meta-topic associated to an inquiry about the framework and the proper place to post can be found herehere.

I'm getting to the point where I'm about to start extracting the parse tree out of a given grammar, followed by a Recursive Descent parser which uses DFA to discern forward paths (similar to ANTLR 4's LL(*)), so I figured I'd open it up to get insight.

In a parser compiler, what kinds of features are ideal?

So far here is a brief overview of what's implemented:

  1. Templates
  2. Look ahead prediction, knowing what's valid at a given point.
  3. Rule 'Deliteralization' taking the literals within rules and resolving which token they're from.
  4. Nondeterministic Automata
  5. Deterministic Automata
  6. Simple lexical state machine for token recognition
  7. Token automation methods:
    • Scan - useful for comments: Comment := "/*" Scan("*/");
    • Subtract - Useful for Identifiers: Identifier := Subtract(IdentifierBody, Keywords);
      • Ensures the identifier doesn't accept keywords.
    • Encode - Encodes an automation as a series X count of base N transitions.
      • UnicodeEscape := "\\u" BaseEncode(IdentifierCharNoEscape, 16, 4);
        • Makes a unicode escape in hexadecimal, with hex 4-transitions. The difference between this and: [0-9A-Fa-f]{4} is the resulted automation with Encode limits the allowed set of hexadecimal values to the scope of IdentifierCharNoEscape. So if you give it \u005c, the encode version will not accept the value. Things like this have a serious caveat: Use sparingly. The resulted automation could be quite complex.

What isn't implemented is CST generation, I need to adjust the Deterministic automations to carry over the proper context to get this working.

For anyone interested, I've uploaded a pretty printed of the original form of the T*y♯ project. Each file should link to every other file, I started to link in the individual rules to follow them, but it would've taken far too long (would've been simpler to automate!)

If more context is needed, please post accordingly.

Edit 5-14-2013: I've written code to create GraphViz graphs for the state machines within a given language. Here is a GraphViz digraph of the AssemblyPart. The members linked in the language description should have a rulename.txt in their relative folder with the digraph for that rule. Some of the language description has changed since I posted the example, this is due to simplifying things about the grammar. Here's an interesting graphviz image.

This is part of a series of questions which focuses on the sister project to the Abstraction Project, which aims to abstract the concepts used in language design in the form of a framework. The sister project is called OILexer, which aims to construct a parser from grammar files, without the use of code injection on matches.

Some other pages associated to these questions, related to structural typing, can be viewed here, and ease of use, found here. The meta-topic associated to an inquiry about the framework and the proper place to post can be found here.

I'm getting to the point where I'm about to start extracting the parse tree out of a given grammar, followed by a Recursive Descent parser which uses DFA to discern forward paths (similar to ANTLR 4's LL(*)), so I figured I'd open it up to get insight.

In a parser compiler, what kinds of features are ideal?

So far here is a brief overview of what's implemented:

  1. Templates
  2. Look ahead prediction, knowing what's valid at a given point.
  3. Rule 'Deliteralization' taking the literals within rules and resolving which token they're from.
  4. Nondeterministic Automata
  5. Deterministic Automata
  6. Simple lexical state machine for token recognition
  7. Token automation methods:
    • Scan - useful for comments: Comment := "/*" Scan("*/");
    • Subtract - Useful for Identifiers: Identifier := Subtract(IdentifierBody, Keywords);
      • Ensures the identifier doesn't accept keywords.
    • Encode - Encodes an automation as a series X count of base N transitions.
      • UnicodeEscape := "\\u" BaseEncode(IdentifierCharNoEscape, 16, 4);
        • Makes a unicode escape in hexadecimal, with hex 4-transitions. The difference between this and: [0-9A-Fa-f]{4} is the resulted automation with Encode limits the allowed set of hexadecimal values to the scope of IdentifierCharNoEscape. So if you give it \u005c, the encode version will not accept the value. Things like this have a serious caveat: Use sparingly. The resulted automation could be quite complex.

What isn't implemented is CST generation, I need to adjust the Deterministic automations to carry over the proper context to get this working.

For anyone interested, I've uploaded a pretty printed of the original form of the T*y♯ project. Each file should link to every other file, I started to link in the individual rules to follow them, but it would've taken far too long (would've been simpler to automate!)

If more context is needed, please post accordingly.

Edit 5-14-2013: I've written code to create GraphViz graphs for the state machines within a given language. Here is a GraphViz digraph of the AssemblyPart. The members linked in the language description should have a rulename.txt in their relative folder with the digraph for that rule. Some of the language description has changed since I posted the example, this is due to simplifying things about the grammar. Here's an interesting graphviz image.

This is part of a series of questions which focuses on the sister project to the Abstraction Project, which aims to abstract the concepts used in language design in the form of a framework. The sister project is called OILexer, which aims to construct a parser from grammar files, without the use of code injection on matches.

Some other pages associated to these questions, related to structural typing, can be viewed here, and ease of use, found here. The meta-topic associated to an inquiry about the framework and the proper place to post can be found here.

I'm getting to the point where I'm about to start extracting the parse tree out of a given grammar, followed by a Recursive Descent parser which uses DFA to discern forward paths (similar to ANTLR 4's LL(*)), so I figured I'd open it up to get insight.

In a parser compiler, what kinds of features are ideal?

So far here is a brief overview of what's implemented:

  1. Templates
  2. Look ahead prediction, knowing what's valid at a given point.
  3. Rule 'Deliteralization' taking the literals within rules and resolving which token they're from.
  4. Nondeterministic Automata
  5. Deterministic Automata
  6. Simple lexical state machine for token recognition
  7. Token automation methods:
    • Scan - useful for comments: Comment := "/*" Scan("*/");
    • Subtract - Useful for Identifiers: Identifier := Subtract(IdentifierBody, Keywords);
      • Ensures the identifier doesn't accept keywords.
    • Encode - Encodes an automation as a series X count of base N transitions.
      • UnicodeEscape := "\\u" BaseEncode(IdentifierCharNoEscape, 16, 4);
        • Makes a unicode escape in hexadecimal, with hex 4-transitions. The difference between this and: [0-9A-Fa-f]{4} is the resulted automation with Encode limits the allowed set of hexadecimal values to the scope of IdentifierCharNoEscape. So if you give it \u005c, the encode version will not accept the value. Things like this have a serious caveat: Use sparingly. The resulted automation could be quite complex.

What isn't implemented is CST generation, I need to adjust the Deterministic automations to carry over the proper context to get this working.

For anyone interested, I've uploaded a pretty printed of the original form of the T*y♯ project. Each file should link to every other file, I started to link in the individual rules to follow them, but it would've taken far too long (would've been simpler to automate!)

If more context is needed, please post accordingly.

Edit 5-14-2013: I've written code to create GraphViz graphs for the state machines within a given language. Here is a GraphViz digraph of the AssemblyPart. The members linked in the language description should have a rulename.txt in their relative folder with the digraph for that rule. Some of the language description has changed since I posted the example, this is due to simplifying things about the grammar. Here's an interesting graphviz image.

Fixup of bad MSO links to MSE links migration
Source Link
  

This is part of a series of questions which focuses on the sister project to the Abstraction Project, which aims to abstract the concepts used in language design in the form of a framework. The sister project is called OILexer, which aims to construct a parser from grammar files, without the use of code injection on matches.

  

Some other pages associated to these questions, related to structural typing, can be viewed here, and ease of use, found here. The meta-topic associated to an inquiry about the framework and the proper place to post can be found here.

I'm getting to the point where I'm about to start extracting the parse tree out of a given grammar, followed by a Recursive Descent parser which uses DFA to discern forward paths (similar to ANTLR 4's LL(**)), so I figured I'd open it up to get insight.

In a parser compiler, what kinds of features are ideal?

So far here is a brief overview of what's implemented:

  1. TemplatesTemplates
  2. Look ahead prediction, knowing what's valid at a given point.
  3. Rule 'Deliteralization' taking the literals within rules and resolving which token they're from.
  4. Nondeterministic Automata
  5. Deterministic Automata  
  6. Simple lexical state machine for token recognition
  7. Token automation methods:
    • Scan - useful for comments: Comment := ""/*"*" Scan("*"*/"");
    • Subtract - Useful for Identifiers: Identifier := Subtract(IdentifierBody, Keywords);
      • Ensures the identifier doesn't accept keywords.
       
    • Encode - Encodes an automation as a series X count of base N transitions.
      • UnicodeEscape := "\\u""\\u" BaseEncode(IdentifierCharNoEscape, 16, 4);
        • Makes a unicode escape in hexadecimal, with hex 4-transitions. The difference between this and: [0-9A-Fa-f]{4} is the resulted automation with Encode limits the allowed set of hexadecimal values to the scope of IdentifierCharNoEscape. So if you give it \u005c, the encode version will not accept the value. Things like this have a serious caveat: Use sparingly. The resulted automation could be quite complex.
         
       
     

What isn't implemented is CST generation, I need to adjust the Deterministic automations to carry over the proper context to get this working.

For anyone interested, I've uploaded a pretty printed of the original form of the T*y♯ projectoriginal form of the T*y♯ project. Each file should link to every other file, I started to link in the individual rules to follow them, but it would've taken far too long (would've been simpler to automate!)

If more context is needed, please post accordingly.

Edit 5-14-2013: I've written code to create GraphViz graphs for the state machines within a given language. Here is a GraphViz digraph of the AssemblyPart Here is a GraphViz digraph of the AssemblyPart. The members linked in the language description should have a rulename.txt in their relative folder with the digraph for that rule. Some of the language description has changed since I posted the example, this is due to simplifying things about the grammar. Here's an interesting graphviz imagegraphviz image.

 

This is part of a series of questions which focuses on the sister project to the Abstraction Project, which aims to abstract the concepts used in language design in the form of a framework. The sister project is called OILexer, which aims to construct a parser from grammar files, without the use of code injection on matches.

 

Some other pages associated to these questions, related to structural typing, can be viewed here, and ease of use, found here. The meta-topic associated to an inquiry about the framework and the proper place to post can be found here.

I'm getting to the point where I'm about to start extracting the parse tree out of a given grammar, followed by a Recursive Descent parser which uses DFA to discern forward paths (similar to ANTLR 4's LL(*)), so I figured I'd open it up to get insight.

In a parser compiler, what kinds of features are ideal?

So far here is a brief overview of what's implemented:

  1. Templates
  2. Look ahead prediction, knowing what's valid at a given point.
  3. Rule 'Deliteralization' taking the literals within rules and resolving which token they're from.
  4. Nondeterministic Automata
  5. Deterministic Automata  
  6. Simple lexical state machine for token recognition
  7. Token automation methods:
    • Scan - useful for comments: Comment := "/*" Scan("*/");
    • Subtract - Useful for Identifiers: Identifier := Subtract(IdentifierBody, Keywords);
      • Ensures the identifier doesn't accept keywords.
    • Encode - Encodes an automation as a series X count of base N transitions.
      • UnicodeEscape := "\\u" BaseEncode(IdentifierCharNoEscape, 16, 4);
        • Makes a unicode escape in hexadecimal, with hex 4-transitions. The difference between this and: [0-9A-Fa-f]{4} is the resulted automation with Encode limits the allowed set of hexadecimal values to the scope of IdentifierCharNoEscape. So if you give it \u005c, the encode version will not accept the value. Things like this have a serious caveat: Use sparingly. The resulted automation could be quite complex.

What isn't implemented is CST generation, I need to adjust the Deterministic automations to carry over the proper context to get this working.

For anyone interested, I've uploaded a pretty printed of the original form of the T*y♯ project. Each file should link to every other file, I started to link in the individual rules to follow them, but it would've taken far too long (would've been simpler to automate!)

If more context is needed, please post accordingly.

Edit 5-14-2013: I've written code to create GraphViz graphs for the state machines within a given language. Here is a GraphViz digraph of the AssemblyPart. The members linked in the language description should have a rulename.txt in their relative folder with the digraph for that rule. Some of the language description has changed since I posted the example, this is due to simplifying things about the grammar. Here's an interesting graphviz image.

 

This is part of a series of questions which focuses on the sister project to the Abstraction Project, which aims to abstract the concepts used in language design in the form of a framework. The sister project is called OILexer, which aims to construct a parser from grammar files, without the use of code injection on matches.

 

Some other pages associated to these questions, related to structural typing, can be viewed here, and ease of use, found here. The meta-topic associated to an inquiry about the framework and the proper place to post can be found here.

I'm getting to the point where I'm about to start extracting the parse tree out of a given grammar, followed by a Recursive Descent parser which uses DFA to discern forward paths (similar to ANTLR 4's LL(*)), so I figured I'd open it up to get insight.

In a parser compiler, what kinds of features are ideal?

So far here is a brief overview of what's implemented:

  1. Templates
  2. Look ahead prediction, knowing what's valid at a given point.
  3. Rule 'Deliteralization' taking the literals within rules and resolving which token they're from.
  4. Nondeterministic Automata
  5. Deterministic Automata
  6. Simple lexical state machine for token recognition
  7. Token automation methods:
    • Scan - useful for comments: Comment := "/*" Scan("*/");
    • Subtract - Useful for Identifiers: Identifier := Subtract(IdentifierBody, Keywords);
      • Ensures the identifier doesn't accept keywords.
       
    • Encode - Encodes an automation as a series X count of base N transitions.
      • UnicodeEscape := "\\u" BaseEncode(IdentifierCharNoEscape, 16, 4);
        • Makes a unicode escape in hexadecimal, with hex 4-transitions. The difference between this and: [0-9A-Fa-f]{4} is the resulted automation with Encode limits the allowed set of hexadecimal values to the scope of IdentifierCharNoEscape. So if you give it \u005c, the encode version will not accept the value. Things like this have a serious caveat: Use sparingly. The resulted automation could be quite complex.
         
       
     

What isn't implemented is CST generation, I need to adjust the Deterministic automations to carry over the proper context to get this working.

For anyone interested, I've uploaded a pretty printed of the original form of the T*y♯ project. Each file should link to every other file, I started to link in the individual rules to follow them, but it would've taken far too long (would've been simpler to automate!)

If more context is needed, please post accordingly.

Edit 5-14-2013: I've written code to create GraphViz graphs for the state machines within a given language. Here is a GraphViz digraph of the AssemblyPart. The members linked in the language description should have a rulename.txt in their relative folder with the digraph for that rule. Some of the language description has changed since I posted the example, this is due to simplifying things about the grammar. Here's an interesting graphviz image.

Migration of MSO links to MSE links
Source Link
Loading
added 4 characters in body; Post Made Community Wiki
Source Link
Loading
added 497 characters in body
Source Link
Loading
added 497 characters in body
Source Link
Loading
Added a link to a template, and removed the old pretty printed version of the syntax, as it lacks structure.
Source Link
Loading
Notice removed Draw attention by CommunityBot
Bounty Ended with user39685's answer chosen by CommunityBot
Condensed question, removed fluff.
Source Link
Loading
Added clarity
Source Link
Loading
Added update on digraph creation.
Source Link
Loading
Tweeted twitter.com/#!/StackProgrammer/status/332678075156422658
Bolded grammar link.
Source Link
Loading
Notice added Draw attention by Allen Clark Copeland Jr
Bounty Started worth 50 reputation by Allen Clark Copeland Jr
Added updated T*y♯ grammar link.
Source Link
Loading
Trimmed question to focus on the question.
Source Link
Loading
Trimmed question to focus on the question.
Source Link
Loading
Source Link
Loading