LRSTAR: LR(*) parser generator for C++

A.M.D.G.

Home Downloads Feedback Comparison Theoretical Documentation Contact
Acknowledgments Installation LRSTAR DFA Definitions

DFA 11.0 User's Manual

Section 1. Introduction

Introduction.

DFA is a lexer generator which creates small and very fast lexers, which use a compressed-matrix structure. Only direct-code structures are faster, but not much faster. Compressed-matrix structures are smaller and compile faster. DFA is designed to do keyword recognition and <identifier> recognition at the same time. No separate function is needed to see if an <identifier> is a keyword.

DFA works with LRSTAR, which assigns token numbers when reading the .grm file. LRSTAR writes the .lex which contains the token numbers. DFA reads the .lex first and then the .lgr file. This makes life easy for you, because you don't need to use defined constantd (e.g. IF, ELSE, SEMI) in your .grm file. You can use the exact representation (e.g. if, else, ';') instead. Less trouble and easier for other people. Not everybody knows what LCURLY means, but everybody knows what '{' means.


Section 2. Program Options

DFA 11.0.001 32b Copyright 2018 Paul B Mann.
|
|   DFA LEXER GENERATOR
|                               
|   dfa <grammar> [(/ or -)<option>...]
|
|   OPTION  DEFAULT  DESCRIPTION
|   c           1    Conflict report (all conflicts)
|   crr         0    Conflict report for Reduce-Reduce conflicts
|   csr         0    Conflict report for Shift-Reduce conflicts
|   d           0    Debug option for generated lexer
|   g           0    Grammar listing
|   m           1    Minimize lexer-table size
|   st          0    State machine for conflicts report
|   sto         0    State machine optimized
|   v           0    Verbose mode
|   vv          0    Verbose mode (more verbose)
|   w           1    Print warnings on screen
|_

'c' option: Conflict report for all conflicts.

This option tells DFA to report all conflicts, shift-reduce and reduce-reduce conflicts.

'crr' option: Conflict report for reduce-reduce conflicts.

This option tells DFA to report only RR conflicts.

'csr' option: Conflict report for shift-reduce conflicts.

This option tells DFA to report SR conflicts.

'd' option: Debug option.

This option tells DFA to put a defined constant in the lexer, which will make the lexer print the input file on the screen and in the output file.

'g' option: Create a grammar listing file.

This option tells DFA to generate a grammar file: ?.lgr.grammar.txt.

'm' option: Minimize lexer table size.

This option tells DFA to minimize the size of the lexer tables.

'st' option: State machine listing.

This option tells DFA to list the original unoptimized state machine. This state machine is the one that you might want to look at when trying to figure out why your grammar has a conflict. This state machine listing can be very large.

'sto' option: State machine optimized listing.

This option tells DFA to list the optimized state machine. The optimized state machine is much smaller, but the state numbers in the optimized one do not correspond to the state numbers in the conflict report.

'v' option: Verbose mode.

This option tells DFA to display more information on the screen.

'vv' option: Verbose mode (more verbose).

This option tells DFA to display more information on the screen (more than the vv option).

'w' option: Warnings messages option.

This option tells DFA to display warning messages on the screen. The warning messages are always listed in the 'warnings.txt' file.


Section 3. Command-Line Syntax.

DFA is a command-line (or console-mode) program. The command-line syntax is:

dfa <grammar> [<option>]...

dfa   is the program name.
<grammar>   is the input grammar, (e.g. 'calc' or 'calc.lgr'). Must have a filetype of '.lgr'.
[<option>]...   means zero or more program options.

An <option> must start with a '/' or '-' or '!'. '/' means yes, '-' means yes, '!' means no, (e.g. !st and /st=0 means no state machine).

Here are some examples:

dfa ansic
dfa ansic /v /c /d /st
dfa ansic /vv /m !g !st

Section 4. Output Files from Lexer Generation.

DFA is capable of creating five output files when generating a lexer. The information in the files depends on what options you specify. For example, if your lexical-grammar is calc.lgr, you will get these five output files:

calc.lgr.conflicts.txt   - A conflicts file, showing the conflicts in your grammar.
calc.lgr.grammar.txt   - A grammar file, showing your grammar and the rule numbers.
calc.lgr.log.txt   - A log file, showing a log of what happened.
calc.lgr.states.txt   - A state-machine file, showing the lexer state machine.
calc.lgr.warnings.txt   - A file with warning messages.


Section 5. Lexical-Grammar Symbols

Lexical-Grammar Symbols

Lexical-grammar symbols are the "words" of your grammar which define the tokens of your language. Here is a list of the different types of symbols allowed in a lexical grammar:

Alpha Symbol        letter, digit, any, neol, not_asterisk

Lexical Symbol      <identifier>, <number>, <eof>, <string>

Ignore Symbol       {whitespace}, {ommentline}, {commentblock}

Literal Symbol      'program', 'if', 'else', 'while', ' ', '->', '"', '1', '9'

Integer (0..255)    0, 127, 32, 255, 26, 10, 13

Escape Symbol       \n, \z, \t, \r, \b, \f

Note 1. ''' is a single quote and '\'' is not valid. This makes for better readability. A backslash is '\', a double quote is '"'.

Note 2. You create your own escape symbols. There are no predefined escape symbols. You may include escape symbols that are not used in your lexical grammar and they will be ignored (no error message given).

Note 3. Ignore symbols, e.g. {whitespace} and {comment}, are bypassed and not given to the parser.


Section 6. Lexical-Grammar Syntax

Grammar Operators and Punctuators

Grammar operator and punctuators are used to further define the syntax of your computer language. Here is a list of the different types of operators and punctuators allowed in a grammar:

Lexical Grammar Notation

->    means "is a" (for a rule which defines a token).
:     means "is a" (for a the first rule which defines a token).
|     means "alternate choice" (another rule or character expression (e.g. 'a'..'z')).
;     means "end of rules" for this token. 

+     means "one or more occurrences".
*     means "zero or more occurrences".
?     means "optional occurrence".

(     means "group start".
)     means "group end".

=     means "character-set definition" which is different from a rule definition.
..    means "up to and including", a range of characters, (e.g 'A'..'Z').
-     means "subtract" the next character or range from the set being defined.

Comments

Comments are allowed in the same style as C++. Line comments start with '//' and end with the end-of- line character, for example:

// C++ Grammar.
// by John Smith.
// in August 2007.

Block comments start with '/*' and end with '*/'. These cannot be nested (one inside of another). Here is an example:

/* Removed 10-15-04 NKP
-> CREATE TableName UniqueStuff ';' 
*/

Section 7. Lexical Input Files

When using both LRSTAR and DFA, you only need to create an .lgr file, which contains the rules for making the tokens, as follows:

/* Token Rules  */

   <identifier>         -> letter (letter|digit)*
   <integer>            -> digit+
   <eof>                -> \z

   {whitespace}         -> (\t | \n | ' ')+
   {commentline}        -> '/' '/' neol*
   {commentblock}       -> '/' '*' na* '*'+ (nans na* '*'+)* '/'

/* Character Sets */

   letter               = 'a'..'z' | 'A'..'Z' | '_'
   digit                = '0'..'9'

   any                  = 0..255 - \z       // any character except EOF
   na                   = any - '*'         // not asterisk
   nans                 = any - '*' - '/'   // not asterisk not slash
   neol                 = any - \n          // not end of line

   \t                   =  9                // tab
   \n                   = 10                // newline
   \f                   = 12                // form feed
   \r                   = 13                // return
   \z                   = 26                // end of file

/* End */

If you are using DFA only, to make a lexer, you will need to add the list of tokens at the top of the .lgr file as follows:

/* Token List */
                  		
   <eof>             EOFCODE        
   <identifier>      IDENTIFIER
   <integer>         INTEGER

   'else'            ELSE
   'endif'           ENDIF
   'if'              IF
   'program'         PROGRAM
   'then'            THEN

   '!='              NOTEQ
   '('               LP
   ')'               RP
   '*'               MUL
   '+'               PLUS
   '-'               MINUS
   '/'               DIV
   ';'               SEMI
   '='               EQ
   '=='              EQTO
   '{'               LB
   '}'               RB

/* Token Rules */

   <identifier>         -> letter (letter|digit)*
   <integer>            -> digit+
   <eof>                -> \z

   {whitespace}         -> (\t | \n | ' ')+
   {commentline}        -> '/' '/' neol*
   {commentblock}       -> '/' '*' na* '*'+ (nans na* '*'+)* '/'

/* Character Sets */

   letter               = 'a'..'z' | 'A'..'Z' | '_'
   digit                = '0'..'9'

   any                  = 0..255 - \z       // any character except EOF
   na                   = any - '*'         // not asterisk
   nans                 = any - '*' - '/'   // not asterisk not slash
   neol                 = any - \n          // not end of line

   \t                   =  9                // tab
   \n                   = 10                // newline
   \f                   = 12                // form feed
   \r                   = 13                // return
   \z                   = 26                // end of file

/* End */

Section 8. Lexical Defined Constants

When using both LRSTAR and DFA and you really need the defined constants (for use in your code), then you should put them at the top of the .grm file as shown below:

IDENTIFIER     <identifier>;
CONSTANT       <constant>;
STRING_LITERAL <string_literal>;

TYPE_NAME      {type_name};

AUTO           'auto';
BOOL           '_Bool';
BREAK          'break';
CASE           'case';
CHAR           'char';
COMPLEX        '_Complex';
CONST          'const';
CONTINUE       'continue';
DEFAULT        'default';
DO             'do';
DOUBLE         'double';
ELSE           'else';
ENUM           'enum';
EXTERN         'exterm';
FLOAT          'float';
FOR            'for';
GOTO           'goto';
IF             'if';
IMAGINARY      '_Imaginary';
INLINE         'inline';
INT            'int';
LONG           'long';
REGISTER       'register';
RETURN         'return';
RESTRICT       'restrict';
SHORT          'short';
SIGNED         'signed';
SIZEOF         'sizeof';
STATIC         'static';
STRUCT         'struct';
SWITCH         'switch';
TYPEDEF        'typedef';
UNION          'union';
UNSIGNED       'unsigned';
VOID           'void';
VOLATILE       'volatile';
WHILE          'while';

PTR_OP         '->';
INC_OP         '++';
DEC_OP         '--';
LEFT_OP        '<<';
RIGHT_OP       '>>';
LE_OP          '<=';
GE_OP          '>=';
EQ_OP          '==';
NE_OP          '!=';
AND_OP         '&&';
OR_OP          '||';
MUL_ASSIGN     '*=';
DIV_ASSIGN     '/=';
MOD_ASSIGN     '%=';
ADD_ASSIGN     '+=';
SUB_ASSIGN     '-=';
LEFT_ASSIGN    '<<=';
RIGHT_ASSIGN   '>>=';
AND_ASSIGN     '&=';
XOR_ASSIGN     '^=';
OR_ASSIGN      '|=';
ELLIPSIS       '...';
(c) Copyright Paul B Mann 2020.  All rights reserved.