This repository has been archived by the owner on Nov 26, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 11
/
Report
167 lines (131 loc) · 9.02 KB
/
Report
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
Step I: Scanner
The scanner opens a file written with valid symbols in the tiny grammar, and writes a file containing a space separated tokenized version of the input to stdout.
The scanner implements a DFA, similar to the one found in the Louden text, that has been adjusted to follow the grammar rules specified in the assignment. Internally, the program scans a line of text a character at a time. The staring character specifies the type of token that will be returned, either control, identification, or number. It continues adding characters to a buffer until the max token length has been reached, which is an error, or a character of a different type is found. This buffer is matched to known tokens, control symbols, and the ID and INTEGER type. The correct token is then written to stdout. This process continues until the end of the input file is reached, or an error state is encountered.
Step II: Parser Generator
First let me go into the assumptions we made for the parser generator:
* EPSILON is what we used for the epsilon symbol.
* $ is a reserved symbol and cannot be in the input grammar or an error will be thrown.
* Input is case sensitive i.e. different case = different symbol
* Nonterminal rules do NOT have to have <> around them; they only have to be declared in the nonterminal section.
* Nonterminals cannot contain a space inside <> i.e. you can't have <EXP LIST> as a nonterminal
The example we will use for the explanations is a simple grammar describing a list separated by semicolons:
<EXP> : ( <INNERLIST> ) | ( )
<INNERLIST> : <INNERLIST> ; INT | INT
If you want to check each of these steps for some other langauge, you can run the parser generator in verbose mode (see README).
Step II(a): Removal of Left Recursion and Common Prefixes
For this step we transfered the pseudocode from the book to a more object oriented syntax. The parser
generator (ParserGenerator.java) has functions removeSelfLeftRecursion and removeCommonPrefixes which
iterates over each nonterminal grammar rule (NonterminalRule.java) and calls their encapsulated functions
of the same name (removeSelfLeftRecursion and removeCommonPrefixes). Within each nonterminal rule
rucursion removal and common prefix removal is executed and the functions return a new rule if one was
created (ie if run on A we return A'). The parser then adds the new rule-prime to the nonterminal
rule list and continues.
Let us consider what would happen if we ran parserGenerator.removeSelfRecursion() on our example grammer.
parserGenerator would iterate through each nonterminal grammar rule, nonterm, calling nonterm.removeSelfRecursion().
parserGenerator would first call removeSelfRecursion on <EXP>.
Since there is no self-left-recursion in <EXP>
it would return null because we don't need <EXP'>.
parserGenerator would then call removeSelfRecursion on <INNERLIST>.
This would adjust <INNERLIST>'s rule to be INT <INNERLIST'>
and would return <INNERLIST'>: ; INT <INNERLIST'> | EPSILON
which would be added to our rules by parserGenerator.
parserGenerator would first call removeSelfRecursion on <INNERLIST'>.
Since there is no self-left-recursion in <INNERLIST'>
it would return null because we don't need <INNERLIST''>.
Our new grammar rules are:
<EXP>: ( <INNERLIST> ) | ( )
<INNERLIST>: INT <INNERLIST'>
<INNERLIST'>: ; INT <INNERLIST'> | EPSILON
Let us consider what would happen if we ran parserGenerator.removeCommonPrefixes() on our example grammer:
parserGenerator would iterate through each nonterminal grammar rule, nonterm, calling nonterm.removeCommonPrefixes().
parserGenerator would first call removeCommonPrefixes on <EXP>.
This would adjust <EXP>'s rule to be ( <EXP'>
and would return <EXP'>: <INNERLIST> ) | )
which would be added to our rules by parserGenerator.
parserGenerator would then call removeCommonPrefixes on <EXP'>.
Since there are no common prefixes in <EXP'>
it would return null because we don't need <EXP''>.
parserGenerator would then call removeCommonPrefixes on <INNERLIST>.
Since there are no common prefixes in <INNERLIST>
it would return null because we don't need <INNERLIST'>.
parserGenerator would first call removeCommonPrefixes on <INNERLIST'>.
Since there are no common prefixes in <INNERLIST'>
it would return null because we don't need <INNERLIST''>.
Our new grammar rules are:
<EXP>: ( <EXP'>
<EXP'>: <INNERLIST> ) | )
<INNERLIST>: INT <INNERLIST'>
<INNERLIST'>: ; INT <INNERLIST'> | EPSILON
Note the lack of self-left-recursion and common prefixes.
StepII(b): Building First Set, Follow Set, and Parse Table
Like the previous step, we simply implemented the pseudocode from the book in a more OO way.
We calculate the first and follow set using the parser generator's constructFirst and constructFollow
functions. These function iterate over each nonterminal grammar rule and calls their encapsulated
functions of the same name (constructFirst and constructFollow). Within the nonterminal rule the
algorithm in the Louden Textbook is executed and a boolean representing whether the individual
nonterminal rules' first or follow set was updated is returned. The parser continues to iterate through the
nonterminals until no changes to their first/follow sets occur.
Let us consider what happens if we run parserGenerator.constructFirst() on our example grammar:
parserGenerator loop through each of the nonterminal rules calling their encapsulated
constructFirst method until none of them report a change in their first sets.
(I will spare the grammar rule level detail since this can iterate for quite a while)
Our new first set is formed correctly:
First set for <EXP> -> ( <EXP'>: (
First set for <EXP'> -> <INNERLIST> ): INT
First set for <EXP'> -> ): )
First set for <INNERLIST> -> INT <INNERLIST'>: INT
First set for <INNERLIST'> -> ; INT <INNERLIST'>: ;
First set for <INNERLIST'> -> EPSILON: EPSILON
Let us consider what happens if we run parserGenerator.constructFollow() on our example grammar:
parserGenerator loop through each of the nonterminal rules calling their encapsulated
constructFollow method until none of them report a change in their follow sets.
(I will spare the grammar rule level detail since this can iterate for quite a while)
Our new follow set is formed correctly:
Follow set for <EXP>: $
Follow set for <EXP'>: $
Follow set for <INNERLIST>: )
Follow set for <INNERLIST'>: )
Forming the parse table is quite simple after we have the first and follow sets. We form the parse
table by calling the parser generator's createParseTable function. It simply loops through all the
nonterminals and loops through each of the terminals. If the nonterminal contains the terminal in
its first set then the correlating production rule is added to the parse table. If the nonterminal's
first set contains epsilon then epsilon is added to the parse table for all of the follow terminal entries.
Here's what the parse table looks like before we consider EPSILON:
| ( | ) | INT | ; | $ |
<EXP> | ( <EXP'> | | | | |
<EXP'> | | ) | <INNERLIST> ) | | |
<INNERLIST> | | | INT <INNERLIST'> | | |
<INNERLIST'> | | | | ; INT <INNERLIST'> | |
Now when we consider the first set of <INNERLIST'> contains EPSILON, the resulting parse table looks like this:
| ( | ) | INT | ; | $ |
<EXP> | ( <EXP'> | | | | |
<EXP'> | | ) | <INNERLIST> ) | | |
<INNERLIST> | | | INT <INNERLIST'> | | |
<INNERLIST'> | | EPSILON | | ; INT <INNERLIST'> | |
Step II(c)
Once we have the parse table parsing is quite simple. The parser (Parser.java) starts out with
an input list, a parse stack that contains only the start symbol, and the parse table. The parser
loops until the stack and input is empty or an error has occured doing the following:
if the top of the stack is a nonterminal
then we add the parse table entry for that nonterminal and the current input symbol to the stack. (generate)
if the top of the stack is a terminal
then we make sure that it matches the current input symbol
and we pop off the top of the stack and get the next input symbol. (match)
if any entry in the parse table is empty or the input symbol and stack don't match
then we throw an error.
For example for the input ( INT ; INT )
INPUT STACK ACTION
( INT ; INT ) $ <EXP> $ generate
( INT ; INT ) $ ( <EXP'> $ match
INT ; INT ) $ <EXP'> $ generate
INT ; INT ) $ <INNERLIST> ) $ generate
INT ; INT ) $ INT <INNERLIST'> ) $ match
; INT ) $ <INNERLIST'> ) $ generate
; INT ) $ ; INT <INNERLIST'> ) $ match
) $ <INNERLIST'> ) $ match
) $ <INNERLIST'> ) $ generate
) $ EPSILON ) $ match (our implementation can put epsilon on the stack, but pops it off when detected)
) $ ) $ match
$ $ match
match
SUCCESSFUL PARSE