Let's write a compiler, part 3: A parser
musicale 2021-08-16 22:16:15 +0000 UTC [ - ]
Since the implementation is in C, perhaps it might make sense to compile a version of PL/0 ("C/0") with C syntax rather than Pascal syntax. It might be possible to get something close to a C/0 compiler that could compile its own source code.
Though one issue with C is you also need a cpp implementation.
kazinator 2021-08-16 16:03:13 +0000 UTC [ - ]
static void
factor(void)
{
switch (type) {
case TOK_IDENT:
case TOK_NUMBER:
next();
break;
case TOK_LPAREN:
expect(TOK_LPAREN);
expression();
expect(TOK_RPAREN);
}
}
The TOK_ enumeration has a good many more members than just those three. If any other one occurs in the code, or an outright corrupt type value occurs, it silently falls through. If the error is handled at all, it will result in
some strange diagnostic, in some higher level code that thinks it has seen an expression with a factor in it.Adding a break after the last block in a switch is a good idea; someone adding a new case might forget. (Wit diagnostic support, like GCC requiring a fallthrough comment, this is less important nowadays).
If a switch deliberately falls through for unmatched cases, have
default:
break;
to express that explicitly.This whole approach of the parser consisting of void functions is poor for 2021. It's as if the author is deliberately trolling everyone who has ever taken any level of interest in functional programming.
The author will have a hard time fitting the necessary semantic actions into this code, like constructing an abstract syntax tree, and might succumb to the temptation to do it with some hacks which build the structure procedurally using global variables.
The parser would be a more future proof if it anticipated the need for backtracking. So that is to say, a function such as factor() above could be considered to be a pattern matching construct, and return a useful value indicating whether it matched or failed to match. Furthermore, if it it failed to match, then it would backtrack, leaving the state of the input stream as it was prior to entry.
You can then do interesting things, because you're essentially LL(k) now, like speculatively parse the input using several possible constructs in priority order, until one matches.
The code inconsistently mixes the use of concrete character constants like '.' and '{' and #define symbols that expand to character constants like #define TOK_PLUS '+'.
Using the character constants in the code as in expect('+') or whatever is clearer. The Yacc approach of using constants for abstract tokens, (which are enumerated starting at some value above 256), and concrete character constants for single-character tokens that denote themselves, is worth mimicking.
You're not going to redefine TOK_PLUS to something other than '+' in the future, and doing so would be like like changing #define FOUR 4 to #define FOUR 5. TOK_PLUS doesn't inform any better than '+', and isn't any easier to find with grep.
wott 2021-08-16 16:58:52 +0000 UTC [ - ]
It's a declared non-goal.
> The code inconsistently mixes the use of concrete character constants like '.' and '{' and #define symbols that expand to character constants like #define TOK_PLUS '+'. > Using the character constants in the code as in expect('+') or whatever is clearer.
That's not how it works, TOK_* represent whole elements, the element may happen to be a single character, but it can be a word. It would be highly confusing to pass single characters to expect() as you wouldn't know at first sight whether it expects a simple character or an element. The TOK_* could as well be assigned a random number as you say in your following sentence. And they were until yesterday :-) I guess it is easier to debug with mnemonic characters.
https://github.com/ibara/pl0c/commit/aba92627deac3f8cfe08679...
kazinator 2021-08-16 17:25:33 +0000 UTC [ - ]
E.g. GNU Awk:
http://git.savannah.gnu.org/cgit/gawk.git/tree/awkgram.y?h=g...
This is readable; it says "I am just what I look like: a terminal symbol corresponding to the actual +".
p4bl0 2021-08-16 17:57:07 +0000 UTC [ - ]
Then what will be going on after parsing? What will be compiled?
kazinator 2021-08-16 18:56:43 +0000 UTC [ - ]
Output can be produced procedurally in a single pass. However, it will still require refactoring to work that in. For instance, say we go to a register machine (one that has an unlimited number of registers). The expression() function needs to know the destination register of the calculation. If it returns void as before, then it has to take it as an argument. Or else, it has to come up with the register and return it.
I think that if the target is a stack-based machine, that would be the easiest to work into the parsing scheme. This is because without any context at all. Let's use term() as an example:
Original recognizer skeleton:
static void
term(void)
{
factor();
while (type == TOK_MULTIPLY || type == TOK_DIVIDE) {
next();
factor();
}
}
Stack-based output: static void
term(void)
{
factor(); // this now outputs the code which puts the factor on the stack
while (type == TOK_MULTIPLY || type == TOK_DIVIDE) {
next();
factor(); // of course, likewise outputs code.
output_stack_operation(type); // we just add this
}
}
What is output_stack_operation_type: static void
output_stack_operation_type(int type)
{
switch (type) {
case TOK_MULTIPLY: output_line("mul"); break;
case TOK_DIVIDE: output_line("div"); break;
// ...
}
}
For instance if we see 3 * 4 / 6
we have the grammar symbols factor TOK_MULTIPLY factor TOK_DIVIDE factor
the term function calls factor() first, and that function produces output, which might be the line: push 3
the term function calls factor() again, so this time the output: push 4
is produced. Then it calls output_stack_operation_type(type) where type is TOK_MULTIPLY. This outputs: mul
The loop continues and further produces push 6
div
Because stack code implicitly accesses and returns operands using the stack, the syntax-directed translation for it doesn't have to pass around environmental contexts like register allocation info and whatnot.The stack-based operations do not take any argument, and therefore a parser whose functions don't return anything and don't take any arguments can accommodate stack-based code generation.
Stack-based code can be an intermediate representation that is parsed again, and turned into something else, like register-based.
If the plan is to go to stack-based output, the code can work.
p4bl0 2021-08-16 16:25:24 +0000 UTC [ - ]
The thing is that building an AST is typically the kind of thing where "do it with some hacks which build the structure procedurally using global variables" as kazinator said is a sure way to have horrible bugs. The inherently recursive nature of an AST doesn't mix very well with procedural manipulation of a global state.
Someone 2021-08-16 20:10:52 +0000 UTC [ - ]
That “as soon as” implies code will be generated before the entire program has been parsed.
Also, for me single-pass implies “no AST”, as you would need at least one pass to construct one, and iterating over an AST counts as another pass for me.
pwdisswordfish8 2021-08-16 21:39:37 +0000 UTC [ - ]
kazinator 2021-08-17 19:37:32 +0000 UTC [ - ]
When we consider a compiler that puts out assembly language, if that works without producing an AST, just by generating assembly in a syntax-directed way as it followed the phrase structure of the code, then we call that a one-pass compiler. What it makes is assembly language, and it does that in one pass.
Producing executable code from the assembly requires another pass, but that pass belongs to the assembler. The compiler is not even running, and so it can't be a pass of that compiler.
userbinator 2021-08-17 04:27:32 +0000 UTC [ - ]
There's already plenty of bikeshed here but that just takes the cake. The only code that's truly "future proof" is code that you're willing to rewrite; which I'm sure the author is capable of doing if/when the need arises.
fishmaster 2021-08-16 14:48:02 +0000 UTC [ - ]
BazookaMusic 2021-08-16 20:15:53 +0000 UTC [ - ]
the_benno 2021-08-16 19:24:43 +0000 UTC [ - ]
pfdietz 2021-08-16 20:44:37 +0000 UTC [ - ]
the_benno 2021-08-16 21:37:45 +0000 UTC [ - ]
Your comment sent me reading the HOPL paper about SML -- I think I (based on folklore knowledge/informal conversations) conflate the original development of ML with subsequent development and standardization around SML. All of this was well before my time so it's quite interesting to read about some of the details.
nikkinana 2021-08-16 15:53:55 +0000 UTC [ - ]
chrisseaton 2021-08-16 16:38:44 +0000 UTC [ - ]
arcticbull 2021-08-16 19:23:46 +0000 UTC [ - ]
In many cases, though, I think parser combinators thread the needle elegantly - nom in Rust for instance. Or even just a hand-rolled LL(1).
eatonphil 2021-08-16 16:48:27 +0000 UTC [ - ]
The Wikipedia list of Bison users even calls out that a number of them switched from Bison to handwritten parsers.
p4bl0 2021-08-16 16:50:39 +0000 UTC [ - ]
What are the arguments in favor of manually writing parsers?
chrisseaton 2021-08-16 17:27:45 +0000 UTC [ - ]
Since most languages are context-sensitive, you almost always have to bend Yacc, which is designed for context-free languages, out of shape to apply it.
It's a hammer but we hardly have any nails, and the nails can just be pushed in by hand without a hammer, and they're the wrong type of nails anyway so that hammer isn't really compatible, and top of that it's a really expensive hammer.
rightbyte 2021-08-16 18:28:32 +0000 UTC [ - ]
chrisseaton 2021-08-16 18:32:08 +0000 UTC [ - ]
And actions often do things like side-effect changes to the lexer state, so it gets worse from there.
codr7 2021-08-16 17:12:44 +0000 UTC [ - ]
Since it's all regular code, you can use the full power of the host language to deal with the problem.
https://github.com/codr7/swifties/blob/main/Sources/Swifties...
p4bl0 2021-08-16 17:22:19 +0000 UTC [ - ]
brundolf 2021-08-16 18:30:35 +0000 UTC [ - ]
ibains 2021-08-16 18:07:26 +0000 UTC [ - ]
Also, you can do this so much easier in scala using parser combinators, if the goal is time to market and the code to be compiled is not very large. Here is an example to see how the approaches compare: https://www.prophecy.io/blogs/scala-packrat-parser-combinato...
macintux 2021-08-17 02:56:04 +0000 UTC [ - ]
I agree with the first, but not the second. C is a valuable language to know, but for security’s sake we need fewer complex programs in it, not more.
adrian_b 2021-08-17 10:59:03 +0000 UTC [ - ]
On most modern CPUs, checking for addressing bounds or for overflow is too expensive and the software developers almost always choose speed over correctness.
There have been a few C compilers with optional run-time checks for bounds and overflow, but almost nobody used those options for production code.
Unlike the Intel/AMD ISA, there are other instruction sets which include a variety of exception conditions, for a cheap implementation of the run-time checks (e.g. the IBM POWER ISA), but even there I do not know if the most recent implementations of those architectures have efficient exceptions.
userbinator 2021-08-17 04:22:46 +0000 UTC [ - ]
"Trusted" computing, DRM, and all its ugly ilk are premised on making things "secure". Maybe its time we opened our eyes to realise that uncomfortable truth. If the government and the megacorps are scared enough by encryption to want backdoors into our lives, maybe we should want ones into them too.
"Those who give up freedom for security deserve neither."
d110af5ccf 2021-08-17 07:15:44 +0000 UTC [ - ]
C being low level means you're forced to do many tedious and complicated things manually. This makes the code you write prone to errors which (in the worst case) can be exploited by attackers to run arbitrary code on your system.
... but I'm sure you're already well aware of this which leaves me wondering why you wrote what you did.
eatonphil 2021-08-16 16:35:47 +0000 UTC [ - ]
If you are writing a compiler, you'll just be emitting some other language anyway so the parser/emitter being in C doesn't really give you anything even if you are emitting C.
If you are writing an interpreter and you want performance, you're going to have to emit bytecode and have a bytecode interpreter. Wanting your bytecode _interpreter_ to be in C could make sense but your frontend could just emit the bytecode to a file and your interpreter written in C could read from that file.
It's a handwritten parser too so it's not like you picked the language so you could use a certain complex or nice parsing library.
And it's not like the performance (or binary size) of the parser/emitter typically matters so much that you need even that part to be so much faster (or smaller) than a parser/emitter in Go/Java/C#/whatever.
The overhead to writing C makes it a less friendly introduction to the topic if you have never written a compiler or interpreter (or parser, specifically) before.
Again, you can do what you want. And if you must do it in C, you will probably value any posts like this. Just wanted to throw out reasons you may not want to do this.
foxfluff 2021-08-16 18:49:48 +0000 UTC [ - ]
"I am thinking that we should choose C to implement our PL/0 compiler. Eventually we will want to have this compiler run on many different platforms, including Unix, MS-DOS, and potentially even CP/M. C is a language that will run on every platform I would want to run our PL/0 compiler on, so C works for me. C also requires no special setup so our usual development platform of vim on OpenBSD will work just fine. Finally, it will help with bootstrapping the self-hosting version of the PL/0 compiler, since we in theory could natively compile the C implementation on our target platform then use that to compile the PL/0 implementation, also on the target platform."
The reader may learn something from these posts whether they choose to write a compiler in C or not.
eatonphil 2021-08-16 19:03:02 +0000 UTC [ - ]
fishmaster 2021-08-16 22:20:00 +0000 UTC [ - ]
david2ndaccount 2021-08-16 18:15:16 +0000 UTC [ - ]
I don’t know why they’re writing the parser using globals though instead of passing a pointer to a parser state struct.
arcticbull 2021-08-16 19:17:50 +0000 UTC [ - ]
retrac 2021-08-16 19:39:13 +0000 UTC [ - ]
arcticbull 2021-08-16 19:41:44 +0000 UTC [ - ]
C is complex because of what it doesn't do, not what it does. That's IMO worse.
retrac 2021-08-16 19:49:56 +0000 UTC [ - ]
It's very easy to implement, though. A compiler fits in 32 kilobytes of RAM on some architectures. Only a handful of control structures. Limited types. That's the sense of simple that I meant, and which most people meant. It's not a compliment. It's a neutral observation. It's actually C's greatest weakness. It's too simple.
arcticbull 2021-08-16 22:08:13 +0000 UTC [ - ]
IdiocyInAction 2021-08-16 21:00:53 +0000 UTC [ - ]
But writing good, safe code in C can be very complex. That doesn't make the language itself is complex.
userbinator 2021-08-17 04:09:21 +0000 UTC [ - ]
Because there's only one. Why obfuscate that fact? Don't be tempted by the cargo-cult to make things more complex than they need to be just to appease some dogmatic "best" practices that actually aren't.
throwaway17_17 2021-08-17 04:37:46 +0000 UTC [ - ]
int_19h 2021-08-16 21:02:05 +0000 UTC [ - ]
gnubison 2021-08-17 00:03:04 +0000 UTC [ - ]
fishmaster 2021-08-16 17:58:06 +0000 UTC [ - ]
That's true and was my first thought. C seems like an arcane choice for an introduction course.
arkj 2021-08-16 17:16:48 +0000 UTC [ - ]
The advantage of writing in C is that the language provides very little “high” level features. All you have is just functions and memory so you end up doing everything which is a laborious process but rewarding to the mind especially if you are someone who never did low level programming.