Hugo Hacker News

Let's write a compiler, part 3: A parser

eatonphil 2021-08-16 16:35:47 +0000 UTC [ - ]

Anyone can do whatever they want, and if you do want to write a parser in C this may be a good guide for you. But it's hard to understand the benefit to writing a language frontend in C, generally speaking. (An obvious exception might be if you are writing one of the first languages for a new architecture/operating system or one where only a C compiler exists.)

If you are writing a compiler, you'll just be emitting some other language anyway so the parser/emitter being in C doesn't really give you anything even if you are emitting C.

If you are writing an interpreter and you want performance, you're going to have to emit bytecode and have a bytecode interpreter. Wanting your bytecode _interpreter_ to be in C could make sense but your frontend could just emit the bytecode to a file and your interpreter written in C could read from that file.

It's a handwritten parser too so it's not like you picked the language so you could use a certain complex or nice parsing library.

And it's not like the performance (or binary size) of the parser/emitter typically matters so much that you need even that part to be so much faster (or smaller) than a parser/emitter in Go/Java/C#/whatever.

The overhead to writing C makes it a less friendly introduction to the topic if you have never written a compiler or interpreter (or parser, specifically) before.

Again, you can do what you want. And if you must do it in C, you will probably value any posts like this. Just wanted to throw out reasons you may not want to do this.

foxfluff 2021-08-16 18:49:48 +0000 UTC [ - ]

Brian's choice to use C is perfectly well justified:

"I am thinking that we should choose C to implement our PL/0 compiler. Eventually we will want to have this compiler run on many different platforms, including Unix, MS-DOS, and potentially even CP/M. C is a language that will run on every platform I would want to run our PL/0 compiler on, so C works for me. C also requires no special setup so our usual development platform of vim on OpenBSD will work just fine. Finally, it will help with bootstrapping the self-hosting version of the PL/0 compiler, since we in theory could natively compile the C implementation on our target platform then use that to compile the PL/0 implementation, also on the target platform."

The reader may learn something from these posts whether they choose to write a compiler in C or not.

eatonphil 2021-08-16 19:03:02 +0000 UTC [ - ]

I don't mean to debate Brian's choice (your quote shows it does fall into my exception anyway) but mention to readers here why they may be more interested in following guides on compiling/interpreting implemented in other languages than C.

fishmaster 2021-08-16 22:20:00 +0000 UTC [ - ]

C is actually a huge disadvantage here for me, since I have no use for it outside of this tutorial which makes it hard to justify learning it on top of learning how a compiler works...

david2ndaccount 2021-08-16 18:15:16 +0000 UTC [ - ]

C is a simple language and at least used to be a language almost everyone knew.

I don’t know why they’re writing the parser using globals though instead of passing a pointer to a parser state struct.

arcticbull 2021-08-16 19:17:50 +0000 UTC [ - ]

C is an incredibly complex language, it just appears simple. It's simple to get started and incredibly hard to do correctly, like skiing.

retrac 2021-08-16 19:39:13 +0000 UTC [ - ]

No, those things are not a complexity of C. They're just complex to implement at a lower level. They're complicated in any language, often, if approached at the level C provides. You have to hand-hold the machine through it. A C-like implementation of linked lists in an array of memory using pointers, for example, looks just as bad if written in that way in unsafe {} Rust, or even some mutating thing in Haskell. You can do that if you want, but you don't have to. Those languages allow you to abstract that away very quickly (often already pre-abstracted for you with default approaches). That's what people mean when they say C is simple. Plain and simple. Minimal. You have to do much of the heavy lifting.

arcticbull 2021-08-16 19:41:44 +0000 UTC [ - ]

Respectfully disagree. A language that fails to adequately provide safeguards to a user is a complex language because it forces all the complexity of dealing with the underlying machine onto the user. These are things a user needs to know to be effective in the language, which in turn makes using the language complex. It's a deceptively simple language.

C is complex because of what it doesn't do, not what it does. That's IMO worse.

retrac 2021-08-16 19:49:56 +0000 UTC [ - ]

It's not so much that we're disagreeing as arguing different points about the language, I think. You are right that it is not a simple language to use, in many ways. Archaic, unnecessarily complicated to implement many standard approaches, and very repetitive. The header system is just the wrong approach. Etc. I wouldn't pick it for the main language for a new project, if at all avoidable personally.

It's very easy to implement, though. A compiler fits in 32 kilobytes of RAM on some architectures. Only a handful of control structures. Limited types. That's the sense of simple that I meant, and which most people meant. It's not a compliment. It's a neutral observation. It's actually C's greatest weakness. It's too simple.

arcticbull 2021-08-16 22:08:13 +0000 UTC [ - ]

I think we’re in violent agreement :)

IdiocyInAction 2021-08-16 21:00:53 +0000 UTC [ - ]

C itself is not particularly complex, especially compared to other languages, like C++. It does have a ton of footguns and nasty UB things though.

But writing good, safe code in C can be very complex. That doesn't make the language itself is complex.

userbinator 2021-08-17 04:09:21 +0000 UTC [ - ]

I don’t know why they’re writing the parser using globals though instead of passing a pointer to a parser state struct.

Because there's only one. Why obfuscate that fact? Don't be tempted by the cargo-cult to make things more complex than they need to be just to appease some dogmatic "best" practices that actually aren't.

throwaway17_17 2021-08-17 04:37:46 +0000 UTC [ - ]

I think this is a really good example to point to in discussions of ‘cargo-cult’-ing. GP is merely reciting the standard ‘best’ practice for development in C, and on a more complex project or in another use case, they may be correct to do so. However, as you pointed out, in this case the globals are true single instance ‘objects’ and the use is semantically sensible and justified. It dovetails nicely with some of the DoD vs ECS discussion from the front page earlier today. Do what makes the most sense for your use case for reasons that apply to your project, everything else is abstraction.

int_19h 2021-08-16 21:02:05 +0000 UTC [ - ]

Parsers involve a lot of string processing, which is really inconvenient and easy to get wrong in C.

gnubison 2021-08-17 00:03:04 +0000 UTC [ - ]

Well, only the lexer deals with strings, and even then… it’s not really creating strings, just extracting some information from them. Creating strings efficiently and safely is IMO the hard part of string management in C.

fishmaster 2021-08-16 17:58:06 +0000 UTC [ - ]

> The overhead to writing C makes it a less friendly introduction to the topic if you have never written a compiler or interpreter (or parser, specifically) before.

That's true and was my first thought. C seems like an arcane choice for an introduction course.

arkj 2021-08-16 17:16:48 +0000 UTC [ - ]

> But it's hard to understand the benefit to writing a language frontend in C, generally speaking.

The advantage of writing in C is that the language provides very little “high” level features. All you have is just functions and memory so you end up doing everything which is a laborious process but rewarding to the mind especially if you are someone who never did low level programming.

musicale 2021-08-16 22:16:15 +0000 UTC [ - ]

I expect this series is similar to Wirth's own chapter on writing a PL/0 compiler. It seems to be following the same recursive descent approach, but using C rather than Pascal.

Since the implementation is in C, perhaps it might make sense to compile a version of PL/0 ("C/0") with C syntax rather than Pascal syntax. It might be possible to get something close to a C/0 compiler that could compile its own source code.

Though one issue with C is you also need a cpp implementation.

kazinator 2021-08-16 16:03:13 +0000 UTC [ - ]

The author forgot default cases in various switch (tok) statements:

  static void
  factor(void)
  {
     switch (type) {
     case TOK_IDENT:
     case TOK_NUMBER:
       next();
       break;
     case TOK_LPAREN:
       expect(TOK_LPAREN);
       expression();
       expect(TOK_RPAREN);
     }
   }
The TOK_ enumeration has a good many more members than just those three. If any other one occurs in the code, or an outright corrupt type value occurs, it silently falls through. If the error is handled at all, it will result in some strange diagnostic, in some higher level code that thinks it has seen an expression with a factor in it.

Adding a break after the last block in a switch is a good idea; someone adding a new case might forget. (Wit diagnostic support, like GCC requiring a fallthrough comment, this is less important nowadays).

If a switch deliberately falls through for unmatched cases, have

   default:
     break;
to express that explicitly.

This whole approach of the parser consisting of void functions is poor for 2021. It's as if the author is deliberately trolling everyone who has ever taken any level of interest in functional programming.

The author will have a hard time fitting the necessary semantic actions into this code, like constructing an abstract syntax tree, and might succumb to the temptation to do it with some hacks which build the structure procedurally using global variables.

The parser would be a more future proof if it anticipated the need for backtracking. So that is to say, a function such as factor() above could be considered to be a pattern matching construct, and return a useful value indicating whether it matched or failed to match. Furthermore, if it it failed to match, then it would backtrack, leaving the state of the input stream as it was prior to entry.

You can then do interesting things, because you're essentially LL(k) now, like speculatively parse the input using several possible constructs in priority order, until one matches.

The code inconsistently mixes the use of concrete character constants like '.' and '{' and #define symbols that expand to character constants like #define TOK_PLUS '+'.

Using the character constants in the code as in expect('+') or whatever is clearer. The Yacc approach of using constants for abstract tokens, (which are enumerated starting at some value above 256), and concrete character constants for single-character tokens that denote themselves, is worth mimicking.

You're not going to redefine TOK_PLUS to something other than '+' in the future, and doing so would be like like changing #define FOUR 4 to #define FOUR 5. TOK_PLUS doesn't inform any better than '+', and isn't any easier to find with grep.

wott 2021-08-16 16:58:52 +0000 UTC [ - ]

> The author will have a hard time fitting the necessary semantic actions into this code, like constructing an abstract syntax tree,

It's a declared non-goal.

> The code inconsistently mixes the use of concrete character constants like '.' and '{' and #define symbols that expand to character constants like #define TOK_PLUS '+'. > Using the character constants in the code as in expect('+') or whatever is clearer.

That's not how it works, TOK_* represent whole elements, the element may happen to be a single character, but it can be a word. It would be highly confusing to pass single characters to expect() as you wouldn't know at first sight whether it expects a simple character or an element. The TOK_* could as well be assigned a random number as you say in your following sentence. And they were until yesterday :-) I guess it is easier to debug with mnemonic characters.

https://github.com/ibara/pl0c/commit/aba92627deac3f8cfe08679...

kazinator 2021-08-16 17:25:33 +0000 UTC [ - ]

Using character constants for denoting one-character tokens is a time-honored tradition that everyone understands.

E.g. GNU Awk:

http://git.savannah.gnu.org/cgit/gawk.git/tree/awkgram.y?h=g...

This is readable; it says "I am just what I look like: a terminal symbol corresponding to the actual +".

2021-08-16 17:14:19 +0000 UTC [ - ]

p4bl0 2021-08-16 17:57:07 +0000 UTC [ - ]

> [building an AST is] a declared non-goal.

Then what will be going on after parsing? What will be compiled?

kazinator 2021-08-16 18:56:43 +0000 UTC [ - ]

Compilers can go straight to output without building an AST, which is just one form of output. I should have more generally written about output.

Output can be produced procedurally in a single pass. However, it will still require refactoring to work that in. For instance, say we go to a register machine (one that has an unlimited number of registers). The expression() function needs to know the destination register of the calculation. If it returns void as before, then it has to take it as an argument. Or else, it has to come up with the register and return it.

I think that if the target is a stack-based machine, that would be the easiest to work into the parsing scheme. This is because without any context at all. Let's use term() as an example:

Original recognizer skeleton:

  static void
  term(void)
  {
    factor();

    while (type == TOK_MULTIPLY || type == TOK_DIVIDE) {
      next();
      factor();
    }
  }
Stack-based output:

  static void
  term(void)
  {
    factor(); // this now outputs the code which puts the factor on the stack

    while (type == TOK_MULTIPLY || type == TOK_DIVIDE) {
      next();
      factor(); // of course, likewise outputs code.

      output_stack_operation(type); // we just add this
    }
  }
What is output_stack_operation_type:

  static void
  output_stack_operation_type(int type)
  {
     switch (type) {
     case TOK_MULTIPLY: output_line("mul"); break;
     case TOK_DIVIDE: output_line("div"); break;
       // ...
     }
  }
For instance if we see

  3 * 4 / 6
we have the grammar symbols

  factor TOK_MULTIPLY factor TOK_DIVIDE factor
the term function calls factor() first, and that function produces output, which might be the line:

  push 3
the term function calls factor() again, so this time the output:

  push 4
is produced. Then it calls output_stack_operation_type(type) where type is TOK_MULTIPLY. This outputs:

  mul
The loop continues and further produces

  push 6
  div
Because stack code implicitly accesses and returns operands using the stack, the syntax-directed translation for it doesn't have to pass around environmental contexts like register allocation info and whatnot.

The stack-based operations do not take any argument, and therefore a parser whose functions don't return anything and don't take any arguments can accommodate stack-based code generation.

Stack-based code can be an intermediate representation that is parsed again, and turned into something else, like register-based.

If the plan is to go to stack-based output, the code can work.

p4bl0 2021-08-16 16:25:24 +0000 UTC [ - ]

I second all this. Reading the blogpost I was wondering how the author will do with this parser to actually build an AST rather than just validating syntax.

The thing is that building an AST is typically the kind of thing where "do it with some hacks which build the structure procedurally using global variables" as kazinator said is a sure way to have horrible bugs. The inherently recursive nature of an AST doesn't mix very well with procedural manipulation of a global state.

Someone 2021-08-16 20:10:52 +0000 UTC [ - ]

I don’t think there will be an AST. https://briancallahan.net/blog/20210814.html: “We will be writing a single-pass compiler for a simple language and immediately output our final output code as soon as our compiler has enough information to do so”

That “as soon as” implies code will be generated before the entire program has been parsed.

Also, for me single-pass implies “no AST”, as you would need at least one pass to construct one, and iterating over an AST counts as another pass for me.

pwdisswordfish8 2021-08-16 21:39:37 +0000 UTC [ - ]

Single-pass to me also implies a code generator that writes the object file directly, rather than compiling to C or assembly (or some other language) as an intermediate format. But words don't mean anything anymore, so long as people can plausibly convince others that making them feel bad by calling out improper use means that they should be allowed to use words however they want, like a child owed the opportunity to exercise their unbridled spirit.

kazinator 2021-08-17 19:37:32 +0000 UTC [ - ]

Note that loading the program into memory is also a pass. Producing an installation image is also a pass, as is installing the program into the target. When we are talking about compiler passes, we don't count these, because they are outside the compiler.

When we consider a compiler that puts out assembly language, if that works without producing an AST, just by generating assembly in a syntax-directed way as it followed the phrase structure of the code, then we call that a one-pass compiler. What it makes is assembly language, and it does that in one pass.

Producing executable code from the assembly requires another pass, but that pass belongs to the assembler. The compiler is not even running, and so it can't be a pass of that compiler.

userbinator 2021-08-17 04:27:32 +0000 UTC [ - ]

The parser would be a more future proof

There's already plenty of bikeshed here but that just takes the cake. The only code that's truly "future proof" is code that you're willing to rewrite; which I'm sure the author is capable of doing if/when the need arises.

fishmaster 2021-08-16 14:48:02 +0000 UTC [ - ]

I'm thinking about following the tutorial in Julia or Kotlin, but it would be interesting how the code looks in a different language like F#.

BazookaMusic 2021-08-16 20:15:53 +0000 UTC [ - ]

The code isn't perfect, it's a learning project and still WIP, but you can check the code for a parser for the LOX language ( c-like) from 'Crafting Interpreters' in this project. 'Ast.fs' and 'Parser.fs' should be enough to get an idea of what it's doing.You can also google the grammar of the language which is quite simple:

https://github.com/BazookaMusic/FLOX

the_benno 2021-08-16 19:24:43 +0000 UTC [ - ]

F# is an ML-family language, and ML stands for meta-language -- it was designed for language tools and compilers! Andrew Appel's Modern Compiler Implementation in ML is a great resource for code examples that aren't exactly F#, but will look damn close. The book has also been published with C and Java, if you want side-by-side comparisons

pfdietz 2021-08-16 20:44:37 +0000 UTC [ - ]

ML was designed to be the metalanguage for the LCF proof system, used to write proof tactics. It was then recognized to be interesting in its own right as a programming language.

the_benno 2021-08-16 21:37:45 +0000 UTC [ - ]

Oh, thanks for the correction and apologies for the inadvertent misinfo.

Your comment sent me reading the HOPL paper about SML -- I think I (based on folklore knowledge/informal conversations) conflate the original development of ML with subsequent development and standardization around SML. All of this was well before my time so it's quite interesting to read about some of the details.

nikkinana 2021-08-16 15:53:55 +0000 UTC [ - ]

yacc is key

chrisseaton 2021-08-16 16:38:44 +0000 UTC [ - ]

Never use Yacc. Just write your own parser.

arcticbull 2021-08-16 19:23:46 +0000 UTC [ - ]

Lemon is an infinitely better parser generator than Yacc; it solves a lot of the issues that make Yacc terrible to work with: named non-terminals, non-terminal destructors, re-entrancy, etc. It's part of the SQLite codebase. Highly recommended.

In many cases, though, I think parser combinators thread the needle elegantly - nom in Rust for instance. Or even just a hand-rolled LL(1).

eatonphil 2021-08-16 16:48:27 +0000 UTC [ - ]

Adding on, the fraction of "real" language implementations using handwritten parsers vs. parser generators basically seems to be half and half.

The Wikipedia list of Bison users even calls out that a number of them switched from Bison to handwritten parsers.

https://en.wikipedia.org/wiki/GNU_Bison#Use

p4bl0 2021-08-16 16:50:39 +0000 UTC [ - ]

Why would you say that? Yacc and other parser generators exist for a good reason: hand written parsers can be quite hairy to debug and extend, while parser generators offer a domain specific language to specify your grammar and can generate efficient parser code based on it.

What are the arguments in favor of manually writing parsers?

chrisseaton 2021-08-16 17:27:45 +0000 UTC [ - ]

Fundamentally I don't think parsing is a problem that's complex enough to warrant a custom tool and language. And even if it was Yacc is the wrong tool for the job in most cases.

Since most languages are context-sensitive, you almost always have to bend Yacc, which is designed for context-free languages, out of shape to apply it.

It's a hammer but we hardly have any nails, and the nails can just be pushed in by hand without a hammer, and they're the wrong type of nails anyway so that hammer isn't really compatible, and top of that it's a really expensive hammer.

rightbyte 2021-08-16 18:28:32 +0000 UTC [ - ]

Isn't the context sensitivity handled when you build an AST with the action statements in Yacc?

chrisseaton 2021-08-16 18:32:08 +0000 UTC [ - ]

Yes - so straight away you have to leave the DSL and bypass the formal mode of Yacc. At which point why bother with it? If the first thing you need to do with your tool is hack around it, maybe the tool isn’t so well-designed?

And actions often do things like side-effect changes to the lexer state, so it gets worse from there.

codr7 2021-08-16 17:12:44 +0000 UTC [ - ]

What few seem to realize is that it's perfectly possible to abstract out some of the work and create an extensible foundation for manual parsers.

Since it's all regular code, you can use the full power of the host language to deal with the problem.

https://github.com/codr7/swifties/blob/main/Sources/Swifties...

p4bl0 2021-08-16 17:22:19 +0000 UTC [ - ]

I understand that, but I still find it easier to write/maintain/extend a grammar than a parser for that grammar.

brundolf 2021-08-16 18:30:35 +0000 UTC [ - ]

If you've got the right utility functions, a handwritten parser visually maps very directly to its grammar. It won't be as concise of course, but cognitively you can work with it almost as if it were just a grammar.

ibains 2021-08-16 18:07:26 +0000 UTC [ - ]

This post is good for learning, we have too many “systems“ developed in Java (Hadoop et al - I’m looking at you - you sluggish bugs in the rug), we need more programmers to learn C and build systems in it.

Also, you can do this so much easier in scala using parser combinators, if the goal is time to market and the code to be compiled is not very large. Here is an example to see how the approaches compare: https://www.prophecy.io/blogs/scala-packrat-parser-combinato...

macintux 2021-08-17 02:56:04 +0000 UTC [ - ]

> …we need more programmers to learn C and build systems in it.

I agree with the first, but not the second. C is a valuable language to know, but for security’s sake we need fewer complex programs in it, not more.

adrian_b 2021-08-17 10:59:03 +0000 UTC [ - ]

While C may be criticized for making it too easy to misuse pointers, for other features that are usually mentioned as security problems for C programs, e.g. out-of-bounds addressing and numeric overflow, the culprit is not the C language, but the manufacturers of the most popular CPUs, e.g. the Intel/AMD CPUs.

On most modern CPUs, checking for addressing bounds or for overflow is too expensive and the software developers almost always choose speed over correctness.

There have been a few C compilers with optional run-time checks for bounds and overflow, but almost nobody used those options for production code.

Unlike the Intel/AMD ISA, there are other instruction sets which include a variety of exception conditions, for a cheap implementation of the run-time checks (e.g. the IBM POWER ISA), but even there I do not know if the most recent implementations of those architectures have efficient exceptions.

userbinator 2021-08-17 04:22:46 +0000 UTC [ - ]

"security"? You mean the paranoia FUD of Big Tech justifying their enslavement of users and locking them out of their own bought general-purpose-computers --- I mean devices --- to further the authoritarian corporatocracy?

"Trusted" computing, DRM, and all its ugly ilk are premised on making things "secure". Maybe its time we opened our eyes to realise that uncomfortable truth. If the government and the megacorps are scared enough by encryption to want backdoors into our lives, maybe we should want ones into them too.

"Those who give up freedom for security deserve neither."

d110af5ccf 2021-08-17 07:15:44 +0000 UTC [ - ]

The comment you replied to had absolutely nothing to do with encryption or DRM.

C being low level means you're forced to do many tedious and complicated things manually. This makes the code you write prone to errors which (in the worst case) can be exploited by attackers to run arbitrary code on your system.

... but I'm sure you're already well aware of this which leaves me wondering why you wrote what you did.