Handmade WDL Lexer and Parser with Rust
First published: January 24, 2024
Last updated: July 5, 2025
I’ve been reading a lot about how parsers work, and I decided to try to handwrite a parser for a simple domain specific language (DSL) I’ve worked with called workflow description language (WDL). WDL is common in bioinformatics pipelines, and I come across it often in my scientific consulting work.
The tooling for WDL is java and python, and while it works well, it is quite heavy and slow. I’d love a simple binary to lint my WDL workflows and add extra features for validating workflows. In an effort to learn how to make things fast and better understand how these tools work, I started working on a handmade lexer/parser for WDL using Rust.
WDL Lexer ¶
Generally, the first step when parsing is building a lexer (also known as a scanner) to transform the incoming text (WDL) into a series of tokens that we will then use to build an abstract syntax tree.
I did some reading on how lexers are built, and came across a lot of info on parser generators and EBNF notation, but by far the most helpful way to learn was to try to build one and dive into the code of other lexers to see how they handle it.
The Zig and Go lexers were the most helpful for getting started. Zig in particular, paired with Mitchell Hashimoto’s write up on the Zig compiler internals are fantastic resources. Even for a novice like me who has never written a line of Zig, the code is extremely straight forward and easy to understand1.
Tokenization ¶
A tokenizer is a state machine that iterates through the file and outputs tokens. The first step was deciding how to iterate through the parts, I chose to use rust to read the file into a string then split it into a vector of chars
. The primary data structures were a struct to hold the tokenizer state, the vector of chars, and the current position we were at in the source. The Tokens themselves were also structs that currently hold the start and end index of the token, and the “tag” (defined as a rust enum) or name of the token.
struct Tokenizer {
source: Vec<char>,
//keep track of the char index in the buffer
index: usize,
state: Vec<State>,
cur_invalid_token: Option<Token>,
}
struct Token {
tag: Tag,
begin: usize,
end: usize,
}
Similar to Zig, the primary method on the Tokenizer
struct is next()
, which scans the string and outputs the next token. The body of next
is primarily a large if..else if
block, that will call separate lexing functions based on the character seen. Most tokens are easy to parse, the one and two characters (e.g., =
, !=
, ==
) are detected in the main loop and don’t have their own functions. I call builtin functions/types and user variables Idents
and they are similarly straight forward, you detect if you are at a legal character, then you call a function that iterates through the characters until you reach a space or an illegal character (e.g., (
, =
, , etc) and then you lookup what you have so far to see if it is a known identifier or a user defined one.
fn lex_ident(&mut self, mut cur_token: Token) -> Option<Token> {
loop {
if let Some(c) = self.peek() {
//if the next character is a terminal or not allowed in an ident,
//then return whatever
//we were parsing and move on.
if c.is_alphabetic() || c == &'_' {
self.index += 1;
continue;
// End of ident
} else {
self.index += 1;
let tag = ident_tag(self.source.get(cur_token.begin..self.index)).unwrap();
cur_token.tag = tag;
cur_token.end = self.index;
return Some(cur_token);
}
// EOF
} else {
self.index += 1;
let tag = ident_tag(self.source.get(cur_token.begin..self.index)).unwrap();
cur_token.tag = tag;
cur_token.end = self.index;
return Some(cur_token);
}
}
}
the ident_tag
function simply matches against known identifier and type names, and either returns that name or the generic Ident
tag.
State ¶
The most challenging part of the WDL lexer has been handling strings and string interpolation. WDL has something similar to python’s f-strings, where you can use a specific syntax to directly insert arbitrary expressions into strings.
First reads the current state (stored as a stack in Tokenizer::state
) currently has three states:
enum State {
Normal,
InString(char),
Interp,
}
to be continued if you’re interested in where I ended up going with this project, it wasn’t very far. I did figure out how to parse f-strings using techniques from the Python interpreter and Zig parser. Send me an email if you have any questions.
While even intermediate rust code is deep in traits and abstractions, Zig feels extremely no-nonsense and to the point. More than anything else, this inspired me to start exploring Zig. ↩︎