Dumb way to parse
This is a bad advice, use it on your own risk
The smart way to parse text is to use a tool like ANTLR, or yacc or parser combinator or even Xtext. However, being smart means studying those tools and frameworks. What if you don’t want to study, what if you just want to write a parser without investing hours on learning the “right” tool. Than you should continue reading as this is exactly what this post is about.
Look ma I can parse text!!!
Normally we don’t parse text just for kicks. We parse, because we need to extract some data out of a text representation. In this article I will use: https://github.com/mzaks/FlatBuffersSwiftCodeGen
as an example of how you can role your own parser in Swift. My code is not super robust and clean, but hey, it’s dumb way to parse article. The intention is to provoke you to learn from my mistakes and do better 😉.
Also if you are interested in a full dive into text parsing terminology and techniques, I can recommend following read:
Anyways, let’s start with the dumb way of parsing. First of all we need to understand that a text is a one dimensional array of numbers. Based on different encodings it can be different numbers representing the same text. So in order to be able to parse text we need to agree on one encoding. I define that text I want to parse is represented in UTF-8.
Now, the correct way of parsing text is based on two phases, lexing and parsing (for more details read “A Guide To Parsing” article I mentioned before). I say screw it, we are in the dumb way to parse, so we will have only one phase which we call eat 🤤.
Please have a look at following file:
In there we have an eat
function which takes:
- a static string — the string we want to eat
- a pointer into the memory where we want to start eating from
- and the length, which tells us how far we might want to eat
It returns an optional pointer, meaning that if we were able to eat the string, we return the pointer where we stopped eating, if we were not able to eat, we return a nil
.
However before we start eating the string itself, we eat white space. White space are characters, which do not have any semantical meaning for humans. If we have a look at character encoding table for first 128 characters in ASCII, which is equal to UTF8:
We see that first 0..<33
are non visible characters and can be considered as white space. Hence the dumb implementation of the eatWhiteSpace
function:
public func eatWhiteSpace(
_ p: UnsafePointer<UInt8>,
length: Int
) -> UnsafePointer<UInt8>? {
var p1 = p
while p1.pointee < 33 {
p1 = p1.advanced(by: 1)
if p.distance(to: p1) > length {
return nil
}
}
return p1
}
It advances through numbers which are smaller than 33
and returns the pointer to the first number which is higher or equal 33
. It also checks that we don’t eat too much. Eating should be a safe experience!
Speaking of safety first, if you already decided to go with dumb way of parsing, you should at least unit test your parser, which is btw. a really pleasant experience. Here are some unit tests for our eating logic:
As I mentioned earlier, we don’t parse text just for kicks. We want to extract data out of the textual representation. But before we do the data extraction, let’s define the data structures which will hold the data. In my example I am parsing the IDL (interface definition language) of FlatBuffers. The grammer for FlatBuffers IDL can be found here:
It’s quite a mouthful and also not always up to date. So let’s pick a somewhat simple declaration out of it in order to see how we can parse it in the dumb way.
Let’s pick the enum_decl
— it’s the declaration of an enum type in FlatBuffers. First we want to define the data model for the enum declaration:
struct Enum {
let name: Ident
let type: Type
let cases: [EnumCase]
let metaData: MetaData?
let comments: [Comment]
}
Here we see that an Enum
has a name which is Ident
it has a type, an array of cases, metadata and array of comments.
Let’s start with the later, let’s see what Comment
is all about:
Comment is a simple struct which hold only a string value. It implements the ASTNode
protocol which forces us to implement the parsing logic as a static method. This method receives pointer and length and returns an optional tuple:
- instance of the comment
- a pointer to the place where we stoped eating characters
It is an optional tuple because it returns nil
if we were not able to produce the comment. Now not being able to produce a comment might be a legitimate thing or an exception. I guess I would be better of, if this method would throw, but I decided to just raise
and crash internally. It’s up to you, to make your implementation better.
Inside of the function we eat //
string (don’t forget that eat
internally also eats the white space before the string) and than we will advance in the string until we hit 10
13
0
or the end of the string. The numbers represent end of the line or end of the zero terminated string. Now we have two pointers, one pointing to the string after //
and one when we stoped consuming. The numbers in between need to be transformed into a String
instance and will become the value of our Comment
.
Here are the unit tests which check if we are able to produce a Comment
instance to our liking.
Let’s have a quick look at other model classes we reference in Enum
:
struct Ident {
let value: String
}
Very simple struct, just captioning the name as a string. An ident can start with a lower / higher case letter or _
character, followed by letters, _
characters, or numbers. The parsing logic can be found here:
struct Type {
let scalar: Scalar?
let vector: Bool
let ref: Ident?
let string: Bool
}
Type is a relativley complex construct which represents the value type. In case of Enum
in FlatBuffers we define what number type the enum corresponds to:
byte
oruint8
short
oruint16
- etc…
Scalar types are represented as a swift enum. A reference to another complex type like Table, Struct Enum or Union are represent through their Ident. I defined string as special case and every type can be a vector, additionally to being scalar, string or ref. As I mentioned before a bit more involved example, but it is not that important in regards to parsing.
struct EnumCase {
let ident: Ident
let value: ValueLiteral?
}
EnumCase
has a name/ident and an optional ValueLiteral
. When you define enum cases in FlatBuffers IDL you can optionally add an = 4
expression which says that this case equals to 4
. This is what ValueLiteral
is all about. It just stores the string after =
character.
struct MetaData {
let values: [(Ident, ValueLiteral?)]
}
MetaData
is a way to add some meta information to your definitions in FlatBuffers IDL. It consists of an array of key value pairs, where value is optional. The key is represented with Ident
and the value with ValueLiteral
.
To complete the deep dive, I want to briefly explain the logic behind converting text into an instance of Enum
which you can find here:
As always we receive a pointer and the length.
First we will try to construct as many Comment
objects as we can, putting them all into an array. This means that comment statements before the actual enum, will belong to the enum definition.
Than we try to eat enum
string. If we are not successful, we return nil
. Meaning that who ever wanted the enum to be constructed knows that there is no Enum definition at the pointer it provided to the Enum
construction function. So it can try to use this same pointer to construct something else.
In case we could eat enum
string, now we need to construct an Ident
as an enum name, than we need to eat :
character and construct the Type
. Than we try to construct MetaData
. In this case it is fine if we get a nil
, because MetaData
is optional.
We continue with eating {
character. And constructing an array of EnumCase
objects.
Than we eat }
character and we are basically done. If all this yummy byte eating was successful, we can construct an Enum
object and return it with the pointer where we stoped. This way the caller will get the Enum
instance and a pointer to the place it can continue parsing.
If you want to see it all work, please have a look at the tests for Enum
:
So this is all I have to share with you. I don’t think this way of parsing is smart and I also can’t recommend it to everyone. But it does have a shallow learning curve, the abstraction is quite thin — you have a stream of bytes and you are eating those in order to produce instance of your model. Your model is your grammar, there is also no code generation, no monads and no library to learn. It’s about you figuring out what’s your model and how you can eat text in order to produce it. I still would not recommend it for a complex parsing tasks, but for the “simpler” ones, it is quite fun.
PS: The example is in Swift but I also used this technique in a C# project 😉