Dumb way to parse

This is a bad advice, use it on your own risk

Maxim Zaks
7 min readAug 27, 2018

The smart way to parse text is to use a tool like ANTLR, or yacc or parser combinator or even Xtext. However, being smart means studying those tools and frameworks. What if you don’t want to study, what if you just want to write a parser without investing hours on learning the “right” tool. Than you should continue reading as this is exactly what this post is about.

Look ma I can parse text!!!

Normally we don’t parse text just for kicks. We parse, because we need to extract some data out of a text representation. In this article I will use: https://github.com/mzaks/FlatBuffersSwiftCodeGen

as an example of how you can role your own parser in Swift. My code is not super robust and clean, but hey, it’s dumb way to parse article. The intention is to provoke you to learn from my mistakes and do better 😉.

Also if you are interested in a full dive into text parsing terminology and techniques, I can recommend following read:

Anyways, let’s start with the dumb way of parsing. First of all we need to understand that a text is a one dimensional array of numbers. Based on different encodings it can be different numbers representing the same text. So in order to be able to parse text we need to agree on one encoding. I define that text I want to parse is represented in UTF-8.

Now, the correct way of parsing text is based on two phases, lexing and parsing (for more details read “A Guide To Parsing” article I mentioned before). I say screw it, we are in the dumb way to parse, so we will have only one phase which we call eat 🤤.

Please have a look at following file:

In there we have an eat function which takes:

  • a static string — the string we want to eat
  • a pointer into the memory where we want to start eating from
  • and the length, which tells us how far we might want to eat

It returns an optional pointer, meaning that if we were able to eat the string, we return the pointer where we stopped eating, if we were not able to eat, we return a nil.

However before we start eating the string itself, we eat white space. White space are characters, which do not have any semantical meaning for humans. If we have a look at character encoding table for first 128 characters in ASCII, which is equal to UTF8:

ascii-code-tabelle-new-excellent-ascii-home-design-symbols-generator-code-table-of-ascii-code-tabelle.png

We see that first 0..<33 are non visible characters and can be considered as white space. Hence the dumb implementation of the eatWhiteSpace function:

public func eatWhiteSpace(
_ p: UnsafePointer<UInt8>,
length: Int
) -> UnsafePointer<UInt8>? {
var p1 = p
while p1.pointee < 33 {
p1 = p1.advanced(by: 1)
if p.distance(to: p1) > length {
return nil
}
}
return p1
}

It advances through numbers which are smaller than 33 and returns the pointer to the first number which is higher or equal 33. It also checks that we don’t eat too much. Eating should be a safe experience!

Speaking of safety first, if you already decided to go with dumb way of parsing, you should at least unit test your parser, which is btw. a really pleasant experience. Here are some unit tests for our eating logic:

As I mentioned earlier, we don’t parse text just for kicks. We want to extract data out of the textual representation. But before we do the data extraction, let’s define the data structures which will hold the data. In my example I am parsing the IDL (interface definition language) of FlatBuffers. The grammer for FlatBuffers IDL can be found here:

It’s quite a mouthful and also not always up to date. So let’s pick a somewhat simple declaration out of it in order to see how we can parse it in the dumb way.

Let’s pick the enum_decl — it’s the declaration of an enum type in FlatBuffers. First we want to define the data model for the enum declaration:

struct Enum {    
let name: Ident
let type: Type
let cases: [EnumCase]
let metaData: MetaData?
let comments: [Comment]
}

Here we see that an Enum has a name which is Ident it has a type, an array of cases, metadata and array of comments.

Let’s start with the later, let’s see what Comment is all about:

Comment is a simple struct which hold only a string value. It implements the ASTNode protocol which forces us to implement the parsing logic as a static method. This method receives pointer and length and returns an optional tuple:

  • instance of the comment
  • a pointer to the place where we stoped eating characters

It is an optional tuple because it returns nil if we were not able to produce the comment. Now not being able to produce a comment might be a legitimate thing or an exception. I guess I would be better of, if this method would throw, but I decided to just raise and crash internally. It’s up to you, to make your implementation better.

Inside of the function we eat // string (don’t forget that eat internally also eats the white space before the string) and than we will advance in the string until we hit 10 13 0 or the end of the string. The numbers represent end of the line or end of the zero terminated string. Now we have two pointers, one pointing to the string after // and one when we stoped consuming. The numbers in between need to be transformed into a String instance and will become the value of our Comment.

Here are the unit tests which check if we are able to produce a Comment instance to our liking.

Let’s have a quick look at other model classes we reference in Enum:

struct Ident {
let value: String
}

Very simple struct, just captioning the name as a string. An ident can start with a lower / higher case letter or _ character, followed by letters, _ characters, or numbers. The parsing logic can be found here:

struct Type {    
let scalar: Scalar?
let vector: Bool
let ref: Ident?
let string: Bool
}

Type is a relativley complex construct which represents the value type. In case of Enum in FlatBuffers we define what number type the enum corresponds to:

  • byte or uint8
  • short or uint16
  • etc…

Scalar types are represented as a swift enum. A reference to another complex type like Table, Struct Enum or Union are represent through their Ident. I defined string as special case and every type can be a vector, additionally to being scalar, string or ref. As I mentioned before a bit more involved example, but it is not that important in regards to parsing.

struct EnumCase {    
let ident: Ident
let value: ValueLiteral?
}

EnumCase has a name/ident and an optional ValueLiteral. When you define enum cases in FlatBuffers IDL you can optionally add an = 4 expression which says that this case equals to 4. This is what ValueLiteral is all about. It just stores the string after = character.

struct MetaData {
let values: [(Ident, ValueLiteral?)]
}

MetaData is a way to add some meta information to your definitions in FlatBuffers IDL. It consists of an array of key value pairs, where value is optional. The key is represented with Ident and the value with ValueLiteral.

To complete the deep dive, I want to briefly explain the logic behind converting text into an instance of Enum which you can find here:

As always we receive a pointer and the length.

First we will try to construct as many Comment objects as we can, putting them all into an array. This means that comment statements before the actual enum, will belong to the enum definition.

Than we try to eat enum string. If we are not successful, we return nil. Meaning that who ever wanted the enum to be constructed knows that there is no Enum definition at the pointer it provided to the Enum construction function. So it can try to use this same pointer to construct something else.

In case we could eat enum string, now we need to construct an Ident as an enum name, than we need to eat : character and construct the Type. Than we try to construct MetaData. In this case it is fine if we get a nil, because MetaData is optional.

We continue with eating { character. And constructing an array of EnumCase objects.

Than we eat } character and we are basically done. If all this yummy byte eating was successful, we can construct an Enum object and return it with the pointer where we stoped. This way the caller will get the Enum instance and a pointer to the place it can continue parsing.

So this is all I have to share with you. I don’t think this way of parsing is smart and I also can’t recommend it to everyone. But it does have a shallow learning curve, the abstraction is quite thin — you have a stream of bytes and you are eating those in order to produce instance of your model. Your model is your grammar, there is also no code generation, no monads and no library to learn. It’s about you figuring out what’s your model and how you can eat text in order to produce it. I still would not recommend it for a complex parsing tasks, but for the “simpler” ones, it is quite fun.

PS: The example is in Swift but I also used this technique in a C# project 😉

--

--

Maxim Zaks
Maxim Zaks

Written by Maxim Zaks

Tells computers how to waste electricity. Hopefully in efficient, or at least useful way.

No responses yet