Data Structures First

programming opinion archive 16 Dec 2015

Update: As of January 2019, my views on this topic have changed a lot. This post is still quite valid though: the visualization of the data structures is the stepping stone that makes one able to better grasp the problems, solutions and goals at hand.

These statements are profound. They are usually quoted as pieces of wisdom, in which the authors provide sufficient authority to prevent challenging.

git actually has a simple design, with stable and reasonably well-documented data structures. In fact, I’m a huge proponent of designing your code around the data, rather than the other way around, and I think it’s one of the reasons git has been fairly successful […] I will, in fact, claim that the difference between a bad programmer and a good one is whether he considers his code or his data structures more important. – Linus Torvalds on Git.

Show me your flowcharts [code], and conceal your tables [schema], and I shall continue to be mystified; show me your tables [schema] and I won’t usually need your flowcharts [code]: they’ll be obvious. – Fred Brooks, “The Mythical Man Month”, ch 9.

I want to provide at least one line of reasoning that proves them right. In the process I will avoid simplistic examples - it will make it harder to understand the concepts I put forward, but it will not give the feeling of having understood a concept that you never manage to put in practice by yourself.

So, what do I mean by “data structures first”?

I mean that data structures, their types, values, and interdependence define the core of your application.

God module anti-pattern is a perfect example of what could happen when you don’t think your data structures first: because most developers code by modifying code instead of extending code, they tend to keep the changes close to each other. Regular modifications, such one condition here and one loop there.

Also, they tend to just complement classes and variables structures to accommodate just that small change.

That’s when inheritance comes and bites us back. A deep inheritance tree, in the name of Code Reuse, actually is God module anti-pattern in disguise. Children classes not only implement their specifics, but bring along with them all the code and data from the parent classes.

The intuitive approach to the concept

Thinking about data structures first means to catalog into a set programming language data constructs all the necessary knowledge (sometimes referred to as “state”, but this name is wrong) to execute the application.

It does not mean design up-front. Design up-front means to know all about that’s to know about an application, and start the Death March of Implementation.

Data structures first aims to fold the current knowledge of the application into a set of data structures, and whenever you have a better knowledge of your application, you reset this catalog - either appending new variables, fields or attributes, or removing them because they are not needed to express the knowledge needed to achieve its goals.

The resulting code executes itself accessing and modifying this catalog of data structures - and except for local state - it constrains itself to this dataset.

This paradigm blurs the line between OO and Procedural programming. What is the difference between a Class which keeps track of its internal state, bounded by a set of attributes, from a set of procedures and functions that constrain themselves a closed dataset?

A combinatorial approach to the concept

The problem with the intuitive approach, is that is does not demonstrate what happens if you just keep adding execution paths to your code and, along the way, just extending data structures with state variables - state, not knowledge.

Question: What is it easier to think about?

A) 5 items

B) 125 items

This is easy. Due to brain limitations, we can remember easily 5 items, but not 125 of them.

Using 5 names, you can name 25 people by combining pairs of names. You can name 125 people by combining names in three. And so on.

That’s the problem. Source code is little more than just an exercise of combining variables to produce behavior and output.

If you ever thought why you were always advised against long functions, that’s the reason. Longer codes means you are combining more variables, and more combinations you get, harder it is to deduce which combinations (“execution paths”) are actually taking place.

On the other hand, if you know that you have five variables, describing the world the way they do, you do know beforehand which part of the world they are referring to.

That’s why variable naming is important, whenever we are confronted with code we can’t understand, we fallback to variable names to deduce what’s the code is all about. We don’t know what it does, but we should be able to extract some meaning out of variable names.

As a matter of fact, just extracting variable names, from the innermost loop outwards, of an existing function will immediately give an idea of what it does.

Few examples for fun. Can you deduce what these functions do?

A) i, j, k, token, valid

B) age, gender, person, people, query, dbh

C) ptr, i, maxphysaddr

D) x, y, lesser, p, greater

(FP hackers will deduce D very easily)

Individual variables convey very little, however when they are packed in structs or classes, they together convey a lot of the knowledge they are supposed to represent. “ptr” means almost nothing. But “ptr, i, maxphysaddr” clearly makes you understand that a ptr (pointer) is being iterated (iota) up to the maxphysaddr.

Code is data too

This is particularly obvious for LISPers. Consider this LISP example

(sum A B C)

Can you figure out what types are for A, B and C? Of course, it is impossible without more context. But the great aspect here, namely in LISP, is that these parameters of sum, can also be functions.

Thus, in a function comprising of ptr, i, maxphysaddr and action - action can be either dictate the execution path, or it can be the execution itself (through induction).

Data structures can also be made of code. The core of your application can be described also in terms of codes: given a set of code as data, and a very strict context around them, it will be trivial to reason what they are about and what they should be doing.

Implicit and Anonymous data structures

If you are following the reasoning so far, you might feel tempted to dismiss a whole lot of what has been discussed here by noting that data structures are not only global-scoped entities. That’s true. Data structures are not only global-scoped entities, nor also are always named.

Take this PHP code:

for ($allValues as $category => $values){
  for ($values as $k => $v) {
    $storage[$category][$k] = sum($v);

Is it possible to tell apart, clean and without any possibility of mistake, when $storage structures ends and its content starts? Is it $category referring to potential attributes of $storage? Or is it $category just a value in an vectorial attribute of $storage?

Implicit data structures are a menace to the transcription of an application knowledge into code. They are cheap and handy, and it feels just way too comfortable not use them: make a new struct or class just for a loop seems like an exaggeration.

That’s when anonymous data structures come to rescue, if they are available - in this example, Go:

items := []struct{
        Category string
        Total float
        {"Category A", 10.0},
        {"Category B", 20.0},
        // so on

How to do it in practice?

Unfortunately, there is not a single method that will universally work. There is one however that work in most situations, and from which you can try to derive your own.

Step 1: think about a functionality of your application.

Step 2: write down, directly into code, all the needed variables to execute it. Ensure these variable have meaningful names.

Step 3: check how these variables can be grouped together into data structures and how they interact with each other.

Step 4: document these data structures, and their interactions.

Rinse and repeat until all known functionalities are thought through.

Finally, implement your code using them.

Some caveats:

  • We don’t care about local variables. They are short lived, and very likely they are only needed to address some branching in code.

  • We do care about global values. If you have some global variable or structure, double check your knowledge of the application. They can exist, but also they might create a big problem of what’s changing them.

  • We do care about visibility. If a part of a data structure is made accessible to others, others will access it. Essentially corrupting the knowledge it encapsulates.

  • Synchronization and Coordination are knowledge too. So mutexes, conditional variables, wait groups and whatnots are also important to be written down.