Oversight: supervision trees for Go

programming go erlang 25 Nov 2018

I have a hidden love for the Erlang ecosystem. Erlang talked concurrency and distributed programming in the time where processors were still evolving according to Moore’s law. Often people resist Erlang due to its syntax - but once you learn how to grok the pattern matching, it is a fascinating language to use to build critical software.

By the way, if the syntax seems strange to you, then take a look at Elixir. It contains all the niceties of Erlang with the readability of Ruby. Erlang creator himself gave good appraisal to Elixir.

Of all things Erlang standard library offers, the one I am most fascinated with is the supervision trees. It is one of those things where Erlang runtime shines at its best: it is robust, simple and makes you wonder how the rest of the world can live without it.

So what is the Erlang’s Supervision Trees? From the documentation:

A supervisor is responsible for starting, stopping, and monitoring its child processes. The basic idea of a supervisor is that it is to keep its child processes alive by restarting them when necessary.

There are many similarities between Go and Erlang - both support CSP-style concurrency and green threads. More often than not, I found myself gluing green threads when creating pipelines of channels (a.k.a Mailboxes). I decided to take on the task to port Erlang’s supervision tree to Go after I read Jeremy Bower’s post on supervision trees. He implemented a port of Erlang’s supervision trees in Go. Inspired by his post and implementation, I wrote cirello.io/supervisor - a free-form implementation of the concepts deeply influenced by Jeremy’s package public interface.

Oversight is my second attempt of porting this feature to Go. This time around, I have decided to be more strict in the port - trying to keep the same terminology and being as close as possible to Erlang’s original implementation. As much as the original, it supports:

  • multiple restart strategies - one for one, one for all, rest for one and simple one for one
  • shutdown tactics - wait until done or wait until timeout
  • correctly handles the internal configuration to support a forest of supervision trees.

Unfortunately, Go does not handle the life-cycle of goroutines in the same way Erlang handles processes. You must account for the differences between the two runtimes. In practice, it means a couple of things:

  • Oversight uses standard library’s context package to convey cancelation to the goroutines, and it expects them to honor the context cancelation channel (<-ctx.Done()).
  • Go scheduler does not expose an interface to kill goroutines brutally. When the killed child process takes too long to complete, Oversight implementation is going to detach from it and move on.
  • Oversight is going to signal each child process to stop in the reverse order they were started. However, it cannot guarantee that the runtime is going to preempt their shutdown in the same order it requested.
  • Unlike suture and like Erlang’s supervisor, cirello.io/oversight does not apply any time jitter on restarts.


Why another supervision tree library?

At first, both suture and my supervisor package seemed to serve me well. However, in a particular scenario, I needed the rest for one restart strategy to exist and suddenly I found myself extending the business logic code to stop itself on errors.

┌────────┐       ┌────────┐       ┌────────┐        ┌────────┐        ┌────────┐
│ Step 1 │──────▶│ Step 2 │──────▶│ Step 3 │───────▶│ Step 4 │───────▶│ Step N │
└────────┘       └────────┘       └────────┘        └────────┘        └────────┘

This graph is the simple representation of a waterfall-like pipe I am implementing. For reasons rather dull, I wanted to make sure that each step would reset its internal state on failure. The nature of the task at hand means that I have to read, process data and keep (commit) the state in memory. It meant that if step 2 found an error, then all following steps also had a dirty state in their cache. The right thing to do would be to step 2 to stash the entry that went wrong - choose another algorithm and try again. At this point, all steps until the end would have to discard their temporary state, load the last known good one and be ready for processing the upstream data.

Neither of them supported rest for one: github.com/thejerf/suture only supported “one_for_one”, and cirello.io/supervisor supported “one_for_one” and “all_for_one” strategies (supervisor.Group). This scenario is a perfect use case for the “rest_for_one” restart strategy.

supervise := oversight.Oversight(
	oversight.Processes(step1, step2, step3, step4, ..., stepN),
	oversight.WithRestartStrategy(oversight.RestForOne()),
)

With this specific configuration, Oversight is going to stop the processes from N to 2, make sure they all stop and then restart then in the start order, giving each of them the chance to have a clean restart.


Conclusion

Even in the presence of some limitations of the Go runtime (lack of control of running goroutines), I decided that I could still have some of the benefits of the supervision trees. These limitations are not deal-breakers, and they matter very little in the real world cases I faced so far.

Also, it is a good idea to have two implementations of a complex design. I expect that at some point in the future, I am going to be able to engage in some cross-pollination with other implementations of supervision trees.

This library is not a piece of new shiny technology that you should rush to use everywhere. Instead, it is the incarnation in Go of a concept that once you understand it for real, it becomes evident when to use it or not.