Chris Dzombak

Failing Actors Reading Series

Toward the end of Fatal Error episode 42 (teaser), Soroush and I discussed the fault isolation section of Chris Lattner’s Swift concurrency manifesto.

I recently reread an older post about how Elixir (via the Erlang VM) provides fault tolerance, and I think there are some useful ideas here when thinking about how the “reliable” actor model Chris proposes should handle failiures.

The manifesto proposes two options to handle an actor’s failure or crash:

Option 1. Provide a standard library API to register failure handlers for actors, allowing higher level reasoning about how to process and respond to those failures. …

Option 2. Force all actor methods to throw, with the semantics that they only throw if the actor has crashed. …

We plan to revisit this in detail in an upcoming episode of Fatal Error, but before we record this episode we’d like interested listeners to read the following articles and send us their thoughts and questions:

Please read those links and let us know your thoughts. We record tomorrow evening.

My gut feeling, having read these articles, is that something like Option 1 is the right way to handle actor failures. Option 2 shifts too much potentially-complex responsibility to call sites, and they’re likely to delegate up to some more central supervisor-like object to handle failures anyway.

But the exact proposal in Option 1 is too simplistic, especially for server-side code which may use distributed actors; we need either language-level or stdlib-level tools which allow linking dependencies between actors and specifying respawn/retry behaviors for actor groups. If the language or standard library doesn't provide these tools, every application will end up reimplementing some subset of them, probably poorly.

As always, I welcome discussion and feedback; I’m @cdzombak on Twitter.