How I met Ecto's dynamic repos

#elixir #programming #sql #ecto #dynamic_repo #umbrella

Read time: about 10 minutes.
Certified AI-free. All mistakes were generated by hand!

Encounter number 1

The year is 2022. My team at the time is working on an upcoming functionality called Spaces. In a nutshell, Spaces will allow our customers to organise their contract templates into workspaces that they can then self-manage and share with other users on our platform, making collaboration much simpler. The secret sauce behind it is this new permissions system we've been working on. It essentially allows us to grant the given performer (in our case, a user) a set of abilities on a particular entity (in our case, a space), as illustrated below:

And because we plan on reusing the same permissions system for other areas of our platform (it is a generic mechanism, so we can relatively easily introduce new types of entities, performers, and abilities to it), we've decided to extract it into its own "library".

Umbrella apps enter the chat

At that time, our main backend codebase is organised as an Elixir Umbrella Project. In short, an umbrella project is a collection of Elixir applications that can be started and run together, and that can share some common configuration. Think of it as using Docker Compose to start and run a bunch of your apps at once. But without Docker. And using an Elixir mechanism instead. Kind of. If that sounds confusing, it's probably because it is. And judging by the large number of puzzled discussions about umbrella projects like this one, I guess a lot of people, like us, fell into the umbrella trap. We already had a couple of apps under our umbrella, and the decision quickly gets made to create a new one for our new permissions "library". I recall my spidey sense was tingling at the time, but having already way too much work on my plate, I didn't feel like digging further and pushing back. That was a mistake, and in hindsight, I should've. More about it later.

So here we are, with a brand new app under our umbrella (brella, brella, eh, oh). This is how it looks like, high-level:

Job done 🤝, time to go on with our lives and start using the new library.

Houston, we have a problem

A few weeks pass, the feature is shaping nicely and the new permissions system is being put to good use. The integration between the APIs to manage Spaces and the permissions library is well underway.

As part of this effort, I spend the afternoon updating the create space backend API so that the user who initiated the creation flow is granted a set of default permissions as part of the operation. Here is, in very broad lines, what the code of that endpoint looks like:

# 'user_profile' represents the user who initiated 
# the creation of the Space in the UI.

MainRepo.transaction(fn ->
  {:ok, space} = Spaces.Api.create_space(user_profile, params)
  {:ok, permissions} = Permissions.Api.grant(
	user_profile,
	space,
	[:manage, :manage_templates, :view_templates, :edit_templates]
  )
end)

The actual code is doing more stuff, like validation, error handling, and publishing a bunch of events on RabbitMQ, but those are not important here. What matters is, we create the space and grant permissions to the user within the same SQL transaction so that the operation is atomic from the database perspective. It either succeeds and gets committed, or fails and gets fully rolled back.

One thing worth mentioning is that both Spaces.Api.create_space() and Permissions.Api.grant() are already fully covered by their respective test suites. Both also create their own SQL transactions internally, mainly so that they can be called on their own in an atomic fashion, for example from the Elixir REPL. Nothing to worry about here though, as nested transactions are fully support by out tools (Ecto + Postgres), as per the documentation:

If transaction/2 is called inside another transaction, the function is simply
executed, without wrapping the new transaction call in any way. If there is an
error in the inner transaction and the error is rescued, or the inner transaction is rolled back, the whole outer transaction is aborted, guaranteeing nothing will be committed.

This is a behaviour I have tested several times in Elixir already. I still remember being bit by a nasty piece of C# code involving nested transactions a few years back, and since then I have been overly cautious with these. Anyway, our code seems to be running properly, including error handling, and all tests are green. Time to push to main and call it a day, right?

Now, If you and I have worked together long enough, you'll know that I'm kinda paranoiac when it comes to programming. By default, my brain is wired in the "it's broken unless proven otherwise" mode. I could have tested absolutely everything thrice, and still need to see the code run on production for at least a week or two before I can half-confidently acknowledge "Ok, it's seems to be working fine". Did you just feel it, that spidey sense tingling again?

"What if, hypothetically, an unexpected error was thrown right before the very end of the transaction?", I ask myself. Right here 👇:

MainRepo.transaction(fn ->
  {:ok, space} = Spaces.Api.create_space(...)
  {:ok, permissions} = Permissions.Api.grant(...)
  
  # An unexpected error occurs here 👈
  1 = 2 # boom 💥
end)

I mean, what could realistically go wrong? But "you don't know until you know", so I quickly modify the code and run the error handling tests again.

The transaction was rolled back... Check.
No trace of the space in the database... Check.
"Mmm that's weird". The permission data is still here! How is this possible 🤔?

"Let's run the the permissions API test suite again, surely there must be something wrong there!" Green, and no data is left behind this time 🤯.

Around one hour of ~~cursing~~ probing later, I have the riddle partially solved. Remember that illustration from earlier?

Let's zoom in a bit to see what's really going on. Computer, enhance 🔍!

Figured it out already? The first illustration may have given us the false impression that those apps are just a bunch of domains from the same monolith accessing the central database. In fact, "monolith" was the very word we usually used to describe our backend within the company.

However, although neatly located next to each other in the file system, those apps, once run, are still physical, independent processes. And so are their underlying Ecto Repos. Yes, all of them are sharing the same database connection string. Yet at runtime each repo has their own, separate connection pool, meaning that transactionality would not apply across them.

Let's try to illustrate our previous code snippet. We have 3 transactions in total:

the top level transaction, created by MainRepo.
two nested transactions. The first one also created by MainRepo, while the second is created by PermissionsRepo.

So when things went boom 💥 right before the end of the top level transaction, the second nested transaction, handled by a different connection pool, was already complete and committed, thus not affected by the rollback! That explains why the permissions data remained in the database while all the rest was gone.

Plugging the hole

With our mistery now solved, One of my teammates and I set to hunt for a solution. Not having the transactional guarantees we had taken for granted was obviously bad (isn't it one of the compelling reasons to go with a monolithic approach?). In reality, since the problem was impacting both our production environment and our automated tests, which were set up in a slightly different way, we were actually hunting for two solutions.

If my memory serves well, it took us one full day of head bumping and running in circles before we got there. Each attempt would end up with things straight up not working and throwing a bunch of SQL- or process-related errors. When it comes to our automated tests, we tried to mess up with our test_helper.exs file and force the Ecto.Adapters.SQL.Sandbox utility to checkout and rollback transactions from multiple repos at once, without success. For our production environment, we tried in vain to reuse the same MainRepo Ecto Repo across all our umbrella apps, but because of the way the code was set up, it resulted in circular dependencies. Working around that would have required some significant and risky refactor.

Hours passed by and I was personally starting to lose hope in finding a decent, cost-effective solution. This is when I accidentally stumbled upon this documentation: "Replicas and dynamic repositories".

Our lord and saviour, Ecto's dynamic repo

Turns out the solution to our problems was in the title of this post all along 😉. What a coincidence! And on top of that, a very simple solution to implement, although this was not necessarily obvious at first glance.

In essence, dynamic repos are a mechanism offered by Ecto to instruct a given repo to use the connection pool of another repo instead of relying on its own pool. And as a name says, this configuration can be done dynamically, at runtime. It is for instance completely possible to set this behaviour for a specific piece of code only, like this:

PermissionsApp.Repo.put_dynamic_repo(AppA.MainRepo)
# PermissionsApp.Repo now uses AppA.MainRepo's connection pool.
PermissionsApp.Repo.update(workspace_changeset, opts)

PermissionsApp.Repo.put_dynamic_repo(AppB.OtherRepo)
# PermissionsApp.Repo now uses AppB.OtherRepo's connection pool.
PermissionsApp.Repo.delete(workspace_changeset, opts)

As you can see it's very easy to tell our repo to use a different connection pool. And thanks to that change transactions across those repos start to work as expected again ✨!

Now, as you can imagine, it would be tedious to manually call Repo1.put_dynamic_repo(Repo2) across our whole codebase. It would also be something developers would easily forget to do. Thankfully Ecto thought about everything and provides a very convenient configuration option via the default_dynamic_repo field. So in our scenario, the solution proved to be a simple one-liner:

defmodule PermissionsApp.Repo do
  use Ecto.Repo,
  
  otp_app: :permissions,
  adapter: Ecto.Adapters.Postgres,
  # The magic one-liner that solved all our issues!
  default_dynamic_repo: AppA.MainRepo
end

default_dynamic_repo saved the day and we were able to enjoy the benefits of transactions once again, without having to go through a painful refactor. Going back to our original example, all transactions would now be rolled back and no data would be left behind ✅.

Encounter number 2

Fast-forward to November 2023. Another company, another team, another codebase. Rest assured, this one is going to be a short one 😅.

I am now working on an internal web app written with Elixir LiveView. The app serves as a back-office to our main product for our Support and Customer Success teams. The initial approach was to grant the back-office app read/write access to the database. Over time the design was updated so that the app only has read access to the database, and any action or side effect must be performed by RPC-ing (remote procedure call) the main API. This has several benefits: not only we avoid significant code duplication by having all actions (and related tooling) in one codebase, but the back-office app is also more secure as it cannot directly interfere with the database state.

And my current task is precisely to make our back-office Ecto Repo read-only, which can be easily achieved with another one-liner:

defmodule BackOffice.Repo do
  use Ecto.Repo,
  
  otp_app: :back_office,
  adapter: Ecto.Adapters.Postgres,
  # This is all it takes to make the repo read-only!
  read_only: true
end

To make things easier we'd still like to be able to perform some write operations against the database in our automated tests. Things like setting up some test data so that we can assert it is properly displayed by the back-office app without having to rely on mocked data.

Again, nothing too difficult, with Ecto we can simply spawn a new Repo solely dedicated to our tests, this time without declaring it as read_only: true, obviously:

# A new Repo dedicated to our automated tests.
defmodule BackOffice.TestRepo do
  use Ecto.Repo,
  
  otp_app: :back_office,
  adapter: Ecto.Adapters.Postgres
end

But wait, do you see the problem now? Exactly! We end up once again with two distinct connection pools, leaving us open to data not being fully rolled back in our tests if we are not careful. Luckily, strong from my past experience with dynamic repos, this time around the fix only took me a couple minutes to apply and test:

defmodule BackOffice.TestRepo do
  use Ecto.Repo,
  
  otp_app: :back_office,
  adapter: Ecto.Adapters.Postgres
  # Dynamic repos to the rescue!
  default_dynamic_repo: BackOffice.Repo
end

# in our "test_helper.exs" file:
Ecto.Adapters.SQL.Sandbox.mode(BackOffice.Repo, :manual)

This means that every time ExUnit would roll back a test transaction created against BackOffice.Repo, it would also roll back any nested transaction created against BackOffice.TestRepo, leaving no data test data behind.

The cool thing is that dynamic_repo was able to respect the read_only: true set on BackOffice.Repo, meaning that, although both repos would use the same connection pool, only BackOffice.TestRepo would be able to perform write operation 👌.

How I met Ecto's dynamic repos

We've seen how Ecto's dynamic repos was able to save our bacon in two different situations by allowing us to ~~bend the space time continuum~~ instruct different repos to use the same connection pool at runtime. This is definitely a very useful tool to have under your belt. Maybe this will save you some trouble the next you'll be in a similar situation.

PS: You remember those pesky umbrella apps? A few months later a colleague of ours went on a mission to completely remove them in favour of... simple folders! This meant that we were back to a true monolith, and as a result we were able to remove all instances of dynamic_repo from the codebase 🙂!