Authzed builds SpiceDB, an open source fine-grained permissions database inspired by Google’s Zanzibar paper. In addition to developing SpiceDB, we also operate managed SpiceDB services: fully isolated, dedicated clusters or permission systems in our Serverless platform. The SpiceDB serverless platform is powered by CockroachDB, which fulfills the role played by Spanner in the paper: it’s the foundation that makes it possible to run geographically distributed workloads with linearizable consistency to prevent the new-enemy problem.
In this post we are discussing database connection draining, a topic often overlooked but has an impact on your application availability. We will answer:
The example will be focused on Go applications and CockroachDB in particular, but the fundamentals are equivalent in other programming languages and database vendors.
Connection draining is also not exclusive to databases: any backend system running in modern infrastructure should also handle it.
Applications perform operations on databases over a connection: they are the “pipe” over which applications “talk” to DBs and so are a critical part of such interactions.
Connections are typically a limited (and scarce) resource - most databases have a default connection limit on how many can be opened at a given time, usually due not only the network sockets required, but also the database’s own connection management overhead. Having a large number of connections opened may affect the performance of the database, which can see its throughput reduced as the number of connections grows.
The process of opening a new database connection is often not cheap either, mainly due to network latency, TCP/TLS handshake, etc. When combined with the aforementioned limits, developers are forced to come up with clever tricks to amortize time spent on the lifecycle of the connection and make more efficient use of those already established.
Connection pooling is a well-known strategy where clients “lease” connections to perform an operation and return it to the pool as soon as they are done. The demand to make efficient use of connections is so common that some proxies have emerged to offload this responsibility from client applications, like ProxySQL for MySQL or PgBouncer for Postgres.
It turns out that handling connections in client applications is not trivial. When applications don’t do connection pooling:
On the other hand, while connection pooling can mitigate these problems, it comes with its own set of challenges:
Unsurprisingly, a lot of energy is spent on properly handling database connections. The goal of this post is not to define the ultimate “how-to database connection guide” (you’ll have to look elsewhere for that!) but to focus on a particular segment of this complex topic that does not get enough attention: connection draining.
Databases are not systems that you spin up and forget about:
These are just some examples. The gist is that the database process will eventually need to be terminated. So what does the DB do about all its precious connections upon termination?
Databases typically implement some sort of connection-draining phase during termination. This is a procedure put in place to allow client applications to gracefully handle these connections getting closed by the server. This grace period is needed because database communication protocols laid over TCP/IP are generally frontend-initiated (e.g. Postgres spec), and the full-duplex nature of TCP means one side of the connection may be closed while the other remains open, unknowingly to the other side. The great Vicent MartĂ wrote at length about the perils of the database connection lifecycle when using the Go MySQL driver.
How connection draining works varies across vendors, but it typically involves the same phases:
Fortunately for us, CockroachDB offers an exemplary connection-draining procedure described in their docs.
Ignoring database connection draining procedures is a fairly common source of availability issues. Depending on the nature of the workload, problems may include intermittent errors, small blips in availability, the system being partially unavailable, and breaching SLOs.
From the application perspective what we typically see is an error when performing a query on a connection. (e.g. unexpected EOF
in Go), even on a freshly leased connection.
This can make the situation quickly snowball:
Depending on the environment, connection draining can be relatively frequent, and these intermittent problems may start to eat up our error budget.
The first step is to look into your database instance configuration and determine what the connection drain timeout is. This parameter is the grace period databases give clients before starting to forcefully close connections, and it could be one or multiple knobs.
Load balancers also need to be taken into consideration, as being part of the request path also means they play a role in connection draining. Whether you are running your own load balancers like HAProxy or a managed service like those offered by the likes of AWS, Azure or GCP, a drain timeout parameter will be most likely available. The timeout starts to count when the load-balancer recognizes the backend is unhealthy, after which it will stop forwarding new requests and will let any in-flight request complete for the duration of the timeout before forcefully closing them.
Naturally, both the load-balancer and backend drain timeout must be aligned: the drain timeout of the backend database should not exceed that of the load-balancer, or the latter may prematurely terminate connections while the backend is still draining.
Defining the drain timeout is an exercise of balance, as there is no single value that works for every application:
From the application-developer perspective, you probably want to favor a large drain timeout, but you’ll also likely have to dial it back depending on the limitations of the DBaaS of your choice, or requirements imposed by your database team.
Once you’ve settled on a connection drain value, you’ll need to configure your application’s database driver/pooler of choice.
Most mature implementations out there offer such APIs, like Go’s DB.SetConnMaxLifetime
which is part of the standard database/sql
package.
You will want to validate your configuration in production by running a database connection draining procedure drill. This is not always possible (e.g. managed databases that do not offer this option) so you may have to wait until the next update (bummer!) or run a simulation in a local environment. For example, you could set up a docker-compose configuration that spins up several replicas of a database along your application and simulate the process by running the draining process inside the container. We did exactly this to validate our drain configuration locally and you can find it in our examples repository.
In CockroachDB connection draining is handled by multiple parameters that configure the duration of the different phases of the drain procedure. Cockroach Labs offers excellent documentation on the topic, so definitely check it out.
server.shutdown.drain_wait
is the parameter used to have CRDB nodes cooperate with the load-balancer. It marks the node as unhealthy so the LB stops sending requestsserver.shutdown.connection_wait
is the actual connection drain timeout and the value you should set your driver/pool max connection lifetime to. It is a relatively recent addition to CRDB, only available from CRDB 22.X, so you may need to update your clusters.server.shutdown.query_wait
for those queries to complete before forcefully closing them. Ideally, the application would have closed all connections by itself before we reach this phase.One may be tempted to set these to a relatively large value for the reasons discussed earlier, but there is a catch: the whole drain procedure must not exceed the drain timeout. This value hasn’t been well documented until recently and largely depends on how CRDB is operated. For Cockroach Dedicated, this limit is 5 minutes and is not configurable. If the sum of the various parameters takes longer than that chances are ungraceful connection termination will happen.
In Go we should set DB.SetConnMaxLifetime
to a value less than the configured drain timeout, specifically the value chosen for connection_wait
.
And it is important that the sum of all parameters never exceeds the global drain timeout limit.
The recommendation is:
SetConnMaxLifetime < server.shutdown.connection_wait < sum(server.shutdown.*) < drain timeout limit
A few things to note:
We’ve experimented with different values and we found the sweet spot for our production workloads was:
drain_wait 30s
connection_wait 4m
query_wait 10s
The above configuration gives Cockroach Labs managed load-balancers a generous 30s to react to nodes being marked as unhealthy.
Given the upper boundary of 5m, we assigned most of the weight of the “drain budget” available to connection_wait
and made sure to leave some padding (the sum of the three flags is less than 5m).
Finally, we configured SpiceDB with the flags that control SetConnMaxLifetime
and SetConnMaxIdleTime
(datastoreConnMaxLifetime
and datastoreConnMaxIdletime
, respectively) to the value of connection_wait
, in this case, 4m.
Remember to test the draining process. Cockroach Labs unfortunately does not offer a means to drain your dedicated Cockroach instances, so we had to rely on simulations in our local environments via the aforementioned docker-compose example, and in our stage environment where we operate our own CRDB cluster using their opensource Kubernetes operator. This worked for the most part, except for misleading documentation of the global drain limit of 5m which has been recently fixed in the docs.
Have a question? Check out the Discord, where we and the community are discussing all things SpiceDB