Surviving Complex Database Schema Migrations

One of our current development efforts requires substantial changes to our PostgreSQL schema. This is tricky to handle when serving millions of requests each day. I’ll explain our current development project and what we’re doing to minimize service interruptions and overhead.

At Userlike we offer live chat for embedding into websites. When no operators (people who answer chats coming in from web visitors) are available, the chat turns into a “leave a message” function. In our existing implementation, these messages are stored separately from chats.

To improve code reuse and to pave the way for a truckload of new features we’re planning, it makes more sense to combine the data representation. The current two separate tables are joined into one new table. Another new table represents some of the chat conversation’s structure.

Schema woes

I like enforcing a fixed schema like that: less potential for surprises and PostgreSQL can enforce a certain level of data consistency right in the database.

What I don’t like, though, is writing backend code that has many conditionals for different versions of the schema. Changing all the queries referring to our old tables is bad enough already. Adding this kind of conditional code affects many moving parts at many different times and that’s just asking for trouble. Off the top of my head:

  • While developing the new feature/code, I can’t just wildly replace queries. Instead, I have to keep the old code and add extra code to account for the new schema. In schemaless databases I have no choice but to do this dance, by the way. For instance, if there’s a new field that is mandatory for new records, I still have to handle old records that are missing this field.
  • If something is working as intended, I don’t know whether it’s working as intended for the right reasons. Maybe the data record got updated correctly, or maybe it’s still the same and the transition code is hiding that from me. In short, adding more branches to my code compounds the state space explosion problem. On the other hand, changing the schema at a fixed point in time keeps the state space smaller. The migration logic can be tested individually.
  • Finally, to prevent code bloat, the old code needs to be cleaned out in a separate step, once all the data has been updated. Even if you’re a wizard of revision control, this is a tedious and/or error-prone task.

If we want to avoid this transition approach, we’re facing a dilemma: you can’t change the schema before you update the code, and you can’t update the code before you change the schema. What to do?

The slow but safe way

Traditionally, we would handle schema migrations by taking all services down during the upgrade and running a migration script to apply the schema and convert our data before bringing the site back online. This is fairly easy to implement and perfectly safe… but transforming the data can take a lot of time and that means a lot of downtime.

Case in point: recently we introduced a new Organizations feature for our customers on our top-tier subscription plan. It allows them to subdivide their account into largely independent units. This way, subsidiaries can be managed in a single account but each gets its own dashboard. To achieve this, many of our tables needed to start referencing an organization. The migration affected almost all of our tables and it took well over 45 minutes to finish (on fairly powerful servers).

We did choose a time of generally low traffic, so that we’d minimize service interruptions for our customers. Still, they’re spread over many different time zones and you can’t please everyone. So, we decided to get rid of such extreme downtimes altogether. We turned that goal into some first principles for future migrations:

  • It must be possible to perform the bulk of the migration live.
  • Adding transition code to the live version prior to deploying the actual upgrade should be avoided. Firstly, we don’t have to bother removing it again and secondly, it reduces the risk of creating regressions. We do a lot of QA for releases, keeping unplanned changes to a minimum because we like our updates boring: deploy it and it works. That’s a worthy ideal if you ask me, and we’re not all that far away from it.

We decided on a three-stage design to make this possible: extend schema – iteratively process – constrain schema.

1
Adding new schema elements

The first step is to make room for the new data without affecting anything used by the existing code, i.e. adding new columns and tables but not modifying or removing existing columns. We can do this at any point before we touch the bulk of the upgrade, if we hold off on adding data constraints for now that would interfere with production use.

For instance, when implementing a feature like Organizations, we would add an organization_id column to many tables but defer the NOT NULL constraint to a later stage, making the column’s values optional for now. We are also leaving out default values when creating new columns in an existing table, so that the database does not have to update all records when adding the new column, making the operation complete much faster.

If we want to change a column’s type or restructure its contents, we add a temporary column for the new data in this stage so we can populate it without affecting our production code. Suppose we wanted to change an extra_data column’s internal format from XML to JSON, then we would create a new extra_data_json column that we can populate with the converted data. In the last stage, we can remove the original column and rename the new one to extra_data. Fortunately we’ve never actually used XML in our database, so our sanity remains intact.

We are going to need two more columns: a general update timestamp and a migration timestamp.

  • The update timestamp refers to the last time a record was changed by anything in our system. The schema and code for this is deployed once, before any of the subsequent steps, and it can be reused for any future migrations.
    We generally want to avoid these kinds of changes, but this is a particularly safe change: Django’s ORM allows us to automatically keep the column up-to-date by adding a DateTimeField definition and a pre_save hook that injects the current time. Nothing else needs to touch or depend on the new column. Since we can reuse it in future migrations and perhaps even use it for new features later on, it’s not exactly “dead code”.
  • The migration timestamp refers to the last time we updated the new columns or associated new tables. It’s temporary and never referenced by anything except our migration script. It will be removed again in stage 3.

One interesting case is adding tables that will contain required data related to existing records in another table.

When we added Organizations, we created a shiny new organizations table. We wanted to avoid separate code for customers with vs. without organizations, so we also created a “default” organization for each existing customer and referenced its ID in all of their data.

Each customer record refers to the corresponding default organization, via a new organization_id column, so there is a nice one-to-one mapping we could use in our migration scripts: if a customer did not have a value for organization_id, we created an organization and added the ID to the customer record in the same transaction. If it did have a value, we simply updated the existing organization as necessary.

Our current project, merging our chat and offline tables, is a similar situation. In stage 1, we created the new table and added a migration timestamp. This table has a column containing the old primary key and a type indicator so that we know which record used to belong to which of our existing tables. In stage 2 we will use that information to match them up and treat them as a unit. For example, given a chat, if we find a record in the new table with type “chat” and the ID of our chat, we can re-migrate that record in the new table, otherwise we create one there.

Finally, new indexes on columns slated to become NOT NULL-constrained should be added now, before any data exists in their corresponding column(s), as partial indexes covering only non-NULL values. Since we don’t have any actual values yet, this way the create operation will consume much less memory and won’t have to actually sort and write millions of index entries. The database will probably still do a full table scan, though, to find any non-NULL values (even if we know there aren’t any).

Should one use CREATE INDEX CONCURRENTLY to avoid a table lock? For any crucial or very large table, probably yes, even though concurrent indexing has a lot more overhead than the normal message. In our case, the chat and message tables are read and updated approximately everywhere in our system and we probably don’t want all transactions on them to block while we wait for index creation to scan millions of records.

2
Incrementally updating data

In the time between stage 1 and the final deploy, we can periodically update data that has accumulated up to that point. The idea here is simple: we have the update timestamp to remember when a record was last touched by our application, and we know when we created the migrated/process data for it thanks to the migration timestamp. Now we can always easily query for all the data that has never been converted before, or has had a new update in the meantime.

Here’s how the incremental migration works, in batches of 1,000 or so to avoid RAM overload:

  1. For any table that needs migrating, fetch all records that have no migration timestamp or one that is older than the update timestamp of the record. Use joins if it’s a new table that is related to an existing table.
  2. Process the records. Many operations are too complex to be done directly in the database. This is why we are fetching the data instead of updating it directly.
  3. Update the records, including a new migration timestamp. To avoid a race condition, it’s important to use the time of requesting the data in step 1, not the time of updating it: if our application updates the data after we fetch it but before we update it, it would get an update timestamp that is lower than the migration timestamp we create afterwards, so we would be fooled into thinking our migration data is up-to-date when it clearly isn’t.

The nice thing about this is that we can stop the script at any time, for example if we notice load issues on the server, and resume it later. The timestamps ensure that we can always skip over the bulk of the data that is already migrated and up-to-date. Occasionally we will process a record a second time even though it was not updated, if the backend updated the record between the start time of our transaction and the time we fetched the record in step 1. Better safe than sorry, though!

We can run this whenever we like, as often as we like, plus a final run during the brief downtime. This last run will be very short and so won’t really affect our downtime because virtually all of the data has already been migrated by now.

3
Cleaning up the schema

Once the new code has been deployed, any temporary and obsolete columns can be removed. If we created a temporary column slated to replace an existing column, we do that now, too. Deleting and renaming columns is done much faster than copying them because the database doesn’t have to write new data blocks, but only mark blocks as freed and update table metadata. The column containing the migration timestamp can now also be deleted.

Finally, we can apply all those data constraints we left out in stage 1 and generally massage the schema until it matches the goal state. Technically all of this can be done after the downtime, but since these operations tend to make liberal use of table locks, that probably isn’t the best of ideas.

Result

We have tested this approach and are fairly happy with it. The only real overhead is that the migration code needs to be structured a little differently. Now that we’ve got that figured out, it shouldn’t be a big deal, though.

The only downside is that because we don’t have the transition code, this still requires some downtime (we can’t use blue-green deployment). On the other hand, we have other factors in our infrastructure making this very complicated, so it should be okay for now.