Data migration is a common part of working with databases, and Ruby on Rails developers have many tools at their disposal. Still, it can be a challenge to get it right in every situation. Here are some tips for making sure your migrations go smoothly.
The easiest way is just to have data migrations alongside with schema migrations. But as your code base grows this might become a problem.
Problems
Code coupling
Data migrations depend on the existing code(models, service objects, etc.) over time code changes potentially breaking the migrations. Let's say you have a Order
model and want to update a recently added attribute fulfilled
to true
for all orders created last year. You can do it with a simple migration:
class UpdateFulfilledIn2021 < ActiveRecord::Migration
def up
Order.all.where("created_at < '2022-01-01'")
.update_all(fulfilled: true)
end
end
The above migration might fail:
- If
Order
model gets renamed or removed completely. - If a new validation was added to the model.
- A new side effect via callback were added to the model. For example if an email gets sent to a customer after the
Order
record is updated withfulfilled = 't'
, and you definitely don't want to bother customers from last year with emails they don't need.
Possibly long execution times
Data migrations might take a lot of time to execute, depending on the amount of records you need to update, and the complexity of calculations to perform for each record. Long execution times can potentially break the deployment pipeline, or lead to some down time of the application.
Hard to test
It's hard to write tests for migrations ensuring that the data is updated in a correct way, since a single error might lead to whole application data corruption requiring some down time to recover DB back to correct state.
Possible solutions
Inline model stubs
Just define models class stubs with minimal required definitions. That way original model changes won't affect the migration. For the example from above:
class UpdateFulfilledIn2021 < ActiveRecord::Migration
class OrderStub < ApplicationRecord
self.table_name = 'orders'
end
def up
OrderStub.all.where("created_at < '2022-01-01'")
.update_all(fulfilled: true)
end
end
Pros:
- Is not affected by actual model changes;
Cons:
- Model logic and association definitions are duplicated;
Pure SQL
Alternatively you can use pure SQL to modify your data.
def up
execute(
"UPDATE user_orders SET fulfilled = 't'
WHERE created_at < '2022-01-01'"
)
end
Pros:
- Does not depend on the code;
- Does not require any additional setup;
Cons:
- Get complex fast for non-trivial data changes;
- Duplicates association logic when need to join associated records;
- No progress indication for long running updates;
Rake tasks
task update_fulfilled_in_2021: :environment do
OrderStub.all.where("created_at < '2022-01-01'")
.update_all(fulfilled: true)
end
Pros:
- Can be run on demand any time in background without affecting deployment process;
- Testable;
Cons:
- No clear indication that migration has already performed, so it's better be idempotent;
Data migration gem
There are a couple of gems that helps organize your data migrations in the same way as schema migrations, one of the most known data-migrate. You can simply generate a new migration:
rails g data_migration update_fulfilled_in_2021
Generated data migrations are stored in db/data
and then could be run with:
rake data:migrate
Pros:
- Has versioning the same way as schema migrations;
- Can be easily run during deployment;
- Testable: via either extracting logic into dedicated classes, or setting up a test helper to run the actual migration during tests;
Cons:
- Additional setup;
Conclusion
Each solution is good in it's own circumstances.
For easy one time updates you can utilize Rails schema migrations or execute pure SQL.
If you need to perform more complex data updates once in a month or so, then rake tasks might help you with that.
But as your code base grows and migrations get bigger & more complex it's better to utilize full data migration automation gem with version control and a way to test the updates before they hit the production database.