NoSQL & Data Versioning

After posting the slides for my talk at MongoSF, a lot of people asked me: What’s with all this NoSQL talk? Isn’t OffScale for relational databases?

There are a lot of misconceptions about NoSQL, and one of the biggies regards a mix-up between having no schema and having no need to version data. The basic idea being that since the database is schemaless, this allows me to ignore the structure of object in the database. After all, I won’t get any “column not found” errors anymore, right?

Right?

Wrong. Errr… Right, but only to a point. From there on, you are on your own.

With great power comes great responsibility

Imaging you are working on a new SaaS. You are adding features and releasing them. One of your customers from Japan complains that they can’t put in the billing address because in Japan, streets have no names. So you add a new class to deal with Japanese addresses. At this point, you are really happy that you are using MongoDB as a NoSQL database, and the new addresses reside in the same collection as the old.

Skip a few versions forward. Things have really picked up. One of the new developers you have hired is writing some code that will help suggest billing addresses according to billing history. At first, he thought he’d migrate all the objects to contain the addresses, but with millions of transactions in the past, and with a monthly active user rate of a few tens of thousands, it makes more sense to find the addresses and add them to the user when the user logs on, so not to have to migrate all the objects. That would just take too long.

Migrate All the Objects

This new developer writes some tests with some addresses, they seem to pass, so this new feature gets deployed. And it all works perfectly well, until someone from Japan tries to complete a transaction, and errors start flying all over the place.

What happened?

Japanese addresses.

They weren’t tested, they didn’t exist in the new developer’s database. They do exist in the production database. Without schema, it’s much harder for everybody on the team to stay on board with all the object structures that accumulate in the database.

The best way to solve this?

Automated Tests With Data

It’s time to make databases a first class citizen of the development environment. They should be integrated with the source control system, and it should work the same – you should be able to commit data sets with the code, so that code and databases go in sync.

But where OffScale will really save you is when you add it to the build cycle.

There are two types of automated tests you need to do if you want to make sure your application will survive in the wild:

  • Migrations – that is, the change of objects from one version of the app to the next. In schemaless databases, the structure is implied by objects, and you can’t trust anyone unless you actually run the migration against real data.
  • Integration tests / regression tests – if you don’t migrate the objects, and you allow objects from different versions of your code to survive next to each other, you will have to be able to support older versions of the objects in the new code. The example above falls in this bucket. In this case, the best way to make sure your code works is by having tests running against data sets of old objects, to see that they still work. The plus side is that you don’t have to keep the code to create old objects anymore – all you need are the datasets you used for your old tests.

Basically, going NoSQL will give you a lot of flexibility, but it also means it will be much easier to shoot yourself in the foot if you don’t put some process to protect you from human error.

Be Sociable, Share!

About Omer Gertel

CTO @ Offscale