Scylla migration

How a no-downtime migration was made and RBAC policies were enacted.

Scylla is a Apache Cassandra-compatible NoSQL data store, with a significantly improved performance. I have recently successfully migrated a self-managed Scylla cluster to a managed, Cloud Scylla. Here are a few takeaways.

First, from a business perspective, the migration made sense as the team was too small to justify spending further engineering time in order to (safely) keep up with the required upgrades. The business had predictable usage patterns, but couldn’t afford any downtime - also, the data was large enough that the estimated downtime would’ve been measured in hours. So we chose to migrate using a transitional period of dual-writes, a fork & lift operation for the older data with scylla-migrator, and finally kept the old cluster for a verification period of dual-reads.

Meanwhile I also updated the services using Scylla and have them autodiscover the nodes, courtesy of a particular client-side library. On the cluster side, I also enacted RBAC for both 1) backend services and 2) QA & developer team. Whereas previously they all used a shared and unlimited superuser role, I now moved them to roles according to the principle of least privilege:

For backend services:

create role app1 with password = 'app1' and login = true;
create role app2 with password = 'app2' and login = true;

-- Only grant the needed rights to each app.
grant select on keyspace akeyspace to app1;
grant select on keyspace akeyspace to app2;
grant modify on akeyspace.atable   to app2;

For QA team and developers:

-- Implies `SUPERUSER = false AND LOGIN = false`.
create role qa;
grant select on keyspace akeyspace to qa;
grant modify on akeyspace.atable to qa;

create role alice with password = 'alice' and login = true;
grant qa to alice;

create role readonly;
grant select on keyspace akeyspace to qa;

create role bob with password = 'bob' and login = true;
grant readonly to bob;

Thus, no more risking that a fatfingered QA team member, or a developer indirectly through one of the apps, would execute a DROP, ALTER, or other dangerous command. Unfortunately, MODIFY cannot be made more granular so as to allow a DELETE but not a TRUNCATE. This limitation is unlikely to disappear, since Cassandra-compatibility must be kept intact.

A nice touch was to also enable workload prioritization, available only with Scylla Enterprise:

CREATE SERVICE_LEVEL olap WITH SHARES = 100;
CREATE SERVICE_LEVEL oltp WITH SHARES = 1000;

ATTACH SERVICE_LEVEL olap TO qa;
ATTACH SERVICE_LEVEL olap TO readonly;
ATTACH SERVICE_LEVEL oltp TO app1;
ATTACH SERVICE_LEVEL oltp TO app2;

Finally, to review the changes made:

LIST ALL PERMISSIONS [OF <role>];
LIST ALL ATTACHED SERVICE_LEVELS;