Investigating CockroachDB Efficiency and Stability With 22k Databases and 300k Tables

In relation to managing multi-tenancy, there are a selection of architectural routes you may take. A few the most well-liked are:

  1. Create a database per tenant.
  2. Use a single database however with tenant IDs to separate out knowledge.

Each of those choices have professionals and cons, however stepping into that’s not the intention of this weblog. At the moment we’ll be exploring what choice 1 could appear to be on a single CockroachDB cluster.

Is that this a sustainable mannequin on CockroachDB? What number of databases/tables can we add? I am going to try to reply these questions and extra as I add round 22,000 databases and 300,000 tables to a cluster.

High-Level Steps

  1. Create a 9-node CockroachDB cluster.
  2. Load x instances of the TPC-C database.
  3. Use the built-in cockroach workload command and a script to execute x workloads against x number of databases.
  4. Monitor stability and performance throughout the exercise.

Set up a CockroachDB Cluster

For this experiment, I created a cluster in Google Cloud Platform (GCP) with the below specs.

Nodes 9
Region(s) us-east1
Machine Type n2-standard-8
vCPU 8
Memory (GB) 32
Storage (GB) 375

I won’t be doing a how-to on the CockroachDB setup in this blog, but below are some links to get you started. I ended up deploying all my nodes in a single AZ since I didn’t care about survivability. I also did not create a load balancer in GCP, but rather used HaProxy on the VM I used to run a workload.

Cluster Settings

There are a few cluster settings I enabled to improve performance. The main setting we’ll keep throughout the duration of the cluster is:

set cluster setting spanconfig.host_coalesce_adjacent.enabled = true;

This setting allows multiple tables per range, rather than a table per range, which can come in handy when adding as many tables as I plan to.

The below settings are only used for initial schema and data load. More can be read about their purpose here.

set cluster setting kv.range_merge.queue_interval = "50ms"; set cluster setting jobs.registry.interval.gc = "30s"; set cluster setting jobs.registry.interval.cancel = "180s"; set cluster setting jobs.retention_time = "15s";

Load Databases/Tables

The ultimate goal is to load a lot of databases and tables – but I also want to easily run workloads against those databases once they are in the cluster. The easiest way to accomplish that for me was to leverage one of the built-in workloads in CockroachDB.

Whereas there are a selection of them out there, I selected the TPC-C workload since it’s a good check of a transactional workload and it will possibly additionally report again effectivity numbers. TPC-C consists of 9 tables in whole. I would like to see at the very least 300k tables whole, so we have now a methods to go.

In case you are unfamiliar with TPC-C, extra data is accessible here.

Base workload init Command

The base command for initializing a schema is quite simple with the built-in command. It looks like the following, where localhost is ideally a load balancer instead.

cockroach workload init tpcc 'postgresql://root@localhost:26257?sslmode=disable'

This command will create 9 tables in a database named tpcc and load 1 warehouse worth of data which comes out to ~200MB. The number of warehouses is configurable via a flag, but we’ll keep it with the default of 1.

Once the data is compressed and replicated it comes out to around 70MB in CockroachDB. We’ll have to keep this number in mind since we currently have 3.1TB of storage available. The final database result for 1 warehouse looks like this.

schema_name table_name type owner estimated_row_count
public customer table root 30000
public district table root 10
public history table root 30000
public item table root 100000
public new_order table root 9000
public order table root 30000
public order_line table root 300343
public stock table root 100000
public warehouse table root 1

Expanding Base Command

There are a few things we need to do to expand on the base command. The first one should be obvious: we can’t use the name tpcc if we are loading tens of thousands of databases.

Depending on how much time you have, this second tip may be equally important. The workload init command does an INSERT by default. While this is ok and may be something you want to test, I’d like to switch to IMPORT and make things a bit faster.

With these two thoughts in mind, our base command now looks something more like this:

cockroach workload init tpcc --db database_$i --data-loader import 'postgresql://root@localhost:26257?sslmode=disable'

Throw in a Bash Script

You’ll notice the command above specifies database_$i since we want to load i instances. Feel free to get as fancy as you want here with your scripting abilities, but in its simplest form, we can do something like the following.

#!/usr/bin/env bash
for i in 1..20000
do    cockroach workload init tpcc --db test$i --data-loader import 'postgresql://root@localhost:26257?sslmode=disable'
done

This will admittedly take a while as we are loading 20,000 databases, 180,000 tables, and ~1.5TB of data (after compression). As I said, feel free to write a more impressive script to parallelize this logic and speed up the process.

Bonus: Speed up via a Backup/Restore

If you want to speed up past the IMPORT command, you can leverage a BACKUP and RESTORE course of that I discovered to be even faster.

Again up a TPC-C database into cloud storage. This instance reveals s3, however different cloud storage choices can be found as effectively. These instructions are run as soon as really within the SQL shell.

BACKUP DATABASE tpcc INTO 's3://BUCKET NAME?AWS_ACCESS_KEY_ID=KEY ID&AWS_SECRET_ACCESS_KEY=SECRET ACCESS KEY' AS OF SYSTEM TIME '-10s';

Restore the tpcc database into a brand new database with a special title.

RESTORE DATABASE tpcc FROM LATEST IN 's3://bucket_name?AWS_ACCESS_KEY_ID=key_id&AWS_SECRET_ACCESS_KEY=access_key' WITH DETACHED, new_db_name="database_$i";

We may additionally loop by means of this command within the language of our alternative to realize a bit faster outcomes than the IMPORT.

Including Extra Tables: No Information

As I discussed above, 20k iterations solely will get us to 180k tables. I would prefer to get nearer to 300k tables managed, however I am not tremendous involved in including extra knowledge as we’re already at 50% capability.

At this level, I added ~700 extra databases, every with 150 tables. This schema was fairly completely different than the TPC-C tables already loaded. How, or if, you selected to get to this increased variety of tables, I am going to go away it as much as you.

Proof

Only a fast affirmation of the cluster measurement and objects added!

First, we’ll see the whole databases rely is simply over 22.5k. Then we’ll do a rely on the descriptors. Descriptors embrace tables, amongst just a few different issues.

Total databases count

Right here is the Cluster Overview. 

Cluster overview

Working a Workload

Now that we have now every part loaded, we will undergo the same course of to run workloads in opposition to a few of these an identical databases. Whereas I haven’t got the infrastructure in place to run 22.5k concurrent TPC-C workloads, hopefully, I can get one thing respectable going.

Base workload run Command

Identical to the preliminary load, Cockroach makes really operating the load easy as effectively. The instance under will run a TPC-C workload for 60 minutes.

cockroach workload run tpcc --duration=60m 'postgresql://root@localhost:26257?sslmode=disable'

Increasing the Base Command

There are just a few modifications I made right here as effectively which finally ends up being fairly just like the preliminary load.

cockroach workload run tpcc --db database_$i --duration=60m --conns=1 --display-every=1m 'postgres://[email protected]:26257?sslmode=disable'

As soon as once more we have added a database flag so we will choose among the databases we created. I additionally added a connections flag and set it to 1. By default, the workload will create 2 connections per run.

Lastly, I added a --display-every flag. This simply adjustments how typically the workload prints out its per-operation statistics. The default is 1s and prints out rather a lot. I wasn’t actually planning on digesting all this data for 500+ workloads. In a while, we are going to seize the ultimate workload statistics that are printed out on the finish.

Throw in a Bash Script

Now that we have now our base command, we will create one other script to loop by means of and begin these processes. I used a separate digital machine, outdoors of the cluster, and with HaProxy put in. FYI, decrease these numbers fairly a bit if you do not have a strong VM. I used to be utilizing n2-standard-32.

Under is the straightforward script I used that loops by means of beginning the workload 500 occasions.

#!/usr/bin/env bash
for i in 1..500
do     nohup cockroach workload run tpcc --db database_$i --duration=60m --conns=1 --display-every=1m 'postgres://root@localhost:26257?sslmode=disable' > outcomes/$i.txt 2>&1 &
carried out

The code on the finish throws the output of every workload into its personal file inside a /outcomes folder.

For my precise checks, I leveraged the next syntax so I wasn’t at all times deciding on simply the primary 500. (I notice there is a semi-decent likelihood for dupes with this logic).

nohup cockroach workload run tpcc --db database_$(( ( RANDOM % 20000 ) + 1 )) --conns=1

Outcomes

Under are the outcomes for workloads of sizes 500, 750, and 1000. Every ran for 1 hour.

Workloads CPU Utilization QPS P99 Latency Tpcc Efc (avg)
500 ~25-50% ~1,800/s ~50ms 97%
750 ~25-50% ~2,200/s ~58ms 97%
1000 ~45-60% ~3,300/s ~70ms (w/ spikes) 97%

Conclusion and Subsequent Steps

General I believed CockroachDB dealt with these checks effectively whereas sustaining a TPC-C efc proportion of 97% all through all checks. Mileage will differ relying on the variety of nodes and vCPU, however this might be a viable choice if the present answer right this moment is making a bodily database per tenant.

There are some things I would develop on when testing out this mannequin extra:

  1. Influence and pace of creating a schema change throughout every database
  2. Implementing role-based entry to every database
  3. QPS in opposition to a non-TPC-C workload
  4. Much more tables! Though not talked about within the weblog, I did check 200k vs 300k and noticed no variations in efficiency.