
Investigating CockroachDB Efficiency and Stability With 22k Databases and 300k Tables
In relation to managing multi-tenancy, there are a selection of architectural routes you may take. A few the most well-liked are:
- Create a database per tenant.
- Use a single database however with tenant IDs to separate out knowledge.
Each of those choices have professionals and cons, however stepping into that’s not the intention of this weblog. At the moment we’ll be exploring what choice 1 could appear to be on a single CockroachDB cluster.
Is that this a sustainable mannequin on CockroachDB? What number of databases/tables can we add? I am going to try to reply these questions and extra as I add round 22,000 databases and 300,000 tables to a cluster.
High-Level Steps
- Create a 9-node CockroachDB cluster.
- Load x instances of the TPC-C database.
- Use the built-in
cockroach workload
command and a script to execute x workloads against x number of databases. - Monitor stability and performance throughout the exercise.
Set up a CockroachDB Cluster
For this experiment, I created a cluster in Google Cloud Platform (GCP) with the below specs.
Nodes | 9 |
Region(s) | us-east1 |
Machine Type | n2-standard-8 |
vCPU | 8 |
Memory (GB) | 32 |
Storage (GB) | 375 |
I won’t be doing a how-to on the CockroachDB setup in this blog, but below are some links to get you started. I ended up deploying all my nodes in a single AZ since I didn’t care about survivability. I also did not create a load balancer in GCP, but rather used HaProxy on the VM I used to run a workload.
Cluster Settings
There are a few cluster settings I enabled to improve performance. The main setting we’ll keep throughout the duration of the cluster is:
set cluster setting spanconfig.host_coalesce_adjacent.enabled = true;
This setting allows multiple tables per range, rather than a table per range, which can come in handy when adding as many tables as I plan to.
The below settings are only used for initial schema and data load. More can be read about their purpose here.
set cluster setting kv.range_merge.queue_interval = "50ms"; set cluster setting jobs.registry.interval.gc = "30s"; set cluster setting jobs.registry.interval.cancel = "180s"; set cluster setting jobs.retention_time = "15s";
Load Databases/Tables
The ultimate goal is to load a lot of databases and tables – but I also want to easily run workloads against those databases once they are in the cluster. The easiest way to accomplish that for me was to leverage one of the built-in workloads in CockroachDB.
Whereas there are a selection of them out there, I selected the TPC-C workload since it’s a good check of a transactional workload and it will possibly additionally report again effectivity numbers. TPC-C consists of 9 tables in whole. I would like to see at the very least 300k tables whole, so we have now a methods to go.
In case you are unfamiliar with TPC-C, extra data is accessible here.
Base workload init
Command
The base command for initializing a schema is quite simple with the built-in command. It looks like the following, where localhost
is ideally a load balancer instead.
cockroach workload init tpcc 'postgresql://root@localhost:26257?sslmode=disable'
This command will create 9 tables in a database named tpcc
and load 1 warehouse worth of data which comes out to ~200MB. The number of warehouses is configurable via a flag, but we’ll keep it with the default of 1.
Once the data is compressed and replicated it comes out to around 70MB in CockroachDB. We’ll have to keep this number in mind since we currently have 3.1TB of storage available. The final database result for 1 warehouse looks like this.
schema_name | table_name | type | owner | estimated_row_count |
---|---|---|---|---|
public | customer | table | root | 30000 |
public | district | table | root | 10 |
public | history | table | root | 30000 |
public | item | table | root | 100000 |
public | new_order | table | root | 9000 |
public | order | table | root | 30000 |
public | order_line | table | root | 300343 |
public | stock | table | root | 100000 |
public | warehouse | table | root | 1 |
Expanding Base Command
There are a few things we need to do to expand on the base command. The first one should be obvious: we can’t use the name tpcc
if we are loading tens of thousands of databases.
Depending on how much time you have, this second tip may be equally important. The workload init
command does an INSERT
by default. While this is ok and may be something you want to test, I’d like to switch to IMPORT
and make things a bit faster.
With these two thoughts in mind, our base command now looks something more like this:
cockroach workload init tpcc --db database_$i --data-loader import 'postgresql://root@localhost:26257?sslmode=disable'
Throw in a Bash Script
You’ll notice the command above specifies database_$i
since we want to load i
instances. Feel free to get as fancy as you want here with your scripting abilities, but in its simplest form, we can do something like the following.
#!/usr/bin/env bash for i in 1..20000 do cockroach workload init tpcc --db test$i --data-loader import 'postgresql://root@localhost:26257?sslmode=disable' done
This will admittedly take a while as we are loading 20,000 databases, 180,000 tables, and ~1.5TB of data (after compression). As I said, feel free to write a more impressive script to parallelize this logic and speed up the process.
Bonus: Speed up via a Backup/Restore
If you want to speed up past the IMPORT
command, you can leverage a BACKUP and RESTORE course of that I discovered to be even faster.
Again up a TPC-C database into cloud storage. This instance reveals s3
, however different cloud storage choices can be found as effectively. These instructions are run as soon as really within the SQL shell.
BACKUP DATABASE tpcc INTO 's3://BUCKET NAME?AWS_ACCESS_KEY_ID=KEY ID&AWS_SECRET_ACCESS_KEY=SECRET ACCESS KEY' AS OF SYSTEM TIME '-10s';
Restore the tpcc
database into a brand new database with a special title.
RESTORE DATABASE tpcc FROM LATEST IN 's3://bucket_name?AWS_ACCESS_KEY_ID=key_id&AWS_SECRET_ACCESS_KEY=access_key' WITH DETACHED, new_db_name="database_$i";
We may additionally loop by means of this command within the language of our alternative to realize a bit faster outcomes than the IMPORT.
Including Extra Tables: No Information
As I discussed above, 20k iterations solely will get us to 180k tables. I would prefer to get nearer to 300k tables managed, however I am not tremendous involved in including extra knowledge as we’re already at 50% capability.
At this level, I added ~700 extra databases, every with 150 tables. This schema was fairly completely different than the TPC-C tables already loaded. How, or if, you selected to get to this increased variety of tables, I am going to go away it as much as you.
Proof
Only a fast affirmation of the cluster measurement and objects added!
First, we’ll see the whole databases rely is simply over 22.5k. Then we’ll do a rely on the descriptors. Descriptors embrace tables, amongst just a few different issues.
Right here is the Cluster Overview.
Working a Workload
Now that we have now every part loaded, we will undergo the same course of to run workloads in opposition to a few of these an identical databases. Whereas I haven’t got the infrastructure in place to run 22.5k concurrent TPC-C workloads, hopefully, I can get one thing respectable going.
Base workload run
Command
Identical to the preliminary load, Cockroach makes really operating the load easy as effectively. The instance under will run a TPC-C workload for 60 minutes.
cockroach workload run tpcc --duration=60m 'postgresql://root@localhost:26257?sslmode=disable'
Increasing the Base Command
There are just a few modifications I made right here as effectively which finally ends up being fairly just like the preliminary load.
cockroach workload run tpcc --db database_$i --duration=60m --conns=1 --display-every=1m 'postgres://[email protected]:26257?sslmode=disable'
As soon as once more we have added a database flag so we will choose among the databases we created. I additionally added a connections flag and set it to 1. By default, the workload will create 2 connections per run.
Lastly, I added a --display-every
flag. This simply adjustments how typically the workload prints out its per-operation statistics. The default is 1s and prints out rather a lot. I wasn’t actually planning on digesting all this data for 500+ workloads. In a while, we are going to seize the ultimate workload statistics that are printed out on the finish.
Throw in a Bash Script
Now that we have now our base command, we will create one other script to loop by means of and begin these processes. I used a separate digital machine, outdoors of the cluster, and with HaProxy put in. FYI, decrease these numbers fairly a bit if you do not have a strong VM. I used to be utilizing n2-standard-32.
Under is the straightforward script I used that loops by means of beginning the workload 500 occasions.
#!/usr/bin/env bash for i in 1..500 do nohup cockroach workload run tpcc --db database_$i --duration=60m --conns=1 --display-every=1m 'postgres://root@localhost:26257?sslmode=disable' > outcomes/$i.txt 2>&1 & carried out
The code on the finish throws the output of every workload into its personal file inside a /outcomes
folder.
For my precise checks, I leveraged the next syntax so I wasn’t at all times deciding on simply the primary 500. (I notice there is a semi-decent likelihood for dupes with this logic).
nohup cockroach workload run tpcc --db database_$(( ( RANDOM % 20000 ) + 1 )) --conns=1
Outcomes
Under are the outcomes for workloads of sizes 500, 750, and 1000. Every ran for 1 hour.
Workloads | CPU Utilization | QPS | P99 Latency | Tpcc Efc (avg) |
---|---|---|---|---|
500 | ~25-50% | ~1,800/s | ~50ms | 97% |
750 | ~25-50% | ~2,200/s | ~58ms | 97% |
1000 | ~45-60% | ~3,300/s | ~70ms (w/ spikes) | 97% |
Conclusion and Subsequent Steps
General I believed CockroachDB dealt with these checks effectively whereas sustaining a TPC-C efc proportion of 97% all through all checks. Mileage will differ relying on the variety of nodes and vCPU, however this might be a viable choice if the present answer right this moment is making a bodily database per tenant.
There are some things I would develop on when testing out this mannequin extra:
- Influence and pace of creating a schema change throughout every database
- Implementing role-based entry to every database
- QPS in opposition to a non-TPC-C workload
- Much more tables! Though not talked about within the weblog, I did check 200k vs 300k and noticed no variations in efficiency.