How to calculate data warehouse cost and prevent it from exploding
A realistic way to estimate warehouse cost and techniques to prevent the warehouse cost from getting out of hand.
Last month, I spoke with a Head of Data who was paying over $60k/mo for BigQuery.
We investigated the situation and found a couple of things:
- Getting a realistic estimation of warehouse cost is difficult
- Warehouse cost starts cheap, but then grows exponentially
Continuing with the theme of helping you reduce warehouse cost (see my earlier post: The cost of product analytics data in your data warehouse), the goal of this post is to build a realistic cost estimator so you can be prepared for warehouse cost in the future, and also to present a few techniques to prevent the warehouse cost from getting out of hand.
Building a realistic cost estimator
Before we start evaluating how cost grows exponentially, we need to understand how you get charged for warehouse usage. Working with our clients, we’ve found the total cost across warehouses to roughly average out for equivalent size and usage. Therefore, I’m using BigQuery as a proxy for this post because it’s one of the simpler pricing models.
BigQuery query cost: $5 per TB of data scanned
This cost is super misleading. A terabyte seems really large, and $5 isn’t that much. When I first saw this, I thought “I won’t have much data for years, and even when I do it’ll still be super cheap”. I was…wrong.
The cost of a data warehouse doesn’t increase at the same rate you increase data volume. It increases exponentially as you increase data volume. This is a common misconception because we don’t intuitively understand how we put load on a warehouse. Below, I tried to come up with a good mental model for this.
A few assumptions
To start, I created a rubric for different company stages and sizes, breaking it down by data sources, tracked users, and data questions:
Now that we have it broken down by company size, I also needed to make a few assumptions on # of tables, usage, and data size:
- For each data source, companies will create 10 tables to capture the base data models
- 10% of new questions will require a new materialized view
- Expect 100 rows per tracked user in an average table.
- The 100 row average is attempting to capture the wide range given email usage and session tables will have many rows, while user tables will have way fewer for example.
- 1 row = 1.2 bytes
- I averaged the size of a table/number of rows in that table and when I averaged this across the warehouse I got 1.2 bytes)
Usage to warehouse load
Given these assumptions we will begin computing the actual load on the data warehouse.
# of Materialized Views = 10 * # of data sources + (10% * new questions/mo * 12 mo/year)
Average Table Size = 100 rows/table * # of tracked users * 1.2 Bytes
Average # of Tables per Query = 3 for small, 5 for medium and large (as your company grows, the questions you ask will become more complex, therefore requiring more complex joins)
Both the number of tables per query and the average size of each table are both increasing. This makes intuitive sense because as you continue to collect data from your customers, the data will always be growing. And as your company asks more questions, we will always need to build new materialized views to answer those questions.
Calculating the cost
Now that we have the number of tables per query and the average size of each table, we account for the fact that data is processed every hour, and multiply it by the cost to scan that amount of data:
Total Query Load per Month = # of Materialized Views * Average Table Size * Average # of tables per Query * 24 hours/day * 30 days/mo
TOTAL MONTHLY COST = $5/TB * Total Query Load per Month
How do I stop the data warehouse cost from growing exponentially?
Consider using views instead of materialized views
It is very easy to just create another materialized view but not everything deserves it. Every time a materialized view is updated on it’s schedule, it requires a scan (costing you money). You can save a lot of money by answering ad-hoc questions with views that are not scheduled. We found that many ad-hoc questions start as super critical, but over time become less valuable once the stakeholder receives an answer. So instead of materializing that data, make it a view and just pay when people actually use the data. All those dashboards that no one looks at will now not cost so much.
Leverage BI Caching
Most BI tools or query tools will have a caching layer. This is like a materialized view but is saved per query. Keep in mind the first time it loads it will be really slow, but on subsequent runs will be really fast.
Use incremental materialization
When dealing with larger data, try to break up your materialized views into a couple of queries that are incrementally updated individually. This means that you will process only the diff per update (an hour of data for example) instead of the whole history of data.
Choose complexity over dependencies
This is a bit counterintuitive. We often think about building tables on top of other tables so each query is simpler. Yes, dependencies can sometimes make things easier to debug. However, we’ve seen queries that depend on 20 nested parents (ones that aren’t being used by any other child queries). This means you are paying 20x the cost for the one table. You can save a lot of money by letting the query get complex. BE VERY CAREFUL with this. This is helpful for larger datasets and remember you will need to maintain it.
Conclusion and the Activity Schema
The reality is, the amount of data will continue to increase exponentially over time, and the amount of data questions as your company grows will continue to increase exponentially over time. There’s no way to truly get around an increase in warehouse cost. However, using some of the techniques mentioned above will definitely mitigate the expense.
And, I’d be remiss if I didn’t at least mention how our clients are using the Activity Schema approach with Narrator. It’s an alternative approach to data modeling that we’ve seen drastically reduce warehouse cost across all of our clients, given that it’s a single incrementally updated table for all analytics vs many tables and nested dependencies that come with a star schema.