Top 10 Signs You Have a Data Modeling Problem

Top 10 Signs You Have a Data Modeling Problem

1. Your warehouse is really slow

Situation

“I just hit Run Query. I am going to get a cup of coffee and hopefully it’s done when I get back.”
— Data Analyst

Problem

  • Every time a transformation runs in your warehouse, it drops and recreates every table, thus processing time only grows.
  • Table locks can add up, causing your table to need to be reprocessed before it can be used.

2. You have changed your BI tool multiple times

Situation

“We just invested in Looker and it will solve all of our problems. So far everything is really fast and matches, unlike Chartio which was slow and nothing matched.”
— BI Team

6 months later…

“I don’t know why Looker is super slow and none of the metrics match. I think Tableau Online solves this problem really well” — BI Team

Problem

  • Only a subset of data models make it into your BI tool when it’s first implemented. Then, as time goes on, you require more data models as your business evolves, and soon you’ve recreated all the issues that you experienced with your last BI tool. It’s slow and there are inconsistencies in metrics.
  • BI tools have to query your warehouse and cache the results of those queries. If a BI tool is accessing a query with multiple tables, but some of those tables still need to update, then it will cache the wrong results. This will also give you inconsistent metrics across dashboards.

3. Your transformations are more than 30 lines

Situation

“I just need to add sales_location to this table but the SQL (and python, sometimes) to generate it is literally 1200 lines. I will just create a new table with a cleaner query."
— Data Engineer

Problem

  • Each query needs to bridge multiple sources — and because foreign keys are often missing — you write insane sub queries to try to get a clean join. This leads to queries that are really long and complicated.
  • With complexity comes lack of understanding and so when something needs to change, your team will elect to either: (1) rewrite the data model with only what they need or (2) wrap the entire query in a CTE (Common Table Expression) and add to the query, increasing the length of the query and its complexity.

4. Reporting takes too long to set up

Situation

“I just spent my whole day debugging the sales_data table and I’m pretty sure the query is actually wrong."
— Data Analyst
“Oh yeah, you should use sales_data_v2.”
— Data Engineer
“What? Why do we still have sales_data then?"
— Data Analyst
“Some dashboard depends on it and we can’t deprecate it because it is used by executives”
— Data Engineer

Problem

  • Every reporting request requires a long process to find the right data, validate it, and then assemble the visualization.
  • Finding, validating, and assembling the data is what takes forever. Your BI Engineer or Analyst has to parse 1000+ line SQL queries and run sanity checks to compare your aggregations with widely used dashboards.
  • As your data models evolve, you often don’t delete older data models because some executive (or stubborn person) at your company depends on it for a dashboard and you’re nervous about migrating it. This leads to weeks of wasted work of re-debugging a table that someone already knew was wrong.

5. Data engineering has an endless backlog

Situation

“We are currently blocked by Data Engineering.”
— Analytics Team

Problem

  • As more requests come in each Data Engineer has to make a very complex decision: build a new data model or update an old model. Creating a new model adds another model to maintain for life. Updating a model means queries get insanely long or introduces a bug that leads to broken dashboards or numbers not matching.

6. Dashboards have numbers that don’t match

Situation

“Why does the Sales Dashboard say we have 12,034 sales but when I calculate the sum of sales in Sales by Ad Source dashboard, I see 13,193. The data is wrong.”
— Marketing Team

Problem

  • Dashboards visualize your data models. If numbers don’t match in a dashboard, it’s an issue with your data model. If you have multiple data models with similar data then you’ll end up with two dashboards using two different data models but computing the same metrics.
  • This leads to two sources of truth for one metric.

7. No one knows where to find the data they need

Situation

“Why don’t you use the data dictionary?”
— Data Engineer
“It’s out of date so I just look at the transformation SQL”
— Data Analyst.

Problem

  • Data Dictionaries are practically impossible to maintain so they go out of date the moment they are deployed. Automated Data Dictionaries usually just tell you the column names and types, but not the assumptions made.
  • As soon as one table’s documentation is out of date, the data team will know that they cannot depend on the documentation so they will go back to the method they can trust — parsing the complex SQL transformations.

8. The data team is not trusted

Situation

“The data is always wrong”
— Stakeholder

Problem

  • With complex data models, even simple data tasks are complicated, risky, and error-prone. Stakeholders only see the data integrity issues when asking “simple” questions and stop believing in the team’s ability to answer more questions.
  • Data integrity issues lead to a mistrust in the data team.

9. Other teams are buying their own data tools to do their own analyses

Situation

“We set up Heap so we could do our own analytics.”
— Website Team

Problem

  • Engineering teams are constantly told that they don’t need data analysts or to do analyses. They’ve been told that these point and click tools can replace the need for data teams. THIS IS NOT TRUE!
  • Data teams now have to do twice the work: re-do the analysis and then educate the team on why the numbers they pulled are different than the numbers the data team got.

10. You’re constantly explaining why “simple” questions are more complex than they seem

Situation

“Well, figuring out how many people called us from the website is really hard.”
— Data Analyst
“Why? People come to the site then call us. I just want to know the % of people. Why is this complicated?”
— Stakeholder

Problem

  • Simple questions become very complex when they bridge data systems because foreign keys are usually not available.
  • Even if we could relate the data, there are many dimensions to each question, so a data analyst needs to consider all the edge cases before answering. Each edge case takes hours to test and, as a stakeholder, it is hard to see the rigor that goes into answering a simple question.

My talk about answering data questions goes into a bit more depth about this: How to Answer Data Questions Without Being Miserable


So you have a data modeling problem. Where should you go from here?

It is not your fault! You’re probably using a Star Schema and that is the status quo “best practice” so you think these problems are simply the nature of the beast. We have good news for you — they’re not.

Image for post

There is a better way.

Check out Narrator: After suffering through these same problems, we came up with a data model that reflects today’s reality. It is a single, 11-column data model that can be used to assemble any table.

See for yourself: Narrator.ai


What are other problems you’ve experienced working with data? Could they be caused by bad data modeling?

Comment with your problems and we can help you identify the problem.


Check us out on the Data Engineering Podcast

Find it on the podcast page or stream it below