Labs4Change

Looker Caching and Datagroups: Stop Hitting Your Database on Every Query

How Looker caching works, how to configure datagroups for optimal performance, and the caching patterns that align refresh with your ETL schedule.

Every time a user clicks "Run" in Looker, a SQL query goes to your database. If 50 users run the same dashboard every morning, that's 50 identical queries — unless you've set up caching properly.

Most Looker instances we audit either have no caching strategy (every query hits the database) or overly aggressive caching (users see stale data and don't trust the numbers). Here's how to get it right.


How Looker Caching Works

When a query runs, Looker stores the results in an encrypted cache file. On the next identical query — same fields, same filters, same parameters, same row limit — Looker serves the cached results instead of querying the database.

You can see this in the Explore interface:

  • "Results from 2:30 PM" = cached results
  • "Results from Now" = fresh query

By default, Looker invalidates the cache after 1 hour. For most production workloads, this default is wrong in both directions — too fresh for historical dashboards (unnecessary database load) and too stale for operational dashboards (users see old data).


Datagroups: The Right Way to Control Caching

Datagroups are Looker's mechanism for defining cache policies. They answer two questions:

  1. How long should cached results stay valid?
  2. What condition should trigger a cache refresh?

Basic Time-Based Datagroup

datagroup: daily_refresh {
  max_cache_age: "24 hours"
  sql_trigger: SELECT CURRENT_DATE() ;;
}

The max_cache_age is a safety net — cache expires after 24 hours regardless. The sql_trigger is the primary mechanism — Looker runs this query periodically, and when the result changes (a new date), all caches tied to this datagroup are invalidated.

ETL-Aligned Datagroup

datagroup: etl_refresh {
  max_cache_age: "24 hours"
  sql_trigger: SELECT MAX(last_updated) FROM etl_log ;;
}

This is the gold standard. Your ETL pipeline writes a timestamp when it finishes. Looker checks that timestamp. When it changes, caches refresh. Your users always see the latest data, and queries never hit the database unnecessarily between ETL runs.

Data-Based Datagroup

datagroup: orders_refresh {
  max_cache_age: "4 hours"
  sql_trigger: SELECT MAX(modified_at) FROM orders ;;
}

For tables that update throughout the day (streaming data, real-time order feeds), trigger cache refresh when new data arrives.


Applying Datagroups

At the Model Level

Applies to all Explores in the model:

model: ecommerce {
  persist_with: daily_refresh
}

At the Explore Level

Overrides the model default for specific Explores:

explore: orders {
  persist_with: orders_refresh  # Refreshes more frequently than the model default
}

explore: historical_analysis {
  persist_with: daily_refresh  # Historical data doesn't change intra-day
}

Multiple Datagroups for Different Cadences

Most production instances need 2-3 datagroups:

# For slowly-changing data (products, customers, reference tables)
datagroup: daily_refresh {
  max_cache_age: "24 hours"
  sql_trigger: SELECT CURRENT_DATE() ;;
}

# For transactional data updated by ETL
datagroup: etl_refresh {
  max_cache_age: "12 hours"
  sql_trigger: SELECT MAX(completed_at) FROM etl_runs ;;
}

# For near-real-time data
datagroup: frequent_refresh {
  max_cache_age: "1 hour"
  sql_trigger: SELECT MAX(created_at) FROM events ;;
}

SQL Trigger Patterns

Time-Based Triggers

# Refresh every hour
sql_trigger: SELECT DATE_TRUNC('hour', CURRENT_TIMESTAMP) ;;

# Refresh daily at midnight
sql_trigger: SELECT CURRENT_DATE ;;

# Refresh weekly
sql_trigger: SELECT DATE_TRUNC('week', CURRENT_DATE) ;;

Data-Based Triggers

# When new records arrive
sql_trigger: SELECT MAX(updated_at) FROM orders ;;

# When row count changes (useful for append-only tables)
sql_trigger: SELECT COUNT(*) FROM daily_summary ;;

# Checksum approach (detects any change)
sql_trigger: SELECT MD5(CAST(COUNT(*) AS STRING) || CAST(MAX(updated_at) AS STRING)) FROM orders ;;

BigQuery-Specific Optimization

On BigQuery, sql_trigger queries count toward your query costs. Use lightweight queries:

# Good — scans minimal data
sql_trigger: SELECT MAX(updated_at) FROM orders ;;

# Better — uses BigQuery metadata (free)
sql_trigger: SELECT MAX(last_modified_time)
  FROM `project.dataset.INFORMATION_SCHEMA.TABLE_STORAGE`
  WHERE table_name = 'orders' ;;

Caching Strategy by Dashboard Type

Dashboard TypeRecommended CacheWhy
Executive summary24 hours / ETL-alignedData doesn't need to be real-time; consistency matters more
Operational monitoring15 min – 1 hourUsers need near-current data for decisions
Historical analysis24-48 hoursData never changes; no reason to re-query
Ad-hoc exploration1 hour (default)Users expect fresh data when actively exploring
Embedded analyticsETL-alignedCustomer-facing dashboards must show latest data post-refresh

Common Caching Mistakes

1. No Datagroups At All

The default 1-hour cache means every dashboard refreshes hourly regardless of whether the underlying data changed. For a BigQuery-backed instance with 50 dashboards, this can cost thousands per month in unnecessary query costs.

2. Cache Too Short

Setting max_cache_age: "5 minutes" on a dashboard that shows monthly revenue means the same expensive query runs 12 times per hour per user. The monthly revenue number won't change — you're just burning money.

3. Cache Too Long

Setting max_cache_age: "48 hours" on an operational dashboard means the morning shift sees yesterday's data. They lose trust in the numbers and start running their own queries in BigQuery directly — defeating the purpose of Looker.

4. Not Aligning With ETL

If your ETL finishes at 6 AM and your cache refreshes at 7 AM, there's a one-hour window where users see stale data. Use an ETL-aligned datagroup to eliminate this gap.

5. Expensive sql_trigger Queries

The trigger query runs every 5 minutes by default (configurable in admin). A SELECT COUNT(*) FROM billion_row_table as your trigger will cost more than the cached queries it's trying to save.


Monitoring Cache Effectiveness

In Looker's Admin panel:

  • System Activity > Query Performance — check cache hit rates
  • System Activity > Data Connections — monitor database query volume
  • i button on dashboard tiles — shows whether results are cached

If your cache hit rate is below 50%, your datagroup strategy needs work. Well-configured instances typically see 70-90% cache hit rates.


Key Takeaways

  1. Always configure datagroups. The default 1-hour cache is wrong for every workload.
  2. Align with your ETL. The best trigger is SELECT MAX(completed_at) FROM etl_runs.
  3. Use 2-3 datagroups with different cadences for different data freshness requirements.
  4. Keep trigger queries cheap — especially on BigQuery where they count toward costs.
  5. Monitor cache hit rates and adjust based on actual usage patterns.

Get the LookML Best Practices Guide + AI Skill

Our guide covers caching, datagroups, project structure, permissions, and more — the patterns we've refined across 200+ Looker implementations.


Labs4Change has trained 16,000+ analytics engineers on Looker. Book a free performance review — we'll check your caching strategy, query performance, and datagroup configuration.