Pre-aggregations
You can use thepre_aggregations parameter within cubes to define
pre-aggregations.
Pre-aggregations must have, at minimum, a name and a type.
Pre-aggregations must include all dimensions and measures you will query with.
A pre-aggregation is typically placed in a cube that defines most of dimensions
and measures that this pre-aggregation includes. If a pre-aggregation doesn’t
include any dimensions or measures from a cube it’s defined in, this
pre-aggregation is ignored.
Parameters
name
The name parameter serves as the identifier of a pre-aggregation. It must be
unique among all pre-aggregations within a cube and follow the naming
conventions.
name, along with the name of the cube, will be used as a prefix for
pre-aggregation tables created in the database.
type
Cube supports the following types of pre-aggregations:
The default type is rollup.
rollup
Rollup pre-aggregations are the most effective way to boost performance of any
analytical application. The blazing fast performance of tools like Google
Analytics or Mixpanel are backed by a similar concept. The theory behind it lies
in multi-dimensional analysis, and a rollup pre-aggregation is the result of a
roll-up operation on an OLAP cube. A rollup pre-aggregation is
essentially the summarized data of the original cube grouped by any selected
dimensions of interest.
The most performant kind of rollup pre-aggregation is an additive rollup:
all measures of which are based on decomposable aggregate
functions. Additive measure types are: count, sum,
min, max or count_distinct_approx. The performance boost in this case is
based on two main properties of additive rollup pre-aggregations:
- A rollup pre-aggregation table usually contains many fewer rows than its’ corresponding original fact table. The fewer dimensions that are selected for roll-up means fewer rows in the materialized result. A smaller number of rows therefore means less time to query rollup pre-aggregation tables.
- If your query is a subset of dimensions and measures of an additive rollup, then it can be used to calculate a query without accessing the raw data. The more dimensions and measures are selected for roll-up, the more queries can use this particular rollup.
original_sql
As the name suggests, it persists the results of the sql property of the cube
in the data source (rather than Cube Store). original_sql pre-aggregations should
only be used when the cube’s sql is a complex query (i.e., nested subqueries,
window functions, and/or multiple joins).
They often do not provide much in the way of performance directly, but there are
two specific applications:
-
They can be used in tandem with the
use_original_sql_pre_aggregationsoption in other rollup pre-aggregations. -
Situations where it is not possible to use a
rolluppre-aggregations, such as funnels.
rollup_join
Cube is capable of performing joins between pre-aggregations from different
data sources to avoid making excessive queries
to them.
In the following example, we have a users cube with a users_rollup
pre-aggregation, and an orders cube with an orders_rollup pre-aggregation,
and an orders_with_users_rollup pre-aggregation. Note the following:
- Both cubes have different values for
data_source. - The type of
orders_with_users_rollupisrollup_join. - This pre-aggregation has a special property
rollupswhich is an array containing references to both “source” rollups.
rollup_join is not required to join cubes from the same data source; instead,
include the foreign cube’s dimensions/measures in the rollup definition
directly:rollup_lambda
A rollup_lambda pre-aggregation is a special type of pre-aggregation that can
combine data from data sources and other rollups. It is extremely useful in
scenarios where real-time data is required.
Lambda pre-aggregations can be used to combine
data from a data source and a pre-aggregation, or even from multiple
pre-aggregations across different cubes that share the same dimensions and
measures.
measures
The measures property is an array of measures from the
cube that should be included in the pre-aggregation:
dimensions
The dimensions property is an array of dimensions from the
cube that should be included in the pre-aggregation:
time_dimension
The time_dimension property can be any dimension of
type time. All other measures and dimensions in
the data model are aggregated. This property is an extremely useful tool for
improving performance with massive datasets.
granularity must also be included in the
pre-aggregation definition.
granularity
The granularity property defines the time dimension granularity of data
within the pre-aggregation. If set to week, for example, then Cube will
pre-aggregate the data by week and persist it to Cube Store.
second, minute, hour,
day, week, month, quarter, or year) or a custom
granularity. This property is required when using
time_dimension.
segments
The segments property is an array of segments from the
cube that can target the pre-aggregation:
partition_granularity
The partition_granularity defines the granularity for each
partition of the pre-aggregation:
hour, day, week, month, quarter, year. A
time_dimension and granularity
must also be included in the pre-aggregation definition. This property is
required when using partitioned pre-aggregations.
Number of partitions to be built per cube is calculated as
build_range divided by partition_granularity. Number of
partitions to be built per cube is multiplied by the count of time zones and
tenants in case different tenants have different pre-aggregation SQL.
refresh_key
Cube can also take care of keeping pre-aggregations up to date with the
refresh_key property. By default, it is set to every: '1 hour',
if neither of the cubes’ pre-aggregation references don’t override refresh_key.
When using partitioned pre-aggregations, the refresh
key is evaluated for each partition separately.
sql
You can set up a custom refresh check strategy by using the sql property:
every is not defined. If the results of the SQL
refresh key differ from the last execution, then the pre-aggregation will be
refreshed.
every
The refresh_key can define an every property which can be used to refresh
pre-aggregations based on a time interval. By default, it is set to 1 hour
unless the sql property is also defined in any of cubes pre-aggregation references, in which case
it is set to 10 seconds. For example:
every parameter, please refer to the
refreshKey documentation.
You can also use every with sql:
incremental
You can incrementally refresh partitioned rollups by setting
incremental: true. This option defaults to false.
update_window
The incremental: true flag generates a special refresh_key SQL query which
triggers a refresh for partitions where the end date lies within the
update_window from the current time.
In the example below, it will refresh today’s and the last 7 days of partitions
once a day. Partitions before the 7 day interval will not be refreshed
once they are built unless the rollup SQL is changed.
incremental
refreshes.
allow_non_strict_date_range_match
The allow_non_strict_date_range_match parameter is used to allow queries to match a
pre-aggregation even if a query contains a non-strict date range. It is set to true
by default via the [CUBEJS_PRE_AGGREGATIONS_ALLOW_NON_STRICT_DATE_RANGE_MATCH]ref-env-allow-non-strict
environment variable.
If this flag is set to false, Cube would check if requested date range exactly matches
pre-aggregation granularity. For example, if you’re requesting half of a day or your date
range filter is just one millisecond off for a pre-aggregation with the daily granularity,
Cube would not use such a pre-aggregation.
With this flag set to true, that strict check is lifted. It allows queries from BI tools
to still match pre-aggregations at the cost of a slight potential data discrepancy.
This is convenient when using Cube with visualization tools such as Tableau
or Apache Superset that use loose date ranges.
use_original_sql_pre_aggregations
Cube supports multi-stage pre-aggregations by reusing original SQL
pre-aggregations in rollups through the use_original_sql_pre_aggregations
property. It is helpful in cases where you want to re-use a heavy SQL query
calculation in multiple rollup pre-aggregations. Without
use_original_sql_pre_aggregations enabled, Cube will always re-execute all
underlying SQL calculations every time it builds new rollup tables.
scheduled_refresh
To always keep pre-aggregations up-to-date, you can set
scheduled_refresh: true. This option defaults to true.
In production mode, pre-aggregations with scheduled_refresh: false will not be
built automatically and require external orchestration to trigger their refresh.
Additionally, any scheduled_refresh: false pre-aggregations that were built manually or on-demand will be considered
stale and orphaned over time, and will be dropped by the refresh worker during
cleanup processes.
The refresh_key is used to determine if there’s a need to
update specific pre-aggregations on each scheduled refresh run. For partitioned
pre-aggregations, min and max dates for
time_dimension are checked to determine range for the
refresh.
Each time a scheduled refresh is run, it takes every pre-aggregation partition
starting with most recent ones in time and checks if the
refresh_key has changed. If a change was detected, then
that partition will be refreshed.
In development mode, Cube runs the background refresh by default and will
refresh all pre-aggregations which have scheduled_refresh: true.
Please consult Production Checklist for best
practices on running background refresh in production environments.
build_range_start and build_range_end
The build range defines what partitions should be built by a scheduled refresh.
By default, the build range is defined as the minimum and maximum values
possible for the time_dimension used in the rollup.
refresh_key to define
granular update settings. Set refresh_key.update_window to the interval in
which your data can change and build_range_start to the earliest point of time
when history should be available.
In the following example, refresh_key.update_window is 1 week and
build_range_start is SELECT NOW() - INTERVAL '365 day', so the scheduled
refresh will build historical partitions for 365 days in the past and will only
refresh last week’s data.
union_with_source_data
This option allows combining a pre-aggregation with fresh data retrieved from
the data source; this is extremely useful in scenarios where latency can be
sacrificed for accuracy.
To configure a pre-aggregation to behave in this way, ensure the pre-aggregation
is of type rollup_lambda, and then set union_with_source_data to true:
indexes
This option allows to define indexes. Indexes are used to
fine-tune pre-aggregation performance when pre-aggregations have significant
cardinality. (For aggregating indexes, see the
type option below.)
Here’s how you can define an index:
<cube name>__<time dimension name>_<granularity>:
original_sql pre-aggregations:
In some cases, indexes would not work with
original_sql pre-aggregations.
Please track this issue.