Share via


Lakeflow Declarative Pipelines event log schema

The Lakeflow Declarative Pipelines event log contains all information related to a pipeline, including audit logs, data quality checks, pipeline progress, and data lineage.

The following tables describe the event log schema. Some of these fields contain JSON data that require parsing to perform some queries, such as the details field. Azure Databricks supports the : operator to parse JSON fields. See : (colon sign) operator.

Note

Some fields in the event log are for internal use by Azure Databricks. The following documentation describes the fields that are intended for customer consumption.

For details about using the Lakeflow Declarative Pipelines event log, see Lakeflow Declarative Pipelines event log.

PipelineEvent object

Represents a single pipeline event in the event log.

Field Description
id A unique identifier for the event log record.
sequence A JSON string containing metadata to identify and order events.
origin A JSON string containing metadata for the origin of the event, for example, the cloud provider, the cloud provider region, user, and pipeline information. See Origin object.
timestamp The time the event was recorded, in UTC.
message A human-readable message describing the event.
level The warning level. The possible values are:
  • INFO: Informational events
  • WARN: Unexpected, but non-critical issues
  • ERROR: Event failure that might need user attention
  • METRICS: Used for high-volume events stored only in the Delta table, and not shown in the pipelines UI.
maturity_level The stability of the event schema. The possible values are:
  • STABLE: The schema is stable and will not change.
  • NULL: The schema is stable and will not change. The value might be NULL if the record was created before the maturity_level field was added (release 2022.37).
  • EVOLVING: The schema is not stable and might change.
  • DEPRECATED: The schema is deprecated and the Lakeflow Declarative Pipelines runtime might stop producing this event at any time.

It is not recommended to build monitoring or alerts based on EVOLVING or DEPRECATED fields.
error If an error occurred, details describing the error.
details A JSON string containing structured details of the event. This is the primary field used for analyzing events. The JSON string format depends on the event_type. See The details object for more information.
event_type The event type. For a list of event types, and what details object type they create, see The details object.

The details object

Each event has different details properties in the JSON object, based on the event_type of the event. This table lists the event_type, and the associated details. The details properties are described in the Details types section.

Details type by event_type Description
create_update Captures the complete configuration that is used to start a pipeline update. Includes any configuration set by Databricks. For details, see Details for create_update.
user_action Provides details on any user action on the pipeline (including creating a pipeline, as well as starting or canceling an update). For details, see Details for user_action event.
flow_progress Describes the lifecycle of a flow from starting, running, to completed or failed. For details, see Details for flow_progress event.
update_progress Describes the lifecycle of a pipeline update from starting, running, to completed, or failed. For details, see Details for update_progress event.
flow_definition Defines the schema and query plan for any transformations occurring in a given flow. Can be thought of as the edges of the Dataflow DAG. It can be used to calculate the lineage for each flow as well as to see the explained query plan. For details, see Details for flow_definition event.
dataset_definition Defines a dataset, which is either the source or the destination for a given flow. For details, see Details for dataset_definition event.
sink_definition Defines a given sink. For details see Details for sink_definition event.
deprecation Lists features that are soon to be or currently deprecated that this pipeline uses. For examples of the values, see Details enum for deprecation event.
cluster_resources Includes information about cluster resources for pipelines that are running on classic compute. These metrics are only populated for classic compute pipelines. For details, see Details for cluster_resources event.
autoscale Includes information about autoscaling for pipelines that are running on classic compute. These metrics are only populated for classic compute pipelines. For details, see Details for autoscale event.
planning_information Represents planning information related to materialized view incremental vs. full refresh. Can be used to get more details on why a materialized view is fully recomputed. For details, see Details for planning_information event.
hook_progress An event to indicate the current status of a user hook during the pipeline run. Used for monitoring the status of event hooks, for example, to send to external observability products. For details, see Details for hook_progress event.
operation_progress Includes information about the progress of an operation. For details, see Details for operation_progress event.

Details types

The following objects represent the details of a different event type in the PipelineEvent object.

Details for create_update

The details for the create_update event.

Field Description
dbr_version The version of the Databricks Runtime.
run_as The user ID that the update will run on behalf of. Typically this is either the owner of the pipeline or a service principal.
cause The reason for the update. Typically either JOB_TASK if run from a job, or USER_ACTION when run interactively by a user.

Details for user_action event

The details for the user_action event. Includes the following fields:

Field Description
user_name The name of the user that triggered a pipeline update.
user_id The ID of the user that triggered a pipeline update. This is not always the same as the run_as user, which could be a service principal or other user.
action The action the user took, including START and CREATE.

Details for flow_progress event

The details for a flow_progress event.

Field Description
status The new status of the flow. Can be one of:
  • QUEUED
  • STARTING
  • RUNNING
  • COMPLETED
  • FAILED
  • SKIPPED
  • STOPPED
  • IDLE
  • EXCLUDED
metrics Metrics about the flow. For details, see FlowMetrics.
data_quality Data quality metrics about the flow and associated expectations. For details, see DataQualityMetrics.

Details for update_progress event

The details for an update_progress event.

Field Description
state The new state of the update. Can be one of:
  • QUEUED
  • CREATED
  • WAITING_FOR_RESOURCES
  • INITIALIZING
  • RESETTING
  • SETTING_UP_TABLES
  • RUNNING
  • STOPPING
  • COMPLETED
  • FAILED
  • CANCELED

Useful for calculating the duration of various stages of a pipeline update from total duration to time spent waiting for resources, for example.
cancellation_cause The reason why an update entered the CANCELED state. Includes reasons such as USER_ACTION or WORKFLOW_CANCELLATION (the workflow that triggered the update was canceled).

Details for flow_definition event

The details for a flow_definition event.

Field Description
input_datasets The inputs read by this flow.
output_dataset The output dataset this flow writes to.
output_sink The output sink this flow writes to.
explain_text The explained query plan.
schema_json Spark SQL JSON schema string.
schema Schema of this flow.
flow_type The type of flow. Can be one of:
  • COMPLETE: Streaming table writes to its destination in complete (streaming) mode.
  • CHANGE: Streaming table using APPLY_CHANGES_INTO.
  • SNAPSHOT_CHANGE: Streaming table using APPLY CHANGES INTO ... FROM SNAPSHOT ....
  • APPEND: Streaming table writes to its destination in append (streaming) mode.
  • MATERIALIZED_VIEW: Outputs to a materialized view.
  • VIEW: Outputs to a view.
comment User comment or description about the dataset.
spark_conf Spark confs set on this flow.
language The language used to create this flow. Can be SCALA, PYTHON, or SQL.
once Whether this flow was declared to run once.

Details for dataset_definition event

The details for a dataset_definition event. Includes the following fields:

Field Description
dataset_type Differentiates between materialized views and streaming tables.
num_flows The number of flows writing to the dataset.
expectations The expectations associated with the dataset.

Details for sink_definition event

The details for a sink_definition event.

Field Description
format The format of the sink.
options The key-value options associated with the sink.

Details enum for deprecation event

The deprecation event has a message field. The possible values for the message include the following. This is a partial list that grows over time.

Field Description
TABLE_MANAGED_BY_MULTIPLE_PIPELINES A table is managed by multiple pipelines.
INVALID_CLUSTER_LABELS Using cluster labels that are not supported.
PINNED_DBR_VERSION Using dbr_version instead of channel in pipeline settings.
PREVIOUS_CHANNEL_USED Using the release channel PREVIOUS, which might go away in a future release.
LONG_DATASET_NAME Using a data set name longer than the supported length.
LONG_SINK_NAME Using a sink name longer than the supported length.
LONG_FLOW_NAME Using a flow name longer than the supported length.
ENHANCED_AUTOSCALING_POLICY_COMPLIANCE Cluster policy only complies when Enhanced Autoscaling uses fixed cluster size.
DATA_SAMPLE_CONFIGURATION_KEY Using the configuration key to configure data sampling is deprecated.
INCOMPATIBLE_CLUSTER_SETTINGS Current cluster settings or cluster policy are no longer compatible with Lakeflow Declarative Pipelines.
STREAMING_READER_OPTIONS_DROPPED Using streaming reader options that are dropped.
DISALLOWED_SERVERLESS_STATIC_SPARK_CONFIG Setting static Spark configs through pipeline configuration for serverless pipelines is not allowed.
INVALID_SERVERLESS_PIPELINE_CONFIG Serverless customer provides invalid pipeline configuration.
UNUSED_EXPLICIT_PATH_ON_UC_MANAGED_TABLE Specifying unused explicit table paths on UC managed tables.
FOREACH_BATCH_FUNCTION_NOT_SERIALIZABLE The provided foreachBatch function is not serializable.
DROP_PARTITION_COLS_NO_PARTITIONING Dropping the partition_cols attribute results in no partitioning.
PYTHON_CREATE_TABLE Using @dlt.create\_table instead of @dlt.table.
PYTHON_CREATE_VIEW Using @dlt.create\_view instead of @dlt.view.
PYTHON_CREATE_STREAMING_LIVE_TABLE Using create_streaming_live_table instead of create_streaming_table.
PYTHON_CREATE_TARGET_TABLE Using create_target_table instead of create_streaming_table.
FOREIGN_KEY_TABLE_CONSTRAINT_CYCLE Set of tables managed by pipeline has a cycle in the set of foreign key constraints.
PARTIALLY_QUALIFIED_TABLE_REFERENCE_INCOMPATIBLE_WITH_DEFAULT_PUBLISHING_MODE A partially qualified table reference that has different meanings in default publishing mode and legacy publishing mode.

Details for cluster_resources event

The details for a cluster_resources event. Only applicable for pipelines running on classic compute.

Field Description
task_slot_metrics The task slot metrics of the cluster. For details, see TaskSlotMetrics object
autoscale_info The state of autoscalers. For details, see AutoscaleInfo object

Details for autoscale event

The details for an autoscale event. Autoscale events are only applicable when the pipeline uses classic compute.

Field Description
status Status of this event. Can be one of:
  • SUCCEEDED
  • RESIZING
  • FAILED
  • PARTIALLY_SUCCEEDED
optimal_num_executors The optimal number of executors suggested by the algorithm before applying min_workers and max_workers bounds.
requested_num_executors The number of executors after truncating the optimal number of executors suggested by the algorithm to min_workers and max_workers bounds.

Details for planning_information event

The details for a planning_information event. Useful for seeing details related to the chosen refresh type for a given flow during an update. Can be used to help debug why an update is fully refreshed rather than incrementally refreshed. For more details on incremental refreshes, see Incremental refresh for materialized views

Field Description
technique_information Refresh-related information. It includes both information on what refresh methodology was chosen and the possible refresh methodologies that were considered. Useful for debugging why a materialized view failed to incrementalize. For more details, see TechniqueInformation.
source_table_information Source table information. Can be useful for debugging why a materialized view failed to incrementalize. For details, see TableInformation object.
target_table_information Target table information. For details, see TableInformation object.

Details for hook_progress event

The details of a hook_progress event. Includes the following fields:

Field Description
name The name of the user hook.
status The status of the user hook.

Details for operation_progress event

The details of an operation_progress event. Includes the following fields:

Field Description
type The type of operation being tracked. One of:
  • AUTO_LOADER_LISTING
  • AUTO_LOADER_BACKFILL
  • CONNECTOR_FETCH
  • CDC_SNAPSHOT
status The status of the operation. One of:
  • STARTED
  • COMPLETED
  • CANCELED
  • FAILED
  • IN_PROGRESS
duration_ms The total elapsed time of the operation in milliseconds. Only included in the end event (where status is COMPLETED, CANCELED, or FAILED).

Other objects

The following objects represent additional data or enums within the event objects.

AutoscaleInfo object

The autoscale metrics for a cluster. Only applicable for pipelines running on classic compute.

Field Description
state The Autoscaling status. Can be one of:
  • SUCCEEDED
  • RESIZING
  • FAILED
  • PARTIALLY_SUCCEEDED
optimal_num_executors The optimal number of executors. This is the optimal size suggested by the algorithm before being truncated by the user-specified min/max number of executors.
latest_requested_num_executors The number of executors requested from the cluster manager by the state manager in the latest request. This is the number of executors the state manager is trying to scale to, and is updated when the state manager attempts to exit the scaling state in the event of timeouts. This field is not populated if there is no pending request.
request_pending_seconds The length of time the scaling request has been pending. This is not populated if there is no pending request.

CostModelRejectionSubType object

An enum of reasons that incrementalization is rejected, based on cost of full refresh versus incremental refresh in a planning_information event.

Value Description
NUM_JOINS_THRESHOLD_EXCEEDED Fully refresh because the query contains too many joins.
CHANGESET_SIZE_THRESHOLD_EXCEEDED Fully refresh because too many rows in the base tables changed.
TABLE_SIZE_THRESHOLD_EXCEEDED Fully refresh because the base table size exceeded the threshold.
EXCESSIVE_OPERATOR_NESTING Fully refresh because the query definition is complex and has many levels of operator nesting.
COST_MODEL_REJECTION_SUB_TYPE_UNSPECIFIED Fully refresh for any other reason.

DataQualityMetrics object

Metrics about how expectations are being met within the flow. Used in a flow_progress event details.

Field Description
dropped_records The number of records that were dropped because they failed one or more expectations.
expectations Metrics for expectations added to any dataset in the flow's query plan. When there are multiple expectations, this can be used to track which expectations were met or failed. For details, see ExpectationMetrics object.

ExpectationMetrics object

Metrics about expectations, for a specific expectation.

Field Description
name The name of the expectation.
dataset The name of the dataset to which the expectation was added.
passed_records The number of records that pass the expectation.
failed_records The number of records that fail the expectation. Tracks whether the expectation was met, but does not describe what happens to the records (warn, fail, or drop the records).

FlowMetrics object

Metrics about the flow, including both total for the flow, and broken out by specific source. Used in a flow_progress event details.

Each streaming source supports only specific flow metrics. The following table shows the metrics available for supported streaming sources:

source backlog bytes backlog records backlog seconds backlog files
Kafka
Kinesis
Delta
Auto Loader
Google Pub/Sub
Field Description
num_output_rows Number of output rows written by an update of this flow.
backlog_bytes Total backlog as bytes across all input sources in the flow.
backlog_records Total backlog records across all input sources in the flow.
backlog_files Total backlog files across all input sources in the flow.
backlog_seconds Maximum backlog seconds across all input sources in the flow.
executor_time_ms Sum of all task execution times in milliseconds of this flow over the reporting period.
executor_cpu_time_ms Sum of all task execution CPU times in milliseconds of this flow over the reporting period.
num_upserted_rows Number of output rows upserted into the dataset by an update of this flow.
num_deleted_rows Number of existing output rows deleted from the dataset by an update of this flow.
num_output_bytes Number of output bytes written by an update of this flow.
source_metrics Metrics for each input source in the flow. Useful for monitoring ingestion progress from sources outside Lakeflow Declarative Pipelines (like Apache Kafka, Pulsar, or Auto Loader). Includes the fields:
  • source_name: The name of the source.
  • backlog_bytes: Backlog as bytes for this source.
  • backlog_records: Backlog records for this source.
  • backlog_files: Backlog files for this source.
  • backlog_seconds: Backlog seconds for this source.

IncrementalizationIssue object

Represents issues with incrementalization that could cause a full refresh when planning an update.

Field Description
issue_type An issue type that could prevent the materialized view from incrementalizing. For details, see Issue Type.
prevent_incrementalization Whether this issue prevented the incrementalization from happening.
table_information Table information associated with issues like CDF_UNAVAILABLE, INPUT_NOT_IN_DELTA, DATA_FILE_MISSING.
operator_name Plan-related information. Set for issues when the issue type is either PLAN_NOT_DETERMINISTIC or PLAN_NOT_INCREMENTALIZABLE to the operator or expression that causes the non-determinism or non-incrementalizability.
expression_name The expression name.
join_type Auxiliary information when the operator is a join. For example, JOIN_TYPE_LEFT_OUTER or JOIN_TYPE_INNER.
plan_not_incrementalizable_sub_type Detailed category when the issue type is PLAN_NOT_INCREMENTALIZABLE. For details, see PlanNotIncrementalizableSubType object.
plan_not_deterministic_sub_type Detailed category when the issue type is PLAN_NOT_DETERMINISTIC. For details, see PlanNotDeterministicSubType object.
fingerprint_diff_before The diff from the fingerprint before.
fingerprint_diff_current The diff from the current fingerprint.
cost_model_rejection_subtype Detailed category when the issue type is INCREMENTAL_PLAN_REJECTED_BY_COST_MODEL. For details, see CostModelRejectionSubType object.

IssueType object

An enum of issue types that could cause a full refresh.

Value Description
CDF_UNAVAILABLE CDF (Change Data Feed) is not enabled on some base tables. The table_information field gives information on which table does not have CDF enabled. Use ALTER TABLE <table-name> SET TBLPROPERTIES ( 'delta.enableChangeDataFeed' = true) to enable CDF for the base table. If source table is a materialized view, CDF should be set to ON by default.
DELTA_PROTOCOL_CHANGED Fully refresh because some base tables (details in the table_information field) had a Delta protocol change.
DATA_SCHEMA_CHANGED Fully refresh because some base tables (details in the table_information field) had a data schema change in the columns used by the materialized view definition. Not relevant if a column that the materialized view does not use has been changed or added to the base table.
PARTITION_SCHEMA_CHANGED Fully refresh because some base tables (details in the table_information field) had a partition schema change.
INPUT_NOT_IN_DELTA Fully refresh because the materialized view definition involves some non-Delta input.
DATA_FILE_MISSING Fully refresh because some base table files are already vacuumed due to their retention period.
PLAN_NOT_DETERMINISTIC Fully refresh because some operators or expressions in the materialized view definition are not deterministic. The operator_name and expression_name fields give information on which operator or expression caused the issue.
PLAN_NOT_INCREMENTALIZABLE Fully refresh because some operators or expressions in the materialized view definition are not incrementalizable.
SERIALIZATION_VERSION_CHANGED Fully refresh because there was a significant change in the query fingerprinting logic.
QUERY_FINGERPRINT_CHANGED Fully refresh because the materialized view definition changed, or Lakeflow Declarative Pipelines releases caused a change in the query evaluation plans.
CONFIGURATION_CHANGED Fully refresh because key configurations (for example, spark.sql.ansi.enabled) that might affect query evaluation have changed. Full recompute is required to avoid inconsistent states in the materialized view.
CHANGE_SET_MISSING Fully refresh because it is the first compute of the materialized view. This is expected behavior for initial materialized view computation.
EXPECTATIONS_NOT_SUPPORTED Fully refresh because the materialized view definition includes expectations, which are not supported for incremental updates. Remove expectations or handle them outside of the materialized view definition if incremental support is needed.
TOO_MANY_FILE_ACTIONS Fully refresh because the number of file actions exceeded the threshold for incremental processing. Consider reducing file churn in base tables or increasing thresholds.
INCREMENTAL_PLAN_REJECTED_BY_COST_MODEL Fully refresh because the cost model determined that a full refresh is more efficient than incremental maintenance. Review the cost model behavior or complexity of the query plan to allow incremental updates.
ROW_TRACKING_NOT_ENABLED Fully refresh because row tracking is not enabled on one or more base tables. Enable row tracking using ALTER TABLE <table-name> SET TBLPROPERTIES ('delta.enableRowTracking' = true).
TOO_MANY_PARTITIONS_CHANGED Fully refresh because too many partitions changed in the base tables. Try to limit the number of partition changes to stay within incremental processing limits.
MAP_TYPE_NOT_SUPPORTED Fully refresh because the materialized view definition includes a map type, which is not supported for incremental updates. Consider restructuring the data to avoid map types in the materialized view.
TIME_ZONE_CHANGED Fully refresh because the session or system time zone setting changed.
DATA_HAS_CHANGED Fully refresh because the data relevant to the materialized view changed in a way that prevents incremental updates. Evaluate the data changes and structure of the view definition to ensure compatibility with incremental logic.
PRIOR_TIMESTAMP_MISSING Fully refresh because the timestamp of the last successful run is missing. This can occur after metadata loss or manual intervention.

MaintenanceType object

An enum of maintenance types that might be chosen during a planning_information event. If the type is not MAINTENANCE_TYPE_COMPLETE_RECOMPUTE or MAINTENANCE_TYPE_NO_OP, the type is an incremental refresh.

Value Description
MAINTENANCE_TYPE_COMPLETE_RECOMPUTE Full recompute; always shown.
MAINTENANCE_TYPE_NO_OP When base tables do not change.
MAINTENANCE_TYPE_PARTITION_OVERWRITE Incrementally refresh affected partitions when the materialized view is co-partitioned with one of the source tables.
MAINTENANCE_TYPE_ROW_BASED Incrementally refresh by creating modular changesets for various operations, such as JOIN, FILTER, and UNION ALL, and composing them to calculate complex queries. Used when Row tracking for the source tables is enabled, and there is a limited number of joins for the query.
MAINTENANCE_TYPE_APPEND_ONLY Incrementally refresh by only computing new rows because there were no upserts or deletes in the source tables.
MAINTENANCE_TYPE_GROUP_AGGREGATE Incrementally refresh by calculating changes for each aggregate value. Used when associative aggregates, such as count, sum, mean, and stddev, are at the topmost level of the query.
MAINTENANCE_TYPE_GENERIC_AGGREGATE Incrementally refresh by calculating only the affected aggregate groups. Used when aggregates like median (not just associative ones) are at the topmost level of the query.
MAINTENANCE_TYPE_WINDOW_FUNCTION Incrementally refresh queries with window functions like PARTITION BY by recomputing only the changed partitions. Used when all of the window functions have a PARTITION BY or JOIN clause and are at the topmost level of the query.

Origin object

Where the event originated.

Field Description
cloud The cloud provider. The possible values are:
  • AWS
  • Azure
  • GCP
region The cloud region.
org_id The org id or workspace ID of the user. Unique within a cloud. Useful to identify the workspace, or to join with other tables, such as system billing tables.
pipeline_id The id of the pipeline. A unique identifier for the pipeline. Useful to identify the pipeline, or to join with other tables, such as system billing tables.
pipeline_type The type of the pipeline to show where the pipeline was created. The possible values are:
  • DBSQL: A pipeline created via Databricks SQL.
  • WORKSPACE: An ETL pipeline created via Lakeflow Declarative Pipelines.
  • MANAGED_INGESTION: A Lakeflow Connect managed ingestion pipeline.
  • BRICKSTORE: A pipeline to update an online table for real-time feature serving.
  • BRICKINDEX: A pipeline to update a vector database. For more details, see vector search.
pipeline_name The name of the pipeline.
cluster_id The id of the cluster where an execution happens. Globally unique.
update_id The id of a single execution of the pipeline. This is equivalent to run ID.
table_name The name of the (Delta) table being written to.
dataset_name The fully qualified name of a dataset.
sink_name The name of a sink.
flow_id The id of the flow. It tracks the state of the flow being used across multiple updates. As long as the flow_id is the same, the flow is incrementally refreshing. The flow_id changes when the materialized view full refreshes, the checkpoint resets, or a full recomputation occurs within the materialized view.
flow_name The name of the flow.
batch_id The id of a microbatch. Unique within a flow.
request_id The id of the request that caused an update.

PlanNotDeterministicSubType object

An enum of non-deterministic cases for a planning_information event.

Value Description
STREAMING_SOURCE Fully refresh because the materialized view definition includes a streaming source, which is not supported.
USER_DEFINED_FUNCTION Fully refresh because the materialized view includes an unsupported user-defined function. Only deterministic Python UDFs are supported. Other UDFs might prevent incremental updates.
TIME_FUNCTION Fully refresh because the materialized view includes a time-based function such as CURRENT_DATE or CURRENT_TIMESTAMP. The expression_name property provides the name of the unsupported function.
NON_DETERMINISTIC_EXPRESSION Fully refresh because the query includes a non-deterministic expression such as RANDOM(). The expression_name property indicates the non-deterministic function that prevents incremental maintenance.

PlanNotIncrementalizableSubType object

An enum of reasons an update plan might not be incrementalizable.

Value Description
OPERATOR_NOT_SUPPORTED Fully refresh because the query plan includes an unsupported operator. The operator_name property provides the name of the unsupported operator.
AGGREGATE_NOT_TOP_NODE Fully refresh because an aggregate (GROUP BY) operator is not at the top level of the query plan. Incremental maintenance supports aggregates only at the top level. Consider defining two materialized views to separate the aggregation.
AGGREGATE_WITH_DISTINCT Fully refresh because the aggregation includes a DISTINCT clause, which is not supported for incremental updates.
AGGREGATE_WITH_UNSUPPORTED_EXPRESSION Fully refresh because the aggregation includes unsupported expressions. The expression_name property indicates the problematic expression.
SUBQUERY_EXPRESSION Fully refresh because the materialized view definition includes a subquery expression, which is not supported.
WINDOW_FUNCTION_NOT_TOP_LEVEL Fully refresh because a window function is not at the top level of the query plan.
WINDOW_FUNCTION_WITHOUT_PARTITION_BY Fully refresh because a window function is defined without a PARTITION BY clause.

TableInformation object

Represents details of a table considered during a planning_information event.

Field Description
table_name Table name used in the query from Unity Catalog or Hive metastore. Might not be available in case of path-based access.
table_id Required. Table ID from the Delta log.
catalog_table_type Type of the table as specified in the catalog.
partition_columns Partition columns of the table.
table_change_type Change type in the table. One of: TABLE_CHANGE_TYPE_UNKNOWN, TABLE_CHANGE_TYPE_APPEND_ONLY, TABLE_CHANGE_TYPE_GENERAL_CHANGE.
full_size The full size of the table in number of bytes.
change_size Size of the changed rows in changed files. It is calculated using change_file_read_size * num_changed_rows / num_rows_in_changed_files.
num_changed_partitions Number of changed partitions.
is_size_after_pruning Whether full_size and change_size represent data after static file pruning.
is_row_id_enabled Whether row ID is enabled on the table.
is_cdf_enabled Whether CDF is enabled on the table.
is_deletion_vector_enabled Whether deletion vector is enabled on the table.
is_change_from_legacy_cdf Whether the table change is from legacy CDF or row-ID-based CDF.

TaskSlotMetrics object

The task slot metrics for a cluster. Only applies to pipeline updates running on classic compute.

Field Description
summary_duration_ms The duration in milliseconds over which aggregate metrics (for example, avg_num_task_slots) are calculated.
num_task_slots The number of Spark task slots at the reporting instant.
avg_num_task_slots The average number of Spark task slots over summary duration.
avg_task_slot_utilization The average task slot utilization (number of active tasks divided by number of task slots) over summary duration.
num_executors The number of Spark executors at the reporting instant.
avg_num_queued_tasks The average task queue size (number of total tasks minus number of active tasks) over summary duration.

TechniqueInformation object

Refresh methodology information for a planning event.

Field Description
maintenance_type Maintenance type related to this piece of information.
If the type is not MAINTENANCE_TYPE_COMPLETE_RECOMPUTE or MAINTENANCE_TYPE_NO_OP, the flow incrementally refreshed.
For details, see MaintenanceType object.
is_chosen True for the technique that was chosen for the refresh.
is_applicable Whether the maintenance type is applicable.
incrementalization_issues Incrementalization issues that might cause an update to fully refresh. For details, see IncrementalizationIssue object.
change_set_information Information about the final produced change set. Values are one of:
  • CHANGE_SET_TYPE_APPEND_ONLY
  • CHANGE_SET_TYPE_GENERAL_ROW_CHANGE