Simplify Amazon Redshift monitoring using the new unified SYS views

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud, providing up to five times better price-performance than any other cloud data warehouse, with performance innovation out of the box at no additional cost to you. Tens of thousands of customers use Amazon Redshift to process exabytes of data every day to power their analytics workloads.

In this post, we discuss Amazon Redshift SYS monitoring views and how they simplify the monitoring of your Amazon Redshift workloads and resource usage.

Overview of SYS monitoring views

SYS monitoring views are system views in Amazon Redshift that can be used to monitor query and workload resource usage for provisioned clusters as well as for serverless workgroups. They offer the following benefits:

They’re categorized based on functional alignment, considering query state, performance metrics, and query types
We have introduced new performance metrics like planning_time, lock_wait_time, remote_read_io, and local_read_io to aid in performance troubleshooting
It improves the usability of monitoring views by logging the user-submitted query instead of the Redshift optimizer-rewritten query
It provides more troubleshooting metrics using fewer views
It enables unified Amazon Redshift monitoring by enabling you to use the same query across provisioned clusters or serverless workgroups

Let’s look at some of the features of SYS monitoring views and how they can be used for monitoring.

Unify various query-level monitoring metrics

The following table shows how you can unify various metrics and information for a query from multiple system tables & views into one SYS monitoring view.

STL/SVL/STV	Information element	SYS Monitoring View	View columns
STL_QUERY	elapsed time, query label, user ID, transaction, session, label, stopped queries, database name	SYS_QUERY_HISTORY	user_id query_id query_label transaction_id session_id database_name query_type status result_cache_hit start_time end_time elapsed_time queue_time execution_time error_message returned_rows returned_bytes query_text redshift_version usage_limit compute_type compile_time planning_time lock_wait_time
STL_WLM_QUERY	queue time, runtime
SVL_QLOG	result cache
STL_ERROR	error code, error message
STL_UTILITYTEXT	non-SELECT SQL
STL_DDLTEXT	DDL statements
SVL_STATEMENTEXT	all types of SQL statements
STL_RETURN	return rows and bytes
STL_USAGE_CONTROL	usage limit
STV_WLM_QUERY_STATE	current state of WLM
STV_RECENTS	recent and in-flight queries
STV_INFLIGHT	in-flight queries
SVL_COMPILE	compilation

For additional information on SYS to STL/SVL/STV mapping, refer to Migrating to SYS monitoring views.

User query-level logging

To enhance query performance, the Redshift query engine can rewrite user-submitted queries. The user-submitted query identifier is different than the rewritten query identifier. We refer to the user-submitted query as the parent query and the rewritten query as the child query in this post.

The following diagram illustrates logging at the parent query level and child query level. The parent query identifier is 1000, and the child query identifiers are 1001, 1002, and 1003.

Query lifecycle timings

SYS_QUERY_HISTORY has an enhanced list of columns to provide granular time metrics relating to the different query lifecycle phases. Note all times are recorded in microseconds. The following table summarizes these metrics.

Time metrics	Description
planning_time	The time the query spent prior to running the query, which typically includes query lifecycle phases like parse, analyze, planning and rewriting.
lock_wait_time	The time the query spent on acquiring the locks on the required database objects referenced.
queue_time	The time the query spent in the queue waiting for resources to be available to run.
compile_time	The time the query spent compiling.
execution_time	The time the query spent running. In the case of a SELECT query, this also includes the return time.
elapsed_time	The end-to-end time of the query run.

Solution overview

We discuss the following scenarios to help gain familiarity with the SYS monitoring views:

Workload and query lifecycle monitoring
Data ingestion monitoring
External query monitoring
Slow query performance troubleshooting

Prerequisites

You should have the following prerequisites to follow along with the examples in this post:

Additionally, download all the SQL queries that are referenced in this post as Redshift Query Editor v2 SQL notebooks.

Workload and query lifecycle monitoring

In this section, we discuss how to monitor the workload and query lifecycle.

Identify in-flight queries

SYS_QUERY_HISTORY provides a singular view to look at all the in-flight queries as well as historical runs. See the following example query:

SELECT *
FROM sys_query_history
WHERE status IN ('planning', 'queued', 'running', 'returning')
ORDER BY start_time;

We get the following output.

Identify top long-running queries

The following query helps retrieve the top 100 queries that are taking the longest to run. Analyzing (and, if feasible, optimizing) these queries can help improve overall performance. These metrics are accumulated statistics across all runs of the query. Note that all the time values are in microseconds.

--top long running query by elapsed_time
SELECT user_id , transaction_id , query_id , database_name , query_type , query_text::VARCHAR(100) , lock_wait_time , planning_time , compile_time , execution_time , elapsed_time
FROM sys_query_history
ORDER BY elapsed_time DESC
LIMIT 100;

We get the following output.

Gather daily counts of queries by query types, period, and status

The following query provides insight into the distribution of different types of queries across different days and helps evaluate and track any changes in the workload:

--daily breakdown of workload by query types and status
SELECT DATE_TRUNC('day', start_time) period_daily , query_type , status , COUNT(*)
FROM sys_query_history
GROUP BY period_daily , query_type , status
ORDER BY period_daily , query_type , status;

We get the following output.

Gather run details of an in-flight query

To determine the run-level details of a query that is in-flight, you can use the is_active = ‘t’ filter when querying the SYS_QUERY_DETAIL table. See the following example:

SELECT query_id , child_query_sequence , stream_id , segment_id , step_id , step_name , table_id , coalesce(table_name,'')|| coalesce(source,'') as table_name , start_time , end_time , duration , blocks_read , local_read_io , remote_read_io
FROM sys_query_detail
WHERE is_active = 't'
ORDER BY query_id , child_query_sequence , stream_id , segment_id , step_id;

To view the latest 100 COPY queries run, use the following code:

SELECT session_id , transaction_id , query_id , database_name , table_name , data_source , loaded_rows , loaded_bytes , duration / 1000.00 duration_ms
FROM sys_load_history
ORDER BY start_time DESC LIMIT 100;

We get the following output.

Gather transaction-level details for commits and undo

SYS_TRANSACTION_HISTORY provides transaction-level logging by providing insights into committed transactions with details like blocks committed, status, and isolation level (serializable or snapshot used). It also logs details about the rolled back or undo transactions.

The following screenshots illustrate fetching details about a transaction that was committed successfully.

The following screenshots illustrate fetching details about a transaction that was rolled back.

Stats and vacuum

The SYS_ANALYZE_HISTORY monitoring view provides details like the last timestamp of analyze queries, the duration for which a particular analyze query ran, the number of rows in the table, and the number of rows modified. The following example query provides a list of the latest analyze queries that ran for all the permanent tables:

SELECT TRIM(schema_name) schema_name , TRIM(table_name) table_name , table_id , status , COUNT(*) times_analyze_was_triggered , MAX(last_analyze_time) last_analyze_time , MAX(end_time) end_time , AVG(ROWS) "rows" , AVG(modified_rows) modified_rows
FROM sys_analyze_history
WHERE status != 'Skipped'
GROUP BY schema_name , table_name , table_id , status
ORDER BY schema_name , table_name , table_id , status , end_time;

We get the following output.

The SYS_VACUUM_HISTORY monitoring view provides a complete set of details on VACUUM in a single view. For example, see the following code:

SELECT user_id , transaction_id , query_id , TRIM(database_name) as database_name , TRIM(schema_name) as schema_name , TRIM(table_name) table_name , table_id , vacuum_type , is_automatic as is_auto , duration , rows_before_vacuum , size_before_vacuum , reclaimable_rows , reclaimed_rows , reclaimed_blocks , sortedrows_before_vacuum , sortedrows_after_vacuum
FROM sys_vacuum_history
WHERE status LIKE '%Finished%'
ORDER BY start_time;

We get the following output.

Data ingestion monitoring

In this section, we discuss how to monitor data ingestion.

Summary of ingestion

SYS_LOAD_HISTORY provides details into the statistics of COPY commands. Use this view for summarized insights into your ingestion workload. The following example query provides an hourly summary of ingestion broken down by tables in which data was ingested:

SELECT date_trunc('hour', start_time) period_hourly , database_name , table_name , status , file_format , SUM(loaded_rows) total_rows_ingested , SUM(loaded_bytes) total_bytes_ingested , SUM(source_file_count) num_of_files_to_process , SUM(file_count_scanned) num_of_files_processed , SUM(error_count) total_errors
FROM sys_load_history
GROUP BY period_hourly , database_name , table_name , status , file_format
ORDER BY table_name , period_hourly , status;

We get the following output.

File-level ingress logging

SYS_LOAD_DETAIL provides more granular insights into how ingestion is performed at the file level. For example, see the following query using sys_load_history:

SELECT *
FROM sys_load_history
WHERE table_name = 'catalog_sales'
ORDER BY start_time;

We get the following output.

The following example shows what detailed file-level monitoring looks like:

 SELECT user_id , query_id , TRIM(file_name) file_name , bytes_scanned , lines_scanned , splits_scanned , record_time , start_time , end_time
FROM sys_load_detail
WHERE query_id = 1824870
ORDER BY start_time;

Check for errors during ingress process

SYS_LOAD_ERROR_DETAIL enables you to track and troubleshoot errors that may have occurred during the ingestion process. This view logs details for the file that encountered the error during the ingestion process along with the line number at which the error occurred and column details within that line. See the following code:

select * from sys_load_error_detail order by start_time limit 100;

We get the following output.

External query monitoring

SYS_EXTERNAL_QUERY_DETAIL provides run details for external queries, which includes Amazon Redshift Spectrum and federated queries. This view logs details at the segment level and provides useful insights to troubleshoot and monitor performance of external queries in a single monitoring view. The following are a few useful metrics and data points this monitoring view provides:

Number of external files scanned (scanned_files) and format of external files (file_format) such as Parquet, text file, and so on
Data scanned in terms of rows (returned_rows) and bytes (returned_bytes)
Usage of partitioning (total_partitions and qualified_partitions) by external queries and tables
Granular insights into time taken in listing (s3list_time) and qualifying partitions (get_partition_time) for a given external object
External file location (file_location) and external table name (table_name)
Type of external source (source_type), such as Amazon Simple Storage Service (Amazon S3) for Redshift Spectrum, or federated
Recursive scan for subdirectories (is_recursive) or access of nested column data type (is_nested)

For example, the following query shows the daily summary of the number of external queries run and data scanned:

SELECT DATE_TRUNC('hour', start_time) period_hourly , user_id , TRIM(source_type) source_type , COUNT (DISTINCT query_id) query_counts , SUM(returned_rows) returned_rows , ROUND(SUM(returned_bytes) / 1024^3,2) returned_gb
FROM sys_external_query_detail
GROUP BY period_hourly , user_id , source_type
ORDER BY period_hourly , user_id , source_type;

We get the following output.

Usage of partitions

You can verify whether the external queries scanning large sums of data and files are partitioned or not. When you use partitions, you can restrict the amount of data that your external query has to scan by pruning based on the partition key. See the following code:

SELECT file_location , CASE WHEN NVL(total_partitions,0) = 0 THEN 'No' ELSE 'Yes' END is_partitioned , SUM(scanned_files) total_scanned_files , COUNT(DISTINCT query_id) query_count
FROM sys_external_query_detail
GROUP BY file_location , is_partitioned
ORDER BY total_scanned_files DESC;

We get the following output.

For any errors encountered with external queries, look into SYS_EXTERNAL_QUERY_ERROR, which logs details at the granularity of file_location, column, and rowid within that file.

Slow query performance troubleshooting

Refer to the sysview_slow_query_performance_troubleshooting SQL notebook downloaded as part of the prerequisites for a step-by-step guide on how to perform query-level troubleshooting using SYS monitoring views and find answers to the following questions:

Do the queries being compared have similar query text?
Did the query use the result cache?
Which parts of the query lifecycle (queuing, compilation, planning, lock wait) are contributing the most to query runtimes?
Has the query plan changed?
Is the query reading more data blocks?
Is the query spilling to disk? If so, is it spilling to local or remote storage?
Is the query highly skewed with respect to data (distribution) and time (runtime)?
Do you see more rows processed in join steps or nested loops?
Are there any alerts indicating staleness in statistics?
When was the last vacuum and analyze performed for the tables involved in the query?

Clean up

If you created any Redshift provisioned clusters or Redshift Serverless workgroups as part of this post and no longer need them for your workloads, you can delete them to avoid incurring additional costs.

Conclusion

In this post, we explained how you can use the Redshift SYS monitoring views to monitor workloads of provisioned clusters and serverless workgroups. The SYS monitoring views provide simplified monitoring of the workloads, access to various query-level monitoring metrics from a unified view, and the ability to use the same SYS monitoring view query to run across both provisioned clusters and serverless workgroups. We also covered some key monitoring and troubleshooting scenarios using SYS monitoring views.

We encourage you to start using the new SYS monitoring views for your Redshift workloads. If you have any feedback or questions, please leave them in the comments.

About the authors

Urvish Shah is a Senior Database Engineer at Amazon Redshift. He has more than a decade of experience working on databases, data warehousing and in analytics space. Outside of work, he enjoys cooking, travelling and spending time with his daughter.

Ranjan Burman is a Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and helps customers build scalable analytical solutions. He has more than 15 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with the use of cloud solutions.

Source: https://aws.amazon.com/blogs/big-data/simplify-amazon-redshift-monitoring-using-the-new-unified-sys-views/

Simplify Amazon Redshift monitoring using the new unified SYS views

Overview of SYS monitoring views

Unify various query-level monitoring metrics

User query-level logging

Query lifecycle timings

Solution overview

Prerequisites

Workload and query lifecycle monitoring

Identify in-flight queries

Identify top long-running queries

Gather daily counts of queries by query types, period, and status

Gather run details of an in-flight query

Gather transaction-level details for commits and undo

Stats and vacuum

Data ingestion monitoring

Summary of ingestion

File-level ingress logging

Check for errors during ingress process

External query monitoring

Usage of partitions

Slow query performance troubleshooting

Clean up

Conclusion

About the authors

Welcome to

Accessibility Dashboard