Monitor Apache Spark applications on Amazon EMR with Amazon Cloudwatch

To improve a Spark application’s efficiency, it’s essential to monitor its performance and behavior. In Read More →

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

This post is co-written with Eliad Gat and Oded Lifshiz from Orca Security. With data Read More →

Enable remote reads from Azure ADLS with SAS tokens using Spark in Amazon EMR

Organizations use data from many sources to understand, analyze, and grow their business. These data Read More →

Cost monitoring for Amazon EMR on Amazon EKS

Amazon EMR is the industry-leading cloud big data solution, providing a collection of open-source frameworks Read More →

Introducing Amazon EMR on EKS job submission with Spark Operator and spark-submit

Amazon EMR on EKS provides a deployment option for Amazon EMR that allows organizations to Read More →

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

Apache Iceberg is an open table format for large datasets in Amazon Simple Storage Service Read More →

How Zoom implemented streaming log ingestion and efficient GDPR deletes using Apache Hudi on Amazon EMR

In today’s digital age, logging is a critical aspect of application development and management, but Read More →

Build, deploy, and run Spark jobs on Amazon EMR with the open-source EMR CLI tool

Today, we’re pleased to introduce the Amazon EMR CLI, a new command line tool to Read More →

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

Many customers run big data workloads such as extract, transform, and load (ETL) on Apache Read More →

Connect Amazon EMR and RStudio on Amazon SageMaker

RStudio on Amazon SageMaker is the industry’s first fully managed RStudio Workbench integrated development environment Read More →

How CyberSolutions built a scalable data pipeline using Amazon EMR Serverless and the AWS Data Lab

This post is co-written by Constantin Scoarță and Horațiu Măiereanu from CyberSolutions Tech. CyberSolutions is Read More →

Accelerate time to insight with Amazon SageMaker Data Wrangler and the power of Apache Hive

Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare data for Read More →

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

Building data lakes from continuously changing transactional data of databases and keeping data lakes up Read More →

Use Apache Iceberg in a data lake to support incremental data processing

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata Read More →

Reduce Amazon EMR cluster costs by up to 19% with new enhancements in Amazon EMR Managed Scaling

In June 2020, AWS announced the general availability of Amazon EMR Managed Scaling. With EMR Managed Read More →