How Presto & Spark Are Paving the Way for an Efficient ETL Approach

essidsolutions

Artificial Intelligence (AI) has had a major influence on business intelligence and operational processing. Since AI systems depend upon large datasets to deliver results, it is necessary for the IT infrastructure to include fast loading of new data (called extract, transform and load, or ETLOpens a new window ), low-latency data retrieval, and the ability to be deployed to the cloud. The latest combination of infrastructure and database engine is Presto on Apache Spark. In this article, we look at what a modern organization needs in its IT footprint to enable efficient ETL and AI processes to co-exist.

The New Kids on the Block

Apache SparkOpens a new window is currently one of the common choices for combined ETL and AI workloads. The original Apache Spark software suite was developed at the University of California, Berkeley in early 2012. It was built with an open-source architecture with the resilient distributed dataset, or RDD, as its primary interface. Spark was developed in response to the limitations of the MapReduceOpens a new window cluster computing paradigm in Hadoop, which we discuss below.

PrestoOpens a new window is an open-source low-latency SQL query engine that enables analytical processing, and and runs interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. It was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses, scaling up to cater to large organizations like Facebook.

As Facebook Data Platform Team engineer Wenlei Xie notes, “Presto is designed for low latency and follows this classic MPP architecture It’s used in memory streaming shuffle to achieve low latency.“ Presto is a common cloud-based engine used by cloud providers such as Amazon AWSOpens a new window . But first, let’s discuss what choices are available for ETL and AI.

Learn More: Moving to the Cloud: 7 Workloads That Are Ready for Cloud Transition

A Dive Into the History of Analytics

A discussion of the latest trends in ETL and AI processing begins with big data and the evolution of analytics. The first big dataOpens a new window applications were for forecasting based on historical data. IT extracted operational data on a regular basis (usually nightly) to store within a stand-alone big data solution. Such processing became known as extract, transform and load or ETL.

Data tended to be well-structured and well-understood. Most data items were either text-based or numeric, such as product names, quantities, and prices. These data types were extracted from operational databases and used in places like data marts and the data warehouse. Analysts understood these data types and how to access and analyze them using SQL.

ETL jobs tend to be resource-intensive, especially in terms of data storage and network channels. Operational data may begin as a single relational table or file that is then moved and altered through several jobs and intermediate files. All this takes disk space, CPU, and I/O resources. Also, since operational systems may be running around the clock, ETL jobs tend to execute once per day during slow periods. This results in latency issues (where big data is constantly a day old or more) and consistency issues (where, for example, today’s customer data has been loaded but the Order data has not).

Finally, even when these various processes are standardized, a delay still ensues when DBAs attempt to add a new data source. For example, when a new operational system moves to production, it may consist of hundreds of tables. Each of these may require a set of several ETL jobs to load the big data application properly. Creating and testing these jobs takes time, causing delays.

Along with ETL and SQL-based analytics queries, IT architectures need to handle secondary functions, including archive or purge of old or stale data, automated data archive, and multi-site instantiation and synchronization of critical tables to drive faster performance.

Today’s Solutions for Data Management and Analytics

Typically, businesses perceived a need to understand data relationships or correlations that may not be obvious. Third-party vendors offered a variety of plug-and-play solutions that usually required a specific hardware platform and special software for data management and analytics. There are two main players in this space: Hadoop and NoSQL databases. These technologies are used to store and retrieve massive amounts of data using parallel I/O channels and proprietary hardware.

Hadoop

Hadoop solutions by Apache uses a network of computer nodes. Each node participates in data storage, and the nodes combine their efforts during query processing. Hadoop provides a specialized file management system that gives one or more central nodes the ability to control the others as a group and coordinate their efforts. This works best for business problems where analytics is done iteratively; that is, business analysts run one or more base queries, store the results, and then run or re-run additional queries against those results.

NoSQL Databases

NoSQL DatabaseOpens a new window solutions depend upon a technique called graph analysis, a method of finding relationships among data elements. Using NoSQL requires an experienced data scientist to support the underlying database and business analysts familiar with this analytic technique.

Relational Database Engines

Relational database engines such as Db2Opens a new window , OracleOpens a new window and Presto rely on data storage in the form of relational tables that are queried using structure query language (SQL). While originally designed to handle operational processing such as financial and transactional applications, these database engines have matured to the point that they can be used effectively for analytics as well.

Learn More: Cloud Migration Planning: Top 3 Challenges and Best Practices 

The Current State of Analytics Infrastructure

From the exterior, an analytics environment looks simple. It consists of a large data store containing (usually) historical data in a proprietary format that provides fast query execution times. However, from the database administrator’s (DBA’s) perspective, there is much more. DBAs must implement several sets of processes, including backup and recovery, purge and archive of unneeded or unused data, and performance analysis and tuning. To keep the DBA’s job simple, it helps to use an infrastructure that is open-source to minimize the effects of hidden or proprietary processes. One of the advantages of an open-source infrastructure is its ability to be flexible in how it is rolled out across geographies, such as multiple cloud instances.

Presto on Spark: An Industry Example

At the Spark + AI SummitOpens a new window in 2020, Facebook Data Platform Team members Wenlei Xie and Andrii Rosa demonstratedOpens a new window the tradeoffs between using a MapReduce infrastructure on Hadoop and Presto. They found that there was a “significant value in coupling Presto’s state-of-art low-latency evaluation with Spark’s robust and fault tolerant execution engine.” (Please refer to this presentation (and others at the conference) for more information.)

The combination of Presto and Spark is an interesting choice. The Facebook team noted that they avoided the problems of MapReduce paradigms while using a standard SQL-based database engine. In particular, they believed their work was leading towards “more confluence between the Spark and the Presto communities, and a major step towards enabling unified SQL experience between interactive and batch use cases.”

This is an interesting contrast with typical cloud implementations of database engines on public cloud platforms. IBM cloudOpens a new window , Amazon AWSOpens a new window , and Microsoft AzureOpens a new window permit the use of the open-source Presto. Facebook also uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse. Other companies, such as Airbnb and Dropbox, also use Presto.

One caveat: Presto is currently in version 0.250Opens a new window , so it is a fairly new product. Also, since it is open-source, organizations may wish to avoid using it for mission-critical work. That said, Presto is an intriguing low-latency database engine that companies are increasingly adopting for ETL + AI use cases.

Did you find this article helpful? Comment below or let us know on LinkedInOpens a new window , TwitterOpens a new window , or FacebookOpens a new window . We’d love to hear from you!