Big Push for Big Data Processing Performance and Speed with Google’s Latest Dataproc 2.0

essidsolutions

Google’s launch of Spark 3 and Hadoop 3 on its latest Dataproc 2.0 adds to the increasing sophistication of the open source environments and also continues to empower its enterprise customers to focus more on data workloads rather than infrastructure.

For data scientists, data explosion is a common business. Due to booming data usage and processing, data professionals had to switch from traditional servers to more flexible, open source, distributed, cluster environments such as Apache Spark and Hadoop which offer Python, Java, Scala, and R interfaces for any data size. Understanding this growing need of data professionals for the best cloud environment, Google has introduced Spark 3 and Hadoop 3 on its latest Dataproc image version 2.0.

What is Dataproc?

Belonging to the Google Cloud portfolio, Dataproc is a powerful tool that manages data processing workloads securely in the cloud. Data engineers and scientists can leverage this fully managed cloud service to run Apache Spark, Hadoop, Hive, and other Open Source Software (OSS) clusters at scale, without worrying about the infrastructure. Last yearOpens a new window , Google offered the best of cloud and open source, with Cloud Dataproc on Google Kubernetes. With this, data professionals could deploy unified resource management and build resilient infrastructure across any environment at a lower price.

Some of the classic use cases for Dataproc are data processing from the Internet of Things (IoT) devices, analyzing business data for sales prospects, or to identify security challenges. Some of Google Cloud’s prominent customers who have moved their on-premises Apache Hadoop to Google Cloud include Twitter, Vodafone, Pandora. Explaining the cost benefits and flexibility around Cloud Dataproc, James Malone, Product Manager at Google Cloud, saysOpens a new window , “Customers who are migrating to the cloud from on-premises data centers often share a common complaint: their uncertainty around the costs invested in and benefits derived from their existing investments in Spark and Hadoop. Cloud economics can mitigate some of these concerns—Cloud Dataproc is specifically designed to stabilize pricing, even when you use your cluster ephemerally.”

Tech News: New Backup-as-a-Service Solutions Address the Urgent Need for Data ProtectionOpens a new window

Google Cloud Dataproc uses image versions to bundle together operating systems, big data components, and Google Cloud Platform (GCP) connectors into a single package which is further deployed on a cluster. Since the images are updated regularly with new features and enhancements, the latest Dataproc image version 2.0 (currently in preview mode) offers a step function increase over the previous image versions and runs the latest iterations of Apache Spark and Hadoop clusters.

Ilias Papachristos, data analyst and a Lead Volunteer at the Google Development Group shares this code to create a Dataproc image version 2.0

To get started with Spark 3 and Hadoop 3, simply run the following command to create a Dataproc image version 2.0 cluster:

gcloud dataproc clusters create ${CLUSTER_NAME}
–region={REGION}
–image-version=preview

Could it be simpler than that?

— Elias Papachristos (@elias_ronin) June 17, 2020Opens a new window


Exploring Spark and Hadoop

It is difficult for a single computer to process petabytes of data, thus there is a growing need for a cluster of machines for data processing. But the tricky question is how do these cluster machines work to solve the data analytics process? Meet Spark and Hadoop.

Developed by Apache Software Foundation, like HadoopOpens a new window , Spark is an open-source, distributed, parallel data processing framework that manages big data and machine learning applications in scalable clusters of computer servers. It offers a set of libraries in Java, Scala, and Python languages and can process data from data repositories such as the Hadoop Distributed File System (HDFS), NoSQL databases, and Apache Hive.

Tech News: AWS Launches the 6th Generation of EC2 Family Powered by AWS Graviton2 ProcessorOpens a new window

Spark has gained more popularity over Hadoop mainly for its speed, performance, and quick feedback loop.Bernard Marr, business influencer and bestselling author explained the growing popularity of Spark in this blogOpens a new window by highlighting its machine learning compatibility, “Spark has proven very popular and is used by many large companies for huge, multi-petabyte data storage and analysis. This has partly been because of its speed. Additionally, Spark has proven itself to be highly suited to machine learning applications.”

Spark 3, the latest iteration of Apache SparkOpens a new window is currently in preview mode and the highlight of the new release is its performance optimization. Spark 3 will formulate end-to-end machine learning pipelines (data ingest, model training, and visualization) at a faster pace and reduced infrastructure costs. With Spark 3, data engineers can now perform adaptive queries, data pruning techniques (eliminating historical information from the database), and GPU acceleration. Moreover, the new Spark 3 has taken down a few functionalities such as Resilient Distributed Datasets, GraphX, and Python 2.7.

Additionally, the Hadoop 3 has exciting features such as native support for GPUs in the Yet Another Resource Negotiator (YARN) scheduler and YARN containerization. Christopher Crosbie, Product Manager, and Igor Dvorzhak, Software Engineer at Google Cloud explainOpens a new window , “In cloud-based deployments of Hadoop, there tends to be less reliance on Hadoop Distributed File System (HDFS) and YARN. HDFS storage will be substituted for Cloud Storage in most situations. YARN is still used for scheduling resources within a cluster, but in the cloud, Hadoop customers start to think about job and resource management at the cluster or VM level. Dataproc offers job-scoped clusters that are right-sized for the task at hand instead of being limited to just configuring a single cluster’s YARN queues with complex workload management policies.”

Furthermore, the latest Dataproc image version 2.0 has modified the existing configuration settings to optimize OSS software and upgraded software and shared libraries to avoid runtime incompatibilities.

We think it’s a welcome addition to an increasingly sophisticated open-source environmentOpens a new window !

Comment below or let us know on LinkedInOpens a new window , TwitterOpens a new window , or FacebookOpens a new window . We’d love to hear from you!