The influx of data from the edge necessitates the move away from traditional, manual data processing in favor of automated data pipelines that can deliver real-time intelligence and results. Michael St-Jean, principal marketing manager for cloud storage and data services at Red Hat, reveals ways to build an automated data pipeline to manage real-time processing of large amounts of data, right from ingestion to consumption.
As companies continue to grapple with massive amounts of data being captured at the network’s edge, data managers are encountering a new challenge: how to properly ingest and process that data to receive actionable intelligence in real-time.Â
Traditional data management involves regularly scanning repositories for updates and new data, then processing, transforming, storing, and distributing that information. Managers are required to manually provision data and distribute it through the data pipeline, but doing so in real-time becomes nearly impossible when dealing with petabytes of raw unstructured data that must be applied to different applications and for different processes.Â
It’s not scalable, it’s time-consuming and resource-intensive. It also prevents organizations from processing data in a timely manner — a benefit that is becoming increasingly important for several use cases, from hospitals keeping inventory of supplies to autonomous vehicles reacting to road conditions.Â
Today, effective management of data services starts with implementing automated data pipelines that continually ingest, prepare, and manage data from the moment it is captured. Automated data pipelines provide the means to automatically analyze data and share actionable recommendations and information with those who need it in real-time.
Let’s take a closer look at how you can implement an automated ingest data pipeline to manage your data from edge point to endpoint.
Learn More: How to Keep your Pipelines Clear and Avoid Delivery Bottlenecks
Start With Object Storage
Object storage is a data storage architecture that provides many advantages for organizations harnessing massive amounts of data. As the proliferation of images, media, sensor data, statistical results, texts, and more increases, the need for cost-effective, highly scalable storage with predictable performance abounds.Â
Unlike file or block storage, object storage bundles data with a unique identifier, and embedded metadata, allowing for easy search capabilities and virtually unlimited scalability. Object storage is ideal for managing very large amounts of data, particularly unstructured and semi-structured data with high ingest throughput rates.Â
A recent report from the Evaluator Group commissioned by Red Hat found that Red Hat Ceph Storage, which provides an object storage interface, can scale up to 10 billion objectsOpens a new window with predictable performance.Â
Create Buckets, Topics, and Notification Configurations
As objects are ingested, a key identifier is created to make them easier to retrieve. These objects are then stored in containers or “buckets.†In typical data pipelines, a manual or scripted process then finds the data needed in these buckets for a particular application workflow.Â
Usually, this is completed periodically as a batch process. However, by incorporating object bucket notifications, you can create an event whenever a new object arrives, is changed, or is deleted in the bucket. Event topics can be devised to look for specific qualifiers of data ingested into the bucket. It could be as simple as the first three letters of the object name, the object type, or a metadata tag.Â
Each topic notification points to a specified endpoint, like a message, a Kafka cluster, or an HTTP endpoint. This essentially indicates where you want the data to move next in the process. Notifications could be as basic as notifying a person or team that the data is now available, or they could move to a stream that partitions the data for processing by one or more applications.
Learn More: Realizing the Full Potential of Artificial Intelligence and Automation
Build on a Serverless Environment
Serverless computing is critical to the process. In a Kubernetes serverless environment, like the one provided by the open-source Knative project, application pods are created, uploaded, and scaled based on an event.Â
Developers define the next step in the workflow, and as events are triggered, the serverless process launches applications to process the data. These applications are automatically scaled up as more data is processed and scales down to zero when changes cease to occur in the bucket. The platform automatically provisions pods as necessary, meaning it does not use unnecessary resources.
Consider a hypothetical clinical example where digital x-ray images are ingested into a hospital network’s data stream. Instead of waiting for medical staff to review those images, as the object arrives, a bucket notification can kick off an event that partitions and streams that data to be processed. Additionally, a serverless function can spawn an AI application that can detect anomalies based on machine learning algorithms. These images can then be tagged with metadata that is useful for doctors to prioritize reviewing high-risk results.
Further, if the inference detects a high correlation with the search criteria — pneumonia detection in pulmonary x-rays, for example — a serverless function can spawn another process that anonymizes that data, thereby protecting customer privacy. The image can then be sent to a research facility where analysts are watching for trends. Those images can also be forwarded to data scientists responsible for the AI detection algorithm to continuously update and improve their machine learning models.Â
You’ve now created an efficient and automated system for object ingestion, storage, processing, transformation, and ultimately, consumption. Since it is automated, you no longer have to manually provision storage and data management resources or attempt to predict future load needs. The necessary components — and only the necessary components — are created on-demand.
More importantly, actionable intelligence can be available to end-users in real-time. There’s no need to wait for a time-based scheduler (or “cron jobâ€) to launch, or for batch data processing, or any of the manual processes that take time and eat away at the value of data services.Â
Users can get results and insights as the data comes in — a powerful benefit for anyone who relies on the fast delivery of information and recommendations. They’ll be able to make more accurate and informed decisions when they matter the most. Â
Did you enjoy reading this article? Let us know your thoughts in the comment section below or on LinkedInOpens a new window , TwitterOpens a new window , or FacebookOpens a new window . We would love to hear from you!