Powering Data Analytics with Data as a Service Platforms: Q&A With Tomer Shiran of Dremio

essidsolutions

“For most companies, especially in the cloud, a lot of the data is in data lake storage so data-as-a-service first and foremost means being able to achieve high query speeds directly on that system.”

Tomer Shiran, CEO and Founder, Dremio, talks about how organizations can manage data lakes with the help of Data as a Service (DaaS)Opens a new window in this informative Tech Talk Interview with Toolbox. Tomer discloses a few best practices companies can follow to create a cohesive data strategy.

Tomer’s expertise in product development and strategy enables him to empower his team at Dremio to create fast query speed data solutions. Toolbox caught up with Tomer to gain insights on bringing agility to enterprise analytics, processing large payloads of dataOpens a new window and securing big data.

Tomer also answers questions on:

  • How does Data-as-a-Service work?
  • How can data scientists speed up information processing?
  • What can companies do to secure their network and data?

Key takeaways from this Tech Talk interview on Data as a Service Platforms:

  • Top 5 big data vulnerabilities organizations could be overlooking
  • Best practices to consider for creating a consolidated data strategy
  • Trends to follow in big data as a service for 2020

Here’s what Tomer shares on how data analytics can be powered with DaaS Platforms:

Tomer, to set the stage, tell us about your career path so far and what your role at Dremio entails.

I started out as a software engineer for IBM Research, then for Microsoft working on security products. At that point, I decided to move into a product role focusing more on product management at Microsoft, and eventually I ended up being one of the first employees at a company called MapR. I was the VP of Product there and helped grow the company from 5 to over 300 employees and 700 enterprise customers.

When I was at MapR, we wanted to help create a single platform for internal data for businesses, the data lake, just as Google has done for search. But as much as they tried, there are still lots of data sources at every organization, all containing valuable data. And of course, the performance on top of data lakes is not good. And you still need IT to help you get at the data.

So that was the reason for starting Dremio, that goal of having self-service data.

I’m CEO at Dremio, which of course means I’m responsible for delivering on our vision. We have a massive opportunity in front of us as far as being the data lake engine, and we’re working with companies across all industries, ranging from Microsoft to UBS to Royal Carribean Cruise Lines. It’s fun. Companies realize more and more than they need to be data-driven, and we make that possible for them.

How does Data-as-a-Service work?

Data-as-a-Service means having the ability to ask any question on the data and receive a fast response.

A platform that delivers Data-as-a-Service is about letting data stay where it is, rather than moving or copying it. For most companies, especially in the cloud, a lot of the data is in data lake storage so data-as-a-service first and foremost means being able to achieve high query speeds directly on that system.

In addition to speed, having a self-service semantic layer, composed of virtual datasets, where IT can apply business policies and data consumers can curate new datasets and collaborate, is critical. Data as a service cannot be achieved without such an abstraction layer, because dealing in physical datasets is too difficult and expensive. Of course, it needs to bake in a lot of other things as well: simplifying access, as I said, but also accelerating analytical processing, securing and masking data, curating datasets, providing a unified catalog, and so on.

To make Data-as-a-Service really work, you need a tool that’s capable of connecting to all these different sources and then providing a single, very fast interface.

And I’ll add one other thing, realistically, there are some cases in which not all data can be loaded into the data lake for one reason or another. In that case, you still need to make it easy to perform joins across the lake and other sources.

Once you have a Data-as-a-Service solution deployed, analysts and BI users are given access to the platform and can use it as the place to go for whatever data they’re looking for – just as they might connect to a database.

Companies are facing difficulties in data aggregation with the onslaught of big data. What are the 3 best practices for companies to create a cohesive data strategy?

Data strategy is about making data strategically valuable, which means accessible for the business. To us, that implies:

  • Enable self-service. In today’s organizations, data consumers spend huge amounts of time just waiting to access to the data they need. The costs are huge, including time that data engineers and IT can spend working on other projects, time that the business spends waiting, and costs from out-of-date or incomplete data. Instead, organizations should look for tools that give analysts and consumers a single source, complete with a semantic layer and a catalog that guide them toward the information they’re looking for.
  • Look for open solutions. There are lots of solutions out there that store data in proprietary formats, or even require you to centralize your data on someone else’s cloud. That reduces your flexibility architecturally and keeps organizations from mixing and matching solutions to build the best stack. Not to mention the incredible expense involved when someone else controls your data. Instead, we recommend open source solutions, open data formats, and robust connector ecosystems.
  • Focus on performance. No matter how elegant your data solutions are, if they’re slow, people will ask fewer questions because of the time it takes to answer each. High performance is critical.

Learn More: Hadoop and Spark Team Up to Tackle Big DataOpens a new window

What are your thumb rules for processing large payloads of data into a central data repository?

We think the data lake model works well if it’s coupled with something that can deliver high-performance queries on top of it. So, I guess our main rule of thumb there is put everything in your data lake first, then answer the question about where it goes from there. But hopefully, you can find a system that sits on top of your data lake but can give you everything else you need, and high performance.

Another thought is to make the distinction between your heavy lifting, long-haul ETL, and your last-mile ETL that analysts are doing to make the data useful for them curating datasets, mostly. That last piece dominates the costs and time involved in ETL, and you should try to minimize the latter and put it into the hands of people who are close to the end users, as that makes sense.

Can you tell us how data scientists can speed up information processing without Extract-Transform-Loads (ETLs) and information consolidation?

The main project, which we open sourced is Apache Arrow. The goal of Arrow is to provide a common way to represent data and a way to process data very efficiently. It’s a framework for columnar, in-memory processing, and it’s been incorporated into many popular projects, including Dremio, Python, Spark and NVidia’s GPU Rapids initiative.

Arrow combines the benefits of columnar data structures with in-memory computing. It provides the performance benefits of these modern techniques while also providing the flexibility of complex data and dynamic schemas. And it does all of this in an open source and standardized way, right on all the underlying data sources.

The other critical piece for us is Data Reflections. Basically, we can create data structures like indexes or cubes that run behind the scenes and transparently accelerate queries. So, you can get orders of magnitude better performance than just having a SQL engine. We incorporate those reflections into query plans as needed, but we also push down processing to underlying data sources as needed as well.

Learn More: Is Cloud-based Data Warehousing as a Service a Good Idea?Opens a new window

What solutions framework do you recommend bringing agility to enterprise analytics when processing big data?

Generally, we’re committed to self-service data, and we think that be accomplished effectively with a data lake at the center, especially in the cloud.

But in this new world, data no longer can realistically be in one place and one relational database, or even a single data lake. And at the same time, you have this growing demand for self service access everyone, including data scientists, business analysts, and so on. Our goal at Dremio is for these people to be self-sufficient and do what they want with the data, no matter where that data is, how big it is, what structure it’s in.

Which are the top 5 big data vulnerabilities organizations could be overlooking today? What can they do to secure their network and data?

One of the really fundamental ones is the fact that it’s hard for analysts to get access to data today – which means that a lot of people move data into spreadsheets, or local copies, or Dropbox, or somewhere else that is invisible to IT and therefore can’t be secured. The best thing to do is to keep data in a protected environment, but to make that environment very easy to access and to use.

And the same thing is true, by the way, for ETL and other copies that are made (for example – extracts and cubes). If the data is in S3, data warehouses, cubes, extracts, then you don’t know where all the copies are, and it’s hard to make sure that only the people that are allowed access have access.

Of course, we also think that restricting access to data is critical, you need to be able to apply masking and fine-grained access control across all access methods, as well as recording the actions of users accessing systems and encryption of data at rest and in transit. It’s easier to achieve all of these objectives using a single platform that provides authentication, role-based access control, encryption, and so on.

Tell us about the upcoming projects in big data at Dremio that you are excited about.

We continue to be very excited about the development of the open source Apache Arrow project, its new execution engine Gandiva, and Apache Arrow Flight, which is an alternative to ODBC / JDBC for exchanging data between systems in a standard, and much faster, way.

Gandiva is at the heart of Dremio’s execution engine, providing efficient, high-performance processing of Apache Arrow data, and users are seeing up to a 70x performance improvement from it. And as soon as Arrow Flight is generally available, applications that implement Arrow can consume Arrow buffers directly, which gives you 100x+ efficiency improvements compared to ODBC/JDBC interfaces.

All of this is open source, of course, and will be helpful for all the different platforms and frameworks that already use Arrow, Pandas, Spark, and Parquet just to name a few.

We also have some cool stuff coming out around allowing Dremio to connect more easily and flexibly to many more different sources and delivering more flexibility for organizations that want to deploy us in the cloud. We also are providing additional acceleration for S3 and ADLS specifically, through a system that transparently predicts the access pattern to columnar files and maximizes utilization of the network. Stay tuned on that.

Learn More: Your Ideal Roadmap to Big Data ImplementationOpens a new window

Which trends are you tracking in this space as we approach 2020?

I’ve already spoken about Apache Arrow and Arrow Flight, and I think you’re going to see more and more around this technology. Adoption of Arrow has increased dramatically in the past six months, with over 4 million downloads a month in the Python community alone.

Part of the drive to adopt Arrow and Flight come from speed and efficiency, but also, systems that implement Arrow can exchange data for free, no serializing and de-serializing.

Self-service is getting more and more important. Companies want an “on-demand” experience for data, provisioned for the specific needs of an individual user, instantly, with great performance, ease of use, compatibility with their favorite tools, and without waiting months for IT.

And the big one is cloud data lakes. We think that the cloud data lake will emerge as a common platform underlying the cloud data warehouse and cloud data science environments. As companies move their analytics workloads to the cloud, the cloud data lake is where data will land first, be transformed, enriched and blended, and be served.

Companies are building cloud data lakes with S3, ADLS, or GCS, then adding things like Glue or Spark, and of course we’ll see tighter integration with streaming platforms, data catalogs, and data prep tools. Even in a basic form, the cloud data lake will become a foundational system for companies moving to the cloud.

Neha: Thank you, Tomer, for sharing your invaluable insights on how companies can use DaaS platforms to process high speed query. We hope to talk to you again soon.

About Tomer ShiranOpens a new window :

Tomer Shiran is the CEO and co-founder of Dremio. Prior to Dremio, he was VP Product and employee number five at MapR, where he was responsible for product strategy, roadmap and new feature development. As a member of the executive team, Tomer helped grow the company from five employees to over 300 employees and 700 enterprise customers. Prior to MapR, Tomer held numerous product management and engineering positions at Microsoft and IBM Research. He holds an MS in electrical and computer engineering from Carnegie Mellon University and a BS in computer science from Technion, Israel Institute of Technology.

About DremioOpens a new window :

Dremio’s Data Lake Engine delivers lightning fast query speed and a self-service semantic layer operating directly against your data lake storage. No moving data to proprietary data warehouses or creating cubes, aggregation tables and BI extracts. Just flexibility and control for Data Architects, and self-service for Data Consumers.

About Tech Talk:

Tech Talk is a Toolbox Interview Series with notable CTOs from around the world. Join us to share your insights and research on where technology and data are heading in the future. This interview series focuses on integrated solutions, research and best practices in the day-to-day work of the tech world.

What are your tips for the future of DaaS platforms? Share your views and opinions with us on TwitterOpens a new window , FacebookOpens a new window , and LinkedInOpens a new window . We’re always listening.