Decoding Cloud Best Practices for Big Data Management


More and more organizations are utilizing cloud services for data storage and retrieval. What are the advantages of using cloud services from the perspective of IT efficiency and cost optimization?

The obvious advantage is economy of scale, that your cloud vendor can negotiate discounts from vendors and run them more efficiently than all but the largest IT organizations. But there is also a subtler advantage, which is that cloud can facilitate using the right tool for the job more often. If you have to operate and troubleshoot everything in-house, you’re going to be very cautious about what you certify and allow your lines of business to deploy. But in the cloud, you can be more flexible. If you don’t need random access to a set of files, maybe you store those in more cost-effective storage like S3 than on your SAN or its cloud equivalent like EBS. This goes for third-party services, as well.

What are some typical missteps you have seen CTOs make when it comes to data management? How can these missteps be avoided by others?

Do pay attention to the fine print. In my own space of databases, a lot of people think that if they use a cloud database, they get scale and geo-distribution and performance out of the box and don’t have to think about it anymore. But it’s not magic, and the CAP theorem still applies. You still need to understand your partitioning model and what limits your database imposes there, or you’re going to get burned. Read the FAQ, read third party postmortems, and don’t assume that just because someone else is operating a service for you that your team doesn’t need to understand how it works.

Data analytics is a significant component of all data-driven decision models currently used by the management. How can a CTO ensure that the insights provided by data analytical tools are – a) meaningful in the process of decision making, and, b) accurate?

Charles Babbage, creator of the Difference Engine (a mechanical calculator that is regarded as a forerunner of digital computers), wrote in the mid-nineteenth century: “On two occasions I have been asked, — ‘Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?’ … I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.”

Today, we still struggle with asking the right questions. There is a strong temptation to look for answers where we have convenient data to query, instead of first asking where and how we can get the most relevant data. Start by defining the questions and gathering the data needed to answer them – resist the temptation to skip to approximating answers immediately.

What are the challenges that CTOs typically face while working with Big Data? How can they overcome these challenges?

I’m still seeing a lot of confusion around analytical big data vs. operational big data. It helps to compare with traditional technologies:

The biggest challenge we see today in operational big data is around hybrid cloud. It’s possible to have the best of both worlds – the flexibility of your own infrastructure and the scale and cost benefits of public cloud – but once you start talking about data and state and not just web servers, it can get tricky. As the analogy puts it, data has “gravity,” and many enterprises underestimate the impact that going hybrid has on things like security, data movement, and data governance.

We are fortunate enough to witness the growth of AI, IoT, machine learning and RPA (Robotic Process Automation). Does this advancement in technology mean the role of human beings in the workforce is shrinking?

AI is to workers in the 21st century what automation was in the 20th, and while there’s both positive and negative connotations there I see the positive dominating. Yes, people lost their jobs when assembly lines replaced blacksmiths. But, for example, washing machines took something that took basically an entire day each week and turned it into almost an afterthought. Washing machines probably cost some jobs, but most people couldn’t afford to hire a laundry service; they did it themselves. So, automation made their lives immensely better. And I think that’s going to be the dominant effect as intelligent assistants mature – taking some of the drudgery out of knowledge work, making lives better a little at a time.

What’s coming up that you’re excited about in two areas: in the market in general—perhaps a trend or tool; and within DataStax – any new features or upcoming innovations?

I think an up-and-coming area in the data space is graph technology. Of course, we’ve had graph databases for years, but they’ve never quite found a killer app, and I think a lot of that is due to early graph databases being limited in scale in a lot of the same ways relational databases were.

Now we’re getting to an inflection point: the technology’s ability to deliver is catching up to its promise, and at the same time enterprises are increasingly encountering the need to manage large, complex, relationship-heavy data sets at massive scale, i.e., the sweet spot for graph models. Our product, DSE Graph, can handle billions of items and their relationships spanning hundreds of machines across multiple datacenters with no single point of failure. This scale makes it significantly more useful for applications like real-time fraud detection.

I’m excited about seeing where Graph goes now that we have an implementation that can handle the largest, most demanding data sets.

About DataStax

DataStax powers the Right-Now Enterprise with the always-on, distributed cloud database, built on Apache Cassandraâ„¢ and designed for the hybrid cloud. The foundation for real-time applications at massive scale, DataStax Enterprise makes it possible for companies to exceed expectations through consumer and enterprise applications that provide responsive and meaningful engagement to each customer wherever they go. For more information, visit DataStax.comOpens a new window and follow us on Twitter @DataStax.