Does Your Unstructured Data Spark Joy?

essidsolutions

Like ‘Holmes’ and ‘Watson’ or ‘Torvill’ and ‘Dean,’ the words ‘spring’ and ‘cleaning’ seem to go naturally together. Spring cleaning used to refer to thoroughly cleaning a house in the springtime. Nowadays, it’s used as a metaphor for any kind of cleaning and tidying that involves hard work, and that can include your unstructured data.

Spring cleaning has become fashionable through the work of Marie Kondo. She’s written a book, The Life-Changing Magic of Tidying Up (2011), and has a Netflix TV series, Tidying Up with Marie Kondo, where she visits families to help them organize and tidy their homes. Marie Kondo developed the KonMari method. For those unfamiliar with the KonMari method, it consists of gathering together all of your belongings, one category at a time, and then keeping only those things that “spark joy”, and having a place where every item will be kept from then on.

So, cleaning gives people a sense of satisfaction, which could put them in a good mood. Perhaps the same applies to unstructured data.

McMains and Kastner, in 2011, found that more clutter significantly limits the brain’s processing capacity. Decreasing clutter can decrease distractions and increase a person’s overall productivity. They used an MRI scanner Opens a new window to see what was happening in people’s brains. So, tidying up helps people to focus better. Similarly, America’s Anxiety Disorder Center’s founder carried out a studyOpens a new window , which concluded that if you get rid of clutter, you free up your brain for more essential decision-making. That could be a metaphor for what we should do with unstructured data.

Marie Kondo-ing Unstructured data

As any database administrator (DBA) knows, the world is divided up into two types of data. There’s data that is clearly defined and searchable, i.e., structured data. And there’s the rest. It’s unstructured because no predefined data model can be used for this data, nor is there any kind of pre-defined organization for it. 

According to IDC, this is estimated to make up around 80% of all the data that an organization stores. Structured data usually fits in fields in databases or is tagged in documents, making it easy to analyze. Unstructured data has, traditionally, been difficult for software to extract useful information from. Structured data can be thought of as quantitative, whereas unstructured data is qualitative. And unstructured data is often stored in what’s called a data lake (or sometimes called a data swamp!), whereas structured data can often be found in a data warehouse.

Another problem with unstructured data is the rate it grows at. Seagate has predicted that connected IoT devices are expected to create over 90 ZB of dataOpens a new window by 2025, most of that being unstructured. Add to that, 30% of the world’s data will need real-time processing.

Examples of unstructured documents include — text files, emails, graphics, photos, videos, presentations, audio files, data from sensors, blog posts, social media posts, call center recordings, log files, etc. And unstructured data is usually stored in its native format – whatever that might be. Not only is unstructured data challenging to search, it’s difficult to sort, organize, and manage. Another issue with unstructured data is that there may be duplicate (or near duplicate) versions stored in addition to the original file.

Also, unstructured data is sometimes referred to as dark data because it’s digital data that all too often goes unused. In KonMari terms, it’s like having a room in your house that you never go in – let alone never tidy up!

Learn More: AI and NLP Tools Hold the Key to Modern Health Data Analysis 

Real-World Machine Learning Models Are Driven By Unstructured Data

Just to complicate things, there is such a thing as semi-structured data, which is unstructured data grouped together, such as folders grouped by topics or tweets organized by hashtag or similar. Sounds like people are taking the first steps in the KonMari method!

We can see from earlier about KonMari why people feel spring cleaning is good for them and good for their home, but what’s the point of tidying unstructured data? The real issue is that unstructured data contains lots of information that is of value to the organization. There may be notes about customers that actually hide purchasing trends. There may be information about which TV shows people like to watch all the way through and which they fast forward. There may be health information about customers. But none of that (and much more) is being harvested because, for many years, it was too hard. There was no way to access the information easily. Once that can be done, then organizations can use it to make business decisions. And without utilizing that data, they are missing out on data that may be available to their competitors.

So, unstructured data provides a wealth of marketing intelligence. It can be used to identify patterns in the way customers behave. And this information can be used by organizations when planning future products that they are hoping to bring to market, or more-exactly targeting existing products to consumers.

In the past, enterprise content management (ECM) systems have been adopted by most organizations looking to derive value for their data, both structured and unstructured. The problem has always been that these ECM systems have been difficult to implement, haven’t integrated well with other applications, and cannot be described as cheap.

Learn More:  Beyond the Hype: Combining Machine Learning with Operational Analytics 

Unstructured Data Key to Meeting Business Objectives

So, what is the best way to unlock the value of unstructured data to meet business objectives? Clearly, manually sorting unstructured data will not work. It’s time-consuming, error-prone, and doesn’t scale. The solution – the equivalent of the KonMari fold – is machine-learning (ML) tools that use natural language processing (NLP). Machine learning uses artificial intelligence (AI) to automatically learn from experience. With NLP, it’s possible to understand, process, and analyze human language and extract value from the unstructured data.

We are all, probably, familiar with Alexa, Cortana, and Google’s voice search. They use NLP to make sense of what people are saying and to then respond appropriately. NLP is also used in many other situations because it can understand the concepts inside contexts that are quite complex. Where the language used is ambiguous, it can decipher it, making it possible to access the key facts and relationships.

In its July 2018 Magic Quadrant for Insight Engines, GartnerOpens a new window predicted that by 2022, “information will proactively find more employees more often, thereby providing the insight needed to progress decisions and actions and reducing reactive searching by 20%.”

NLP is being enhanced all the time, as are the ML algorithms working with it, and companies like Amazon, Google, and Microsoft are working on this. The result is that it’s now possible to automate routine tasks, scale operations, sort data, and analyze data in real-time.

The process of converting the unstructured text in documents, emails, social media posts, etc., into normalized and structured data for ML algorithms is called text mining or text analytics. NLP reads through the text to find new information or answer research questions. This might be by:

    • Topic, which identifies topics in the text. It uses algorithms like Correlated Topic Model (CTM), Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), and Probabilistic Latent Semantic Analysis (PLSA). Keywords might be extracted, allowing reviews to be sorted by highlighting price, features, or ease of use as their main feature.
    • Sentiment, which can be used on customer surveys or feedback to see how customers feel about a company. The text to be analyzed on the polarity of the opinion expressed – whether that’s positive, neutral, negative, or whatever.
    • Aspect mining, which often involves part-of-speech tagging and can be used with sentiment analysis to gain more information about customer opinions.
    • Named entity recognition (NER) can pick out things like dates, people, organizations, and locations.
    • Summaries, which summarize large amounts of text.

Once the information is extracted, it can be converted into a structured format that is then available for any further analysis required. NLP can also be enhanced by the use of ontologies, vocabularies, and custom dictionaries. They are important to the success of text mining tools. A list of key concepts can be prepared to contain both names and synonyms to look for. These are often presented in a hierarchy.

In normal speech or written text, the same thing may be said in many different ways. Ontologies are used, so the software understands the real meaning. So, for example, different words may refer to the same thing, or sometimes a word might be abbreviated, or there may be different spellings for the same word (e.g., U.S./U.K. spellings).

Machine Learning for NLP can be supervised or unsupervised. With supervised learning, the software is trained to know what to look for and how to interpret it by tagging example text. Example supervised machine learning algorithms are:

    • Bayesian Networks
    • Conditional Random Field
    • Maximum Entropy
    • Neural Networks/Deep Learning
    • Support Vector Machines

Unsupervised machine learning doesn’t use pre-tagging or annotating to create the model; instead, it may use techniques such as clustering, Latent Semantic Indexing, or matrix factorization. With ML, the tools can automatically learn from past samples to make predictions about new data. And this makes getting the data more successful each time, leading to better information being available, and it is becoming easier to meet the business’s objectives.

When working with ML tools, it’s useful to have some form of a dashboard and easy-to-use user interface so that IT teams can visualize trends and, perhaps, identify areas for further research in the future. And the whole process should add to current workflows within an organization rather than disrupt existing workflows.

In Conclusion

Unstructured data has been lying around data centres like clothes on the floor of a teenager’s bedroom. Up until now, there has been no successful way to access this data and tidy it up – i.e. deriving useful business information from it. Now there is, which means DBAs are beginning to feel that their unstructured data is sparking joy.

What data analysis techniques have you applied to unlock the business value from unstructured data? Comment below or let us know on LinkedInOpens a new window , TwitterOpens a new window , or FacebookOpens a new window . We’d love to hear from you!