Since data is at the heart of AI, it should come as no surprise that AI and ML systems need enough good quality data to “learn”. In general, a large volume of good quality data is needed, especially for supervised learning approaches, in order to properly train the AI or ML system. The exact amount of data needed may vary depending on which pattern of AI you’re implementing, the algorithm you’re using, and other factors such as in house versus third party data. For example, neural nets need a lot of data to be trained while decision trees or Bayesian classifiers don’t need as much data to still produce high quality results.
So you might think more is better, right? Well, think again. Organizations with lots of data, even exabytes, are realizing that having more data is not the solution to their problems as they might expect. Indeed, more data, more problems. The more data you have, the more data you need to clean and prepare. The more data you need to label and manage. The more data you need to secure, protect, mitigate bias, and more. Small projects can rapidly turn into very large projects when you start multiplying the amount of data. In fact, many times, lots of data kills projects.
Clearly the missing step between identifying a business problem and getting the data squared away to solve that problem is determining which data you need and how much of it you really need. You need enough, but not too much. “Goldilocks data” is what people often say: not too much, not too little, but just right. Unfortunately, far too often, organizations are jumping into AI projects without first addressing an understanding of their data. Questions organizations need to answer include figuring out where the data is, how much of it they already have, what condition it is in, what features of that data are most important, use of internal or external data, data access challenges, requirements to augment existing data, and other crucial factors and questions. Without these questions answered, AI projects can quickly die, even drowning in a sea of data.
Getting a better understanding of data
In order to understand just how much data you need, you first need to understand how and where data fits into the structure of AI projects. One visual way of understanding the increasing levels of value we get from data is the “DIKUW pyramid” (sometimes also referred to as the “DIKW pyramid) which shows how a foundation of data helps build greater value with Information, Knowledge, Understanding and Wisdom.
With a solid foundation of data, you can gain additional insights at the next information layer which helps you answer basic questions about that data. Once you have made basic connections between data to gain informational insight, you can find patterns in that information to gain understanding of the how various pieces of information are connected together for greater insight. Building on a knowledge layer, organizations can get even more value from understanding why those patterns are happening, providing an understanding of the underlying patterns. Finally, the wisdom layer is where you can gain the most value from information by providing the insights into the cause and effect of information decision making.
This latest wave of AI focuses most on the knowledge layer, since machine learning provides the insight on top of the information layer to identify patterns. Unfortunately, machine learning reaches its limits in the understanding layer, since finding patterns isn’t sufficient to do reasoning. We have machine learning, not but the machine reasoning required to understand why the patterns are happening. You can see this limitation in effect any time you interact with a chatbot. While the Machine learning-enabled NLP is really good at understanding your speech and deriving intent, it runs into limitations rying to understand and reason.For example, if you ask a voice assistant if you should wear a raincoat tomorrow, it doesn’t understand that you’re asking about the weather. A human has to provide that insight to the machine because the voice assistant doesn’t know what rain actually is.
Avoiding Failure by Staying Data Aware
Big data has taught us how to deal with large quantities of data. Not just how it’s stored but how to process, manipulate, and analyze all that data. Machine learning has added more value by being able to deal with the wide range of different types of unstructured, semi-structured or structured data collected by organizations. Indeed, this latest wave of AI is really the big data-powered analytics wave.
But it’s exactly for this reason why some organizations are failing so hard at AI. Rather than run AI projects with a data-centric perspective, they are focusing on the functional aspects. To gain a handle of their AI projects and avoid deadly mistakes, organizations need a better understanding not only of AI and machine learning but also the “V’s” of big data. It’s not just about how much data you have, but also the nature of that data. Some of those V’s of big data include:
- Volume: The sheer quantity of big data that you have
- Velocity: The rate at which that big data is changing. Successfully applying AI means applying AI to high velocity data.
- Variety: Data can come in a variety of different formats including structured data such as databases, semi-structured data such as invoices, and unstructured data such as emails, image and video files. Successful AI systems can deal with this variety.
- Veracity: This refers to the quality and accuracy of your data and how much you trust that data. Garbage in is garbage out, especially in the case of data-driven AI systems. As such, successful AI systems need to be able to deal with the high variability in data quality.
With decades of experience managing big data projects, organizations that are successful with AI are primarily successful with big data. The ones that are seeing their AI projects die are the ones who are coming at their AI problems with application development mindsets.
Too Much of the Wrong Data, and Not Enough of the Right Data is Killing AI Projects
While AI projects start off on the right foot, the lack of the necessary data and the lack of understanding and then solving real problems are killing AI projects. Organizations are powering forward without actually having a real understanding of the data that they need and the quality of that data. This poses real challenges.
One of the reasons why organizations are making this data mistake is that they are running their AI projects without any real approach to doing so, other than using Agile or app dev methods. However, successful organizations have realized that using data-centric approaches focus on data understanding as one of the first phases of their project approaches. The CRISP-DM methodology, which has been around for over two decades, specifies data understanding as the very next thing to do once you determine your business needs. Building on CRISP-DM and adding Agile methods, the Cognitive Project Management for AI (CPMAI) Methodology requires data understanding in its Phase II. Other successful approaches likewise require a data understanding early in the project, because after all, AI projects are data projects. And how can you build a successful project on a foundation of data without running your projects with an understanding of data? That’s surely a deadly mistake you want to avoid.