Natural Language Processing Is Impossible Without Humans
By Aaron Bianchi
Jan 15, 2022
Computer vision dominates the popular imagination. Use cases like driverless cars, facial recognition, and drone deliveries – machines navigating the three-dimensional world – are compelling and easy to grasp, even if the technology behind these use cases is not well understood.
But in reality, the holy grail of AI is natural language processing (NLP). Teaching machines to accurately and reliably understand and generate human language ushers in a revolution with boundaries that are hard to envision.
In theory, machines can be perfect listeners, which unlike humans never get bored or distracted. They also can consume and respond to content far, far faster than any human, at any time of day or night. The implications of these capabilities are staggering.
This assumes, of course, that we really can teach algorithms to understand what they are “hearing” and build into them the judgment required to communicate on our behalf. And that is what makes NLP such an elusive holy grail: because doing that is so hard on so many levels. Sure, helping machines to make sense of two- and three-dimensional images is an enormous challenge, and headlines describing autonomous vehicle crashes and facial recognition mistakes hint at the complexity of CV. But human language is orders of magnitude more complex.
Five ways that humans struggle with our own natural language processing:
You misinterpret sarcasm in a text message
You hear a pun and you don’t get it
You overhear a conversation between experts and get lost in their specialized vocabulary
You struggle to understand accented speech
You yearn for context when you come up against semantic, syntactic, or verbal ambiguity (“He painted himself,” or “What a waste/waist!”)
Obviously, processing and interpreting language can be a challenge even for humans, and language is our principal form of communication. Language is complex, and chock full of ambiguity and nuance. We begin to process language in the womb and spend our whole lives getting better at it. And we still make mistakes all the time.
Ways that humans and machines struggle with each other’s natural language processing:
Comprehending not just content, but also context
Processing language in the context of personal vocabularies and modes of speech
Seeing beyond content to intent and sentiment
Detecting and adjusting for errors in spoken or written content
Interpreting dialects, accents, and regionalisms
Understanding humor, sarcasm, misdirection
Keeping up with usage and word evolution and slang
Mastering specialized vocabularies
These challenges have not deterred NLP pioneers, and NLP remains an extremely fast-growing sector of machine learning. These pioneers have made great progress with use cases like:
Document classification – building models that assign content-driven labels and categories to documents to assist in document search and management
Named entity recognition – constructing and training models that identify particular categories of content in text so as to understand the text’s purpose
Chat bots – replacing human operators with models that can ascertain a customer’s problem and direct them to the right resource
Of course, even these NLP applications are complex, and the pioneers have taken away three lessons that anyone interested in NLP should heed:
Algorithms require enormous volumes of labeled and annotated training data. The complexity and nuance of language processing means that much of what we think of natural language is full of edge cases. And as we all know, training algorithms on edge cases can demand many orders of magnitude more training data than the routine. Because algorithms have not yet overcome the barriers to machine/human communication outlined above, training data must come from humans.
Only humans can label and annotate text and speech data in ways that highlight nuance and context.
Relying on commercial and open-source NLP training data is a dead end. Getting your model to the confidence levels you need demands training data that matches your specific context, industry, use case, vocabulary, and region.
The hard lesson that the pioneers learned is that NLP invariably demands custom-labeled datasets.
The humans who prepare your datasets must be qualified. If you are dealing with a healthcare use case, your human specialists must have fluency with medical terminology and processes. If the audience for your application is global, the training data cannot be prepared by specialists in a single geography. If the model will encounter slang and idiomatic content, the specialists must be able to label your training data appropriately.
Given the volume of training data NLP requires and the complexity and nuance that surrounds these models, look for a data labeling partner with a sizable, diverse, distributed workforce of labeling specialists.