The type of AI we currently have is Narrow AI. I will with two posts try to explain what that means in practice for someone wishing to utilize deep learning.
Narrow AI for language
The state of AI right now is what we call narrow AI. Models trained on limited data in a small niche to solve one task. An example might be classifying sentiment on reviews of movies(Was the movie good or bad according to the reviewer). It would have been common a few years ago to only train a model to get a representation of words in only reviews of movies and then train a classifier on top.
That means if you did a project 1-3 years ago your model is most likely incredibly narrow. The old models had horrible accuracies on new data. Now we have transfer learning, where transformer models learns adaptable representations of words and sentences from a vast amount of text that could be blog posts, corporate documents, forums, news, books and research papers. This means that models generalize way better. The older way of doing things meant the models had no "knowledge" of the world outside your training data. So when real world data outside of your training data is processed by one of the old models chances are it will not work at all. Heres why:
Example sentiment classification of movies in movie reviews:
It was an amazing movie
It was a great movie
It was a terrible movie
It was a horrible movie
If your dataset do not contain all the words that are positive such as "great", "fantastic" the model has no idea that great and fantastic could make a text more positive. So most likely someone built a list from their imagination or tried to use statistics to extract the most common words from your data samples where the sentiment where positive. What if you do have the greatest list ever built? What about misspelt words? "terriibl" ? count similar letters and "terrific" means the opposite of terrible.
I really thought it was an amazing movie until i saw it and changed my mind
I really thought it was a terrible movie until i saw it and changed my mind
You have double negatives and other ways to that reverses the meaning of a sentence. You need to look at the whole text.
The popcorn was very bad, it was one of my worst experiences ever. I would never eat that crap again. The movie was good though.
Judging the movie by the whole text does not work either.
This movie was shit
This movie was the shit
The same words could mean very different things in an almost identical setting. You need context to judge what a word means. So old methods like word2vec where you tried to compress synonyms into a vector space in order to be able to be free from lists of words did not really work either but being a great start for something new. To combat these problems people spent years building syntactical grammar parsers and huge lists of things. Never really getting there. Working somewhat well in a very limited domain and requiring PhDs to build these systems. There is a lot of products out there today claiming to be doing AI, but doing nothing but counting characters. There is no such thing as AI yet but the closest thing we have is Deep neural networks.
So how do we do it?
We use the latest architectures designed to combat these problems and have developed our own method for getting started really fast. We will not go in to details but we use our own custom transformer based model. Transformer models have taken the world by storm and is the main reason researchers surpassed the human baselines on the Super General Language Understanding Evaluation benchmark. We have a very efficient model that works right away. Together with our recommendation system we can solve most classification problems incredibly fast.
So what does it take to get here and further?
Models still benefit from "domain adaptation" where you continue training it specifically on legal documents for doing some legal task later on downstream. However one might think that a general enough dataset given that the model is truly capable could help solve "generability". One such Corpora is the pile with its 800GB of English texts from a very thought out mix of domains.
Can models benefit from knowing more languages? New words for different concepts with slightly different uses. While also being incredibly useful for a real world setting advances on multilingual language models could be a way forward.
How can we make these models more general besides showing it more robust data?
Larger and better Model architectures
The larger the model, the more parameters("storage", "space", "reasoning") capabilities it has and more generalization between these domains and tasks. I guess that this will further accelerate AI adoption as larger models becomes usable and methods such as retrievers where you can have a model "search" a known verified bank of facts or graphNNs that's making its way into Transformer models. More efficient models allows larger sizes. More performing architectures allows for better "understanding". Such as DeBerta from Microsoft thats currently on the top of the SuperGLUE leaderboards.
Better unsupervised and supervised training tasks might get the model to generalize better. Research into tasks such as Electra where models learns faster from new pretraining techniques such as guessing if every token in a sentence is replaced or not instead of guessing which token/word would be the best fit if you removed a word. More unsupervised tasks where models learns to "play" with data that is already easily available. I'm really looking forward to what might happen in these areas the coming years.
At Labelf we do our best to utilize all these things and thats why we are able to have a solution that works for over 100 languages.
The elephant in the room...
Training one system on various type of domains such as vision, language and audio is called multi-modal training. Training one model to understand a text might require "understanding" of visual or audio concepts, maybe even human sensory feedback such as touch and smell. Does it need to have feelings such as grief, anxiety and pain to fully understand our world? Even now, just reading random texts from the internet AI outperforms human baselines on reading comprehension and language understanding tasks.
Most models used in production and research is one domain(single-modal) only. While it is very possible that this vast quantity of language data contain different variants on how you can describe visual scenes such as "a pony on a green field", "one small horse on pasture" and later down the road when using the "knowledge" it might "know" that these are similar. If the model also could see and not only read I predict that the models would be more general and have a wider domain of "understanding". We are currently experimenting with this and so are others, this is incredibly interesting and some large scale experiments indicate that a vision model performs even better if you can incorporate a description of the image in language during training.