Google Advances AI With ‘One Model To Learn Them All’
Recent successes of deep neural networks have spanned many domains, from computer vision to speech recognition as well as many other tasks. Convolutional networks excel at tasks that are related to vision, while recurrent neural networks have proven to be successful at natural language processing tasks, e.g., at machine translation.
But in each case, the network was designed and tuned specifically for the problem at hand. This limits the impact of deep learning, as this effort needs to be repeated for each new task. It is also very different from the general nature of the human brain, which is able to learn many different tasks and benefit from transfer learning.
The natural question that then arises is: Can we create a unified deep learning model that is able to solve tasks that range across multiple domains?
Deep learning as we know is known to have yielded great results across many fields. May it be image classification speech recognition or even translation. But for each problem that comes ahead, getting a deep model to work well involves a long period of tuning and good research into the architecture.
One Such Model: A New Work.
Here, it is presented to you a single model that yields good results on a number of problems that span over multiple domains. To be particular, this single model has been trained concurrently on image captioning (COCO dataset), multiple translation tasks, ia speech recognition corpus, an English parsing task and ImageNet.
This unique model architecture incorporates beneath building blocks from multiple domains and consists of an attention mechanism, convolutional layers, and sparsely-gated layers. Each of these computational blocks is known to be very crucial for a subset of the tasks that are trained on. As a matter of interest, even if a block is not crucial enough for a task, it is observed that adding it never results in performance degradation and in most of the cases improves it on all tasks. The work also shows that tasks that are comprised of less data are benefitted largely from joint training with other tasks, while even if at all, the performance on large tasks degrades only slightly.
The MultiModel Architecture:
In this work, that has been presented, positively a step has been taken in order to answer the above question by introducing the MultiModel architecture: A single deep-learning model that has the ability to simultaneously learn multiple tasks from various domains. Concretely, on the following 8 corpora they simultaneously train the MultiModel:
-
WSJ speech corpus
-
ImageNet dataset
-
COCO image captioning dataset
-
WSJ parsing dataset
-
WMT English-German translation corpus
-
The reverse of the above: German-English translation.
-
WMT English-French translation corpus
-
The reverse of the above: German-French translation.
The model well learns all of the above tasks and succeeds in achieving good performance: not state-of-the-art at present, but above many task-specific models studied in recent past.
After being simultaneously trained on the above mentioned eight tasks, the authors then further set out to determine the following:
-
How close the MultiModel gets to state-of-the-art results in each task.
-
How training on 8 tasks simultaneously compares to training on each task separately.
-
How the different computational blocks influence different tasks.
Now, what exactly is the MultiModel Architecture?
The MultiModel basically consists of an encoder, I/O mixer, a few small modality-nets and an autoregressive decoder, as already depicted above. Here, the encoder and decoder are constructed with the use of 3 key computational blocks in order to get good performance across a range of different problems:
-
Convolutions allow the model to detect local patterns and generalize across space.
-
Attention layers allow focussing on specific elements to improve the performance of the model.
-
The model gets capacity with the help of Sparsely-gated mixture-of-experts without any excessive computation cost.
In this, they start by scribing the architecture of each of these 3 blocks and then introduce the encoder, decoder and the architecture of our modality-nets.
"And for the very first time, we demonstrate, that a single deep learning model can jointly learn a number of large-scale tasks that range from multiple domains. The key that leads to its success comes from designing a multi-modal architecture in which parameters as many as possible are shared and from using computational blocks from different domains together. This, we believe, puts forward a path towards interesting future work on more general deep learning architectures, especially since this model from tasks with a large amount of available data to ones where data is limited shows transfer learning." Says the team.
For More Insight on the work and the experiments conducted, the link is provided below:
Source And Link To The PDF: Click Here.