Do visual tasks have a relationship, or are they unrelated? For instance, could having surface normals simplify estimating the depth of an image? Intuition answers these questions positively, implying existence of a "structure" among visual tasks. Knowing this structure has notable values; it is the concept underlying transfer learning and provides a principled way for identifying redundancies across tasks, e.g., to seamlessly reuse supervision among related tasks or solve many tasks in one system without piling up the complexity.
We propose a fully computational approach for modeling the structure of the space of visual tasks. This is done via finding (first and higher-order) transfer learning dependencies across a dictionary of twenty six 2D, 2.5D, 3D, and semantic tasks in a latent space. The product is a computational taxonomic map for task transfer learning. We study the consequences of this structure, e.g. nontrivial emerged relationships, and exploit them to reduce the demand for labeled data. For example, we show that the total number of labeled datapoints needed for solving a set of 10 tasks can be reduced by roughly 2/3 (compared to training independently) while keeping the performance nearly the same. We provide a set of tools for computing and probing this taxonomical structure including a solver that users can employ to devise efficient supervision policies for their use cases.
Process overview. The steps involved in creating the taxonomy.
The provided API uses our results to recommend a superior set of transfers. By using these transfers, we can get similar results close to a fully supervised network using substantially less data.
Example taxonomies. Generated from the API.
In order to evaluate the quality of the learned transfer functions, we ran the transfer networks on a random youtube video. Visit the Transfer Visualization page to analyze how well different sources transfer to a target, or how well a source transfers to different targets. You can compare the results to a fully superivsed network as well as to baselines trained on ImageNet or not employing trasnfer learning at all.
We provide a large and high-quality dataset of varied indoor scenes.
Complete pixel-level geometric information via aligned meshes.
Semantic information via knowledge distillation from ImageNet, MS COCO, and MIT Places.
Globally consistent camera poses. Complete camera intrinsics.
3x times big as ImageNet.
* If you are interested in using the full dataset (12 TB), then please contact the authors.
Taskonomy: Disentangling Task Transfer Learning.
Zamir, Sax*, Shen*, Guibas, Malik, Savarese.
CVPR 2018 [Oral]