Weigh Losses for Scene Geometry and Semantics Alex Kendall University of Cambridge
[email protected] Yarin Gal University of Oxford
[email protected] Roberto Cipolla University of Cambridge
[email protected] Abstract Numerous deep learning applications benefit from multi- task learning with multiple regression and classification ob- jectives. In this paper we make the observation that the performance of such systems is strongly dependent on the relative weighting between each task’s loss. Tuning these weights by hand is a difficult and expensive process, mak- ing multi-task learning prohibitive in practice. We pro- pose a principled approach to multi-task deep learning which weighs multiple loss functions by considering the ho- moscedastic uncertainty of each task. This allows us to si- multaneously learn various quantities with different units or scales in both classification and regression settings. We demonstrate our model learning per-pixel depth regression, semantic and instance segmentation from a monocular in- put image. Perhaps surprisingly, we show our model can learn multi-task weightings and outperform separate mod- els trained individually on each task. 1. Introduction Multi-task learning aims to improve learning efficiency and prediction accuracy by learning multiple objectives from a shared representation [7]. Multi-task learning is prevalent in many applications of machine learning – from computer vision [27] to natural language processing [11] to speech recognition [23]. We explore multi-task learning within the setting of vi- sual scene understanding in computer vision. Scene under- standing algorithms must understand both the geometry and semantics of the scene at the same time. This forms an in- teresting multi-task learning problem because scene under- standing involves joint learning of various regression and classification tasks with different units and scales. Multi- task learning of visual scene understanding is of crucial importance in systems where long computation run-time is prohibitive, such as the ones used in robotics. Combining all tasks into a single model reduces computation and allows these systems to run in real-time. Prior approaches to simultaneously learning multiple tasks use a na¨ ıve weighted sum of losses, where the loss weights are uniform, or manually tuned [38, 27, 15]. How- ever, we show that performance is highly dependent on an appropriate choice of weighting between each task’s loss. Searching for an optimal weighting is prohibitively expen- sive and difficult to resolve with manual tuning. We observe that the optimal weighting of each task is dependent on the measurement scale (e.g. meters, centimetres or millimetres) and ultimately the magnitude of the task’s noise. In this work we propose a principled way of combining multiple loss functions to simultaneously learn multiple ob- jectives using homoscedastic uncertainty. We interpret ho- moscedastic uncertainty as task-dependent weighting and show how to derive a principled multi-task loss function which can learn to balance various regression and classifica- tion losses. Our method can learn to balance these weight- ings optimally, resulting in superior performance, compared with learning each task individually. Specifically, we demonstrate our method in learning scene geometry and semantics with three tasks. Firstly, we learn to classify objects at a pixel level, also known as se- mantic segmentation [32, 3, 42, 8, 45]. Secondly, our model performs instance segmentation, which is the harder task of segmenting separate masks for each individual object in an image (for example, a separate, precise mask for each in- dividual car on the road) [37, 18, 14, 4]. This is a more difficult task than semantic segmentation, as it requires not only an estimate of each pixel’s class, but also which object that pixel belongs to. It is also more complicated than ob- ject detection, which often predicts object bounding boxes alone [17]. Finally, our model predicts pixel-wise metric depth. Depth by recognition has been demonstrated using dense prediction networks with supervised [15] and unsu- pervised [16] deep learning. However it is very hard to esti- mate depth in a way which generalises well. We show that we can improve our estimation of geometry and depth by using semantic labels and multi-task deep learning. In existing literature, separate deep learning models 1 arXiv:1705.07115v3 [cs.CV] 24 Apr 2018 Encoder Semantic Decoder Input Image Multi-Task Loss Instance Decoder Depth Decoder Semantic Task Uncertainty Instance Task Uncertainty Depth Task Uncertainty Σ Figure 1: Multi-task deep learning. We derive a principled way of combining multiple regression and classification loss functions for multi-task learning. Our architecture takes a single monocular RGB image as input and produces a pixel-wise classification, an instance semantic segmentation and an estimate of per pixel depth. Multi-task learning can improve accuracy over separately trained models because cues from one task, such as depth, are used to regularize and improve the generalization of another domain, such as segmentation. would be used to learn depth regression, semantic segmen- tation and instance segmentation to create a complete scene understanding system. Given a single monocular input im- age, our system is the first to produce a semantic segmenta- tion, a dense estimate of metric depth and an instance level segmentation jointly (Figure 1). While other vision mod- els have demonstrated multi-task learning, we show how to learn to combine semantics and geometry. Combining these tasks into a single model ensures that the model agrees be- tween the separate task outputs while reducing computa- tion. Finally, we show that using a shared representation with multi-task learning improves performance on various metrics, making the models more effective. In summary, the key contributions of this paper are: 1. a novel and principled multi-task loss to simultane- ously learn various classification and regression losses of varying quantities and units using homoscedastic task uncertainty, 2. a unified architecture for semantic segmentation, in- stance segmentation and depth regression, 3. demonstrating the importance of loss weighting in multi-task deep learning and how to obtain superior performance compared to equivalent separately trained models. 2. Related Work Multi-task learning aims to improve learning efficiency and prediction accuracy for each task, when compared to training a separate model for each task [40, 5]. It can be con- sidered an approach to inductive knowledge transfer which improves generalisation by sharing the domain information between complimentary tasks. It does this by using a shared representation to learn multiple tasks – what is learned from one task can help learn other tasks [7]. Fine-tuning [1, 36] is a basic example of multi-task learning, where we can leverage different learning tasks by considering them as a pre-training step. Other models al- ternate learning between each training task, for example in natural language processing [11]. Multi-task learning can also be used in a data streaming setting [40], or to prevent forgetting previously learned tasks in reinforcement learn- ing [26]. It can also be used to learn unsupervised features from various data sources with an auto-encoder [35]. In computer vision there are many examples of methods for multi-task learning. Many focus on semantic tasks, such as classification and semantic segmentation [30] or classifi- cation and detection [38]. MultiNet [39] proposes an archi- tecture for detection, classification and semantic segmenta- tion. CrossStitch networks [34] explore methods to com- bine multi-task neural activations. Uhrig et al. [41] learn semantic and instance segmentations under a classification setting. Multi-task deep learning has also been used for ge- ometry and regression tasks. [15] show how to learn se- mantic segmentation, depth and surface normals. PoseNet [25] is a model which learns camera position and orienta- tion. UberNet [27] learns a number of different regression and classification tasks under a single architecture. In this work we are the first to propose a method for jointly learn- ing depth regression, semantic and instance segmentation. Like the model of [15], our model learns both semantic and geometry representations, which is important for scene un- derstanding. However, our model learns the much harder task of instance segmentation which requires knowledge of both semantics and geometry. This is because our model must determine the class and spatial relationship for each pixel in each object for instance segmentation. 2 $713จ ֤λεΫͷෆ࣮֬ੑΛֶशͯ͠ॏΈ͚͠ɺγʔϯͷ زԿਪఆͱҙຯཧղΛಉ࣌ʹߴਫ਼Խ ൴ͷ݁ɿ ίʔσΟϯάΤʔδΣϯτιϑτΣΞΤϯδχΞϦϯά͚ͩ Ͱͳ͘ɺݚڀʹͱͬͯมֵతͰ͋Δ ৄࡉͳ࣮ɺՄࢹԽɺίʔσΟϯάڥͷηοτΞοϓࠓந Խ͞Ε͍ͯΔ ϘτϧωοΫίʔσΟϯάͰͳ͘ɺΑΓྑ͍࣭Λ ͢Δ͜ͱɻ͜ΕʹΑΓɺݚڀָ͕͘͘͠ײ͡ΒΔ ࢲͦͷQEGΛ$PEFYʹΞοϓϩʔυ͠ɺ(15ʹ͜͏ࢦࣔ͠· ͨ͠ɻʮ./*45σʔληοτΛ͓ͬͨͪΌͷλεΫͰ͜ͷจ Λ࠶࣮ͤΑɻϚϧνλεΫͷͷͩɻʯ ৴͡ΒΕͳ͍͜ͱʹɺθϩγϣοτͰਖ਼͘͠ಈ࡞͢Δ࣮Λੜ͠ ·ͨ͠ɻࢲͨͪͷֵ৽తͳෆ࣮֬ੑՃॏଛؚࣦΊͯͰ͢ɻ ࣍ʹɺΤʔδΣϯτʹʮֶशύϥϝʔλΛ࠷దԽͯ͠ੑೳΛ্͞ ͤΑʯͱґཔͨ͠ΒɺֶशɺॏΈݮਰɺΥʔϜΞοϓͳͲΛௐ ͯ͠͞ΒʹͷվળΛߜΓग़͠·ͨ͠ɻ