Ճใ - pedestrian - ं྆ใɼΠϯϑϥετϥΫνϟ ҰൠಓΛࡱӨͨ͠σʔληοτ Apolloscape Dataset Figure 3. Example scenarios of the TITAN Dataset: a pedestrian bounding box with tracking ID is shown in , vehicle bounding box with ID is shown in , future locations are displayed in . Action labels are shown in different colors following Figure 2. centric views captured from a mobile platform. In the TITAN dataset, every participant (individuals, vehicles, cyclists, etc.) in each frame is localized us- ing a bounding box. We annotated 3 labels (person, 4- wheeled vehicle, 2-wheeled vehicle), 3 age groups for per- son (child, adult, senior), 3 motion-status labels for both 2 and 4-wheeled vehicles, and door/trunk status labels for 4- wheeled vehicles. For action labels, we created 5 mutually exclusive person action sets organized hierarchically (Fig- ure 2). In the first action set in the hierarchy, the annota- tor is instructed to assign exactly one class label among 9 atomic whole body actions/postures that describe primitive action poses such as sitting, standing, standing, bending, etc. The second action set includes 13 actions that involve single atomic actions with simple scene context such as jay- walking, waiting to cross, etc. The third action set includes 7 complex contextual actions that involve a sequence of atomic actions with higher contextual understanding, such as getting in/out of a 4-wheel vehicle, loading/unloading, etc. The fourth action set includes 4 transportive actions that describe the act of manually transporting an object by agent i at each past time step from 1 to Tobs, where (cu, cv ) and (lu, lv ) represent the center and the dimension of the bounding box, respectively. The proposed TITAN frame- work requires three inputs as follows: Ii t=1:Tobs for the ac- tion detector, xi t for both the interaction encoder and past object location encoder, and et = {αt, ωt } for the ego- motion encoder where αt and ωt correspond to the acceler- ation and yaw rate of the ego-vehicle at time t, respectively. During inference, the multiple modes of future bounding box locations are sampled from a bi-variate Gaussian gen- erated by the noise parameters, and the future ego-motions ˆ et are accordingly predicted, considering the multi-modal nature of the future prediction problem. Henceforth, the notation of the feature embedding func- tion using multi-layer perceptron (MLP) is as follows: Φ is without any activation, and Φr, Φt, and Φs are associated with ReLU, tanh, and a sigmoid function, respectively. 4.1. Action Recognition We use the existing state-of-the-art method as backbone TITAN Dataset Linear 123 477 1365 950 3983 223 857 2303 1565 6111 LSTM 172 330 911 837 3352 289 569 1558 1473 5766 B-LSTM[5] 101 296 855 811 3259 159 539 1535 1447 5615 PIEtraj 58 200 636 596 2477 110 399 1248 1183 4780 Table 3: Location (bounding box) prediction errors over varying future time steps. MSE in pixels is calculated over all predicted time steps, CMSE and CFMSE are the MSEs calculated over the center of the bounding boxes for the entire predicted sequence and only the last time step respectively. MSE Method 0.5s 1s 1.5s last Linear 0.87 2.28 4.27 10.76 LSTM 1.50 1.91 3.00 6.89 PIEspeed 0.63 1.44 2.65 6.77 Table 4: Speed prediction errors over varying time steps on the PIE dataset. Last stands for the last time step. The results are reported in km/h. is generally better on bounding box centers due to the fewer degrees of freedom. Context in trajectory prediction. We first evaluate the proposed speed prediction stream, PIEspeed, by comparing this model with two baseline models, a linear Kalman filter and a vanilla LSTM model. We use MSE metric and re- port the results in km/h. Table 4 shows the results of our experiments. The linear model achieves reasonable perfor- PIE Dataset Figure 5: Illustration of our TrafficPredict (TP) method on camera-based images. There are six scenarios with different road conditions and traffic situations. We only show the trajectories of several instances in each scenario. The ground truth (GT) is drawn in green and the prediction results of other methods (ED,SL,SA) are shown with different dashed lines. The prediction trajectories of our TP algorithm (pink lines) are the closest to ground truth in most of the cases. stance layer to capture the trajectories and interactions for instances and use a category layer to summarize the simi- ҰൠಓΛࡱӨͨ͠σʔληοτ • αϯϓϧɿ81K • γʔϯɿ100,000 • ରछྨ - pedestrian, car, cyclist ҰൠಓΛࡱӨͨ͠σʔληοτ • αϯϓϧɿ645K • γʔϯɿ700 • ରछྨ • Ճใ - pedestrian, car, cyclist - ߦಈϥϕϧɼาߦऀͷྸ Y. Ma, et al., “Traf fi cPredict: Trajectory Prediction for Heterogeneous Traf fi c-Agents,” AAAI, 2019. A. Rasouli, et al., “PIE: A Large-Scale Dataset and Models for Pedestrian Intention Estimation and Trajectory Prediction, ” ICCV, 2019. S. Malla, et al., “TITAN: Future Forecast using Action Priors,” CVPR, 2020. nuScenes: A multimodal dataset for autonomous driving Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, Oscar Beijbom nuTonomy: an APTIV company
[email protected] Abstract Robust detection and tracking of objects is crucial for the deployment of autonomous vehicle technology. Image based benchmark datasets have driven development in com- puter vision tasks such as object detection, tracking and seg- mentation of agents in the environment. Most autonomous vehicles, however, carry a combination of cameras and range sensors such as lidar and radar. As machine learn- ing based methods for detection and tracking become more prevalent, there is a need to train and evaluate such meth- ods on datasets containing range sensor data along with im- ages. In this work we present nuTonomy scenes (nuScenes), the first dataset to carry the full autonomous vehicle sensor suite: 6 cameras, 5 radars and 1 lidar, all with full 360 de- gree field of view. nuScenes comprises 1000 scenes, each 20s long and fully annotated with 3D bounding boxes for 23 classes and 8 attributes. It has 7x as many annotations and 100x as many images as the pioneering KITTI dataset. We define novel 3D detection and tracking metrics. We also provide careful dataset analysis as well as baselines for li- dar and image based detection and tracking. Data, devel- opment kit and more information are available online1. 1. Introduction Figure 1. An example from the nuScenes dataset. We see 6 dif- ferent camera views, lidar and radar data, as well as the human annotated semantic map. At the bottom we show the human writ- ten scene description. Multimodal datasets are of particular importance as no single type of sensor is sufficient and the sensor types are complementary. Cameras allow accurate measurements of edges, color and lighting enabling classification and local- ization on the image plane. However, 3D localization from images is challenging [13, 12, 57, 80, 69, 66, 73]. Lidar pointclouds, on the other hand, contain less semantic infor- nuScenes Dataset ҰൠಓΛࡱӨͨ͠σʔληοτ • αϯϓϧɿ645K • γʔϯɿ700 • ରछྨ • Ճใ - truck, bicycle, car, etc. - ηϯαʔใɼਤσʔλɼ܈ใɼΤΰϞʔγϣϯ H. Caesar, et al., “nuScenes: A multimodal dataset for autonomous driving,” CVPR, 2020.