


Further, a model trained on such data may have difficulty generalizing to real scenes. Generating data synthetically requires realistic modeling and rendering of a wide range of scenes and natural human actions, which is challenging. The key question is where to get such data. We train our depth-prediction model in a supervised manner, which requires videos of natural scenes, captured by moving cameras, along with accurate depth maps. Our model predicts the depth map ( right brighter=closer to the camera) from a regular video ( left), where both the people in the scene and the camera are freely moving. In this work, we focus specifically on humans because they are an interesting target for augmented reality and 3D video effects. While there is a recent surge in using machine learning for depth prediction, this work is the first to tailor a learning-based approach to the case of simultaneous camera and human motion. The model avoids direct 3D triangulation by learning priors on human pose and shape from data. In “ Learning the Depths of Moving People by Watching Frozen People”, we tackle this fundamental challenge by applying a deep learning-based approach that can generate depth maps from an ordinary video, where both the camera and subjects are freely moving.
Right: We consider the setup where both camera and subject are moving. Left: The traditional stereo setup assumes that at least two viewpoints capture the scene at the same time. As a result, most existing methods either filter out moving objects (assigning them “zero” depth values), or ignore them (resulting in incorrect depth values). Satisfying this assumption requires either a multi-camera array (like Google’s Jump), or a scene that remains stationary as the single camera moves through it. This confuses traditional 3D reconstruction algorithms that are based on triangulation, which assumes that the same object can be observed from at least two different viewpoints, at the same time. The field of computer vision has long studied how to achieve similar capabilities by computationally reconstructing a scene’s geometry from 2D image data, but robust reconstruction remains difficult in many cases.Ī particularly challenging case occurs when both the camera and the objects in the scene are freely moving. Even in complex environments with multiple moving objects, people are able to maintain a feasible interpretation of the objects’ geometry and depth ordering. The human visual system has a remarkable ability to make sense of our 3D world from its 2D projection. Posted by Tali Dekel, Research Scientist and Forrester Cole, Software Engineer, Machine Perception
