The goal of this project is to develop a new approach to the fully automatic 3D modelling of architecture from a video sequence.
The recovery of 3D structure (3D shape and 3D motion) from a video sequence has been widely addressed in the recent past by the computer vision community. The strongest cue to estimating the 3D structure from a video clip is the 2D motion of the brightness pattern in the image plane. For this reason, the problem is generally referred to as structure from motion (SFM). Early approaches to SFM processed a single pair of consecutive frames. Two-frame based algorithms are highly sensitive to image noise. More recent research has been oriented towards the use of longer image sequences. The problem of estimating 3D structure from multiple frames has a larger number of unknowns (the 3D shape and the set of 3D positions) but it is more constrained than the two-frame SFM problem because of the rigidity of the scene. The usual approach to multi-frame SFM relies on the matching of a set of feature points along the image sequence. Dense 3D shape estimates usually require hundreds of features that are difficult to track and that lead to a complex correspondence problem. Due to this difficulty, the automatic 3D modelling from video is still an open research problem.
This project attempts to overcome the difficulty outlined above by taking into account the more distinctive characteristic of common buildings – the flatness of their walls. The methods and algorithms to be developed within this project consider particular scenes whose 3D shape is well described by a piecewise planar model. Under this scenario, instead of tracking point wise features, one can track larger regions where the 2D motion is described by a single set of parameters. The 3D structure of the scene is then computed from the 2D motion parameters. This approach avoids the correspondence problem and is particularly suited to constructing 3D models for buildings and urban scenes that are well described by piecewise flat surfaces.
The proposed project will lead to a method that is simultaneously a powerful tool to “virtualize” buildings and urban scenes and a further step into the development of artificial vision systems. Usually, constructing 3D scene descriptions suitable to virtual manipulation requires a lot of human interaction. The usefulness of the proposed method is due to the fact that it replaces the human interaction by a procedure that recovers 3D models from a video clip in a fully automatic way. That method can also be seen as a further step into the development of artificial vision systems because the piecewise planar assumption is valid as an approximation of the shape of the environment in more general scenarios.
The approach to be followed in this project is then summarized in the following two steps:
Step i) From the video sequence, estimate the set of parameters describing the 2D motion of the image brightness pattern. The 2D displacement between two perspective views of the points that fall into a plane is given by a homography. The first part of the project will be devoted to the development of a new method to robustly estimate homographies from pairs of images.
Step ii) Given the set of parameters describing the 2D motion, compute the 3D shape of the scene and the 3D motion of the camera. The second part of the project concerns solving this large non-linear problem by using linear subspace constraints that proved to be efficient in related problems.