Acronym ARGUS
Funding Reference FCT - PTDC/EEA-CRO/098550/2008
URL http://users.isr.ist.utl.pt/~jsm/ARGUS/
Dates 2010-01|2013-06
Summary

There are hundreds of cameras deployed in public places (e.g., streets, shopping malls, airports). However, only a small fraction of the cameras outputs (<5%) is regularly observed by humans. Most of the video information is stored or destroyed without being watched or processed. This means that current surveillance systems are not able to detect abnormal events in real time.
A large effort has been recently done to develop algorithms to track objects (persons, vehicles) in video sequences and to recognize activities and unusual behaviors. Despite this effort, the problem is still far from being solved. Most of the current techniques only consider isolated persons and try to characterize the activity based on the trajectory in the video sequence or using shape features (silhouette, gait, gestures). Shape parameters may be useful when the camera is close to the object being observed, but trajectories are more important if the object is far from the camera.
In the case of small/far objects, the surveillance system should be able to learn typical trajectories from the video signal and to compare new trajectories with the learned ones in order to recognize activities and detect unusual events. This can be done by clustering the observed trajectories into a set of typical clusters, each characterized by a “mean” trajectory or by a statistical model.
The comparison of trajectories is not a simple task. Some methods involve the temporal alignment of sequences (e.g., using dynamic time warping). Other methods avoid explicit alignment by resorting to generative models (e.g., dynamic Bayesian networks). In the latter approach, checking if some new trajectory is well described by a model corresponds to computing its likelihood under that model. These generative models, however, are difficult to train and non intuitive since their structure and parameters have no physical meaning.
This project will develop a new representation for activity recognition which tries to overcome some of the previous difficulties. We assume that the (person, vehicle, animal) trajectory is generated by a set of space-varying velocity fields learned from the video data. Different velocity fields correspond to different motion regimes. Consider, for example, a cross between two streets. We may have one velocity field describing the pedestrian motion in the first street and a second velocity field in the second street. Switching between the driving fields is possible and the switching probabilities are also space-varying (they depend on the person position in the scene); its possible to switch at the cross, but not far from it. Therefore, switching is described by a field of space varying stochastic matrices.
This model will be learned from a set of observed trajectories using the expectation-maximization (EM) algorithm. This seems to be a natural choice, since we do not know which velocity field is driving the motion (active field); these active field labels are thus treated as missing data. The estimation of the space-varying matrices will be performed in an information theoretic framework using tools from differential geometry (natural gradient based on Riemaniann metric). This will be done in an unsupervised way. The number of models, the velocity fields and the field of switching matrices will all be learned from unlabeled data (video sequences).
The proposed model is intuitive and easily understandable/ interpretable since each velocity field can be observed and describes a different type of motion in the scene. This information can be used by the manager of the infrastructure to characterize the typical ways in which people move in that place. Furthermore, it is also a good starting point for activity recognition based on the sequence of active models (switching sequence) and the computation of the sequence probability in order to detect abnormal behaviors.
We also propose to develop an extension of the activity recognition system to the case of pairs of interacting pedestrians or even larger groups. The movement of each pedestrian will also be modeled using the set of velocity fields as before. The dependence between different pedestrians will have to be modeled since we can no longer assume that they are independent. We will model the pairs (triplets, etc.) of labels in order to describe common activities (e.g., walking together, pursuing).
Although the project is focused on human activity recognition, the proposed methods can also be applied in other contexts (e.g., surveillance of highways, analysis of the motion of isolated animals of groups of animals).
The project team has worked together in the past and that work led to several joint publications. It includes 4 PhD researchers from 3 research institutes and it will also involve post-graduate students. The team has significant experience in all the the topics involved in the project: surveillance, object tracking, pattern recognition, computer vision, dynamical systems, and differential geometry.

Research Groups Signal and Image Processing Group (SIPG)
Project Partners Instituto de Telecomunicações (PT), INESC-ID Lisboa (PT)
ISR/IST Responsible
Jorge S. Marques
People
Jacinto Nascimento