Affordances are defined as action opportunities that an environment offers an agent. This concept, introduced by American psychologist J. J. Gibson in the 70s, provides a non-reductionist explanation regarding visual perception in animals by placing their motor capabilities as the backbone of perceptual faculties. In addition to its profound impact in psychology and neuroscience, action-derived perception naturally facilitates the design of artificial agents that are expected to operate in unstructured and diverse environments, thus computational modelling of affordances has become an important research topic in robotics.

Despite this potential application, robotic perceptual systems that are designed based on the concept of affordances still suffer from limited applicability and scope. The goal of this Thesis is to gradually and incrementally extend the computational formulation of affordances in humanoid manipulation tasks, in order to arrive at more general solutions. Ultimately, we strive to design an embodied artificial agent (i.e. a robot) who looks at a generic indoor scene and either knows how objects in the scene would react in response to some of its actions, or can perform simple experiments and learn these affordances.

We start by identifying one of the important limitations of state-of-the-art computational models of objects and tools affordances, namely the reliance on expert or data-driven discretisation. Our first contribution is to propose alternative models with better performances that operate on the unprocessed data. Our next proposal is to extend these models by quantifying the degree of uncertainty associated with their estimations and to scale up to larger datasets that we collected using the iCub humanoid robot.

Our experiments indicated that large datasets are required for training scalable and flexible models, but acquiring large interaction datasets from research robots is a very costly and time-consuming endeavour. Robotic physical simulators can generate large datasets in a fast and controlled way but the discrepancy between the real and simulated domains hinder the application of models trained on synthetic data in the real world.

Once again, inspired by what Gibson assumed as constituents of affordances, we investigate two sources of this discrepancy: 1) the difference between synthetic renders and real-world images, and 2) the material properties of real and simulated objects.
For the first case, our contribution is to show that in a challenging scenario of object category detection, one can significantly improve the accuracy of a model trained on a small real dataset using synthetic renders that are not necessarily photo-realistic, as long as the synthetic data incorporates enough variation in texture, light, size, etc.

Regarding the material properties, we designed and evaluated novel experiments where the robot can estimate consistent physical parameters of objects (mass and friction coefficient) from real and synthetic interactions.

Lastly, we consider defining effects as 2D displacement of objects to be a limitation. We propose novel architectures based on deep learning that allow a robot to either predict its whole visual sensory input as a function of its actions and the current state of the world (a forward model). Similarly, we see how the same architecture can be used to retro-dict the initial state of the world in the image space, given the current observations and the knowledge of the executed action (a backward model). We show the application of this model in a planning scenario and compare it to similar state-of-the-art models.

As evidenced by experiments with real robots, throughout this Thesis we will see how more sophisticated
solutions would allow generalization of the knowledge about affordances to broader scenarios which become increasingly more similar to humans everyday environments.