Image classification is a fundamental Computer Vision task that has become more relevant in our current information-driven society. Despite the recent leap forward in the field brought by deep learning approaches, we identify two main challenging problems that are still unresolved and are currently active research topics: multi-label image classification and classifying in the presence of noisy data.
In order to make decisions, for instance when purchasing a product, people rely on rich and accurate descriptions, which entail multi-label retrieval processes. However, multi-label classification is challenged by high dimensional and complex feature spaces and its dependency on large and accurately annotated datasets. In current massified online shopping applications the daily insertion of new data requires an overwhelming annotation effort. Usually done by humans, it comes at a huge cost and yet generates high rates of noisy or missing labels.
In this thesis we focus on image classification of fashion images, using deep learning ap- proaches to tackle the multi-class/multi-label problems in order to generate rich images descrip- tions. Fashion datasets are generally challenging because (1) they include a vast amount of similarly looking images, (2) they are annotated with a large diversity of attributes but with few labels per exemplar.
To address the previous issues we explore the use of domain knowledge to constrain the (otherwise completely data-driven) solutions. Specifically, we first show how to incorporate knowledge about annotations structure. Secondly, we use context and semantic localization to guide an attention mechanism that designs the feature space by focusing on visually meaningful regions.
Finally, considering that missing and noisy supervision seriously hinder the effectiveness of deep learning approaches, we develop a framework that is able to relabel noisy annotations by learning a sparse reconstruction of image features as a combination of very few correctly annotated images, therefore increasing the number of correct labels in multi-label datasets.
We show the performance gains achieved with all our contributions through thorough ex- perimentation. Moreover, as a result of their design, the last two contributions afford model interpretability, a key aspect for inspection, validation and usability of deep learning models.