Development and Learning of Kernels for Text and Images

Acronym	DELKETI
Name	Development and Learning of Kernels for Text and Images
Funding Reference	FCT - PTDC/EEA-TEL/72572/2006
Dates	2008-01\|2010-12
Summary	Kernel-based methods caused a true revolution in the theory and practice of machine/statistical learning, namely because they allowed adapting classical linear algorithms to the non-linear realm. Recently, kernel methods were further extended to non-vectorial data (strings, trees, graphs, images) with great success. A well known fact is that the performance of kernel-based methods depends heavily on the adequacy of the kernel to the particular problem at hand. This has caused a recent research trend, aiming at approaches which are able to learn “good” kernels directly from training data. This project proposes to contribute to the area of kernel development and learning, with special emphasis on kernels for structured data, such as text or images. More specifically, the proposed project will contribute in the following topics: 1. Universal dissimilarity/distance functions, based on Kolmogorov and Shannon information theories, have been recently proposed. In principle, these dissimilarities will enable defining “universal kernels”. We aim at developing techniques to approximate the universal dissimilarity for several types of structured data, such as text and images. This goal will be pursued by using the known ability of compression algorithms (of the Lempel-Ziv type, in the case of text) to approximate the universal distance. 2. “Kernelized” versions of most algorithms for statistical supervised, non-supervised, and semi-supervised learning tasks have been proposed (e.g., support vector machines, kernel Fisher discriminants, kernel principal component analysis). Accordingly, we will be able to plug the (approximate) universal kernels mentioned above into any kernel-based algorithm, thus extending their applicability to a wide range of structured data types. 3. Kernels implemented via (universal) compression algorithms are necessarily approximations to the theoretical universal kernels. In some cases, these compression-based kernels converge (asymptotically in the sequence length) to the theoretical counterpart, but it is not clearly understood how they behave for finite (small) samples. In this project we will study this problem, aiming to derive concentration bounds for these kernels. 4. An alternative to tailoring kernels to specific problems or using universal-type kernels is to develop algorithms that learn “good” kernels directly from training data. In this project, we will work along this research direction. One of the main goals here, will be to learn optimal combinations of universal kernels for heterogenous data (e.g., classification of web pages, which involve multiple data types: text, links, images). 5. At a final stage of the project, the several kernel-based methods mentioned above will be applied to address problems in text and image analysis. Namely we will target central language processing problems, such as categorization, disambiguation, clustering, and summarization, as well as image analysis tasks, such as image/video classification and content-based image retrieval.
Research Groups	Signal and Image Processing Group (SIPG)
Project Partners	Instituto de Telecomunicações (PT), Priberam Informática (PT)
ISR/IST Responsible	Pedro Aguiar