Summary 
Kernelbased methods caused a true revolution in the theory and practice of machine/statistical learning, namely because they allowed adapting classical linear algorithms to the nonlinear realm. Recently, kernel methods were further extended to nonvectorial data (strings, trees, graphs, images) with great success. A well known fact is that the performance of kernelbased methods depends heavily on the adequacy of the kernel to the particular problem at hand. This has caused a recent research trend, aiming at approaches which are able to learn “good” kernels directly from training data. This project proposes to contribute to the area of kernel development and learning, with special emphasis on kernels for structured data, such as text or images. More specifically, the proposed project will contribute in the following topics:
1. Universal dissimilarity/distance functions, based on Kolmogorov and Shannon information theories, have been recently proposed. In principle, these dissimilarities will enable defining “universal kernels”. We aim at developing techniques to approximate the universal dissimilarity for several types of structured data, such as text and images. This goal will be pursued by using the known ability of compression algorithms (of the LempelZiv type, in the case of text) to approximate the universal distance.
2. “Kernelized” versions of most algorithms for statistical supervised, nonsupervised, and semisupervised learning tasks have been proposed (e.g., support vector machines, kernel Fisher discriminants, kernel principal component analysis). Accordingly, we will be able to plug the (approximate) universal kernels mentioned above into any kernelbased algorithm, thus extending their applicability to a wide range of structured data types.
3. Kernels implemented via (universal) compression algorithms are necessarily approximations to the theoretical universal kernels. In some cases, these compressionbased kernels converge (asymptotically in the sequence length) to the theoretical counterpart, but it is not clearly understood how they behave for finite (small) samples. In this project we will study this problem, aiming to derive concentration bounds for these kernels.
4. An alternative to tailoring kernels to specific problems or using universaltype kernels is to develop algorithms that learn “good” kernels directly from training data. In this project, we will work along this research direction. One of the main goals here, will be to learn optimal combinations of universal kernels for heterogenous data (e.g., classification of web pages, which involve multiple data types: text, links, images).
5. At a final stage of the project, the several kernelbased methods mentioned above will be applied to address problems in text and image analysis. Namely we will target central language processing problems, such as categorization, disambiguation, clustering, and summarization, as well as image analysis tasks, such as image/video classification and contentbased image retrieval.
