LAMBDA creates a shared culture of research and innovation in crucial areas in Machine learning and Data Mining, and focuses on a small number of critical applications in analysing complex data, ranging from visual objects like 3D models, to data with temporal aspects such as videos or time-series, up to complex personas incorporating various sources, such as text, images, audio and video data aimed at the insurance and the ﬁnancial business. These types of data, multifaceted and heterogeneous, are frequently appearing in many industrial applications and, when endowed with pairwise similarities, they can be naturally interpreted as a geometric space. Under this interpretation, several important data analytic questions can be understood as geometric computational problems. A unique feature of LAMBDA is the mathematically rigorous geometric approach we bring into Machine Learning and Data Mining. The cross-fertilisation between the theory of algorithms and, specifically, geometric algorithms on the one hand, and Machine Learning on the other, is a major source of interdisciplinarity.
Complex data are usually ﬁrst mapped to points in a Euclidean space as a standard approach before being processed. One drawback is an inherent complexity in representing the data. When this mapping occurs in a straightforward way, the dimensionality of the target space is typically extremely high and this raises the issue of the curse of dimensionality. Another drawback is that while research in these areas is extensive and the current state-of-the-art shows impressive results on standard academic benchmarks, the industrial sector struggles to catch up. This is mostly due to the fact that these approaches are either too focused on specific domains, or are too generic and their tuning requires a lot of expertise, especially when dealing with complex data.
LAMBDA seeks to use diverse algorithmic tools in the setting of geometric data analysis, forging new connections between mathematics and machine learning. We expect this novel geometric approach in Machine Learning to improve the quality of results oﬀered by our industrial partners who are currently using “state the of art” techniques. Specifically, to address the issues mentioned above, LAMBDA adopts and implements approximation algorithms for two major questions in large-scale data science, namely searching and learning in high-dimensions. There are several connections between the two problems which we plan to exploit on the algorithmic level, as well as on the software development level. Our methods shall be validated on real-world data from our European industrial participants, namely AXA, and 3DI.
LAMBDA’s first objective is software innovation, as a means to support a dynamic pipeline for transferring cutting edge results from academia to the European industry. Our software development shall be based on the highly optimised platform BIDMach, so as to ensure interoperability between modules and facilitate integration into the companies’ software as well as to promote dissemination of our methods. We realise our innovation activities by concretely applying them on two different domains, represented by the two industrial participants. The first has to do with 3D shape retrieval and analysis, while the second with management of and constructing models on insurance data. Retrieval systems are software products that are designed for searching for information on data of a specific domain, most commonly text. We develop next generation advanced 3D shape retrieval technology, available through different platforms and benefiting various sectors. Our second major focus concerns complex data from insurance reports, financial transactions, as well as video and audio data. Such data are unstructured and require the development of adapted methods. One focus is on content-based retrieval of financial reports. Another is on pricing services under several constraints. For AXA it is moreover indispensable that our clustering and data mining software methods be highly scalable.
Our second objective concerns data (pre)-processing and curation. We employ methods for mining large text corpuses and for extracting knowledge from unstructured data. We take into account privacy concerns and ensure that conﬁdential data are protected. To that end we will use anonymisation techniques, as well as generation of synthetic datasets that share similar statistical properties with the original data. We shall make our (non-confidential) datasets publicly available; one of our goals is to create open benchmark datasets.
An important aspect is dissemination and networking, by strengthening existing links and by creating new synergies, all of which shall support this two-way knowledge sharing far beyond the lifetime of LAMBDA with the goal of commercialising some of our results and a follow-up wider project. LAMBDA and the consortium built by this Project shall stay on the scientific and technical forefront by means of a strong international dimension, relying on three partners from the USA. Last but not least, training is a related aspect, critical for the younger members of the consortium, including the ESRs. LAMBDA establishes various training structures aimed at educating the new generation of ed machine learning, namely local courses, general workshops and focused meetings.
WP1: Rigorous methods and software
Design of general methods, for fundamental problems needed throughout the Project, which possess rigorous guarantees regarding their performance; prototype implementation. Data anonymisation, synthetic data, generation and collection of data for benchmarking, open data. Integration of software acceleration techniques, use of BIDMach as the software platform of LAMBDA.
WP2: Retrieval and Shape analysis
Efficient and compact representation of complex objects, including shapes with attributes, in high-dimensional space. Data structures and methods for efficient search and retrieval. New methods for matching and analysing 3D shapes.
WP3: Unsupervised and semi-supervised learning
Dimension reduction is employed for clustering of complex data. Data mining, in particular text mining, developed on top of BIDMach. BIDMach’s modules adapted to industrial needs, including image analysis in the insurance business, and analysis of car traffic.
WP4: Training and Dissemination
Offer intersectoral training, create awareness of the role of businesses in technology, and the contribution of research organisations to innovation. International training at world leader institutions. Technology transfer to and exploitation by the industrial participants; transfer of knowledge to the wider industrial community. Dissemination of LAMBDA’s results to the scientific community. Communication of scientific and technological advances to the general public.
WP5: Project Management
Consortium management according to the declared goals, within the given time frame and resources: Financial management and distribution of funds; Supervise the dissemination of results; Oversee and align the research activities with the Project’s goals, promote the collaboration between members; Coordinate reporting, self evaluation, communication with EC; Manage risks, address ethical and gender issues.
WP6: Ethics requirements
This work package sets out the ‘ethics requirements’ that the project must comply with.