The Trentino Knowledge Base¶
The Trentino Knowledge Base project aims at constructing a large knowledge base of facts about the Trentino territory and its demography, dubbed TNKB. The goal of the TNKB is to serve as a platform for the development of services of interest for the public administration and other parties.
While the range of services that can be provided is large, the focus is on a number of services of interest, namely:
- Evasion detection, as seen through the lens of anomaly detection and statistical rule mining.
- Decision support, viewed as a constructive preference elicitation with huge background knowledge and logical constraints.
- Intelligent development, as a form of large-scale constructive learning with distributed multi-user feedback.
See below for more details. These services all share the potential for a large benefit to the territory.
All the software produced within the project are (or will be made) available under open source licenses. See below for a links to the sources.
The Knowledge Base¶
Target data sources include:
- Data from the public administration (registry office, cadastre, etc.).
- Open Data such as geographical and topographical repositories, currently OpenStreetMap.
- Social networks, currently Google+ and Twitter.
The TNKB leverages Semantic Web technologies and standards for maximum interoperability, pairing them with thorough data cleaning and data linking practices.
Note: due to its sensitive content, the TNK will is only available to selected third parties.
Evasion Detection¶
Tax evasion is an ancient practice and accounts for a large portion of the GDP of contemporary nations – especially Italy. Our goal is to design an automated system able to identify undutiful citizens among the rest.
Due to its illicit nature, the occurrence of tax evasion is kept well hidden, sometimes through ingenious processes. The availability of large-scale aggregated demographic repositories, such as the TNKB, enables the development of methods that capture statistical regularities (or patterns) that characterize legitimate versus illicit tributary behaviors.
Some factors aggravate to the difficulty of the problem:
- Few manually annotated examples are available, if any: the majority of tax evaders are bound to be unknown.
- Irreversible annotation errors in the data render the estimation of statistics about the wealth owned and due by a citizen an arduous task
- The normality model relies on the ability of dealing with numerical quantities – think for instance of monetary amounts.
From an Artificial Intelligence perspective, it is fruitful to cast the detection problem as a specific instance of anomaly detection: the idea is to mine a model of normality that characterizes the behavior of the majority of citizens from the TNKB: those citizens whose behavior is poorly explained by the model are deemed to be candidate evaders, and flagged for further inspection.
This approach is more flexible than the standard measures used to assess frauds, as the normality model is mined from the knowledge base in an automated fashion, so as to explain the tributary behavior of the majority of the users in the knowledge base in a given time frame and within certain laws in force. This is necessary due to the dynamic nature of undutiful behavior.
We have been designing and experimenting with automated mining methods for extracting the model of normality. For a first peek at our findings, see:
- Stefano Teso, Andrea Passerini – Inducing Sparse Programs for Learning Modulo Theories, In NIPS Workshop on Constructive Machine Learning 2015.
More will come in the upcoming months!
Decision Support¶
A human decision maker (DM) is tasked with deciding among a large number of complex alternatives, possibly having uncertain outcomes. Given that the DM is supported by limited resources (e.g. memory, attention, knowledge), she may be overwhelmed by information overload and take erroneous, suboptimal decisions.
Instances of this scenarios arise, for instance, in the medical domain (which treatments should be suggested to a patient affected by certain symptoms?) and the administrative domain (which shop in a town should be selected for a fiscal check?).
At a high level, the idea of automated decision support systems is that of simplifying the decision making process by focusing the attention of the DM on a few, promising alternatives. DSSs have been studied for decades and successfully applied to several practical areas of interest, for instance healthcare and business intelligence, and have recently found their way to distributed online sales with recommender systems and automatic advertising.
In a public administration setting, it is unrealistic to expect a fully automated decision making procedure to make perfectly acceptable decisions without any kind of supervision. Our goal is to rather develop an interactive decision support system, designed around the intervention of an expert decision maker, i.e. the public administrator.
Our current take on the subject is based on preference elicitation, an umbrella term encompassing a number of different methods for learning the preferences of the DM in an interactive fashion.
We are particularly interested in leveraging the information held by the TNKB as prior knowledge (e.g. to capture the online or tributary behavior of citizens) and/or as factual constraints on feasible decision space (e.g. some decisions may be meaningless given the facts reported in the TNKB). Additionally, constraints can be used to mimic the application of regulations in force.
Current state-of-the-art systems are based on costly probabilistic (more specifically, Bayesian) algorithms, which can hardly scale to decision spaces of realistic sizes. Further, they do not support logical constraints on the decision space.
We have designed a very efficient preference elicitation system, dubbed setmargin, that scales much, much better than Bayesian alternatives while offering comparable (or better) performance. For a detailed explanation, see:
- Stefano Teso, Andrea Passerini, Paolo Viappiani – Constructive Preference Elicitation by Setwise Max-margin Learning, In IJCAI 2016. link
- Stefano Teso, Paolo Dragone, Andrea Passerini – Structured Feedback for Preference Elicitation in Complex Domains, BeyondLabeler workshop at IJCAI 2016.
The next steps involve extending setmargin to more complex scenario. More information will be released as our new results become more mature.
Intelligent Development¶
More info will be published shortly. Stay tuned!
Publications¶
- Stefano Teso, Andrea Passerini, Paolo Viappiani – Constructive Preference Elicitation by Setwise Max-margin Learning, In IJCAI 2016. link
- Stefano Teso, Andrea Passerini – Inducing Sparse Programs for Learning Modulo Theories, In NIPS Workshop on Constructive Machine Learning 2015.
Software¶
Currently, the following packages are available:
pylmt : an implementation of Learning Modulo Theories for learning in hybrid relational domains.
setmargin : implementation of constructive preference elicitation via set-wise max-margin learning.
frm : an implementation of rule mining in feature space. Work in progress!
Acknowledgments¶
The TNKB project is a joint effort of:
- University of Trento, Deep and Structured Machine group.
- Fondazione Bruno Kessler, Data & Knowledge Management unit.
- Okkam s.r.l.
The TNKB porject is co-financed by Fondazione Cassa di Risparmio di Trento e Rovereto (CARITRO).