Portale per l'Accesso alle Risorse Linguistiche per l'Italiano
PARLI DIT-PRJ-10-029
Status NOT active project
DISI role Partner
Project type Research Project
Dimension National
Acquisition date 2010-01-20
Start date 2010-01-20
End date 2012-01-20
Project details
Project astract In a society that is more and more knowledge-oriented it is of paramount importance that the information present in various electronic documents (especially in<br/>Internet) be accessible in an automatic and effective way. On the other hand, most of those documents have a textual content, in that they include, as their main part,<br/>texts written in some "natural" language (as Italian or English). Therefore, in order that this information be accessible, we need some software tool for accessing and<br/>indexing it; these software tools must be able to process the textual content and must go beyond a simple keyword search, whose limitations are well known.<br/>The research efforts in the field of Natural Language Processing has lead, in the last years, to the development of methodologies for text processing that range from<br/>simple lexical labeling to more complex operations of text categorization and query answering. For the next future, the growth of this area appears to be of<br/>fundamental importance, both from the scientific and from the economic point of view. Gartner Group, one of the main enterprises that trace the development of the<br/>IT market (http://www.gartner.com/), lists among the 7 "Grand Challenges" of IT for the next 25 years the "non-tactile natural interfaces", which concern the study of<br/>"natural language processing, which include speech synthesis, speech recognition, natural language understanding, natural language generation, machine<br/>translation and translating one natural language into another" (April 2008).<br/>Since, as it is obvious, linguistic technologies are strictly connected with the specific language under examination, it is essential that the resources existing for Italian<br/>be carefully monitored and their harmonic growth be coordinated at the national level. By "resource", we mean here, and in the whole project, both static resources<br/>(collection of texts, labeled from the syntactic and/or semantic point of view and linguistic knowledge bases as dictionaries and grammars) and dynamic resources<br/>(syntactic analyzers, semantic interpreters, extractors of specific types of information, etc.). The importance of the dynamic tools is obvious; with respect to the static<br/>ones, it must be said that their relevance is twofold: first, they represent evidence of the language "in use", more than the language as is encoded in a grammar;<br/>second, most knowledge about language is today automatically learnt from annotated textual corpora. While for English there is a reasonable amount of these<br/>corpora, the situation for other languages, and in particular for Italian, is more critical.<br/>The present project intends to develop a portal enabling a user to access the linguistic resources existing for Italian. During the implementation of the portal, a<br/>substantial scientific activity will be the comparative study of the annotation format of the currently existing resources and the conception of methods for their<br/>development and extension. In particular, the project will be organized in work packages, whose objectives, beyond the implementation of the portal, are:<br/>- The study of the principles and of the annotation schemes of the existing resources, with the proposal of mapping methods that make them more homogeneous<br/>- The study of the software tools currently available for Italian, of their characteristics in terms of tasks and linguistic coverage<br/>- The implementation of tools that support the creation of new annotated resources, based on cooperative annotation methods<br/>- The use of these instruments to extend the existing resources<br/>- The comparison and evaluation of the tools for Natural Language Processing in monitored contests<br/>The overview of the existing material will take into account the presence, at international level, of various initiatives addressing the collection and distribution of<br/>resources. In most cases, however, they manage proprietary data, whose access is ruled by licenses, and this reduces their usability for research purposes. On the<br/>contrary, the resources created within the PARLI project or made available by the partner Units will be freely distributed. The PARLI Units, in fact, have already<br/>tried in the past to move in this direction, but the absence of a central coordination, as the one proposed in this Project, led to collections of data and software of<br/>great scientific value, but characterized by a dishomogeneity of formats that prevented them from being easily aggregated. Consequently, the Project intends to<br/>overcome these limits, with the aim of providing the scientific community with a first core of resources, that, in the future, can be enriched via the contribution of<br/>other research centers and institutions.
Keywords ARTIFICIAL INTELLIGENCE, NATURAL LANGUAGE PROCESSING, LINGUISTIC RESOURCES
Fundings 50 €
Partners
- DIT - UniTN
- University of Turin
- University of Pisa
- University of Rome
- Univeristy of Venice
DISI Sub-project details
Project astract UNITN grazie alla collaborazione (descritta nella proposta iniziale) con il gruppo CLIC di Scienze Cognitive di Rovereto sarà in grado di produrre un sistema di risoluzione delle coreferenze (RC) per l'italiano. Come descritto nella proposta iniziale, si adatterà il sistema BART di RC, originariamente progettato per la lingua inglese.<br/>A tale proposito, si utilizzerà il sistema Texpro (un sistema per l'estrazione di nominali complessi, e.g. nomi propri) fornito dall'istituto di ricerca di Trento FBK, per generare le ipotesi di coreferenza su testi italiani sul quale il sistema BART potrà essere addestrato. Dato che alcuni attributi importanti, usati dal modello di apprendimento automatico di BART, sono estratti dagli alberi sintattici delle frasi del testo considerato, è necessario l'utilizzo di analizzatori grammaticali. Questi costituiranno il fulcro della collaborazione con le altre unità del progetto Parli, le quali produrranno analizzatori grammaticali usabili dal sistema BART.<br/><br/>Inoltre per addestrare il sistema di coreferenza per l'italiano saranno necessari corpora annotati. Quindi utilizzeremo due corpora: il corpus del progetto LiveMemories per l'Italiano (basato su articoli di Wikipedia e blogs annotati con entità nominali e coreferenze) e il corpus VENEX (basato su articoli estratti da Repubblica) creato insieme all'unità di Venezia (verrà convertito in formato MMAX2, usato da BART).<br/><br/>Un altra parte interessante del lavoro di UNITN riguarderà, come descritto nella proposta di progetto, la produzione di annotazioni addizionali di RC, sfruttando il gioco, Phrase Detectives per Italiano.<br/><br/>Dal punto di vista di machine learning, tecniche basate su kernel verranno studiate per implementare e migliorare i sistemi automatici sopraa descritti.
Fundings 10000 €
Manager Alessandro Moschitti
Participating RP

