CSI Research and Internal Development
Data inizio
Data fine
Project type
Smart Data Platform

The goal of this project is to develop solutions that allow the processing and analysis of textual content from different streams and channels. For example, contents can consist in documents produced by internal processes or received from citizens and businesses, e-mails from citizens via certified or non-certified mail, Web and social network texts, internal e-mails, access logs to portals.

Two are the objectives of this research activity:

  • Identify and implement the necessary techniques, methodologies and technologies, in line with the Big Data framework available and functional to enrich the logic in Yucca - Smart Data Platform (SDP).
  • Verify the feasibility of the identified use cases, including in particular:

    • Indexing of text and documentary content in the Big Data Hub Data Hub.
    • Full text and indexed content search.
    • Ability to process and semantic processing of full-text research.
    • Extraction of information and textual content from documents.
    • Text mining on content from digital documents or textual data streams.
    • Automatic classification of texts applied to e-mails that administrations receive from taxpayers.

This research activity is indeed aimed at:

  • Testing the feasibility, potentialities, advantages, limitations and implications of using Big Data technologies to store and process unstructured data, in particular digital documents and textual content.

  • developing one or more prototypes based on open source and scalable technologies allowing to address both a full-text search engine on textual content and analytics on texts also with innovative and advanced approaches such as machine learning and advanced text mining techniques.

  • Extracting information and analysing data from documents and textual contents previously considered as closed boxes not usable for understanding phenomena.
  • Addressing a search engine with semantic functionality on Big Data platform.
  • Exploiting a cluster of distributed (Hadoop), open source and low-cost systems for indexing and documents and text analysis.
  • Acquiring skills to use automatic classification digital content and sentiment analysis techniques through machine learning of phenomena.
  • Full-text search engine based on Big Data technologies.
  • Advanced indexing for results faceting navigation.
  • E-mails texts classification using automatic learning and machine learning.
  • Creation of two prototypes: search engine and text analytics on emails on taxation issues.


  • YUCCA - Smart Data Platform: search function on the metadata of the services and Web Store displayed data.
  • Enhancement of the regional health information assets through the use of SDPs for better regional governance.
  • Text analytics for automatic classification of e-mails on car tax.
  • Document management systems.
  • Infodir evolution.
  • Use of 100% open source technologies.
  • Faceted search indexing.
  • More skills gained in the integration and analysis of textual content on Big Data platforms.