Distributing Text Mining tasks with librAIry

We present librAIry, a novel architecture to store, process and analyze large collections of textual resources, integrating existing algorithms and tools into a common, distributed, high-performance workflow. Available text mining techniques can be incorporated as independent plug&play modules working in a collaborative manner into the framework. In the absence of a pre-defined flow, librAIry leverages on the aggregation of operations executed by different components in response to an emergent chain of events. Extensive use of Linked Data (LD) and Representational State Transfer (REST) principles are made to provide individually addressable resources from textual documents. We have described the architecture design and its implementation and tested its effectiveness in real-world scenarios such as collections of research papers, patents or ICT aids, with the objective of providing solutions for decision makers and experts in those domains. Major advantages of the framework and lessons-learned from these experiments are reported.


INTRODUCTION
Given the huge amount of textual data about any domain that is daily being produced or captured in any imaginable domain, it becomes crucial to provide mechanisms for programmatically processing this raw data so we can make sense out of it: discarding all the noisy, non-relevant information and keeping only the data that can bring value for the involved agents (general consumers, experts, companies, investors…).While some speci c tools already allow for advanced sense-making operations, others opt for composing a solution where di erent analysis techniques are integrated under a uniform data schema.However, this integration involves signicant e orts on reconciling data sources, coordinating processing operations, and e ciently exploiting results from the execution of those techniques.ere is the need for a more exible paradigm where tools and algorithms for textual document analysis, from di erent programming languages and technologies, can operate independently and in a collaborative manner creating a common document oriented work-ow through their actions.In the context of the scienti c publications, the personalized recommendation of research papers based on their content is a key novel feature for performing a smart selection of relevant resources over very big collections of scienti c content.From the set of values and di erent a ributes extracted from the papers and by generating advanced knowledge models about the information they contain we can bridge across the di erent relevant pieces of information and allow users to navigate them in a more e cient and powerful way. is knowledge about a speci c document is frequently acquired by di erent techniques focused on revealing certain aspects of it, that are later combined to achieve one particular task.e architecture presented in this paper aims to ease the way di erent so ware modules work together and lays the foundation for eciently process big volumes of textual documents in a distributed, decoupled manner.

RELATED WORK
e annotation of human-readable documents is a well-known problem in the Arti cial Intelligence domain in general and Information Retrieval and Natural Language Processing elds in particular.
ere already exist a broad set of tools and frameworks able to analyze text for automatically producing such annotations, at very di erent levels of granularity: from minimal units such as terms and entities, to descriptors at the level of the entire collection such as topics or summaries.For example, StanfordNLP [7] framework allows to perform di erent operations such as PoS or Named Entity Recognition in various languages.Others like Mallet 1 or SparkLDA2 perform topic modeling and clustering.e system we propose looks at the transversal problem of making those standalone tools coexisting under the same solution.Being able to e ectively integrating them under a common ecosystem helps to seamlessly obtain di erent kind of annotations and boost the way those solutions can make sense of document collections.
Certain systems among the research and industrial communities have already integrated some of the annotation tools introduced above.For example, [2] works with records from the biomedical domain, where robustness and high precision are prioritized.erefore they rely on techniques supported by GATE3 framework, which widely supports hand-cra ed, domain speci c techniques such as rules or nite state transducers.On the other side of the spectrum we nd [6], where the authors try to annotate text from a much noisier, sparser and error-prone medium: a tweet stream.
erefore they do not rely on any linguistic feature, due to the unpredictable way short social media post are wri en.We observe how each of those examples has very speci c needs and leverages on certain annotation tools in order to accomplish the tasks it was originally created for.In both systems the involved components are highly coupled so they can not be easily extended to contemplate complementary annotation tools or alternative modules.On the contrary, librAIry advocates loosely interconnected components that make the architecture more reusable and expandable in other systems across domains.
One crucial problem regarding the re-usability and expansion possibilities of those systems and the tools they leverage on is the language they have been developed in.For example, Mallet uses Java, but others like spaCy4 are python-based.To the best of our knowledge, there has not been any signi cant e orts on reconciling into a single architecture such heterogeneous set of tools, therefore minimizing the engineering e ort and maximizing scalability of the system so it can be applied to very di erent domains and textual annotation tasks.
In addition, available annotation systems rely on certain storage solutions that are suited for some tasks but are less adequate others.For example [5] uses a relational database (MySQL5 ) to ensure reliability and speed in managing the indexed information.In [8], the authors leverage on Virtuoso triple-store to provide native graph operations over the data.But new requirements may be considered for those systems so di erent storage needs can come into play.For example, column oriented databases (Cassandra6 ) can help to be er handle high-volume queries on speci c data elds.Same goes with text oriented indexes such as ElasticSearch7 , which can provide customized text-based search operations over the available information.librAIry straightforward supports the coexistence of di erent storage solutions, so it can be agnostic to the kind of underlying storage modules implemented.anks to the distributed nature of the proposed architecture, di erent databases can be synchronized under the same common environment working together to store and deliver results in a more e cient manner.

LIBRAIRY
librAIry is a framework where di erent text mining tools, available in various languages and technologies, can operate in a distributed, high-performance and isolated manner creating a common workow through their actions.Instead to work towards a pre-de ned sequence of actions, synchronization across modules is achieved through the aggregation of the operations executed by them in response to an emergent chain of events.is raises both technical and functional challenges to coordinate multiple executions.From the technical point of view, isolated environments and communication mechanisms are provided so initially dissimilar tools can be executed with maximum guarantees.From the functional point of view, all executions are coordinated to reach a nal result as aggregation of partial results derived from each execution.

Functional Features
e architecture is articulated around three main concepts: (1) the resource such as document, a part of a document, or a domain.(2) the actions performed over them: create, update or delete a resource.And (3) the new state that is reached by the resource a er an action is performed, such as created, updated or deleted.An event is a message containing details about those three aspects, published on a shared event-bus available for all the modules deployed in the framework.is will, in turn, allow that any module can perform actions on one or more resources in response to a new state reached by a given resource.Actions executed in parallel from distributed environments.
3.1.1Resources.Two main kinds of resources are considered: those derived from external sources such as (1) documents from textual les (e.g. a research paper), (2) parts from logical divisions of a document (e.g.rhetorical classes or sections), and (3) domains from sets of documents (e.g. a conference or journal), and those derived from processing the previous ones such as annotations.
To be er illustrate this model, consider to explore the research papers published at the SIGGRAPH conference in 2016.First, every paper will be materialized as a new document containing the full-text.Immediately a er, the document will be automatically associated to several parts, each of them grouping sentences by rhetorical class (e.g.approach, background, challenge, future work and outcome) and by section (e.g abstract, introduction).Finally, a new domain will be created grouping all these documents.Different analysis will be performed extending the initial set of resources with more annotations at several representational levels: at document level, full-text based annotations are provided such as named-entities, compounds and descriptive tags.At relational level, connection between resources are found (e.g.semantic similaritybased relationships).And nally, at domain level annotations such as tags and summaries are composed describing the corpus of documents.

Event-based
Paradigm.An event illustrates a performed action, i.e. a resource and its new state.It follows the Representational State Transfer (REST) [4] paradigm, but taking into account 3.1.3Linked Data Principles .Data in librAIry is individually addressable and linkable [9] following the Linked Data principles de ned by T. Berners-Lee [1].us, resources (i.e. a domain, a document, a part or an annotations) have: (1) a URI as name, (2) a retrievable (or dereferenceable) HTTP URI so that it can be looked up, (3) a useful information provided by using standard notation (e.g.JavaScript Object Notation (JSON)) when it is looked up by URI, and (4) links to other URIs so that other resources can be discovered from it.

Framework Architecture
Following a publisher/subscriber approach, all the modules in the framework can publish and read events to notify and to be notied about the state of a resource.erefore, the system ow is not unique and is not explicitly implemented, instead distributed and emergent ows can appear according to particular actions on resources.

Event-Bus. We use the Advanced Message
euing Protocol (AMQP) as the messaging standard in librAIry to avoid any cross-platform problem and any dependency to the selected message broker.is protocol de nes: exchanges, queues, routing-keys and binding-keys to communicate publishers and consumers.A message sent by a publisher to an exchange is tagged with a routing-key.Consumers matching that routing-key with the binding-key used to link the queue to that exchange will receive the message.In librAIry this key follows the structure: resource.status.Since a wildcard-based de nition can be used to set the key, this paradigm allow modules both listening to individual type events (e.g.domains.createdfor new domains), or multiple type events (e.g.#.created for all new resources).

API. A HTTP-Rest Application Program Interface (API)
was designed for interaction with end-users.Any external operation motivated by a user will be handled here.Some of them, usually those related to reading operations, will be completely managed by this module ge ing all the data from the internal storage.However, those operations implying a modi cation of the status of some resource (e.g.creation of a document), may be also performed by other modules listening for that type of event asynchronously.is module publishes to the following routing-keys: domain.(created;updated;deleted),document.(created;updated;deleted),part.(created;updated;deleted), and annotation.(created;updated;deleted).• column-oriented database: Focused on unique identied and/or structured data. is storage allow us searching key elements across resources.
• document-oriented database: Focused on indexing raw text. is storage allow us to execute advanced search operations over all the information gathered about a textual resource.
• graph database: Focused on relations.is storage allow us exploring resources through the relationships between them.

3.2.4
Modules. e modules composing librAIry have been designed following the microservices architectural style.A module is a cohesive (i.e. it implements only functionalities strongly related to the concern that it is meant to model [3]) and independent process working on the framework with a speci c purpose. is purpose is de ned by both the routing-key and the binding-key associated to the events handled by the module.
ese are the main types of modules identi ed in librAIry: • Harvester: creates system resources such as documents, parts and domains, from local or remote located textual les.

EXPERIMENTS AND LESSONS-LEARNED
librAIry has been used in some real scenarios such as a researchpaper repository for the European project DrInventor 8 , a support to decision makers for analyzing patents and public aids for the ICT sector, and also as a book recommender for an online content platform.is has allowed us to identify some weak and strong points of the framework and iterate over the architecture to come with the described solution.e following modules have been developed9 : (1) a generalpurpose harvester which retrieves text and meta-information from PDF les in local or remote le-system; (2) a research paperoriented harvester focused on collecting and processing more speci c textual les (e.g.scienti c papers) creating both documents and parts inferred from the rhetorical classes of the paper; (3) a Stanford CoreNLP-based Annotator which discovers namedentities, compounds and lemmas from documents and parts; (4) a Topic Modeler based on Latent Dirichlet Allocation (LDA) which creates probabilistic topic models for each domain in the framework.ey are annotated with the set of topics (i.e.ranked list of words) discovered from the corpus, and both documents and parts of that domain are also annotated by the vector of probabilities to belong to these topics.It uses the Spark implementation of the algorithm; and (5) a Word Embedding Modeler which creates a word2vec model from the documents contained in a domain.
Due to linear scalability and high performance features, Cassandra has been used to support the column-oriented storage functionality, Elasticsearch as document-oriented storage and Neo4j as graph-oriented storage.
All modules in librAIry have been packaged as Docker10 containers and uploaded to Docker-Hub11 to facilitate the installation of the system.
Maximizing information re-usability and minimize irrelevant data, becomes specially important when the system handles large collections of data (around million of documents).Fine-grained resource de nitions have been key to achieve this, so modules execute actions only when really necessary.When a new domain is created, for instance, a new Topic Model is trained for that domain and is used to calculate the semantic similarity between the documents (and the parts) in that domain.If a new document (or part) is added to that domain, the model is trained again and the semantic similarities are re-calculated.However, this becomes unfeasible when the domain is frequently updated and it is composed by a large number of documents.One solution has been to de ne a new type of resource between domains and documents, models, that describes the representational state (e.g.topic model) of a collection of documents.us the model is only re-trained when a signi cant amount of documents are added to the sampling data set and not to the entire domain.is less transient model is used to calculate semantic similarities between the document collection (and parts) inside a domain in a more e cient way.Following this more precise execution of tasks, the routing-keys should include the URI of the implied resource into the de nition, not only in the content of the message.It would allow modules listening to both the type of a resource or to a speci c resource (or subsets, via regular expressions).
While the storage modules are always used to save/update/delete a resource, they are not always required from the end-user.e graph storage, for instance, makes sense when a path between two documents or parts is requested for a given domain.However, some domains are not intended to be explored by their linked resources.A more ne/grained de nition of resources will allow graph-storage being only used when necessary.
On the other hand, distributed execution of NLP tasks (not only in threads, but also in machines) has proved to be especially useful to handle large collection of documents.It requires less processing time than a monolithic solution (e.g.CoreNLP application) and it also provides a dynamic load balancing between modules.

CONCLUSIONS AND FUTURE WORK
In librAIry, existing algorithms and tools coming from di erent technologies can work collaboratively to process and analyze large collections of textual resources which has been successful applied to some real scenarios 12 .
A new model de nition based on the previously mentioned principle of maximizing information re-usability and minimize irrelevant data is being studied to create a more ne-grained resource design.New domains, in the sense of particular vocabularies or speci c textual formats, are also being analyzed to be included into the system via speci c harvesters and/or more precise annotators.Moreover, a template-based mechanism oriented to facilitate the integration of new tools and techniques into the system is being built to make easier to develop new modules as well as increasing the available modules at Docker-Hub.

3. 2 . 3
Storage.Multiple types of data can be handled in this ecosystem.Inspired in the Data Access Object (DAO) pa ern, we have created a Uni ed Data Manager (UDM) providing access to any type of data used in the system.ree types of databases have been considered: