Due to the increasing on-line availability of various biomedical data sources, the ability to federate heterogeneous and distributed data sources becomes critical to support multi-centric studies and translational research in medicine. The CrEDIBLE project organized 3 thematic working days in October 15-17 2012 in Sophia Antipolis (France) where experts were invited to present their latest work and discuss their approaches. The aim was to gather scientists from all disciplines involved in the set up of distributed and heterogeneous medical image data sharing systems, to provide an overview of this broad and complex area, to assess the state-of-the-art methods and technologies addressing it, and to discuss the open scientific questions it raises.
The methods for biomedical data distribution considered in the context of CrEDIBLE are:
On Monday afternoon, Tuesday afternoon and Wednesday morning, the workshop will be held in the conference room of the I3S laboratory in Sophia Antipolis (France). See the following map to locate the laboratory. The conference room is located at the ground floor of I3S (room number 007).
On Tuesday morning, the workshop will be held in room 101-001 on the first floor of the “Templiers Ouest” building. Please refer to the map below for identifying the building.
Web semantic technologies play a critical role to represent, interprete and query data, to finally achieve data mediation. The themes of knowledge modeling through ontologies and the reuse of existing ontologies will be addressed more specifically. The challenges related to the integration and the federation of heterogeneous data bases, including the data representation models and their impact on the system performance, will be studied. Downstream, the study will also consider the exploitation and the production of new knowledge in the context of data processing workflows. Some feedback on existing tools and their capabilities / limitations is also expected.
The data sources to be integrated are related but yet heterogeneous, using different semantic references (vocabularies…), different representations (files, relational / triple / XML databases…) and even different data models (relational, knowledge graphs…). Data integration will also be constrained by medical application constraints, in particular the set up of multi-centric studies, the support of translational research and medical applications. Data security and fine-grained access control is another important related problem. The ability to simultaneously process different data representation model makes data security a particularly challenging problem.
The ontology defines conceptual primitives which represent data semantics (images, test and questionnaire results) by integrating their production context (study, examination, subject, medical practitioner, data acquisition protocol, processing, acquisition device, parameterization, scientific publications). Such an ontology spans over different domains (different entity classes) and includes hundreds of concepts. It is structured through modules with different abstraction levels, to leverage generic primitives that can formalize several domains for practical reason related to ontology maintenance.
The ontology design involves: reusing (completely or partially) existing ontological modules (at different abstraction levels); designing new modules (in particular to represent knowledge related to particular medical domains); managing modules life cycle; documenting to ease reusability. There are different means of exploitation: ontological alignment to federate data that rely on different semantics (addressing problems related to the level of details or even discrepancies in the entities considered); data processing assistance (checking the compatibility of data with processing tools, producing data provenance information); query-based and/or visualization-based data access. Each usage scenario might involve adapting the ontology representation to the tool manipulated (inference engine, visualizer) and its language.
Usually, medical data are stored in relational databases which allow for a fast access to data, while metadata are formalized through graph-based knowledge representation models, designed for the semantic Web, which enable reasoning capabilities through inferences based on ontologies used to model this knowledge. Main challenges are the mixed use of different representation and the scalability of data storage and reasoners. The scalability problem is well known in the Web of data community. Promising approaches lie on the use of graph-oriented databases, the adaptation of inferences performed to the size of manipulated data stores, and on querying and reasoning techniques adapted to distributed stores.
The acquisition and the representation of knowledge related to the manipulated data is tightly linked to the data processing and transformation tools applied. Knowledge acquired on data may be used to validate or filter the processing tools applied on this data. Conversely, knowledge acquired on processing tools can be used to infer new knowledge on data, in particular the data produced through this processing. Knowledge exploitation can happen at different levels of the scientific processing pipelines life cycle: at design time through editing assistance (static validation, assisted composition) and at run time (dynamic validation, new knowledge creation).
Knowledge on both data and processing tools is also often used to describe data provenance information. Provenance is then described as semantic annotations tracing the execution path. Provenance is tightly related to the nature of data processed. It facilitates the reuse and the interconnection between data from different sources. It can make use of several domain-ontologies and facilitate interoperability between different data processing engines.
The list of scientific questions of interest for the workshop, classified by session, is shown below.
1) Foundational ontologies (DOLCE)
2) Application ontology (overall project ontology)
3) Ontologies alignment