Abstract
The current practice of data dissemination via data portals exhibits scalability limitations. Technically, the use of keywords and themes for the description of data semantics cannot keep up with the growing number and topical diversity of published datasets. At the same time, the publish-find-bind model of data dissemination decouples data discovery from data access which makes it increasingly difficult to find data that can be (re)used in combination with other data for the construction of tailored data products. However, data application developers increasingly need to see all the data not only from one provider but also from several providers as a whole in runtime.
In this context, Linked Data (LD) as a technique for publishing structured semantically rich data on the Web is able to open up the data silos of the traditional data portals. Being based on well-known Web technologies (such as HTTP), interoperable semantic standards and graph data model, LD makes it possible to perform data discovery, access and integration in one step. Apart from that, LD creates a Web-scale semantic infrastructure that enables interoperability not only between separate databases but also between organisations.
However, converting existing datasets into LD as is does not allow the realisation of these properties. In fact, original data sources were created with the closed world assumption in mind and, therefore, lacked connections between each other and the existing semantic infrastructure. For this reason, the realisation ofLDadvantages requires data to be semantically enriched, which is a knowledge and labour-demanding task. Additionally, data providers meet an organizational challenge. Investment in semantic enrichment (SE) of provider data to leverage LD requires evaluation of benefits that are difficult to quantify. A common and well-understood approach of use-case-driven development is too narrow to cover all business requirements and, therefore, is not adequate to demonstrate the added value of LD.
Instead, this thesis proposes enabling exploratory search (ES) of data as an integrated approach towards developing LD in an organisation. This proposition is supported by two major research contributions presented in this work, namely (1) a business rationale for LD implementation and (2) technical requirements of SE instantiated by a number of developed data and software artefacts.
Technical requirements of SE are framed as four data enrichment levels: (1) ontological, (2) conceptual, (3) instantial and (4) literal. Semantic enrichment design principles for each of these levels are created based on the analysis of the data-related requirements and functional limitations of existing ES systems. The analysed systems are grouped into nine categories according to the exploratory techniques they facilitate. This makes it possible to abstract from particular implementations and formulate requirements for each of the categories.
The business rationale for LD implementation is based on the analysis of the roles that LD plays in the realisation of a business vision of a large public data provider. Deriving from the experience of the development of a data platform at the Netherlands’ Cadastre Land Registry and Mapping Agency (Kadaster), it is concluded that LD directly contributes to the operationalisation of three out of four business ambitions, namely, (1) enabling a use-case oriented vision, (2) increasing the data value, and (3) ensuring the certainty of the data and legitimacy of the organization. The remaining ambition of being a spatial data provider can be fulfilled by conventional technologies, however, LD indirectly contributes to reaching users outside of the GIS community. These findings are used to prioritise the previously identified technical requirements of SE to achieve alignment with the business vision. As a result of this prioritisation, two semantic enrichment scenarios that combine design principles into two groups according to their advancement level are formulated, implemented, and evaluated. The driving reason for the division of scenarios is to differentiate design principles that are easy to implement but that lead to high gains in terms of ES (the first scenario) from those that require greater efforts but at the same time lead to maximum possible results (the second scenario).
The first scenario focuses on ES-enabling semantic enrichment of datasets that have thematically similar content but are curated by several independent data providers. The design principles covered in the first scenario allow to create and disseminate LD representation of existing data on top of existing spatial data infrastructures (SDIs) without altering already implemented data pipelines and structures. They are put to test during the development of the Open European Location Services (OpenELS) project, a collaborative effort aimed at the construction of European-level SDI for the provision of geospatial data within the Spatial Information in the Europe Community (INSPIRE) initiative.
The second scenario, in contrast, covers the creation of explorable Knowledge Graphs, voluminous data sources that combine datasets with a wide range of topics from a multitude of knowledge domains. Apart from the topical diversity, the design principles of this scenario also cater for a large volume of spatio-temporal data by utilising topological relations for the creation of hierarchical spatial partitioning. Construction of the Kadaster Knowledge Graph (KKG) by combining 12 datasets curated by 8 organisations represented the evaluation of the second scenario design principles. The usability of the graph was assessed during the creation of three applications, namely data browsing, urban planning, and the development of a chatbot.
There are several important characteristics of the research that influence the applicability of research contributions. The first limitation stems from the fact that a major part of the research is conducted in a governmental institution, namely Kadaster. Therefore, for for-profit organisations, the business rationale for LD implementation has relevance in parts related to revenue-focused business ambitions such as use case orientedness and data value increase. Another limitation is that the rationale is based on qualitative reasoning which implies that it does not provide an understanding of how much resources implementation of LD can take. For the data and software artefacts created as a part of this research, it is important to emphasise that their evaluation is based on providing semantic enrichment and exploratory search capabilities and does not include usability testing. However, in this respect, the design principles of both scenarios were discussed within a large group of specialists involved in their implementation, then taken into use within their organisations, and considered applicable and useful.
The approach presented in this thesis is thought to be beneficial for the following application domains and target groups. First of all, the research contributes to the topics of data dissemination and data- and semantic- infrastructure development. Therefore, it is of interest to data owners, data custodians and data providers including but not limited to National Mapping and Cartographic Agencies (NMCAs), since the research is greatly concerned with the utilisation of spatial and temporal data components for linking and consolidating data assets into one virtual data source. This can be particularly appealing for data scientists and engineers who are confronted with tailored data product development as well as application developers who work with heterogeneous cross-domain data to create novel applications. From the viewpoint of software development companies, this work can be considered as a blueprint for semantic enrichment scenarios to be supported by data management tools, and as such, it can help in foreseeing market demands in terms of functional requirements. Apart from that, the research findings add to the business rationalisation of semantic enrichment in organisations. Therefore, managers who are responsible for data desilofication strategies and business developers who are concerned with the development and operationalisation of the business vision are among the target groups of this work. Exploratory search over big and distributed data is another application domain of the thesis, therefore the analysis of ES system requirements towards data quality is of particular interest to the community of ES
researchers and software developers. Finally, elaboration on the topics of Knowledge Graphs development presented in this work contributes to the ongoing discussion in the Semantic Web and Linked Data communities and therefore, is of interest to them.
In this context, Linked Data (LD) as a technique for publishing structured semantically rich data on the Web is able to open up the data silos of the traditional data portals. Being based on well-known Web technologies (such as HTTP), interoperable semantic standards and graph data model, LD makes it possible to perform data discovery, access and integration in one step. Apart from that, LD creates a Web-scale semantic infrastructure that enables interoperability not only between separate databases but also between organisations.
However, converting existing datasets into LD as is does not allow the realisation of these properties. In fact, original data sources were created with the closed world assumption in mind and, therefore, lacked connections between each other and the existing semantic infrastructure. For this reason, the realisation ofLDadvantages requires data to be semantically enriched, which is a knowledge and labour-demanding task. Additionally, data providers meet an organizational challenge. Investment in semantic enrichment (SE) of provider data to leverage LD requires evaluation of benefits that are difficult to quantify. A common and well-understood approach of use-case-driven development is too narrow to cover all business requirements and, therefore, is not adequate to demonstrate the added value of LD.
Instead, this thesis proposes enabling exploratory search (ES) of data as an integrated approach towards developing LD in an organisation. This proposition is supported by two major research contributions presented in this work, namely (1) a business rationale for LD implementation and (2) technical requirements of SE instantiated by a number of developed data and software artefacts.
Technical requirements of SE are framed as four data enrichment levels: (1) ontological, (2) conceptual, (3) instantial and (4) literal. Semantic enrichment design principles for each of these levels are created based on the analysis of the data-related requirements and functional limitations of existing ES systems. The analysed systems are grouped into nine categories according to the exploratory techniques they facilitate. This makes it possible to abstract from particular implementations and formulate requirements for each of the categories.
The business rationale for LD implementation is based on the analysis of the roles that LD plays in the realisation of a business vision of a large public data provider. Deriving from the experience of the development of a data platform at the Netherlands’ Cadastre Land Registry and Mapping Agency (Kadaster), it is concluded that LD directly contributes to the operationalisation of three out of four business ambitions, namely, (1) enabling a use-case oriented vision, (2) increasing the data value, and (3) ensuring the certainty of the data and legitimacy of the organization. The remaining ambition of being a spatial data provider can be fulfilled by conventional technologies, however, LD indirectly contributes to reaching users outside of the GIS community. These findings are used to prioritise the previously identified technical requirements of SE to achieve alignment with the business vision. As a result of this prioritisation, two semantic enrichment scenarios that combine design principles into two groups according to their advancement level are formulated, implemented, and evaluated. The driving reason for the division of scenarios is to differentiate design principles that are easy to implement but that lead to high gains in terms of ES (the first scenario) from those that require greater efforts but at the same time lead to maximum possible results (the second scenario).
The first scenario focuses on ES-enabling semantic enrichment of datasets that have thematically similar content but are curated by several independent data providers. The design principles covered in the first scenario allow to create and disseminate LD representation of existing data on top of existing spatial data infrastructures (SDIs) without altering already implemented data pipelines and structures. They are put to test during the development of the Open European Location Services (OpenELS) project, a collaborative effort aimed at the construction of European-level SDI for the provision of geospatial data within the Spatial Information in the Europe Community (INSPIRE) initiative.
The second scenario, in contrast, covers the creation of explorable Knowledge Graphs, voluminous data sources that combine datasets with a wide range of topics from a multitude of knowledge domains. Apart from the topical diversity, the design principles of this scenario also cater for a large volume of spatio-temporal data by utilising topological relations for the creation of hierarchical spatial partitioning. Construction of the Kadaster Knowledge Graph (KKG) by combining 12 datasets curated by 8 organisations represented the evaluation of the second scenario design principles. The usability of the graph was assessed during the creation of three applications, namely data browsing, urban planning, and the development of a chatbot.
There are several important characteristics of the research that influence the applicability of research contributions. The first limitation stems from the fact that a major part of the research is conducted in a governmental institution, namely Kadaster. Therefore, for for-profit organisations, the business rationale for LD implementation has relevance in parts related to revenue-focused business ambitions such as use case orientedness and data value increase. Another limitation is that the rationale is based on qualitative reasoning which implies that it does not provide an understanding of how much resources implementation of LD can take. For the data and software artefacts created as a part of this research, it is important to emphasise that their evaluation is based on providing semantic enrichment and exploratory search capabilities and does not include usability testing. However, in this respect, the design principles of both scenarios were discussed within a large group of specialists involved in their implementation, then taken into use within their organisations, and considered applicable and useful.
The approach presented in this thesis is thought to be beneficial for the following application domains and target groups. First of all, the research contributes to the topics of data dissemination and data- and semantic- infrastructure development. Therefore, it is of interest to data owners, data custodians and data providers including but not limited to National Mapping and Cartographic Agencies (NMCAs), since the research is greatly concerned with the utilisation of spatial and temporal data components for linking and consolidating data assets into one virtual data source. This can be particularly appealing for data scientists and engineers who are confronted with tailored data product development as well as application developers who work with heterogeneous cross-domain data to create novel applications. From the viewpoint of software development companies, this work can be considered as a blueprint for semantic enrichment scenarios to be supported by data management tools, and as such, it can help in foreseeing market demands in terms of functional requirements. Apart from that, the research findings add to the business rationalisation of semantic enrichment in organisations. Therefore, managers who are responsible for data desilofication strategies and business developers who are concerned with the development and operationalisation of the business vision are among the target groups of this work. Exploratory search over big and distributed data is another application domain of the thesis, therefore the analysis of ES system requirements towards data quality is of particular interest to the community of ES
researchers and software developers. Finally, elaboration on the topics of Knowledge Graphs development presented in this work contributes to the ongoing discussion in the Semantic Web and Linked Data communities and therefore, is of interest to them.
Original language | English |
---|---|
Qualification | Doctor of Philosophy |
Awarding Institution |
|
Supervisors/Advisors |
|
Award date | 8 May 2023 |
Place of Publication | Enschede |
Publisher | |
Print ISBNs | 978-90-365-5617-0 |
DOIs | |
Publication status | Published - 8 May 2023 |