Marcin Szymczak, Ph.D.

The broadly meant goal of the project was to support the exchange of information which is extremely important and challenging  in the rapidly and unpredictably changing world. The amount of data is growing very fast and is often distributed over heterogeneous systems or databases. As a consequence, the same piece of information can be represented in different ways, called coreferent data. This may be a serious problem in data processing, hampering the interoperability of distributed systems. Due to the volume of data processed and its required multilevel analysis, it is usually very difficult, and often just impossible, to remedy this problem “manually”. Thus, it is important to identify coreference in automatic fashion on different levels to secure the interoperability, i.e. the ability of systems and organizations to work together.

One  can distinguish two major levels in coreference detection, namely, the metadata and the data level, which are strongly related to each other. On the one hand, metadata, e.g. a knowledge base (such as an ontology or taxonomy) or a database schema that defines structure and properties, provide additional information about data and can support the coreference detection on the data level. On the other hand, the data, more specifically data which are described by metadata, can be used to construct metadata or detect coreference in metadata.

In the framework of the project two novel schema matching techniques have been proposed. The first technique is based only on XML schema information, more specifically on names (tags) of schema elements and their sequences, called here paths, as elements may be nested in other elements. This method compares element names lexically and considers their relative importance. This makes it a very efficient solution. The second schema matching technique is only based on content data and is a composition of a vertical and a horizontal schema matching. Firstly, attributes domains are statistically and lexically compared in the vertical matching. Secondly, a horizontal matching is applied which is based on detecting coreferent tuples. This allows to address the attribute granularity and coverage problems.

Moreover, a novel automated method, called DOC, is proposed to construct a knowledge base with semantic information on the domain of an attribute. Such a knowledge base then supports the semantic comparison of domain values that can be sorted by means of an order relation reflecting a notion of generality. The use and impact of this method on the mapping and transformation of attribute values across heterogeneous data collections, the detection of coreferent tuples and data fusion (merging coreferent representations of an entity into a single representation) are investigated. Our novel technique has the advantage that there is no need for a priori taxonomical knowledge on the attribute domains. Instead, this knowledge is dynamically constructed and hence only depends on the content data, which means that it can be automatically reconstructed when these data change.

All proposed methods have been extensively evaluated on large real-life data collections. The results has been documented in a number of published papers, including two in the JCR listed journals, Informations Sciences and Information Fusion. Information on further publications may be find here.

You are here: Home Results Marcin Szymczak, Ph.D.