Text summarization based on a new conceptual information retrieval model

G. De Tré (Faculty of Engineering, Department of Telecommunications and Information Processing, Ghent University, Gent, Belgium)    Guy.DeTre >at>

Description: The exponential growth of the World Wide Web causes the publication of an enormous amount of data on a daily basis. This has led to a particular branch of data mining called text mining. One of the problems that text mining deals with is the clustering of a set of documents, according to the topic described by documents. Most existing solutions for the clustering of texts consider regular data mining approaches by transforming the space of documents into a vector space. However, the use of the vector space model has some important disadvantages, such as high dimensionality and the loss of semantics. In this project, fundamental research will be conducted to a novel model for representation of documents. This model will be based on the parsing of concepts and relations from texts in an automated way. An advantage hereby is that the parsing can be done in a context-free way, which makes the use of taxonomies redundant. The consideration of concepts, rather than words, increases the semantic power of our model. Based on the novel model, methods for determining the topic of a text will be investigated, taking advantage of the rich structure in which documents can be represented. The assignment of topics to documents will then allow for fast and accurate clustering algorithms. In a last stage, the problem of multi document summarization will be investigated.
Research goals. The goal of this project is to develop a formal model for co-reference of semi-structured and unstructured entities, starting from a fundamentally new representation model for text that avoids the well known disadvantages of the vector space model. The problem of text co-reference deals with identifying texts that have the same topic. It is clearly emphasized that the term ‘topic’ is interpreted very specifically as the entity that is described in the text. It is thus assumed that the topic of a text is always clearly determined. The project does not deal with classification of texts according to category (politics, sports, ...). This assumption is essential for the methodology that is followed. A typical application of the considered problem is to cluster (on-line) news articles according to the news they describe.
You are here: Home Projects Text summarization based on a new conceptual information retrieval model