In the project LAUNDRY a data cleaning framework with special focus on data extracted from the World Wide
Web is designed and implemented as prototype. Components for structural, semantic and syntactic normalization,
tokenization, de-duplication, cleaning of inconsistencies, and data fusion are developed. Techniques, methods and
tools for data cleaning are studied in context of efficiency and performance and used accordingly; moreover, new
techniques for particular components are developed. The advantages of the LAUNDRY system are on the one hand
in its open, pluggable, and modular framework, and on the other hand in the interactive generation of cleaning
components, and addititionally the data cleaning extensions directly plugged into the Lixto Suite, a sophisticated
software for web data extraction and processing. The LAUNDRY system offers all phases of data cleaning based
on efficient algorithms, and can be extended with new algorithms.
The LAUNDRY data cleaning framework will primarily be used for cleaning of web data that has been extracted
with the Lixto Suite. Lixto Suite offers the interactive configuration and runtime environment for data extraction
from the web. In numerous application scenarios such as Competitive Intelligence it turned out that web data are
very heterogeneous. Hence, beside challenging techniques for extraction from semistructured data also methods for
data cleaning, in particular normalization, record linkage, and data fusion are required. With LAUNDRY it is
possible to treat these problems, and therefore web data will more easily and efficiently be usable in enterprise
applications such as in competitive intelligence (price comparison, product comparison) scenarios.