Overview
ABSTRACT
This synthesis presents the recent evolution of techniques for the evaluation and improvement of data quality in databases based on machine learning methods. It describes recent solutions proposed mainly by the academia as well as approaches implemented to detect and correct main data quality problems such as outlying, inconsistent or missing data, and duplicates.
Read this article from a comprehensive knowledge base, updated and supplemented with articles reviewed by scientific committees.
Read the articleAUTHOR
-
Laure BERTI-ÉQUILLE: Research Director - Development Research Institute - ESPACE-DEV - Montpellier, France
INTRODUCTION
Significant progress has been made in recent years in the design of tools to automate the evaluation, monitoring and improvement of data quality, thanks in particular to technological advances in Artificial Intelligence, and in particular machine learning (ML – Machine Learning). Machine learning techniques have been made operational on a large scale, and are now widely deployed in all sectors of activity, to automate prediction and classification tasks in decision support for numerous fields of application (health, finance, marketing, etc.). However, the reliability of these methods' results remains highly dependent on the quality of the input data for the learning models. Data is often imperfect, and optimal data quality is rarely achieved. Thus, two complementary approaches are commonly proposed: one from the data management research community, aimed at correcting data upstream of analysis chains (by cleaning or repairing data), and the other from the community of learning researchers and practitioners (data scientists), aimed at developing models that are more robust to noise and more efficient, with greater emphasis on transforming and preparing data for a particular predictive task.
For decades, for the data management community, data cleansing has consisted in correcting and transforming data using declarative ETL (Extraction-Transformation-Loading) approaches , detecting inconsistencies in relational databases in the form of constraint violations, to "repair" them
Exclusive to subscribers. 97% yet to be discovered!
Already subscribed? Log in!
KEYWORDS
machine learning | data quality | data science | anomaly detection | data cleaning | data quality management | data repair
Detecting and correcting data quality problems using machine learning
Article included in this offer
"Software technologies and System architectures"
(
227 articles
)
Updated and enriched with articles validated by our scientific committees
A set of exclusive tools to complement the resources
Bibliography
Exclusive to subscribers. 97% yet to be discovered!
Already subscribed? Log in!