Detecting and correcting data quality problems using machine learning

Overview

ABSTRACT

This synthesis presents the recent evolution of techniques for the evaluation and improvement of data quality in databases based on machine learning methods. It describes recent solutions proposed mainly by the academia as well as approaches implemented to detect and correct main data quality problems such as outlying, inconsistent or missing data, and duplicates.

Read this article from a comprehensive knowledge base, updated and supplemented with articles reviewed by scientific committees.

Read the article

AUTHOR

Laure BERTI-ÉQUILLE: Research Director - Development Research Institute - ESPACE-DEV - Montpellier, France

INTRODUCTION

Significant progress has been made in recent years in the design of tools to automate the evaluation, monitoring and improvement of data quality, thanks in particular to technological advances in Artificial Intelligence, and in particular machine learning (ML – Machine Learning). Machine learning techniques have been made operational on a large scale, and are now widely deployed in all sectors of activity, to automate prediction and classification tasks in decision support for numerous fields of application (health, finance, marketing, etc.). However, the reliability of these methods' results remains highly dependent on the quality of the input data for the learning models. Data is often imperfect, and optimal data quality is rarely achieved. Thus, two complementary approaches are commonly proposed: one from the data management research community, aimed at correcting data upstream of analysis chains (by cleaning or repairing data), and the other from the community of learning researchers and practitioners (data scientists), aimed at developing models that are more robust to noise and more efficient, with greater emphasis on transforming and preparing data for a particular predictive task.

For decades, for the data management community, data cleansing has consisted in correcting and transforming data using declarative ETL (Extraction-Transformation-Loading) approaches , detecting inconsistencies in relational databases in the form of constraint violations, to "repair" them

You do not have access to this resource.

Exclusive to subscribers. 97% yet to be discovered!

You do not have access to this resource. Click here to request your free trial access!

Already subscribed? Log in!

KEYWORDS

Ongoing reading
Detecting and correcting data quality problems using machine learning

The impact of data quality in machine learning

Article included in this offer

"Software technologies and System architectures"

( 232 articles )

Complete knowledge base

Updated and enriched with articles validated by our scientific committees

Services

A set of exclusive tools to complement the resources

View offer details

Bibliography

(1) - BARBER (R.F.), CANDES (E.J.), RAMDAS (A.), TIBSHIRANI (R.) - Predictive inference with the Jackknife+. - Ann. Statist., 49(1):486-507, February 2021.
(2) - BARNETT (V.), LEWIS (T.) - Outliers in statistical data. – - John Wiley and...

You do not have access to this resource.

Exclusive to subscribers. 97% yet to be discovered!

You do not have access to this resource. Click here to request your free trial access!

Already subscribed? Log in!