Machine learning on non curated data

Dirty data made easy

Gael Varoquaux

Big Data Data Data Science Machine-Learning Scientific Libraries (Numpy/Pandas/SciKit/...)

See in schedule

According to industry surveys [1], the number one hassle of data scientists is cleaning the data to analyze it. Textbook statistical modeling is sufficient for noisy signals, but errors of a discrete nature break standard tools of machine learning. I will discuss how to easily run machine learning on data tables with two common dirty-data problems: missing values and non-normalized entries. On both problems, I will show how to run standard machine-learning tools such as scikit-learn in the presence of such errors. The talk will be didactic and will discuss simple software solutions. It will build on the latest improvements to scikit-learn for missing values and the DirtyCat package [2] for non normalized entries. I will also summarize theoretical analyses in recent machine learning publications.

This talk targets data practitioners. Its goal are to help data scientists to be more efficient analysing data with such errors and understanding their impacts.

With missing values, I will use simple arguments and examples to outline how to obtain asymptotically good predictions [3]. Two components are key: imputation and adding an indicator of missingness. I will explain theoretical guidelines for these, and I will show how to implement these ideas in practice, with scikit-learn as a learner, or as a preprocesser.

For non-normalized categories, I will show that using their string representations to “vectorize” them, creating vectorial representations gives a simple but powerful solution that can be plugged in standard statistical analysis tools [4].

[1] Kaggle, the state of ML and data science 2017
[3] Josse Julie, Prost Nicolas, Scornet Erwan, and Varoquaux Gaël (2019). “On the consistency of supervised learning with missing values”.
[4] Cerda Patricio, Varoquaux Gaël, and Kégl Balázs. "Similarity encoding for learning with dirty categorical variables." Machine Learning 107.8-10 (2018): 1477

Type: Talk (45 mins); Python level: Intermediate; Domain level: Intermediate

Gael Varoquaux


Gaël Varoquaux is an Inria faculty researcher working on data science and brain imaging. He is also a historical contributor to the scientific Python and pydata ecosystems. His academic research focuses on using data and machine learning for scientific inference, applying it to brain-imaging data to understand cognition, as well as developing tools that make it easier for non-specialists to use machine learning. Years before the NSA, he was hoping to make bleeding-edge data processing available across new fields, and he has been working on a mastermind plan building easy-to-use open-source software in Python. He is one of the core developers and originators of scikit-learn, joblib, Mayavi and nilearn, a nominated member of the PSF, and often teaches scientific computing with Python using the scipy lecture notes.