Practical Data Cleaning with Python Resources

Posted on Wed 03 May 2017 in trainings

Practical Data Cleaning Resources

(O'Reilly Live Online Training)

This week I will be giving my first O'Reilly Live Online Training via the Safari platform. I'm pretty excited to share some of my favorite data cleaning libraries and tips for validating and testing your data workflows.

This post hopes to be a resource to those attending the class, but also anyone interested in the subject of practical data cleaning with Python. If you have tips or ideas on extra content or links to add, feel free to comment or reach out via Twitter or email.

Hope you enjoy!

Libraries / Repositories

  • Course Repository:


  • Dedupe:
  • CSV Dedupe:

String Matching

  • Fuzzy Wuzzy:
  • TextaCy:

Managing Nulls

  • Pandas functions:
  • Dora:
  • Badfish:

Normalization & Preprocessing

  • Scikit-learn preprocessing:
  • Pandas stats:

Specific data cleaning topics

  • Privacy?
  • Measurements?
  • Versioning ML Data?
  • Dates? or
  • AutoClean?
  • DIY Parser?

Simple pipelines / graphs, task processing

  • Dask:
  • Distributed:

Schema Validation

  • Voluptuous:
  • Validr:
  • With Serialization:
  • For JVM / Apache:

Dataframe Validation

  • Engarde:
  • Validada:

Constraint Detection

  • TDDA: Test-Driven Data Analysis:
  • SciPy:

Property-based Testing

  • Hypothesis:
  • Haskell's Quickcheck:

More Validation and Testing

  • Model Cross Validation:
  • Testing ML features:
  • Built-in Stats:

Unit Testing Basics

  • PyTest:
  • Mocking:
  • Faking Data with Faker:
  • Faker CSVs:
  • Watch Ned Batchelder’s testing talk
  • Continuous Integration: TravisCI, Jenkins, TeamCity and many more
  • Better Code Reviews:

Testing Pipelines

  • Data Quality Checks with Spark DataFrames
  • Drunken Data Quality (Spark DF):
  • Apache Beam:
  • Tip: Check your framework first!

Open Datasets (to try out your skills!)


That's all for now! Check back as I plan to update and evolve this list with more libraries and examples.