Practical Data Cleaning with Python Resources

Practical Data Cleaning Resources

(O'Reilly Live Online Training)

This week I will be giving my first O'Reilly Live Online Training via the Safari platform. I'm pretty excited to share some of my favorite data cleaning libraries and tips for validating and testing your data workflows.

This post hopes to be a resource to those attending the class, but also anyone interested in the subject of practical data cleaning with Python. If you have tips or ideas on extra content or links to add, feel free to comment or reach out via Twitter or email.

Hope you enjoy!

Libraries / Repositories

Deduplication

String Matching

Managing Nulls

Normalization & Preprocessing

Specific data cleaning topics

Simple pipelines / graphs, task processing

Schema Validation

Dataframe Validation

Constraint Detection

Property-based Testing

More Validation and Testing

Unit Testing Basics

Testing Pipelines

Research

That's all for now! Check back as I plan to update and evolve this list with more libraries and examples.