Practical Data Cleaning with Python Resources
Posted on Mi 03 Mai 2017 in trainings
Practical Data Cleaning Resources
(O'Reilly Live Online Training)
This week I will be giving my first O'Reilly Live Online Training via the Safari platform. I'm pretty excited to share some of my favorite data cleaning libraries and tips for validating and testing your data workflows.
This post hopes to be a resource to those attending the class, but also anyone interested in the subject of practical data cleaning with Python. If you have tips or ideas on extra content or links to add, feel free to comment or reach out via Twitter or email.
Hope you enjoy!
Libraries / Repositories
- Course Repository: https://github.com/kjam/data-cleaning-101
Deduplication
- Dedupe: https://github.com/dedupeio/dedupe
- CSV Dedupe: https://github.com/dedupeio/csvdedupe
String Matching
- Fuzzy Wuzzy: https://github.com/seatgeek/fuzzywuzzy
- TextaCy: https://github.com/chartbeat-labs/textacy
Managing Nulls
- Pandas functions: http://pandas.pydata.org/pandas-docs/stable/missing_data.html
- Dora: https://github.com/NathanEpstein/Dora
- Badfish: https://github.com/harshnisar/badfish
Normalization & Preprocessing
- Scikit-learn preprocessing: http://scikit-learn.org/stable/modules/preprocessing.html
- Pandas stats: http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics
Specific data cleaning topics
- Privacy? https://github.com/datascopeanalytics/scrubadub
- Measurements? http://pint.readthedocs.io/
- Versioning ML Data? https://github.com/NathanEpstein/Dora
- Dates? http://arrow.readthedocs.io/en/latest/ or https://github.com/kennethreitz/maya
- AutoClean? https://github.com/rhiever/datacleaner
- DIY Parser? https://github.com/datamade/parserator
Simple pipelines / graphs, task processing
- Dask: https://github.com/dask/dask
- Distributed: https://github.com/dask/distributed
Schema Validation
- Voluptuous: https://github.com/alecthomas/voluptuous
- Validr: https://github.com/guyskk/validr
- With Serialization: https://marshmallow.readthedocs.io/en/latest/
- For JVM / Apache: https://avro.apache.org/
Dataframe Validation
- Engarde: https://github.com/TomAugspurger/engarde
- Validada: https://github.com/jnmclarty/validada
Constraint Detection
- TDDA: Test-Driven Data Analysis: https://github.com/tdda/tdda
- SciPy: https://docs.scipy.org/doc/scipy-0.19.0/reference/stats.html#statistical-functions
Property-based Testing
- Hypothesis: https://hypothesis.readthedocs.io/
- Haskell's Quickcheck: https://hackage.haskell.org/package/QuickCheck
More Validation and Testing
- Model Cross Validation: http://scikit-learn.org/stable/modules/cross_validation.html
- Testing ML features: https://github.com/machinalis/featureforge
- Built-in Stats: https://docs.python.org/3/library/statistics.html
Unit Testing Basics
- PyTest: https://docs.pytest.org/en/latest/
- Mocking: https://docs.python.org/3/library/unittest.mock-examples.html
- Faking Data with Faker: https://faker.readthedocs.io/en/master/
- Faker CSVs: https://github.com/pereorga/csvfaker
- Watch Ned Batchelder’s testing talk
- Continuous Integration: TravisCI, Jenkins, TeamCity and many more
- Better Code Reviews: http://www.bettercode.reviews/
Testing Pipelines
- Data Quality Checks with Spark DataFrames
- Drunken Data Quality (Spark DF): https://github.com/FRosner/drunken-data-quality
- Apache Beam: https://beam.apache.org/documentation/pipelines/test-your-pipeline/
- Tip: Check your framework first!
Open Datasets (to try out your skills!)
- Kaggle Datasets: beyond just competition data, Kaggle also has shared datasets curated by users.
- Awesome Datasets GitHub List
- Quora: Where can I find large public datasets?
- Scikit-learn datasets
- Dataquest.io: 17 places to find open datasets for projects
- NLTK Data: NLP data such as books, scripts, articles and poems
Research
- Towards Reliable Interactive Data Cleaning: A User Survey and Recommendations, S Krishnan, D Haas, M. J. Franklin, 2016
- Continuous Data Cleaning, M. Volkovs, F. Chiang, J. Szlichta, and R. J. Miller. ICDE, 2014
- ActiveClean: Krishnan, Franklin, Goldberg, Wang, Wu, 2016
- Katara: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing, X Chu, Morcos, Ilyas et al. 2015
- Test-driven Evaluation of Linked Data Quality, D. Kontokostas, P. Westphal, S. Auer, S. Hellmann, J. Lehmann, R. Cornelissen, and A. Zaveri., 2015
- Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery, T Schlegl, P Seeböck, S M. Waldstein, U Schmidt-Erfurth, and G Langs, 2017
That's all for now! Check back as I plan to update and evolve this list with more libraries and examples.