kjam's blog

AI Risk and Threat Taxonomies

katharine — Tue, 05 Aug 2025 00:00:00 +0200

It seems like every week my LinkedIn feed is filled with new just released AI risk taxonomies, threat models or AI governance handbooks. Usually these taxonomies come from governance consultants or standards authorities and are a great reference for understanding the wide variety of risks AI systems¹ bring with …

Algorithmic-based Guardrails: External guardrail models and alignment methods

katharine — Mon, 28 Jul 2025 00:00:00 +0200

You've probably at some point heard the term "guardrails" when talking about security or safety in AI systems like LLMs or multi-modal models (i.e. models that include and produce multiple modalities, like speech and image, videos, image and text).

Are you a visual learner? There's a YouTube video for …

Blocking AI/ML Memorization with Software Guardrails

katharine — Fri, 11 Jul 2025 00:00:00 +0200

One common way to control memorization in today's deep learning systems is to fix the problem by building software around it. This software can also be used to deal with other undesired behavior, like producing hate speech or mentioning criminal activities.

Are you a visual learner? There's a YouTube video …

Defining Privacy Attacks in AI and ML

katharine — Thu, 12 Jun 2025 00:00:00 +0200

In this article series, you've been able to investigate memorization in AI/deep learning systems -- often via interesting attack vectors. In security modeling, it's useful to explicitly define the threats you are defending against, so you can both discuss and address them and compare potential interventions.

Prefer to learn by …

Priveedly: your private and personal content reader and recommender

katharine — Thu, 23 Jan 2025 00:00:00 +0100

I'm excited to open-source a project that I've been using for the past 2 and a half years: a private/personal reader and recommender.

It works with:

and comes with an example Jupyter Notebook for training your own text-based recommendation model once you have …

Adversarial Examples Demonstrate Memorization Properties

katharine — Wed, 15 Jan 2025 00:00:00 +0100

In this article, the last in the problem exploration section of the series, you'll explore adversarial machine learning - or how to trick a deep learning system.

Adversarial examples demonstrate a different way to look at deep learning memorization and generalization. They can show us how important the learned decision space …

Differential Privacy as a Counterexample to AI/ML Memorization

katharine — Thu, 02 Jan 2025 00:00:00 +0100

At this point in reading the article series on AI/ML memorization you might be wondering, how did the field get so far without addressing the memorization problem? How did seminal papers like Zhang et al's Understanding Deep Learning Requires Rethinking Generalization not fundamentally change machine learning research? And maybe …

How Memorization Happens: Overparametrized Models

katharine — Wed, 18 Dec 2024 00:00:00 +0100

You've heard claims that we will "run out of data" to train AI systems. Why is that? In this article in the series on machine learning memorization you'll explore model size as a factor in memorization and the trend for bigger models as a general problem in machine learning.

Prefer …

How memorization happens: Novelty

katharine — Mon, 09 Dec 2024 00:00:00 +0100

So far in this series on memorization in deep learning, you've learned how massively repeated text and images incentivize training data memorization, but that's not the only training data that machine learning models memorize. Let's take a look at another proven memorization: novel examples.

Prefer to learn by video? This …

How memorization happens: Repetition

katharine — Tue, 03 Dec 2024 00:00:00 +0100

In this article in the deep learning memorization series, you'll learn how one part of memorization happens -- highly repeated data from the "head" of the long-tailed distribution.

Prefer to learn by video? This post is summarized on Probably Private's YouTube.

Recall from the data collection article that some examples are …

Gaming Evaluation - The evolution of deep learning training and evaluation

katharine — Tue, 26 Nov 2024 00:00:00 +0100

In this article in the series on machine learning memorization, you'll dive deeper into how typical machine learning training and evaluation happens, a crucial step in ensuring the machine learning model actually "learns" something. Let's review the steps that lead up to training a deep learning model.

High-level steps to …

Exploring new meadows

katharine — Wed, 20 Nov 2024 00:00:00 +0100

Hello!

We may not know each other, but here you are on my website -- perhaps because you saw a post or someone shared a link. I'm resourceful, determined, intelligent and looking for new challenges. Welcome!

Wenn Deutsch einfacher ist, schreiben Sie mir bitte per Email (katharine at kjamistan punkt com …

Private and Personalized AI

katharine — Tue, 19 Nov 2024 00:00:00 +0100

I recently had the wonderful experience of keynoting PyData Paris, thanks again for the invite! When deciding on a topic, I was considering my recent research about how AI/ML systems memorize data. As I've mentioned in a few talks, if we indeed embraced the fact that machine learning systems …

Encodings and embeddings: How does data get into machine learning systems?

katharine — Mon, 18 Nov 2024 00:00:00 +0100

In this series, you've learned a bit about how data is collected for machine learning, but what happens next? You need to turn the collected data -- images, text, video, audio or even just a spreadsheet -- into numbers that can be learned by a model. How does this happen?

TLDR (too …

Machine Learning dataset distributions, history, and biases

katharine — Wed, 13 Nov 2024 00:00:00 +0100

You probably are already aware that many machine learning datasets come from scraped internet data. Maybe you received the infamous GPT response: "Please note that my knowledge is limited to information available up until September 2021." You might have also read fear-mongering opinions and articles that companies will "run out …

Deep learning memorization, and why you should care

katharine — Mon, 04 Nov 2024 00:00:00 +0100

When's the last time that ChatGPT parroted someone else's words to you? Or the last time a diffusion model you used recreated someone's art, someone's photo, someone's face? Has Copilot given you someone else's code without permission or attribution? If this happened, how would you know for sure?

In this …

A Deep Dive into Memorization in Deep Learning

katharine — Sun, 03 Nov 2024 00:00:00 +0100

Want to learn more about how, when and why machine learning, particularly deep learning systems memorize data? By studying memorization, you'll learn more about how machine learning systems really function, along with how privacy works from a technical point-of-view. You'll also be better able to decide how, when and where …

Building a Privacy-First Newsletter

katharine — Sun, 12 Mar 2023 09:00:00 +0100

Building a newsletter is a fairly common activity these days, with many creators, writers and thinkers making part of their living via subscribers willing to give small amounts of money out per year or month to get exclusive access. Beyond the paid subscriptions, there's an increasing demand for free, or …

Joining Dropout Labs!

katharine — Sat, 23 Nov 2019 00:00:00 +0100

After months of searching, lots of fun (and some less fun) interviews and hours of self-reflection, I am excited to announce I am the new Head of Product at Dropout Labs! 🎉

The interview and decision process was quite iterative and disruptive! I am somewhat to blame for this as I …

Let's Get Together: More Details on Me, You and My Dream Gig

katharine — Thu, 06 Jun 2019 00:00:00 +0200

Hello!

Here's more about me, in case it is news to you:

[About Me]

Co-founder of …

Adversarial Learning for Good: My Talk at #34c3 on Deep Learning Blindspots

katharine — Thu, 28 Dec 2017 00:00:00 +0100

When I first was introduced to the idea of adversarial learning for security purposes by Clarence Chio's 2016 DEF CON talk and his related open-source library deep-pwning, I immediately started wondering about applications of the field to both make robust and well-tested models, but also as a preventative measure against …

Towards Interpretable Reliable Models

katharine — Sun, 29 Oct 2017 00:00:00 +0200

I presented a keynote at PyData Warsaw on moving toward interpretable reliable models. The talk was inspired by some of the work I admire in the field as well as a fear that if we do not address interpretable models as a community, we will be factors in our own …

GDPR & You: My Talk at Cloudera Sessions München

katharine — Wed, 11 Oct 2017 00:00:00 +0200

Unless you have been avoiding all news, you have likely heard of the coming changes in European privacy regulations which go into effect in May 2018. The changes are covered under the General Data Privacy Regulation Directive, whose final text was made available in May 2016.

I presented a talk …

Algorithmic Art and "Künstliche Kunst"

katharine — Sat, 07 Oct 2017 00:00:00 +0200

I was invited to give a talk at 404 Dublin, a really cool conference joining community groups w/ tech folks and art installations. When thinking of what topics might be of interest to the audience, I selfishly went to one of my (side) passions.. following artists who are doing amazing …

Comparing scikit-learn Text Classifiers on a Fake News Dataset

katharine — Mon, 28 Aug 2017 00:00:00 +0200

Finding ways to determine fake news from real news is a challenge most Natural Language Processing folks I meet and chat with want to solve. There is significant difficulty in doing this properly and without penalizing real news sources.

I was discussing this problem with Miguel Martinez-Alvarez on my last …

Data Unit Testing: EuroPython Tutorial

katharine — Fri, 14 Jul 2017 00:00:00 +0200

I gave a long and opinionated tutorial at EuroPython 2017 about how we should do unit testing and validation within a data science scope. The GitHub repository for the course (which is part of my O'Reilly Live Online training) is https://github.com/kjam/data-cleaning-101. I will continue editing and …

if Ethics is not None

katharine — Fri, 14 Jul 2017 00:00:00 +0200

This past Wednesday, I had the pleasure of giving a keynote at EuroPython 2017. I covered a historical view of ethics in computing. The slides are shared here, but it was also recorded so I will post a video when it is available. (Updated: video added!)

In addition, a series …

Practical Data Cleaning with Python Resources

katharine — Wed, 03 May 2017 00:00:00 +0200

Practical Data Cleaning Resources

(O'Reilly Live Online Training)

This week I will be giving my first O'Reilly Live Online Training via the Safari platform. I'm pretty excited to share some of my favorite data cleaning libraries and tips for validating and testing your data workflows.

This post hopes to be …

PyData Amsterdam Keynote on Ethical Machine Learning

katharine — Fri, 07 Apr 2017 00:00:00 +0200

I was kindly asked by the PyData Amsterdam organizers to keynote the conference. As a passionate fan of ethical machine learning and the great research being done by data scientists and academics around the world -- I am very enthused to present the topic to the conference.

My slides are currently …

Ten Tips for First-Time Conference Speakers

katharine — Sat, 11 Feb 2017 00:00:00 +0100

The saddest moment for me at conferences is when I'm in the middle of an interesting conversation with a bright person and I ask her when her talk is and she says, "Who me?"

The number of folks I speak with every year at conferences who have amazing stories to …

The Practice of Programming: 18 Years Later

katharine — Fri, 20 Jan 2017 00:00:00 +0100

Over the new year holiday time I had a chance to get away from it all, and snuck up to Finland to sit in a lodge on the Gulf of Finland, sip coffee, take saunas and read. I brought along a few books, the only programming one being Brian W …

New O'Reilly Video Training: Data Pipelines with Python

katharine — Tue, 13 Dec 2016 00:00:00 +0100

I'm really excited to announce a new Python video course with O'Reilly on data pipelines. If you are interested in learning some of the popular options available for workflow automation and management in Python, take a look!

In the course, I cover:

Using Celery for simple automation
Setting up Hadoop …

DAGs & Dask: How and When to Accelerate your Data Analysis

katharine — Sat, 29 Oct 2016 00:00:00 +0200

I gave a talk about Directed Acyclic Graphs (DAGs) and Dask at PyConCZ 2016. It was super fun and I had a great time at the conference. If you want to read my slides below, here they are! There will be videos available later, so I'll post the link / video …

Introduction to Data Wrangling @ PyConCZ

katharine — Sat, 29 Oct 2016 00:00:00 +0200

PyConCZ 2016 was such a fun conference! First off, it was the first time I got to see Jackie Kazil since we started writing our O'Reilly book Data Wrangling with Python together, HOORAYYYY!

OMG PYTHONISTAS! @JackieKazil & I are together for the first time since we started the @OReillyMedia Data Wrangling …

Chatbot Scraper: Europarl Scraper: 24 Languages of Politics, at your fingertips

katharine — Thu, 20 Oct 2016 00:00:00 +0200

I participated in a two-day PyDataBerlin Hackathon event in early-October and decided to build a scraper for European Parliament. This was after I found the Europarl parallel corpus a bit underwhelming as it is messy and not tagged for party, speakers or topic (this is understandable, as it is primarily …

Chatbot Scraper: Using (today's) IRC logs as your NLP datasets

katharine — Thu, 29 Sep 2016 00:00:00 +0200

I dunno about you, but I often find myself bored with NLP (natural language processing) datasets. Too often they are older, based around something that is not particularly interesting to me or something I've analyzed or used before.

For me, IRC has often been a source of community, fun, sometimes …

Automating your Data Cleanup with Python

katharine — Sat, 17 Sep 2016 00:00:00 +0200

I gave a talk at PyCon UK 2016 on automating your data cleanup with Python. I want to again thank the organizers for having me and thank the folks who attended. If you have any questions or are interested in talking about data cleaning problems, feel free to reach out …

Embedded *isms in Vector-Based Natural Language Processing

katharine — Fri, 16 Sep 2016 00:00:00 +0200

You may have read recently about machine learning's bias problem particularly in word embeddings and vectors. It's a massive problem. If you are using word embeddings to generate associative words, phrases or to do comparisons, you should be aware of the biases you are introducing into your work. In preparation …

Obligatory Women In Tech Post

katharine — Fri, 16 Sep 2016 00:00:00 +0200

Question: How does it feel to be a woman in tech?

Answer:

via GIPHY

see also: OG PyLadies Interview

I Hate You, NLP ;)

katharine — Thu, 21 Jul 2016 00:00:00 +0200

"I had a great time talking about Sentiment Analysis and Natural Language processing at EuroPython 2016. Here are my slides for your review, feel free to reach out on Twitter or email if you'd like to chat further about NLP, machine learning and sentiment. I look forward to starting more …

Python Flight Search

katharine — Tue, 29 Mar 2016 00:00:00 +0200

Like many people, I enjoy travel. With family and friends all across the United States and a home base in Berlin, it's fairly easy to find a reason to travel -- either globally or within the EU. That said, what I find more difficult is to determine what's the best way …

Data Wrangling with Python Course

katharine — Mon, 29 Feb 2016 00:00:00 +0100

I'll be in New York on July 13th and 14th, teaching how to "big data" with Python. We'll cover Pandas, Hadoop, PySpark and more on automation, acquisition and managing your data.

Next Course: New York City, July 13-14

Tickets are available on Eventbrite with a special Early Bird and Student …

Data Wrangling with Python

katharine — Sun, 01 Nov 2015 00:00:00 +0100

Just a quick note that my book: Data Wrangling with Python is available for prepurchase on Amazon as well as in early release on O'Reilly's web site.

Pick up a copy for less than full amount now. I'll be posting some examples of problems we work through in the book …

Europython 2015

katharine — Thu, 23 Jul 2015 00:00:00 +0200

Introduction to Data Analysis Tutorial

Want to learn how to analyze data using Python? If you're at #europycon you should drop by my course! If not, watch the video online later today (will post link!)