<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>kjam's blog</title><link href="https://blog.kjamistan.com/" rel="alternate"></link><link href="https://blog.kjamistan.com/feeds/atom.xml" rel="self"></link><id>https://blog.kjamistan.com/</id><updated>2026-04-07T00:00:00+02:00</updated><entry><title>Using Claude Code with Locally-Hosted models</title><link href="https://blog.kjamistan.com/using-claude-code-with-locally-hosted-models.html" rel="alternate"></link><published>2026-04-07T00:00:00+02:00</published><updated>2026-04-07T00:00:00+02:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2026-04-07:/using-claude-code-with-locally-hosted-models.html</id><summary type="html">&lt;p&gt;I've been exploring privacy and security aspects of AI-assisted coding and also experimenting with those workflows for my own work. In doing so, I've got a pretty robust setup for using Claude Code with both the Anthropic backend and a locally hosted GPU machine in my &lt;a href="https://blog.kjamistan.com/building-out-my-home-ai-lab-for-private-and-local-ai.html"&gt;at home AI lab …&lt;/a&gt;&lt;/p&gt;</summary><content type="html">&lt;p&gt;I've been exploring privacy and security aspects of AI-assisted coding and also experimenting with those workflows for my own work. In doing so, I've got a pretty robust setup for using Claude Code with both the Anthropic backend and a locally hosted GPU machine in my &lt;a href="https://blog.kjamistan.com/building-out-my-home-ai-lab-for-private-and-local-ai.html"&gt;at home AI lab&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I thought it would be useful for others to put together a guide on how to get started using Claude Code with a local-first setup.&lt;/p&gt;
&lt;div class="toc"&gt;&lt;span class="toctitle"&gt;Table of Contents&lt;/span&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#why-though"&gt;Why, though?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#system-setup"&gt;System setup&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#claude-code-intricacies"&gt;Claude Code intricacies&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#change-your-environment-variable-for-anthropic_"&gt;Change your environment variable for ANTHROPIC_*&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#set-the-model-name-when-launching-claude-code"&gt;Set the model name when launching Claude Code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#other-settings-that-can-come-in-handy"&gt;Other settings that can come in handy&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#model-serving-llamacpp-or-vllm"&gt;Model serving: llama.cpp or vllm?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#model-serving-which-models"&gt;Model serving: which models?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#putting-it-all-together"&gt;Putting it all together&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#privacy-and-security-advice"&gt;Privacy and Security Advice&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#sandboxing-101"&gt;Sandboxing 101&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#synthetic-data-generation"&gt;Synthetic Data Generation&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#fully-synthetic-data"&gt;Fully synthetic data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#statistical-methods-for-synthetic-data"&gt;Statistical methods for synthetic data&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#spying-on-claude-code"&gt;Spying on Claude Code&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;h2 id="why-though"&gt;Why, though?&lt;/h2&gt;
&lt;p&gt;Let's just cover why first, because I'm not here to tell you what model you should or shouldn't use. :) I am here, however, to tell you that Claude uses A LOT more tokens than the other models I'm using locally and takes about the same time to respond.&lt;/p&gt;
&lt;p&gt;Of course, if you are already paying for Claude and you like it, keep doing you! But, if you are curious about local-first design or want to experiment as you are reaching your monthly token-limit, I think it's always a good idea to try out new things and see if they work for you.&lt;/p&gt;
&lt;p&gt;In addition, being a privacy researcher, there are serious privacy benefits to keeping your data local and choosing what data goes into Anthropic (or other) servers. Testing this workflow out might help you determine when sending the data to Anthropic is worth it and when not. :)&lt;/p&gt;
&lt;h2 id="system-setup"&gt;System setup&lt;/h2&gt;
&lt;p&gt;I have a &lt;a href="https://blog.kjamistan.com/building-out-my-home-ai-lab-for-private-and-local-ai.html"&gt;much longer article&lt;/a&gt; and &lt;a href="https://youtu.be/3h_JCBVnHBI"&gt;YouTube video&lt;/a&gt; on getting your local AI lab setup, but you'll need to have a pretty beefy GPU or similar compute if you want your local models to compete with the rapid response and code quality of the cloud-based models.&lt;/p&gt;
&lt;p&gt;For my setup I have 32GB of GPU memory. This means I can load several of the larger quantized models directly onto the GPU without problem. I've noticed my favorite coding buddy right now is Qwen-3.5-35B (quantized), but I've also tested out Qwen3-Coder, DeepSeek3, GLM-Flash. You can have a look on &lt;a href="https://huggingface.co/models?other=code"&gt;HuggingFace&lt;/a&gt; to get an idea of what models might fit on your computer.&lt;/p&gt;
&lt;p&gt;I'm pretty certain if you have a smaller card (or chip) that you're going to deal with latency issues or even not being able to use certain models. I don't think you should let this deter you from giving it a try if only to start to have a workflow that you can use if you either get a bigger machine or when models get even more task-specific and smaller.&lt;/p&gt;
&lt;p&gt;I will also be getting some other chips later this year, looking at you &lt;a href="https://tenstorrent.com/"&gt;tensortorrent&lt;/a&gt; 😏. These will be cheaper than my current GPU and have about the same amount of memory. The catch will be that the setup might be a bit harder... Wanna get updates on how it goes? Give me a &lt;a href="https://www.youtube.com/@ProbablyPrivate"&gt;follow on YouTube&lt;/a&gt;, &lt;a href="https://probablyprivate.com/"&gt;join my newsletter&lt;/a&gt; or &lt;a href="https://www.linkedin.com/in/katharinejarmul/"&gt;find me on LinkedIn&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="claude-code-intricacies"&gt;Claude Code intricacies&lt;/h2&gt;
&lt;p&gt;The initial catches getting Claude to communicate with a local model were not well documented, so here's the steps:&lt;/p&gt;
&lt;h3 id="change-your-environment-variable-for-anthropic_"&gt;Change your environment variable for ANTHROPIC_*&lt;/h3&gt;
&lt;p&gt;Here's a brief breakdown of settings you should set, either in your Claude settings.json file or in your environment. I use environment variables because then I can switch models quickly. I point the BASE_URL to my local GPU server instead of localhost, but replace it for your setup. For example, you can also point it to an ollama server running locally.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="go"&gt;ANTHROPIC_BASE_URL=http://YOUR_IP_ADDRESS:PORT&lt;/span&gt;
&lt;span class="go"&gt;ANTHROPIC_DEFAULT_OPUS_MODEL=my-model&lt;/span&gt;
&lt;span class="go"&gt;ANTHROPIC_DEFAULT_SONNET_MODEL=my-model&lt;/span&gt;
&lt;span class="go"&gt;ANTHROPIC_DEFAULT_HAIKU_MODEL=my-model&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You can do so like this in most shells by running the following&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="go"&gt;export ANTHROPIC_BASE_URL=http://localhost:8000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;And reset for Claude Anthropic by running:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="go"&gt;unset ANTHROPIC_BASE_URL&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3 id="set-the-model-name-when-launching-claude-code"&gt;Set the model name when launching Claude Code&lt;/h3&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="go"&gt;claude --model qwen35_35&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Note: this is tricky when using models from HuggingFace because Claude Code deliberately doesn't allow slashes in the model name. This means that you also have to set up your GPU server with special model names. More on that in the next session.&lt;/p&gt;
&lt;h3 id="other-settings-that-can-come-in-handy"&gt;Other settings that can come in handy&lt;/h3&gt;
&lt;p&gt;You may want to setup specific things on your server, such as thinking through caching&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt;, context size and tool call support. Much of this will depend on how you serve, what type of machine you have and what models you want to use.&lt;/p&gt;
&lt;p&gt;That said, Claude Code is VERY verbose, so if you are intent on using the Claude Code interface, I recommend you increase context size so that you have more flexibility. So far, setting context at 131072 tokens (server-side setting) has worked well for me, but for a few longer-running tasks, I've bumped it up to 256000.&lt;/p&gt;
&lt;h2 id="model-serving-llamacpp-or-vllm"&gt;Model serving: llama.cpp or vllm?&lt;/h2&gt;
&lt;p&gt;For most of my work with LLMs and VLMs I use vllm (check out &lt;a href="https://www.youtube.com/watch?v=k930Mtf_rLk"&gt;my video on using vllm&lt;/a&gt;). It's easy to get started with and it has a bunch of out-of-the-box performance upgrades. vllm has a guide on &lt;a href="https://docs.vllm.ai/en/latest/serving/integrations/claude_code/#configuring-claude-code"&gt;getting started with Claude Code&lt;/a&gt; which is pretty straightforward to use for your own setup.&lt;/p&gt;
&lt;p&gt;However, I had heard really good things about serving quantized models with llama.cpp. I also know that llama.cpp powers a good part of vllm, so why not go to the source?&lt;/p&gt;
&lt;p&gt;Getting llama.cpp compiled to my GPU was a bit of a challenge, but I eventually found a guide with the right flags for my GPU&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt;. There are instructions on how to also just run pre-built binaries &lt;a href="https://github.com/ggml-org/llama.cpp?tab=readme-ov-file"&gt;in the llama.cpp repository&lt;/a&gt;. However, I wanted to compile it to make sure that it was using the GPU architecture to the best of its abilities.&lt;/p&gt;
&lt;p&gt;This is &lt;a href="https://www.glukhov.org/llm-hosting/llama-cpp/"&gt;a pretty good walkthrough on steps to get started&lt;/a&gt; should you run into issues or want to follow a step-by-step guide.&lt;/p&gt;
&lt;p&gt;Once you have llama.cpp compiled and running, I recommend setting up a config file, so you can run llama-server once and serve multiple models.&lt;/p&gt;
&lt;p&gt;Example config.ini file with one model:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;[*]&lt;/span&gt;
&lt;span class="c1"&gt;# Global settings&lt;/span&gt;
&lt;span class="na"&gt;jinja&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;

&lt;span class="k"&gt;[glm_4_flash]&lt;/span&gt;
&lt;span class="na"&gt;hf-repo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL&lt;/span&gt;
&lt;span class="na"&gt;jinja&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;temp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;0.7&lt;/span&gt;
&lt;span class="na"&gt;ctx-size&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;131072&lt;/span&gt;
&lt;span class="na"&gt;top-p&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;0.9&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# originally 1&lt;/span&gt;
&lt;span class="na"&gt;min-p&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;0.01&lt;/span&gt;
&lt;span class="na"&gt;fa&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;To run the server you'll run, with the file path updated to point to your configuration file&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="go"&gt;./llama.cpp/build/bin/llama-server --models-preset PATH_TO_CONFIG_FILE --sleep-idle-seconds 300 --host 0.0.0.0 --port 9999&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2 id="model-serving-which-models"&gt;Model serving: which models?&lt;/h2&gt;
&lt;p&gt;Whether you're using vllm or llama.cpp (or something else), you'll need to choose what models to use. Here's some models I've tried so far that I could recommend:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF"&gt;Qwen 3.5 - 35B quantized by Unsloth&lt;/a&gt;: This has become my current goto coding companion. I would say it is certainly more generalist and relatively good at initial planning. Comparing it to Claude Opus is of course a stretch (it's probably a 10th of the size with a 10th of the information!), but if you're up for iterating and clarifying, you can get the same end result with a bit more of your own effort.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://unsloth.ai/docs/models/glm-4.7-flash"&gt;GLM Flash&lt;/a&gt;: I'm just getting started integrating this one into my workflows. So far I really like it for debugging, but it's probably powerful at other things that I haven't tested it on yet. Will keep you updated as I get to know the performance better via more testing.  &lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF"&gt;Qwen3-Coder-Next (quantized)&lt;/a&gt;: This was the model I started with, but for my workflows it seemed like it didn't match as well. However, I am not a software engineer! So I wanted to mention it because I think if you are writing software maybe this model is worth testing out.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://huggingface.co/google/gemma-3-27b-it-qat-q4_0-gguf"&gt;Gemma3-27B (quantized)&lt;/a&gt;: This one I've used for doing things like rewriting documentation, texts and project history/planning. I really like pairing with Gemma on text-workflows, and this is a pretty powerful text-to-text model.  &lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;You'll note right now that I've leaned heavily on Unsloth quantized models. I will definitely be comparing them with other quantized models soon, but I had to start somewhere and I thought I'd do so methodically. If you have any quantized models you prefer, please feel free to reach out.&lt;/p&gt;
&lt;p&gt;Another good way to get to know what models are useful is to take a look at what's trending or what has a lot of downloads on HuggingFace. Just note that ANYONE can upload a model to HuggingFace, so just like you wouldn't install a random software package off the internet without verifying it's not malware, don't install a random HuggingFace model without verifying who built it (!!).&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If you are serving models on the open internet (i.e. not across your local network), use the --api-key flag to authenticate your requests. Otherwise you're just giving people trolling the internet free compute and potentially asking for much more serious privacy problems. :)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'll try to keep this list updated, but if you see a model that you think would fit well, feel free to reach out and tell me about it.&lt;/p&gt;
&lt;h2 id="putting-it-all-together"&gt;Putting it all together&lt;/h2&gt;
&lt;p&gt;Here's your steps to get it all running:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Decide on a model you'd like to start with that also fits on your machine.&lt;/li&gt;
&lt;li&gt;Decide between vllm or llama.cpp and get that installed and running with your model of choice.&lt;/li&gt;
&lt;li&gt;Test out the connection and model name with a simple curl post to the other machine and the model name. &lt;a href="https://gist.github.com/kjam/9f39cd69f340832d322ea27e10fabb35"&gt;example&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;If all worked so far, set your environment variables and get claude running with --model [YOUR MODEL HERE]&lt;/li&gt;
&lt;li&gt;Send over your first prompt!&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;I hope you enjoy getting everything set up and running to try out local-only Claude Code usage. But don't stop there, because even though it's local, it doesn't mean it's always secure !!&lt;/p&gt;
&lt;h2 id="privacy-and-security-advice"&gt;Privacy and Security Advice&lt;/h2&gt;
&lt;p&gt;It wouldn't be very on-brand to not talk about privacy and security of using an AI coding assistant (even if it is using a local model), so let's dive into some basics that are useful to know.&lt;/p&gt;
&lt;h3 id="sandboxing-101"&gt;Sandboxing 101&lt;/h3&gt;
&lt;p&gt;Claude Code comes auto-shipped with a Sandbox, but so far I've been very underwhelmed with the ability to block commands.&lt;sup id="fnref:3"&gt;&lt;a class="footnote-ref" href="#fn:3"&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Tip: For security best practices, check out Rich Harang's article on &lt;a href="https://developer.nvidia.com/blog/practical-security-guidance-for-sandboxing-agentic-workflows-and-managing-execution-risk/"&gt;Sandboxing&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I've outlined what the documentation says in this section, but so far some commands seem to be ignored. I will keep this section here because I hope that Claude Code eventually fixes the bugs, but my real advice is to launch Claude code within a VM and truly only put files there where it's fine to read/write/manipulate at will.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Look into tool and file permissions.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;To check what your default permissions are, you can run:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="go"&gt;/permissions&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You can change these in your Claude settings.json file. It is useful to start with more restrictive file settings (i.e. allowRead, allowWrite with a small list) first. You can combine allow/deny for more granularity.&lt;/p&gt;
&lt;p&gt;For example, the following settings have a mixture of allow and deny, which clarifies &lt;em&gt;exactly&lt;/em&gt; what local files in the project folder can be written, read and used.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;sandbox&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;more_sandbox_settings_here&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;the following is just a snippet !&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;filesystem&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;allowWrite&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;.&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;denyWrite&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;.env&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;settings/config.json&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;denyRead&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;./secrets/*&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;.env*&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Warning! This only denies the READ agent, but Claude Code + friends can still run &lt;code&gt;cat&lt;/code&gt;, &lt;code&gt;less&lt;/code&gt;, etc. I would not use this as an actual full protection.&lt;/p&gt;
&lt;p&gt;You can also add tool-specific permissions, which can help especially if you want to expand the tools used. That syntax looks similar and has similar specificity rules. Note that these permissions currently live on the same key/value level as sandbox, but that can change so please check the latest documentation.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;quot;permissions&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;allow&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Bash(python *)&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Bash(ls *)&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;deny&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Bash(curl *)&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Bash(cat *)&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Bash(aws *)&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Read(**/.env)&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Read(**/.env.*)&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Read(**/secrets/**)&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Again, warning! So far I've been able to get around some of these deny lists by cleverly asking for other tools. This is by no means foolproof or actually sandboxed. &amp;gt;.&amp;lt;&lt;/p&gt;
&lt;p&gt;To state again, I believe a strong VM solution or file encryption is probably the only way to actually block reads or other bash commands.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Because sandbox settings and defaults might change, I recommend checking any of this advice against &lt;a href="https://code.claude.com/docs/en/sandboxing"&gt;Claude Code documentation&lt;/a&gt; on sandboxing.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ol&gt;
&lt;li&gt;Check out the operating system, network and managed settings.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Outside of tool calls and file permissions, you might want to ensure that operating system and networking controls. By default processes that Claude Code spawns within a sandbox inherit the same sandbox properties; however, there is one caveat:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A warning note from Claude Code documentation that explicitly states by default if something isn't working that might be related to the sandbox, that Claude will send a request to violate sandbox policies. You have to change your settings to disable this." src="/images/2026/claude_escape_hatch.png"&gt;&lt;/p&gt;
&lt;p&gt;So you might want to set up that setting immediately. While you're at it you can also decide if you want to send telemetry data or not by setting the CLAUDE_CODE_ENABLE_TELEMETRY setting to 0 or 1.&lt;/p&gt;
&lt;p&gt;In addition, there are networking rules you can set, such as approve or deny domains, and whether the sandbox can connect to local hosts and Unix sockets. Note that the more you allow for networking, the more security risk you open. Of course, you need to find a balance between denying everything, but if you are running this on an organization laptop/computer, I recommend bothering the security team to take a look and make some recommendations (if they haven't already).&lt;/p&gt;
&lt;p&gt;If you are working in a security or engineering leadership team, you're probably interested in both the &lt;a href="https://code.claude.com/docs/en/permissions#managed-settings"&gt;managed settings&lt;/a&gt; and &lt;a href="https://code.claude.com/docs/en/devcontainer"&gt;devcontainers&lt;/a&gt;. For organization-wide settings, these can be used to override local settings and hopefully create a secure and private baseline across the organization.  &lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Containerize (or use VMs, jails or similar controls especially when running in a secure environment)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Claude Code sandbox is not truly sandbox safe, so containerize your Claude Code (or any AI-system) when running it in a secure, production or server environment which may have sensitive data or potential valuable targets.&lt;/p&gt;
&lt;p&gt;This also involves starting to test your security assumptions around your environment, like investigating a security audit, pen-testing or threat modeling + red teaming for more trust and an overall better security posture.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If you are interested in starting to learn security vulnerabilities in AI systems, check out my free YouTube course on &lt;a href="https://www.youtube.com/playlist?list=PLJkNSeYcYBlC88vkG58yx3fHSobmCmDw_"&gt;Purple Teaming AI Systems&lt;/a&gt;. I offer internal trainings and hackdays, so &lt;a href="mailto:katharine@kjamistan.com"&gt;email me&lt;/a&gt; if your team might want in-house training.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;For my setup I went so far as to get a separate computer to run Claude Code. I connect to it from my main work machine and it connects to my GPU machine over the local network. I move files that I've thoroughly tested and reviewed back to my main machine. Of course, this is my setup because I am also testing how to break security controls, and that's not a very good idea to do on your main machine. :)&lt;/p&gt;
&lt;h3 id="synthetic-data-generation"&gt;Synthetic Data Generation&lt;/h3&gt;
&lt;p&gt;In many coding situations you might need example data to complete the exercise. Too often at organizations test data is real data sampled from a production environment. Instead it's advised to build synthetic data that meets your testing and LLM requirements.&lt;/p&gt;
&lt;h4 id="fully-synthetic-data"&gt;Fully synthetic data&lt;/h4&gt;
&lt;p&gt;The safest option is to build out fully synthetic data using deterministic libraries to do so. For example, in Python the &lt;a href="https://faker.readthedocs.io/en/master/"&gt;Faker library&lt;/a&gt; is a common choice for building out fully synthetic data.&lt;/p&gt;
&lt;p&gt;To build out data like this, you'll need to only know the general data types and any relationships that must be maintained and then usually follow software documentation on how to build out such datasets. If you're also looking to more thoroughly test your code, you might want to add in &lt;a href="https://en.wikipedia.org/wiki/Property_testing"&gt;property-based testing&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Why is it fully synthetic? Well, it's only using the data types and potential business logic and no other special information to generate data. However, this might not be what you need depending on what you're doing with your feature or software.&lt;/p&gt;
&lt;p&gt;Sometimes you might need data that's informed by the actual statistics and distributions in your real data. If that's the case, you'll want to choose a statistical method for producing synthetic data.&lt;/p&gt;
&lt;h4 id="statistical-methods-for-synthetic-data"&gt;Statistical methods for synthetic data&lt;/h4&gt;
&lt;p&gt;If you need data that has particular properties related to your real production data, you'll most likely choose a statistical method for generating your data.&lt;/p&gt;
&lt;p&gt;First and foremost, it's important to create an understanding of both the privacy, testing and statistical requirements for the data. You'll want to engage anyone at your organization who might lead such efforts before building out modeling for the data.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I offer internal workshops and trainings on &lt;a href="https://kjamistan.com"&gt;evaluating privacy enhancing technologies for synthetic data generation&lt;/a&gt; should you want to roll out such programs organization-wide.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;There's always a privacy tradeoff between properly representing the data and leaking information that could lead to reidentification of real individuals. Not convinced? Check out &lt;a href="https://www.usenix.org/conference/pepr24/presentation/desfontaines"&gt;Damien Desfontaines' USENIX talk&lt;/a&gt; on privacy leakage in synthetic data generator products. For this reason, it's important to gradually build out synthetic data and do so informed by real privacy requirements and dangers.&lt;/p&gt;
&lt;p&gt;If you just need to make sure that two fields are appropriately linked (i.e. regional address and phone number matching) or if you need a number or attribute to lie in a particular distribution range (i.e. bounded within your real distributions), I would err on choosing a simple method like the fully synthetic and then building out modifications to alter any rows or entries that violate those rules.&lt;/p&gt;
&lt;p&gt;However, if your requirements for linkage and attributes become more intertwined, there are some interesting methods that have been developed. In &lt;a href="https://arxiv.org/abs/2108.04978"&gt;research that won a synthetic data generation challenge from NIST&lt;/a&gt;, the authors took certain attribute-based samples to develop a graph of data relationships. They then applied differential privacy to their distribution samples and were able to produce well-performing synthetic data with very strong guarantees.&lt;/p&gt;
&lt;p&gt;There are also deep-learning based synthetic generation libraries that can "learn" or identify your data properties and generate data based on those properties. I want to remind you of the &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;memorization problems&lt;/a&gt; related to deep learning models, especially ones that fine-tune on small datasets.&lt;/p&gt;
&lt;p&gt;However, if you do want to build a deep learning setup for synthetic generation, I recommend looking into libraries or setups that also use differential privacy as part of their training. &lt;a href="https://blog.kjamistan.com/differential-privacy-in-deep-learning.html"&gt;Differential privacy in deep learning&lt;/a&gt; combats pesky memorization and helps introduce measurable privacy into your data generation.&lt;/p&gt;
&lt;h3 id="spying-on-claude-code"&gt;Spying on Claude Code&lt;/h3&gt;
&lt;p&gt;I've been investigating the inner workings of Claude Code software and prompt infrastructure for a few weeks now. Want to also spy on your "coding assistant"? Let me show you how.&lt;/p&gt;
&lt;p&gt;I have a two-tiered setup at present because it helps me better analyze different parts of the data flows.&lt;/p&gt;
&lt;p&gt;First I am using &lt;a href="https://github.com/mitmproxy/mitmproxy"&gt;mitm&lt;/a&gt; to proxy direct traffic from Claude Code and log it into easily parseable JSON files. I have &lt;a href="https://gist.github.com/kjam/4959e6a52e07e20224d56bf1fe0cd92f"&gt;an example as a Gist to get you started&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Then, I also am using &lt;a href="https://github.com/eunomia-bpf/agentsight/tree/master"&gt;AgentSight&lt;/a&gt; which is an ebpf library with an interface that specifically looks at the Claude Code binary and follows process forks, system calls and sequences of such. You can read more about the design and usage &lt;a href="https://arxiv.org/abs/2508.02736"&gt;in their ArXiv paper&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A view from the AgentSight dashboard that shows processes, http events, system calls and SSL requests." src="/images/2026/agentsight.png"&gt;&lt;/p&gt;
&lt;p&gt;I built AgentSight from source (took a bit of manipulating the Makefile in my setup) and it's useful to set the CLAUDE_BIN as an environment variable when you call the built file. They also have prebuilt binaries but they didn't work for my setup.&lt;/p&gt;
&lt;p&gt;I think in the future it might be useful for you to design your own Agent-Spyware by building specific ebpf libraries and packages, but I am by no means an expert on that. A tip from someone more informed says &lt;a href="https://github.com/iovisor/bcc"&gt;bcc is a good starting point&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;If you build something and want any practitioner feedback, I'd be very interested.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I'll be posting more on agentic security and privacy and testing out alternatives like &lt;a href="https://opencode.ai/"&gt;opencode&lt;/a&gt;. So far I really like the ease of tool calling via opencode in comparison, but I'm just getting started on my investigation...&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'll be updating this post as I iron out my workflows and gain more experience. I'll also be releasing a longer series on what I find from a privacy and security perspective later this year, so stay tuned! If you have burning questions or any interesting research you are working on in the space, feel free to reach out.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;If this post helped you, consider &lt;a href="https://probablyprivate.com/subscribe"&gt;subscribing to my newsletter&lt;/a&gt; or &lt;a href="https://www.youtube.com/@ProbablyPrivate?sub_confirmation=1"&gt;my YouTube&lt;/a&gt; and sharing my work! I also offer &lt;a href="https://kjamistan.com"&gt;advisory and workshops&lt;/a&gt; and a new &lt;a href="https://maven.com/katharine-jarmul/practical-ai-privacy/"&gt;Maven Practical AI Privacy course&lt;/a&gt; on topics like security and privacy in AI/ML and personal AI.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;There are several ways to control caching in both llama.cpp and vllm. Here's a useful &lt;a href="https://github.com/ggml-org/llama.cpp/discussions/20574"&gt;starting point for llama.cpp&lt;/a&gt; and for &lt;a href="https://docs.vllm.ai/en/latest/design/prefix_caching/"&gt;vllm&lt;/a&gt;. Again, I would first use the system for a while before implementing caching so you can diagnose any potential caching pros and cons based on your initial experience and observations.&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;For my NVIDIA Blackwell architecture + Debian setup &lt;a href="https://forums.developer.nvidia.com/t/tutorial-build-llama-cpp-from-source-and-run-qwen3-235b/352604"&gt;this guide&lt;/a&gt; did the trick. But your architecture and the flags you need will differ, so try following the initial README or looking around with your GPU name + OS + compile llama.cpp.&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;Even &lt;a href="https://claude.ai/share/795fcbe8-bc8f-4c6d-8230-1bef84465f0f"&gt;Claude&lt;/a&gt; can't figure out how to block reads from files in the working directory.&amp;#160;&lt;a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="personal-ai"></category></entry><entry><title>Practical AI Privacy: A 6-week online Maven masterclass</title><link href="https://blog.kjamistan.com/practical-ai-privacy-a-6-week-online-maven-masterclass.html" rel="alternate"></link><published>2026-03-30T00:00:00+02:00</published><updated>2026-03-30T00:00:00+02:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2026-03-30:/practical-ai-privacy-a-6-week-online-maven-masterclass.html</id><summary type="html">&lt;p&gt;AI usage (both intended and not) is increasing in our products, software and lives. For some, this is a welcome way to automate tedious tasks; for others, an intrusion that doesn't seem to end. For everyone, AI changes how you might think about and evaluate data privacy.&lt;/p&gt;
&lt;p&gt;At workplaces, there's …&lt;/p&gt;</summary><content type="html">&lt;p&gt;AI usage (both intended and not) is increasing in our products, software and lives. For some, this is a welcome way to automate tedious tasks; for others, an intrusion that doesn't seem to end. For everyone, AI changes how you might think about and evaluate data privacy.&lt;/p&gt;
&lt;p&gt;At workplaces, there's often top-down and bottoms-up incentivization to automate and manage your work with AI. Increasingly, these tasks touch sensitive data, documents and workflows. How can you automate these workflows safely? What are the best practices with regard to privacy and security?&lt;/p&gt;
&lt;p&gt;&lt;img alt="Katharine on stage presenting about privacy in AI systems" src="/images/2026/kj_on_stage.jpg"&gt;
&lt;em&gt;Katharine on Practical Data Privacy at GOTO Amsterdam 2023&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;I've been considering these questions long before LLMs came on the scene. My book &lt;a href="https://practicaldataprivacybook.com/"&gt;Practical Data Privacy (O'Reilly 2023)&lt;/a&gt; is one of the most recommended introductions to privacy controls in data, AI and ML workflows.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://maven.com/katharine-jarmul/practical-ai-privacy/"&gt;Practical AI Privacy masterclass&lt;/a&gt; is my 2026 update for that book. The course focuses on inference instead of training and information-heavy workloads, like LLMs, diffusion models and advanced setups like agentic workflows. It aims to give AI model users (not model developers) the ability to better understand and control their privacy and security.&lt;/p&gt;
&lt;p&gt;By creating a safe place to experiment and learn about AI privacy risks and controls, you'll learn real skills you can use both at work and in your own personal AI usage. You'll leave the course able to assess, evaluate and address privacy risks presented by large models and AI workflows. You'll have code you wrote, analyzed and tested at your fingertips, for work and personal projects.&lt;/p&gt;
&lt;h3 id="who-is-the-class-for-data-software-and-ai-engineering-but-not-only"&gt;Who is the class for? Data, Software and AI Engineering, but not only&lt;/h3&gt;
&lt;p&gt;Since &lt;a href="https://maven.com/katharine-jarmul/practical-ai-privacy/"&gt;the course&lt;/a&gt; is quite hands-on, the target audience is someone who is comfortable reading and writing some code and who wants to build out privacy engineering in AI/ML workflows. I chose this style because I think there's an immediate need for people to build these tasks at work, and want to create a safe environment where you can practice and learn.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Want a taste of what we'll cover and how I teach? Check out &lt;a href=""&gt;my Maven Lightning Lesson on why privacy is such a hard problem to solve in AI systems&lt;/a&gt; and &lt;a href="https://www.youtube.com/playlist?list=PLJkNSeYcYBlC88vkG58yx3fHSobmCmDw_"&gt;my Probably Private YouTube mini-course on security&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;If you're not an engineer it doesn't necessarily mean the course isn't for you. If you wear a hat helping shape policy and architecture (like privacy analysts, privacy and security architects) or if you are a privacy professional or product owner, I think your expertise will help drive different aspects of how privacy fits into AI systems.&lt;/p&gt;
&lt;p&gt;Multidisciplinary teams are essential to any successful privacy program because it's often non-engineering roles that drive policy and process at an organization.&lt;/p&gt;
&lt;p&gt;If you want to join the course but are intimidated by coding, know that there will be ways to partner and team up with others; as well as ways to contribute your knowledge as part of the larger conversations as to how we build these systems.&lt;/p&gt;
&lt;h3 id="what-youll-learn-and-why"&gt;What You'll Learn and Why&lt;/h3&gt;
&lt;p&gt;I've broken down the main concepts you'll learn with some details on what the concept is and why it's important. Several concepts build on previous ones, so I've tried to keep the concepts in general order that they will be taught in.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What you'll learn&lt;/th&gt;
&lt;th&gt;Details&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Getting Started with Local AI&lt;/td&gt;
&lt;td&gt;Create a safe experimentation environment for testing new ideas.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Architecting an AI product workflow&lt;/td&gt;
&lt;td&gt;Practice architecture decisions, learn how AI products work&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Generating synthetic data and initial evaluations&lt;/td&gt;
&lt;td&gt;Evaluate synthetic data and generic evals as building block for privacy evaluations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Privacy attacks on AI systems&lt;/td&gt;
&lt;td&gt;Learn how to run attacks focused on extracting confidential or sensitive information&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reviewing and changing architectures&lt;/td&gt;
&lt;td&gt;Based on what you've learned, re-evaluate your architecture choices and make new designs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Evaluating basic protections (permissions, pseudonymization, input sanitization)&lt;/td&gt;
&lt;td&gt;Determine when and how basic protections address the privacy and security issues exposed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Guardrails&lt;/td&gt;
&lt;td&gt;Learn what guardrails can and cannot do and get hands-on practice using them in your setup&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Advanced protections (other than guardrails): Prompt minimization, routing and local options&lt;/td&gt;
&lt;td&gt;Experiment with more advanced protections for privacy and update your architectures accordingly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Privacy testing and evaluations&lt;/td&gt;
&lt;td&gt;Build evaluation and testing suites around your use case, tuned to the privacy requirements&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Privacy monitoring and observability&lt;/td&gt;
&lt;td&gt;Establish observability and monitoring best practices with privacy as the focus&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;All of this will happen in hands-on labs with extra office hours for questions, deeper dives and experimentation. The course will have two main lessons per week along with time for questions. There will also be some at-home assignments, team projects and asynchronous conversations (via the Maven platform and Mattermost).&lt;/p&gt;
&lt;h3 id="feedback-questions-ideas-very-welcome"&gt;Feedback, Questions, Ideas very welcome!&lt;/h3&gt;
&lt;p&gt;I'd be happy to answer any questions you have. I'll try to keep this post updated as questions emerge to ensure the course is clear.&lt;/p&gt;
&lt;p&gt;If you have feedback on the topics, or wish I covered something you expected to see, feel free to write me. You can &lt;a href="mailto:katharine@kjamistan.com"&gt;email me&lt;/a&gt; or reach out &lt;a href="https://linkedin.com/in/katharinejarmul"&gt;on LinkedIn&lt;/a&gt;.&lt;/p&gt;</content><category term="classes"></category></entry><entry><title>Differential Privacy Parameters, Accounting and Auditing in Deep Learning and AI</title><link href="https://blog.kjamistan.com/differential-privacy-parameters-accounting-and-auditing-in-deep-learning-and-ai.html" rel="alternate"></link><published>2026-02-06T00:00:00+01:00</published><updated>2026-02-06T00:00:00+01:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2026-02-06:/differential-privacy-parameters-accounting-and-auditing-in-deep-learning-and-ai.html</id><summary type="html">&lt;p&gt;You've learned in the last few articles about &lt;a href="https://blog.kjamistan.com/differential-privacy-in-deep-learning.html"&gt;how differential privacy works&lt;/a&gt; and some of the &lt;a href="https://blog.kjamistan.com/differential-privacy-in-todays-ai-whats-so-hard.html"&gt;common pitfalls&lt;/a&gt; of actually using it in deep learning scenarios.&lt;/p&gt;
&lt;p&gt;In this article, you'll learn about tracking differential privacy: through parameter choice, accounting and auditing. If done well, these choices and methods reduce memorization …&lt;/p&gt;</summary><content type="html">&lt;p&gt;You've learned in the last few articles about &lt;a href="https://blog.kjamistan.com/differential-privacy-in-deep-learning.html"&gt;how differential privacy works&lt;/a&gt; and some of the &lt;a href="https://blog.kjamistan.com/differential-privacy-in-todays-ai-whats-so-hard.html"&gt;common pitfalls&lt;/a&gt; of actually using it in deep learning scenarios.&lt;/p&gt;
&lt;p&gt;In this article, you'll learn about tracking differential privacy: through parameter choice, accounting and auditing. If done well, these choices and methods reduce memorization in deep learning systems.&lt;/p&gt;
&lt;h3 id="how-do-the-privacy-parameters-in-differential-privacy-work"&gt;How do the privacy parameters in differential privacy work?&lt;/h3&gt;
&lt;p&gt;As you likely recall &lt;a href="https://blog.kjamistan.com/differential-privacy-in-deep-learning.html"&gt;from the previous article&lt;/a&gt;, differential privacy has privacy parameters in the definition. Reasoning about these parameters is something a team really has to do in practice to develop an understanding of their meaning and what they measure. As a starting point, however, I recommend &lt;a href="https://desfontain.es/blog/differential-privacy-in-more-detail.html"&gt;this useful chart generated by Damien Desfontaines&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A graph showing how an attacker can update their suspicion of a person in the dataset based on a variety of epsilon values given an epsilon-based differential privacy query interface. On the x-axis there is initial suspicion with a range from 0 to 1. The y-axis is shown as updated suspicion with the same range. There is a legend on the right hand side showing possible epsilon ranges, from 0 to 7 with different bounds. The graph itself has radiating ranges that begin along the diagonal between 0 and 1 where initial suspicion equals updated suspicion. These ranges look like radiating concentric oblong shapes -- where lower values of epsilon are wrapped in larger values. These begin to look more like logarithmic curves the further out you go -- where epsilon is larger than 3." src="./images/2026/epsilon_graph.png"&gt;
&lt;em&gt;Updated suspicion based on epsilon choices&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;This chart assumes a sophisticated attacker who attempts to determine which database they are interacting with and update their knowledge based on the response.&lt;/p&gt;
&lt;p&gt;The chart shows the initial suspicion across the x axis. This can represent what the attacker either has already learned about whether the person is in the dataset or not, or what they learned based on a previous query.&lt;/p&gt;
&lt;p&gt;Then, after they query a differentially private database, they can update their suspicion. The y-axis shows the bounds of the information they get in return which depends on the epsilon choice. It is never guaranteed that they learn the "maximum" of the bounds, but instead guaranteed that they learn within the bounds -- it could be that with one response they learn nothing.&lt;/p&gt;
&lt;p&gt;The different colors in the chart represent different choices of epsilon. The bounds can be quite small with a small epsilon and much larger with a large epsilon. Choosing an epsilon of 5 is much different than an epsilon of 1. These bounds are the guarantees that differential privacy uses to protect individual information. You will also use these bounds to determine what is the balance between the information you are trying to learn and the information you are trying to protect.&lt;/p&gt;
&lt;p&gt;If you are doing something like preprocessing or data exploration, you usually choose an epsilon upfront and split it unevenly across your queries, so you can get a bit more information from some queries rather than others. Spending more of your epsilon on a query results in a more accurate, informative and therefore less private response.&lt;/p&gt;
&lt;p&gt;Any time you repetitively query or process data in your analysis, you track this epsilon (or other parameters depending on your definition) so you know the entire epsilon spent.&lt;/p&gt;
&lt;p&gt;In today's differentially private deep learning libraries, there are two approaches to epsilon choice. One is to choose an maximum epsilon, which will lead to early stopping. The other is to track the epsilon and to keep training until a certain accuracy is reached. The second choice means that you could end up with a higher epsilon than you originally thought, but you can also decide to throw away the model and retrain (which means you wasted time and compute, but you hopefully learned something about your data).&lt;/p&gt;
&lt;p&gt;So how does the second process work in deep learning? How can you track the epsilon if you're not setting it in advance?&lt;/p&gt;
&lt;h3 id="an-introduction-to-accounting"&gt;An Introduction to Accounting&lt;/h3&gt;
&lt;p&gt;If we don't know exactly how much epsilon we have to spend, we need to measure how much we did spend! In the case of deep learning, this is actually often how we track the privacy parameters. In each batch during the DP-SGD training the process calculates and calibrates the noise based on characteristics of the batch. This calculation also estimates the epsilon, and that is tracked by an accountant.&lt;/p&gt;
&lt;p&gt;Already in &lt;a href="https://arxiv.org/abs/1607.00133"&gt;the original definition of DP-SGD Abadi et al.&lt;/a&gt; there was an accountant to track epsilon spent. This accountant is called a moments accountant. This approach is still used in both &lt;a href="https://jax-privacy.readthedocs.io/en/latest/"&gt;JAX Privacy&lt;/a&gt; and &lt;a href="https://opacus.ai/"&gt;PyTorch's Opacus&lt;/a&gt; for differentially private learning.&lt;/p&gt;
&lt;p&gt;To appropriately calculate the bounds and approximate the epsilon the authors needed to understand the behavior of the noise they were using. The authors were interested in Gaussian noise as it has some nice properties when compared with other distributions and some benefits when looking at how deep learning systems work (i.e. when working with regularization, recognizing patterns and managing embedded spaces).&lt;/p&gt;
&lt;p&gt;&lt;img alt="A chart showing two different probability density distributions. The Laplace distribution has a very strong peak at 0 with exponential tails dropping off very quickly. In comparison, the Gaussian distribution has no peak and much more gradual tails. Both of these are shown at their proper scale when epsilon=0.9." src="./images/2026/gaussian_v_laplace_noise.png"&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I highly encourage you to play with &lt;a href="https://lpanavas.github.io/mechanism-comparison/"&gt;Liudas Panavas' interactive chart&lt;/a&gt; to explore more values of epsilon.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The problem with Gaussian noise compared to, say, a Laplace distribution is that it has a higher chance of ending up with less noise because of the gradual tails. If you end up too far in a tail, you could essentially add 0 noise, nullifying the privacy guarantee. If you want those tight bounds/guarantees like in Desfontaines's chart, how do you appropriately calculate the chances you end up adding noise from the middle, the 10th percentile, the .05th percentile?&lt;/p&gt;
&lt;p&gt;Abadi and coauthors were developing some interesting work on learning theory around differential privacy and they had the idea to use higher order moments (like 3 for skewness and 4 for kurtosis) to better understand and therefore bound the probability of catastrophic failure (i.e. almost 0 noise). By studying this both via working through the math and theory but additionally testing their implementation, they were able to develop better bounds for Gaussian noise in deep learning.&lt;/p&gt;
&lt;p&gt;Let's walk through how the moments accountant works:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Based on ideal spend for the two parameters (epsilon and delta) and the clipping value (which is the gradient norm of the mini-batch), a Gaussian distribution is formed to add noise to the clipped gradient. These parameters inform the standard deviation of the Gaussian noise distribution.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Several moments (i.e. 3 and 4, for example, but you can go up to a very high number of moments) are recorded for that noise distribution and stored by the accountant.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Training continues, performing steps 1 and 2 for each minibatch.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;At the end of an epoch, sum the moments accumulated and calculate the upper bounds of those training rounds based on those values. This can be done to reverse the amount of epsilon and delta spent in that epoch.&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If the target epsilon or delta is reached, you can stop training. If target epsilon or delta are not reached, training epochs can continue. Usually the accountant will also display the information via a log message per epoch.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This moments accountant and the Gaussian bounds calculation eventually led to more advanced definitions, such as Renyi differential privacy, which is now the standard for today's deep learning implementations.&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;Okay, so how do you make sure that the moments accountant is doing things appropriately? Similar to difficult problems like cryptography, you never want to write your own unless you are working with actual experts. Even (or especially because) the experts know, you gotta audit!&lt;/p&gt;
&lt;h3 id="auditing-v-accounting"&gt;Auditing v. accounting&lt;/h3&gt;
&lt;p&gt;Let's disentangle two separate concepts: auditing and accounting. Accounting is keeping track of the privacy parameters for a given experiment, process or query. In your case, this is likely your training or fine tuning. Usually this means you:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;set an ideal budget&lt;/li&gt;
&lt;li&gt;track it via some accountant (like the moments accountant)&lt;/li&gt;
&lt;li&gt;ideally stop processing when the budget is reached&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As you already learned, sometimes you might actually stop at a given model performance (i.e. error, TPR or FPR or some other metric). At that point you can also use the accountant to decide, is this privacy "enough" to use the model?&lt;/p&gt;
&lt;p&gt;Auditing, however, is actually making sure that your differential privacy mechanism is working properly and also that your accountant works properly. This is testing that the libraries and tools you are using are working properly, but can also cover if you are using them properly (i.e. looking at the entire setup and ensuring the parameters, processing and tools meet the requirements set).&lt;/p&gt;
&lt;p&gt;Let's dive deeper into auditing so you can know how to choose tools that are safe, have been tested and that also fit your needs.&lt;/p&gt;
&lt;h3 id="third-party-auditing-of-tools-is-necessary"&gt;Third-party auditing of tools is necessary&lt;/h3&gt;
&lt;p&gt;Auditing your tools and accounting makes sense. The entire privacy guarantee is based on whether the differential privacy libraries:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;theoretically prove what they're doing is sound&lt;/li&gt;
&lt;li&gt;implement that theory into software properly&lt;/li&gt;
&lt;li&gt;didn't accidentally introduce a new vulnerability&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Similar to other difficult problems like cryptography, developers need to make sure their implementation is tested by other experts and regularly updated with new insights, theory and attacks.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I also recommend reading &lt;a href="https://desfontain.es/blog/privacy-auditing-terminology.html"&gt;Damien Desfontaines' article on 3 types of privacy auditing&lt;/a&gt;. I'm talking about definition #3 here.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In &lt;a href="https://arxiv.org/abs/2202.12219"&gt;Debugging Differential Privacy: A Case Study for Privacy Auditing&lt;/a&gt;, the authors showed how easy it is even for domain experts to miss important nuances in differential privacy in deep learning. By auditing a new approach, they uncovered a bug in the implementation which severely underestimated the epsilon and resulting privacy leakage.&lt;/p&gt;
&lt;p&gt;Does this make the original authors "bad" or any less experts than they were? No. It proves without robust auditing from external parties it's difficult to ensure you have "seen" everything. This is why the field of cryptography has embraced open-source, open auditing and even incentivized auditing as an industry--to attempt to catch these mistakes and ensure the security guarantees are real. Even then, it's possible to get this wrong, such as what happened with the &lt;a href="https://en.wikipedia.org/wiki/Heartbleed"&gt;OpenSSL Heartbleed Vulnerability&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;To audit the incorrect differential privacy implementation, the authors ran a sophisticated &lt;a href="https://blog.kjamistan.com/defining-privacy-attacks-in-ai-and-ml.html"&gt;membership inference attack&lt;/a&gt; where they inserted specifically poisoned examples which acted as canaries. Then they attempted to find these canaries with sophisticated membership inference attacks.&lt;/p&gt;
&lt;p&gt;To build successful canaries, they identified useful starting examples by testing canary creation with a subsample of the training data.&lt;sup id="fnref:3"&gt;&lt;a class="footnote-ref" href="#fn:3"&gt;3&lt;/a&gt;&lt;/sup&gt; Then they trained 10K models on different subsets of the poisoned vs non-poisoned examples--all supposedly with appropriate differential privacy granted via the new implementation.&lt;/p&gt;
&lt;p&gt;They admit that training 10K models was overkill and they could have probably just trained 1K to get the same results. Since the models were simple MNIST models, it was a fairly short training cycle per model, especially on large compute.&lt;/p&gt;
&lt;p&gt;They trained those models so that they could use the loss information on the (outlier) canaries to successfully target them for MIAs. By gathering the loss for canary examples versus non-canary examples repeatedly, they slowly build two distributions.&lt;/p&gt;
&lt;p&gt;Because these distributions have overlap (i.e. where the loss is similar or the same), they then optimize to find the exact threshold to separate these losses and identify the canaries versus the other examples. Below is a visual of their threshold findings.&lt;/p&gt;
&lt;p&gt;&lt;img alt="There are two Gaussian-looking distributions that clearly have different distribution properties. The canary/poisoned distribution has a higher peak and smaller tails and is shown in blue. The baseline distribution is shown in orange, and the mean is shifted to the right. There is a threshold line that is drawn to best separate the two distributions." src="./images/2026/threshold_canary_improper_dp.png"&gt;&lt;/p&gt;
&lt;p&gt;Because finding this threshold shows significant privacy leakage for the canaries, this can be used to estimate epsilon bounds. Comparing this estimation with the paper, the authors concluded that the epsilon reported in the paper was not the true epsilon.&lt;/p&gt;
&lt;p&gt;Like any good auditing team, this prompted them to investigate the code implementation, where they found a bug. The bug reduced the gradient sensitivity by batch size, which is incorrect. The batch size doesn't affect sensitivity of any of the gradients. The authors made the mistake because gradient clipping is calibrated by batch (adaptive clipping by gradient norm) but not the sensitivity.&lt;/p&gt;
&lt;p&gt;Therefore the implementation was underestimating sensitivity and adding too little noise to counter the gradient information. This, in turn, left those canaries overexposed (along with any other points that needed more privacy). Getting sensitivity right in theory and in practice is a difficult job, which is why auditing is necessary. There's also continuing research on new attacks, meaning audits must be performed regularly&lt;sup id="fnref:4"&gt;&lt;a class="footnote-ref" href="#fn:4"&gt;4&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;
&lt;p&gt;Let's say you are using an appropriately audited library. The next thing you might want to review is to make sure that you're using it correctly. To do so in a repeatable and adaptable way, you probably want to build testing into the usage.&lt;/p&gt;
&lt;h2 id="appropriate-testing-is-hard"&gt;Appropriate testing is hard&lt;/h2&gt;
&lt;p&gt;An obvious starting point is to actually test the produced models for the most common privacy attacks. Can you successfully run a MIA or a LiRA on the model?&lt;/p&gt;
&lt;p&gt;&lt;a href="https://arxiv.org/abs/2404.17399"&gt;Aerni et al (2024)&lt;/a&gt; reviewed recent papers on privacy-preserving machine learning and ran these tests against the implementations. To do so, they introduced canary inputs and then ran LiRA and MIA attacks on the resulting models. In their analysis of the results, they surmised that most of the papers were either cherry-picking results or taking the average attack on something like a mislabeled example versus a cleverly crafted canary.&lt;/p&gt;
&lt;p&gt;As you have learned throughout this series, not all data has the same privacy risk. Some examples will be more prone to memorization because they are novel or represent complexity. It is useful to design privacy testing that takes this into account! One way to do this might be to evaluate sample complexity and example complexity as part of training and to select complex and interesting samples for future privacy evaluation.&lt;/p&gt;
&lt;p&gt;As you also learned, mislabeled examples are easy for a model to unlearn/forget because they contradict information that the model will learn from the rest of their class/label/examples. Therefore, they are not a good use for testing real privacy concerns.&lt;/p&gt;
&lt;p&gt;Outside of looking at risks for outlier and complex examples, it's also useful to think about what highly repeated examples are expected to be learned, and which aren't. If you get to know your task and data via unsupervised methods like clustering, this could show aspects of both of these types of examples.&lt;/p&gt;
&lt;p&gt;Aerni et al make a few research recommendations which are also useful if you're a practitioner looking to appropriately evaluate differential privacy for your use case. They recommend to:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Evaluate membership inference success (specifically true positive rate (TPR) at low false positive rate (FPR)) for the most vulnerable samples in a dataset, instead of an aggregate over all samples. To make this process computationally efficient, audit a set of canaries whose privacy leakage approximates that of the most vulnerable sample.&lt;/li&gt;
&lt;li&gt;Use a state-of-the-art membership inference attack that is properly adapted to privacy defense specifics you are using in your training (i.e. DP or otherwise).&lt;/li&gt;
&lt;li&gt;Compare your model to DP baselines (e.g., DP-SGD) that use state-of-the-art techniques and reach similar utility.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If you're a practitioner and you want to get inspired by their experiment's, take a look at the &lt;a href="https://github.com/ethz-spylab/misleading-privacy-evals"&gt;GitHub repository&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;New approaches to testing have also helped move the field of privacy auditing forward. &lt;a href="https://arxiv.org/abs/2302.07956"&gt;Nasr et al. developed an interesting approach&lt;/a&gt; that looks at auditing the training process as it is performed, meaning you don't have to wait until &lt;em&gt;after&lt;/em&gt; you trained a model in order to get a good idea of what privacy guarantees you are offering.&lt;/p&gt;
&lt;p&gt;In their work, they used a different type of attack to simulate the LiRA and state-of-the-art MIAs. They allow the attacker to observe the training process itself and to actively insert canary gradients&lt;sup id="fnref:5"&gt;&lt;a class="footnote-ref" href="#fn:5"&gt;5&lt;/a&gt;&lt;/sup&gt; into the process. Then, the attacker observes the loss created by these batches with and without canaries and trains a model to distinguish between them. They also present a weaker version where the process is only observed at certain intervals and there aren't canaries.&lt;/p&gt;
&lt;p&gt;By evaluating the training process this way, they were able to determine better lower bounds for the privacy guarantees, meaning this research assists people choosing parameters that match the sensitivity they need. Not only that, you need to understand the upper bounds and the likelihood that you end up in the middle as &lt;a href="https://desfontain.es/blog/bad-ugly-good-maybe.html"&gt;Damien Desfontaines presented at PEPR&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Another interesting recent approach looks at the training process and tracks loss across epochs. &lt;a href="https://arxiv.org/abs/2411.05743"&gt;Pollock et al.&lt;/a&gt; proved that looking at the distribution of losses across the training process can identify more than 90% of the memorized and at-risk examples. They also &lt;a href="https://github.com/imperial-aisp/loss_traces"&gt;released their work on GitHub&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="There are three different images of frogs at the top of the graph, each with a different positioning and placement. The average frog is centered in the image and there's nothing else in the image. The easy-to-learn outlier is a picture of a frog from underneath. Then the hard-to-learn outlier is a frog from further away positioned slightly behind a wall of some sort. In the graph below, the loss traces are shown. The line for the average goes quickly down as the epochs continue. The easy-to-learn outlier has some erratic behavior in the early epochs and then drops and stabilizes in later epochs. The hard-to-learn outlier's loss almost never stabilizes. It spikes and drops and spikes and drops, in the very last epochs it looks to partially stabilize." src="./images/2026/loss_traces.png"&gt;&lt;/p&gt;
&lt;p&gt;This is an example from their paper, showing three different types of frog images in CIFAR-10. One is average (lost in the crowd), one is an easy-to-learn outlier (not complex) and one is difficult to learn (complex). The authors also note the success rate of MIAs targeting these examples, showing that the outliers have a higher risk of being identified.&lt;/p&gt;
&lt;p&gt;The loss traces are shown in the graph below, where the average example has a really stable and early decline in loss. The easy-to-learn outlier shows less stability in early and middle epochs, but eventually stabilizes. In comparison, the hard-to-learn outlier's loss almost never stabilizes. These are the indicators that the authors used (via sampling loss), to sort the outliers from the average and to understand the complex outliers versus the easier ones.&lt;/p&gt;
&lt;p&gt;Ideally better testing like this example is easily accessible for practitioners. &lt;a href="https://blog.tensorflow.org/2020/06/introducing-new-privacy-testing-library.html"&gt;TensorFlow built some of these experiments and research work&lt;/a&gt; into Tensorflow code and infrastructure; however this repository might not be maintained in the future.&lt;/p&gt;
&lt;p&gt;It would be useful to have these types of tests written into most of the major deep learning libraries, and into ML Ops and evaluation software. These tests should be easy to opt into and use, so privacy testing can become a normal part of ML/AI workflows.&lt;/p&gt;
&lt;h4 id="open-research-questions-aka-kjams-wish-list"&gt;Open research questions (aka kjam's wish list)&lt;/h4&gt;
&lt;p&gt;There are several unanswered questions when you take the theory of differential privacy and apply it in a real-world machine learning.&lt;/p&gt;
&lt;p&gt;Research often uses toy datasets which sometimes have little to do with today's machine learning tasks. Proving that something works with CIFAR-10 doesn't necessarily tell me, as a practitioner, if this will help me fine-tune a LLM, or train a useful diffusion model with user-submitted art, etc.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://arxiv.org/abs/2212.06470"&gt;Tramèr et al&lt;/a&gt; called for more realistic training examples that mimicked real-world use cases that require private training; such as learning with health data, or with the Netflix Prize dataset. Of course, it is difficult to create data that mimics real private data without actually releasing real private data.&lt;/p&gt;
&lt;p&gt;Another open question I'd like to see in research is a better way to reason about the preprocessing steps and how to incorporate a holistic understanding of the privacy guarantees in an end-to-end ML system.&lt;/p&gt;
&lt;p&gt;I'd also like to see better advice and research on parameter choice, differential private definition choices, noise choices and privacy unit choices. Although these significantly depend on the use case, there should be better research on pointing practitioners in the right direction for their use case. Papers comparing these choices using real-world tasks are rare.&lt;/p&gt;
&lt;p&gt;Finally, as mentioned in the last section, there should be better ways to actually test privacy in deep learning systems as part of training and evaluation. And better ways to reason about what those test results mean for individuals.&lt;/p&gt;
&lt;p&gt;In following articles, you'll dive deeper into other potential solutions, many of which are either new ways of framing the problem or research ideas. I'm curious... what do you think should be done about memorization and privacy risk in AI/ML systems? What potential mitigations do you find most appealing and why?&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;To dive deeper than this relatively high-level description, I recommend &lt;a href="https://www.youtube.com/watch?v=ZxDBEyjiPxI"&gt;watching the paper presentation&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;There's strong critique of using Renyi differential privacy and DP-SGD in practice by &lt;a href="https://arxiv.org/abs/2206.04621"&gt;Blanco-Justicia et al., 2022&lt;/a&gt; where the authors experimented with several values of epsilon and delta, and also reasoned about what level of privacy this offered.&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;There are also interesting applications for using canaries to &lt;a href="https://desfontain.es/blog/better-empirical-privacy-metrics.html"&gt;audit other differential privacy tasks like synthetic data creation&lt;/a&gt;&amp;#160;&lt;a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:4"&gt;
&lt;p&gt;In research on auditing differentially private deep learning systems, &lt;a href="https://arxiv.org/abs/2006.07709"&gt;Jagielski et al.&lt;/a&gt; uncovered how targeting private SGD with specific poisoning examples allowed them to more easily influence the model behavior even with the DP noise and clipping.&amp;#160;&lt;a class="footnote-backref" href="#fnref:4" title="Jump back to footnote 4 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:5"&gt;
&lt;p&gt;&lt;a href="https://arxiv.org/abs/2302.07956"&gt;The paper also tests different types of canaries&lt;/a&gt; and develops an algorithm for advanced canary generation. I think it's worth reviewing if you're thinking of building your own canaries and/or privacy testing.&amp;#160;&lt;a class="footnote-backref" href="#fnref:5" title="Jump back to footnote 5 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="ml-memorization"></category></entry><entry><title>Get your data local: Setting up Network Attached Storage (NAS) and your first steps in self-hosting</title><link href="https://blog.kjamistan.com/get-your-data-local-setting-up-network-attached-storage-nas-and-your-first-steps-in-self-hosting.html" rel="alternate"></link><published>2026-01-30T09:00:00+01:00</published><updated>2026-01-30T09:00:00+01:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2026-01-30:/get-your-data-local-setting-up-network-attached-storage-nas-and-your-first-steps-in-self-hosting.html</id><summary type="html">&lt;p&gt;If you're just getting started with local AI and local-first development, one of the initial hurdles will be getting your data local.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;More of an audio-visual person? Check out the &lt;a href="https://youtu.be/TwCdM7fKw0c"&gt;accompanying YouTube video on the Probably Private channel&lt;/a&gt; if you'd rather watch and listen.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Why should you store data locally …&lt;/p&gt;</summary><content type="html">&lt;p&gt;If you're just getting started with local AI and local-first development, one of the initial hurdles will be getting your data local.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;More of an audio-visual person? Check out the &lt;a href="https://youtu.be/TwCdM7fKw0c"&gt;accompanying YouTube video on the Probably Private channel&lt;/a&gt; if you'd rather watch and listen.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Why should you store data locally?&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Ease of use: If you're developing your projects locally, it's a lot easier if that data is already on your local network. The initial hurdle of downloading or transferring data will almost always be the slowest part of your setup (outside of training models from scratch) if you don't store data locally.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Better control and understanding of your data: by keeping your data locally you have more control over what accesses your data and how it's used in your data/AI/ML workflows. In addition, you can run tools over your data to get an understanding of your data which can give you new insights about what you might want to build.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Security: having a local copy means that if something is lost or breached, at least you have your backup. In addition, moving away from less secure services or apps and using a hybrid setup of your own can improve your security posture overall.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Self-hosted apps have significantly improved: you can now run many apps and software that are openly available to do obvious tasks, such as manage documents, highlight photos, stream music and run routine tasks.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Interested in getting your data local? I'll walk you through a few steps to ensure that your data is appropriately set up for your local-first projects.&lt;/p&gt;
&lt;h3 id="choosing-your-hardware"&gt;Choosing your hardware&lt;/h3&gt;
&lt;p&gt;You'll first want to decide what type of computer to buy for storing your data. I'm a fan of having one computer just for data backups and for a small amount of software you want to use with that data. In my experience it's good to choose something that's not your GPU-enabled machine because you want it to be smaller and cheaper to run as it will likely be on most of the time (especially if you automate some of your backups).&lt;/p&gt;
&lt;p&gt;First, you might want to guesstimate how much storage you will need. It's fine to start smaller and then either buy a second computer or practice data minimization to better manage your storage. For me, I looked at the data I wanted to backup (documents, photos, self-traced data, computer backups) and calculated approximately how many gigs that was. Then, I multiplied that by 2. That's been more than enough for my initial 7 years of self-hosting my data, especially if you set up data hygiene (removing duplicates, removing older computer backups, consolidating documents, etc).&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Pro Tip: Spend some time cleaning up and consolidating the data you care about BEFORE baking it up. This is general good practice for both you and anyone who might need to look at your data one day.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;You'll also want to figure out any other features you want on your device. For example, you likely want an ethernet interface, you might want to look at energy expenditure or ways to expand the storage (i.e. extra slots to add more storage later). Don't overcomplicate your choices if it's your first setup, start small and simple!&lt;/p&gt;
&lt;p&gt;I chose a Network Attached Storage device (or NAS), but you might want to choose a beefier computer with more chip power and RAM if you're interested in self-hosting a lot of apps, software or functionality. If you also have a &lt;a href="https://blog.kjamistan.com/building-out-my-home-ai-lab-for-private-and-local-ai.html"&gt;GPU computer setup like mine&lt;/a&gt; you could also move to the larger machine as your workflow grows, so again, don't overcomplicate where you backup your data.&lt;/p&gt;
&lt;p&gt;Once you have some basic specs, you're going to want to also decide on how your redundancy works before clicking buy. Let's dive into how RAID works to help you ensure you have enough safety in your storage.&lt;/p&gt;
&lt;h3 id="trusting-your-setup-raid"&gt;Trusting your setup (RAID)&lt;/h3&gt;
&lt;p&gt;You'll use some version of RAID (Redundant Array of Independent Disks) to ensure that your data is stored safely. RAID was invented in the late 80s as a way to ensure data is properly backed up across multiple disks (you will use HDD (hard disk drive) for this).&lt;/p&gt;
&lt;p&gt;To determine what works for you, take a look at the &lt;a href="https://en.wikipedia.org/wiki/Standard_RAID_levels"&gt;RAID levels&lt;/a&gt; and choose one that fits your liking. Probably this will be RAID-5 or RAID-6. A cool thing about RAID is that it uses coding theory and error correcting codes to ensure that if one of your storages fails (i.e. your HDD in slot 1 breaks), there is enough information and parity bits on the other storages that you can fully recreate the data via the error correcting codes.&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;Now that you know which RAID you want to use, you can make the final calculations of the storage you need on your NAS device. This means you'll need a minimum amount of disks (i.e. the minimum based on the RAID you chose -- 4 disks for RAID 5 and 5 disks for RAID 6) and each disk should have enough for your storage x some overhead (that you calculated in step 1).&lt;/p&gt;
&lt;p&gt;I recommend looking through online or local computer store options for NAS computers and choosing one that meets your specifications. A lot of the modern NAS devices have easy ways to swap out drives or expand drives, making them especially easy for home use.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A photo of my NAS with a comic on it that says &amp;quot;einfach mal herunterfahren&amp;quot;. Roughly translated: just shut it off." src="./images/2026/nas.jpeg"&gt;&lt;/p&gt;
&lt;p&gt;This is my 6-year-old 4-disk RAID-5 enabled NAS that's been happily running some random scripts along with backups and a few apps for the past 6 years. I have it running debian-server without any issues. I bought these from Jacob.de. Here's a breakdown of the components and cost.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Name and Description&lt;/th&gt;
&lt;th&gt;Price (in EUR)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Housing and computer&lt;/td&gt;
&lt;td&gt;QNAP TBS-453DX M.2 SSD NASbook - NAS-Server&lt;/td&gt;
&lt;td&gt;539.03&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SSDs&lt;/td&gt;
&lt;td&gt;ADATA XPG SX8200 Pro - SSD - 2TB (263.05 per drive) x 4&lt;/td&gt;
&lt;td&gt;1052.20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAM&lt;/td&gt;
&lt;td&gt;PHS-memory 16GB RAM Speicher für QNAP TBS-453DX-4G DDR4&lt;/td&gt;
&lt;td&gt;119.46&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;1,710.69&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;You'll find that NAS servers have become much less expensive than 6 years ago, and hopefully that will continue as more people move to self-hosting.&lt;/p&gt;
&lt;p&gt;Of course when moving your data, please follow the &lt;a href="https://www.veeam.com/blog/321-backup-rule.html"&gt;3-2-1 rule&lt;/a&gt; for storing your data; meaning 3 copies (like on your computer, your NAS and a cloud you trust), 2 different mediums (on 2 separate computers) in at least 1 remote location (i.e. in a cloud you trust or have a backup server in another location).&lt;/p&gt;
&lt;p&gt;Once you have your NAS set up (see next step), you'll actually make sure your RAID is running. If you're using Linux a popular choice is zfs. I think &lt;a href="https://coffeeaddict.dev/selfhosted/zfs/"&gt;this guide&lt;/a&gt; on using ZFS with your RAID choice is a good starting point, but I'm sure you can also use the documentation or follow a tutorial online that fits your liking.&lt;/p&gt;
&lt;p&gt;However, if you don't want to use linux, you can literally use whatever operating system you like. Let's talk a little bit about operating systems and networking, as those end up being sometimes the sticking points that make self-hosting "too hard" or where people get stuck.&lt;/p&gt;
&lt;h3 id="operating-systems-and-networking"&gt;Operating Systems and Networking&lt;/h3&gt;
&lt;p&gt;I've been a linux user for more than 15 years, but I know I might be an outlier there, so really my first advice is to just start with an operating system you like.&lt;/p&gt;
&lt;p&gt;If you use Windows regularly and like it, install that! If you are comfortable learning linux and have used it at least once for work, maybe get started with Ubuntu as it has a nice interface that should be easier to learn. If you are a Mac user and want to support automated Mac backups, consider a Mac-Mini or something similar.&lt;/p&gt;
&lt;p&gt;Regardless of your operating system you should be able to check:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How do I set up the RAID level I like?&lt;/li&gt;
&lt;li&gt;How do I test that?&lt;/li&gt;
&lt;li&gt;How do I then automate my backups&lt;/li&gt;
&lt;li&gt;How do I connect my GPU-enabled machine for my AI/ML workloads?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So long as you can make those steps happen, it really shouldn't matter what operating system you use.&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt; Obviously package support will vary depending on your operating system, but if you're just using your NAS for storage and then your GPU machine for training, inference or other math-heavy workloads, you should be fine.&lt;/p&gt;
&lt;p&gt;Networking is another sticking point that some people get stuck on. First, you might not actually want your data to be accessible outside of your home network, so I would recommend starting with just getting your services working locally. Then decide how you'd like to use your local storage when you aren't on the local network.&lt;/p&gt;
&lt;p&gt;Usually you'll set up some sort of VPN. I've had good experience using &lt;a href="https://tailscale.com/download"&gt;TailScale&lt;/a&gt;, but choose one that works for you.&lt;sup id="fnref:3"&gt;&lt;a class="footnote-ref" href="#fn:3"&gt;3&lt;/a&gt;&lt;/sup&gt; In addition, you'll likely want to better control your home router, which is why I use &lt;a href="https://openwrt.org/start"&gt;openwrt&lt;/a&gt;. More on this soon!&lt;/p&gt;
&lt;h3 id="choosing-something-youre-motivated-to-use-or-do"&gt;Choosing something you're motivated to use or do&lt;/h3&gt;
&lt;p&gt;Finally, the most important part is to make sure you choose a project you're actually motivated to do. My journey into NAS and self-hosting started because I wanted to back up my photos somewhere that wasn't Google or iCloud.&lt;/p&gt;
&lt;p&gt;For you, it might be similar, or another thing you like doing; like hosting your saved bookmarks, recipes, books or anything else. There are &lt;a href="https://github.com/awesome-selfhosted/awesome-selfhosted"&gt;so many self-hosted applications&lt;/a&gt; and &lt;a href="https://www.xda-developers.com/containers-for-self-hosting-beginners/"&gt;interesting guides for people new to self-hosting&lt;/a&gt;, probably just searching "self-host [Your idea here]" is enough to get you started. I can also recommend checking out &lt;a href="https://www.reddit.com/r/selfhosted/"&gt;the selfhosted subreddit&lt;/a&gt; for inspiration.&lt;/p&gt;
&lt;p&gt;Choose a project that you feel passionate enough about that when you hit a troubleshooting problem you are still motivated to fix it. Of course, take a hot cocoa or tea break in between or even let a few days pass, but if you're motivated to overcome the initial obstacles you find in moving to self-hosting your second, third and fourth projects will benefit greatly.&lt;/p&gt;
&lt;p&gt;Like with many things, starting a bit smaller and growing an idea over time has a lot of benefits. Be patient and enjoy the learning process. I'd be excited to hear how your self-hosting journey goes.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;If you want to learn more about coding theory, I highly recommend &lt;a href="https://www.youtube.com/playlist?list=PLidiQIHRzpXLSQBywYbSZ5PUhkR6VWM2P"&gt;Mary Wooters course&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;I also always choose self-installed linux because it generally has relatively good support for security patches. Whatever OS you use, make sure you are regularly updating your security packages -- and I would steer away from using an OS that you haven't used that comes pre-installed on your device, as it might be a cheap linux-based but not updated image.&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;Some VPNs have pretty awful practices when it comes to privacy, so please make sure you choose a trusted and audited VPN to make sure they aren't tracking and selling your private data. Here's one horror story from &lt;a href="https://www.yahoo.com/news/articles/millions-private-chatgpt-conversations-being-140103898.html"&gt;Urban Proxy VPN&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="personal-ai"></category></entry><entry><title>Building out my home AI Lab for private and local AI</title><link href="https://blog.kjamistan.com/building-out-my-home-ai-lab-for-private-and-local-ai.html" rel="alternate"></link><published>2026-01-15T09:00:00+01:00</published><updated>2026-01-15T09:00:00+01:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2026-01-15:/building-out-my-home-ai-lab-for-private-and-local-ai.html</id><summary type="html">&lt;p&gt;So, you wanna do at-home AI? Yes, you do!&lt;/p&gt;
&lt;p&gt;There's a bunch of great reasons to run your own AI including having more control over your data and models, learning more about how deep learning works, testing out new ideas without having to pay extra cloud or subscription costs and …&lt;/p&gt;</summary><content type="html">&lt;p&gt;So, you wanna do at-home AI? Yes, you do!&lt;/p&gt;
&lt;p&gt;There's a bunch of great reasons to run your own AI including having more control over your data and models, learning more about how deep learning works, testing out new ideas without having to pay extra cloud or subscription costs and building out your ability to run AI safely and under your own terms.&lt;/p&gt;
&lt;p&gt;In this short post, I'll show you the specs for my two machines, you can find out more about them via &lt;a href="https://youtu.be/3h_JCBVnHBI"&gt;my YouTube explainer&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I built out my first machine in 2017 to work on some adversarial machine learning projects and a few paid fine-tuning engagements. But probably more useful to talk about is my second machine, which I built out last year to bring to the &lt;a href="https://feministai.party"&gt;Feminist AI LAN party&lt;/a&gt; and to tinker with for my own AI projects.&lt;/p&gt;
&lt;div class="toc"&gt;&lt;span class="toctitle"&gt;Table of Contents&lt;/span&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#hardware-rundown"&gt;Hardware Rundown&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#software-tips"&gt;Software Tips&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#look-into-drivers"&gt;Look into drivers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#setting-up-your-python-and-gpu-software-environments"&gt;Setting up your Python and GPU-software environments&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#focus-on-one-virtual-environment-per-project-or-use-case"&gt;Focus on one virtual environment per project or use case&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#run-your-first-mlai-job"&gt;Run your first ML/AI job!&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;h3 id="hardware-rundown"&gt;Hardware Rundown&lt;/h3&gt;
&lt;p&gt;&lt;img alt="A photo of my very small, very rainbow gaming computer set up with 32GB of GPUs. The GPU lights up in rainbow colors, as does the RAM and the fans. The casing is pink and white." src="./images/2026/rainbow_gaming_pc.jpeg"&gt;
&lt;em&gt;My very small, very rainbow GPU machine&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Here's the full specs of my newest machine:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Name and Description&lt;/th&gt;
&lt;th&gt;Price (in EUR)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPU&lt;/td&gt;
&lt;td&gt;Gainward GEForce RTX Phantom&lt;/td&gt;
&lt;td&gt;2,999.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CPU&lt;/td&gt;
&lt;td&gt;AMD Ryzen 7 9800X3D CPU, 8 Kernel, 5,2 GHz, AM5 (Granite Ridge)&lt;/td&gt;
&lt;td&gt;538.79&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hard Drive&lt;/td&gt;
&lt;td&gt;Samsung 990 PRO Series NVMe SSD, PCIe 4.0 M.2 Type 2280 (4TB)&lt;/td&gt;
&lt;td&gt;328.03&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mainboard&lt;/td&gt;
&lt;td&gt;GIGABYTE B850I Aorus Pro, AMD B850 Mainboard&lt;/td&gt;
&lt;td&gt;298.99&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAM&lt;/td&gt;
&lt;td&gt;G.Skill Trident Z5 CK DDR5-8200 RAM, CL40, XMP 3.0, CUDIMM - 48 GB Dual-Kit&lt;/td&gt;
&lt;td&gt;260.58&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Power Supply&lt;/td&gt;
&lt;td&gt;Corsair RMe Series RM1200e 80 PLUS Gold Netzteil&lt;/td&gt;
&lt;td&gt;239.90&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Housing&lt;/td&gt;
&lt;td&gt;GEJP-011 Jonsplus Z20 Micro-ATX&lt;/td&gt;
&lt;td&gt;94.90&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cooling&lt;/td&gt;
&lt;td&gt;Thermalright Peerless Assassin 120 SE A-RGB - 120 mm&lt;/td&gt;
&lt;td&gt;44.90&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Shipping&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;20.98&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grand Total&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;4826.07&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;So for a total of 4826.07 euros, I have 32 GB of GPU and 48GB of RAM. For me, comparing this to the amount of hassle of getting my datasets to a cloud I trust and getting their GPUs set up with software I like using, this was a worthwhile investment.&lt;/p&gt;
&lt;p&gt;If you're just getting started with deep learning and AI, I wouldn't advise getting such a beefy GPU. For a long time, I used the 16GB offered by my old machine for both fine tuning, running small models for inference and general ML/AI/DL tinkering.&lt;/p&gt;
&lt;p&gt;I bought these parts from a mixture of vendors including: MediaMarkt, Alternativ, NotebooksBilliger and CaseKing (who were also kind enough to let me drop in and exchange things, thank you!).&lt;/p&gt;
&lt;p&gt;I'll be sharing more about how to set up your own LAN party, but in case you're already in the purchasing mood, here's the networking equipment I brought with me:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Name and Description&lt;/th&gt;
&lt;th&gt;Price (in EUR)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Switch&lt;/td&gt;
&lt;td&gt;MikroTik Cloud Router Switch - CRS328-24P-4S+RM&lt;/td&gt;
&lt;td&gt;400.19&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;USB to Ethernet Connectors&lt;/td&gt;
&lt;td&gt;TP-LINK UE300C UE300C USB Type-C to RJ45 Gigabit Ethernet Network Adapter x 10 (18.07 per part)&lt;/td&gt;
&lt;td&gt;180.70&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grand Total&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;580.89&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;I got these from MediaMarkt and OMG.de. If you want a smaller switch with less capabilities, that'll do fine especially if you're only connecting one server.&lt;/p&gt;
&lt;p&gt;I couldn't find all the specs for my 2017 machine (oops!), but here's what I could find.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A photo of my older machine, with two green lit up GPUs. You can see it is much larger than the newer one." src="./images/2026/old_pc.jpeg"&gt;
&lt;em&gt;My old but trustworthy GPU machine&lt;/em&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Name and Description&lt;/th&gt;
&lt;th&gt;Price (in EUR)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPUs&lt;/td&gt;
&lt;td&gt;Gainward GeForce GTX 1080 Ti Phoenix GS x 2 (799 per GPU)&lt;/td&gt;
&lt;td&gt;1598.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAM&lt;/td&gt;
&lt;td&gt;Kingston ValueRAM DIMM 16 GB DDR4-2400 x 8 (90 per unit)&lt;/td&gt;
&lt;td&gt;1463.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mainboard&lt;/td&gt;
&lt;td&gt;ASUS X99-E WS&lt;/td&gt;
&lt;td&gt;494&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Power Supply&lt;/td&gt;
&lt;td&gt;Corsair AX1500i 1500 Watt 80+ Titanium Quality&lt;/td&gt;
&lt;td&gt;450&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hard Drive&lt;/td&gt;
&lt;td&gt;Intel® 480GB DC S4600 Serie 2.5" SATA, Solid State Drive&lt;/td&gt;
&lt;td&gt;384&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Housing&lt;/td&gt;
&lt;td&gt;Corsair Carbide Air 540&lt;/td&gt;
&lt;td&gt;175&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CPU&lt;/td&gt;
&lt;td&gt;??&lt;/td&gt;
&lt;td&gt;??&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cooling&lt;/td&gt;
&lt;td&gt;??&lt;/td&gt;
&lt;td&gt;??&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Incomplete Total&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;4564&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If you're interested, I'll see if I can find the receipts or a price graph and dig around my computer to fill out the above chart.&lt;/p&gt;
&lt;p&gt;Then I bought most of the supplies from Amazon and Alternativ.&lt;/p&gt;
&lt;p&gt;As you can see, RAM cost a lot more back then (about 11 euros per gigabyte compared to about 5 euros now). That trend should continue. It's a lot harder to time pricing of GPUs mainly because the market is now very unpredictable and there are only a few suppliers.&lt;/p&gt;
&lt;p&gt;I'll be testing out a few other GPUs this year and maybe building out a machine live on YouTube, so stay tuned.&lt;/p&gt;
&lt;h3 id="software-tips"&gt;Software Tips&lt;/h3&gt;
&lt;p&gt;If you're new to setting up a computer for AI use, you'll probably have a learning curve for setting up Linux. Just be patient with yourself and take it slow!&lt;/p&gt;
&lt;p&gt;If you've already been running linux either at work on servers or on your own machines, there are still some tips if you're new to AI/ML workflows. I'll try to summarize some here and I'd be happy to update with additional feedback.&lt;/p&gt;
&lt;h4 id="look-into-drivers"&gt;Look into drivers&lt;/h4&gt;
&lt;p&gt;Depending on how old or new your GPUs are, you might run into driver issues. This is because the GPU providers sometimes change the chip architecture and the open-source drivers might not yet support it.&lt;/p&gt;
&lt;p&gt;This means first looking into what drivers you can use with the GPU you bought. Usually someone has posted about this, so I recommend looking at the GPU specifications page and then searching the internet. If the open-source drivers work, then install those. If only proprietary drivers work, install those.&lt;/p&gt;
&lt;p&gt;There will be additional libraries that you need that are usually distributed via your linux OS. For example, in debian/Ubuntu there are several supporting packages for NVIDIA GPUs that are required and often start with nvidia-. Look into specifically which of these work for the drivers that you choose.&lt;/p&gt;
&lt;p&gt;You can usually run a command once the drivers are properly installed to test that the mainboard and operating system can talk to your GPU. First, always reboot so the new drivers load properly. Then, I run nvidia-smi to see that my Ubuntu install can see my chip. There are different commands for AMD GPUs.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A screenshot of the output of nvidia-smi command. It shows a table where the GPU is listed and then a following table with the processes running on the GPU." src="./images/2026/nvidia-smi.png"&gt;&lt;/p&gt;
&lt;h4 id="setting-up-your-python-and-gpu-software-environments"&gt;Setting up your Python and GPU-software environments&lt;/h4&gt;
&lt;p&gt;The next thing you'll need to do is set up your Python environments either using something like &lt;a href="https://www.anaconda.com/download"&gt;conda&lt;/a&gt; or &lt;a href="https://docs.astral.sh/uv/"&gt;uv&lt;/a&gt;. Many people new to Python prefer uv, so if you haven't done a lot of Python, start there.&lt;/p&gt;
&lt;p&gt;In addition, your GPU has specific software that helps the Python libraries run the correct parallelization for your chips. For NVIDIA GPUs this is &lt;a href="https://developer.nvidia.com/cuda-downloads"&gt;CUDA&lt;/a&gt;. For AMD, you can use several open libraries, like &lt;a href="https://www.compilersutra.com/docs/gpu/opencl/basic/setting_up_opencl/"&gt;OpenCL&lt;/a&gt;, or use &lt;a href="https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html"&gt;ROCm&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;You should install this GPU-specific software from the vendors, or make sure that you are looking at what operating system you have and what CPU you have. There will be some of these libraries available via your operating system, so have a look and see if they are updated for your liking.&lt;/p&gt;
&lt;p&gt;I usually just install CUDA from the NVIDIA website, so I can always opt for the newest version. This is because some of the Python libraries you want to use might only support newer versions.&lt;/p&gt;
&lt;p&gt;&lt;img alt="An example of the CUDA installer selection, where you need to choose your operating system, your CPU architecture and what type of installer you want to use." src="./images/2026/cuda_installer_selection.png"&gt;
&lt;em&gt;CUDA Installer Example&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Once you have the GPU-software and your initial Python environment running, you can get started with AI/ML specific libraries.&lt;/p&gt;
&lt;h4 id="focus-on-one-virtual-environment-per-project-or-use-case"&gt;Focus on one virtual environment per project or use case&lt;/h4&gt;
&lt;p&gt;Always use new virtual environments for new projects. Because many of the ML/AI libraries will have underlying dependencies based on Python version, GPU-software version and other Python libraries, this means focusing on the most important library you want to use first.&lt;/p&gt;
&lt;p&gt;This means, for your project, making a new virtual environment with a Python version you know is supported by the most important library you want to use for the project. Usually, I start with &lt;a href="https://pytorch.org/get-started/locally/"&gt;PyTorch&lt;/a&gt;, but you may start with something else, like vLLM or Hugging-Face or some other library you want to choose.&lt;/p&gt;
&lt;p&gt;PyTorch has an easy-to-use recommender that helps you decide how to install it based on your operating system and GPU-software. Not all libraries will have something like this, so you might need to do some trial-and-error if you find that it doesn't work on your machine (like trying an earlier version of the library, or searching around to see if anyone has solved it).&lt;/p&gt;
&lt;p&gt;&lt;img alt="An example selector for which torch version to install. You choose which operating system, which package provider, which language and what type of GPU-software you are using. Then it outputs what command to use for installation." src="./images/2026/torch_installer_selection.png"&gt;
&lt;em&gt;Torch Installer Example&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;One thing to note: if you upgrade your CUDA/ROCm, you might break some of your virtual environments. This is just a pain you'll need to live with, and also a reason to start with the latest GPU-software available. There are ways to run multiple CUDA/ROCm installs, but I haven't actually done that for my personal projects yet.&lt;/p&gt;
&lt;p&gt;Once you get your main library installed, test it! For example, for PyTorch and CUDA, you can start a Python shell in your virtual environment and see that it runs on your GPU.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;torch&lt;/span&gt;
&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_available&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;It should return True. You can also use &lt;code&gt;torch.cuda.get_device_name(0)&lt;/code&gt; to ensure that it matches your expectations.&lt;/p&gt;
&lt;p&gt;Then, get started installing anything else you might want for the project. Make sure that the version of your main library does not change by specifying which version you want to keep as an additional installation requirement (i.e. install YOUR_MAIN_LIBRARY==VERSION new_library) or by keeping an eye on any libraries that get changed with installation.&lt;/p&gt;
&lt;h4 id="run-your-first-mlai-job"&gt;Run your first ML/AI job!&lt;/h4&gt;
&lt;p&gt;At this point, you're ready to test out your setup for running a workflow. Train your first model, fine-tune something, or serve a model that you want to use. Check that when it loads, it says you are using your GPU. If not, restart at the top and verify each step (i.e. drivers are working, GPU-software is working, library is working, workflow test).&lt;/p&gt;
&lt;p&gt;Enjoy! For more videos on how to run things on your setup, check out the &lt;a href="https://www.youtube.com/@ProbablyPrivate"&gt;Probably Private YouTube&lt;/a&gt;.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;If this post helped you, consider &lt;a href="https://probablyprivate.com/subscribe"&gt;subscribing to my newsletter&lt;/a&gt; or &lt;a href="https://www.youtube.com/@ProbablyPrivate?sub_confirmation=1"&gt;my YouTube&lt;/a&gt; and sharing my work! I also offer &lt;a href="https://kjamistan.com"&gt;advisory and workshops&lt;/a&gt; on topics like security and privacy in AI/ML and personal AI.&lt;/p&gt;</content><category term="personal-ai"></category></entry><entry><title>Differential Privacy in Today's AI: What's so hard?</title><link href="https://blog.kjamistan.com/differential-privacy-in-todays-ai-whats-so-hard.html" rel="alternate"></link><published>2026-01-06T00:00:00+01:00</published><updated>2026-01-06T00:00:00+01:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2026-01-06:/differential-privacy-in-todays-ai-whats-so-hard.html</id><summary type="html">&lt;p&gt;In the last article in the &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;series on addressing the problems of memorization in deep learning and AI&lt;/a&gt;, you learned about differential privacy and how to apply it to deep learning/AI systems. In this article, you'll explore what can go wrong when using differential privacy training in deep learning …&lt;/p&gt;</summary><content type="html">&lt;p&gt;In the last article in the &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;series on addressing the problems of memorization in deep learning and AI&lt;/a&gt;, you learned about differential privacy and how to apply it to deep learning/AI systems. In this article, you'll explore what can go wrong when using differential privacy training in deep learning and open questions around using differential privacy to address memorization in machine learning.&lt;/p&gt;
&lt;p&gt;In this article, you'll confront some larger issues in applying differential privacy in deep learning systems. These issues span beyond tuning parameters and applying technical thinking into how organizations function and what privacy means to us as individuals and society.&lt;/p&gt;
&lt;p&gt;You'll address the following questions:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;What data needs to be protected in the ML lifecycle?&lt;/li&gt;
&lt;li&gt;Is your data representative enough to learn while using DP?&lt;/li&gt;
&lt;li&gt;Can some tasks ever be private?&lt;/li&gt;
&lt;li&gt;How can you work interdisciplinary to develop real use cases?&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id="defining-sensitive-data-and-governing-its-processing"&gt;Defining sensitive data and governing its processing&lt;/h3&gt;
&lt;p&gt;You might recall from the last article that a group of Google researchers pretrained a BERT model (LLM) using differential privacy. This isn't as common as you think, actually in many instances of using differential privacy at scale, practitioners first pretrain or train a base model on "public" data and then later fine-tune the model using differential privacy.&lt;/p&gt;
&lt;p&gt;There are certainly accuracy benefits when doing a part of the training without differential privacy. Many companies claim scraped data is public anyways, so there's no problem training on it. But, is all scraped data from the web really "public"? Are there any privacy problems with pretraining on internet-scale data scraped from the web?&lt;/p&gt;
&lt;p&gt;&lt;a href="https://arxiv.org/abs/2212.06470"&gt;Tramèr et al. (2024) released a position paper&lt;/a&gt; calling for a more nuanced approach to what data is considered public. The authors cite a real example where someone's phone number was memorized in a LLM. When the person asked for their data to be deleted, the company responded that memorization wasn't possible because they fine-tuned with differential privacy.&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;What actually happened was that person's number was exposed during pretraining on "public" data due to a file being on the public internet with the person's information. The model memorized it from the "public" pretraining and no amount of differential privacy fine-tuning mattered (in this instance).&lt;/p&gt;
&lt;p&gt;The paper also highlights a GitHub user who accidentally published their cryptocurrency wallet information. When they realized this, they deleted it; however, Copilot already memorized the string and someone extracted that information and emptied the wallet.&lt;/p&gt;
&lt;p&gt;So is web scraped data really "public"?&lt;/p&gt;
&lt;p&gt;The authors:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Someone may post their contact information along with a research publication with the intent that it is used to contact that person about details of the publication. Sensitive data about individuals could also be uploaded to the Internet unintentionally (or by third parties privy to this information). As a result, people often underestimate how much information about them is accessible on the Web, and might not consent to their “publicly accessible” personal data being used for training machine learning models.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In both of these cases privacy wasn't considered as part of the entire machine learning life cycle. To properly address privacy, you'd need to look at the entire system: from dataset collection, preprocessing, embedding models and the actual AI/deep learning system.&lt;/p&gt;
&lt;p&gt;Beyond deciding which data should be used with or without differential privacy, there are open questions around how to best apply differential privacy in machine learning setups. One of which is the ability to reason about the impact of differential privacy on groups in the dataset.&lt;/p&gt;
&lt;h4 id="is-the-data-representative-enough-for-error"&gt;Is the data representative enough for error?&lt;/h4&gt;
&lt;p&gt;For training deep learning, you need to evaluate the training pipeline in its completion alongside deciding what data you have for the task. But before you know whether your training will even work, how can you decide if you have the right data to learn? And what do you already know about what you need to learn?&lt;/p&gt;
&lt;p&gt;&lt;a href="https://arxiv.org/abs/2011.11660"&gt;Tramèr and Boneh (2021)&lt;/a&gt; found that to reach the same learning accuracy a differentially private deep learning model might need an "order of magnitude" more data.&lt;/p&gt;
&lt;p&gt;How come? Well, if the model cannot memorize or learn from just one example, it must process many examples of that concept to learn it privately.&lt;/p&gt;
&lt;p&gt;If you studied machine learning or have worked in the field for some time, you might have come across PAC learning theory or universal learning theory. These theories help the field discover more about how machine learning works, how to make it more efficient, how to evaluate learning bounds and determine what data is required to appropriately learn. Too often, learnings from this field are not integrated into how today's AI/ML systems are built. This means you are often doing things more inefficiently than necessary.&lt;/p&gt;
&lt;p&gt;There's been significant evolution of the overlap between learning theory and differential privacy. For example, there are bounds as to what can be learned (PAC theory) when it comes to using differential privacy on certain complex distributions. There's also mathematical proof that all distributions are differentially privately learnable although they might not be efficient.&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;One thing worth considering when choosing datasets and collecting other people's data for deep learning is that the underlying distribution and sample complexity impact whether you can learn privately. You already learned this by understanding that complex classes or examples are prone to memorization and hard to unlearn, but I want you to consider the problem from a different perspective.&lt;/p&gt;
&lt;p&gt;In an applied setting: If you don't have enough &lt;em&gt;different&lt;/em&gt; and diverse data, or if you don't have a well-defined problem space and can't find data that adequately represents that problem, you probably won't be able to learn privately. And to be honest, you might not be able to learn even well without differential privacy!&lt;/p&gt;
&lt;p&gt;This means spending time to understand your training data and how it represents the task you are trying to learn. This means looking critically at benchmarks and deciding if the data really represents the task.&lt;/p&gt;
&lt;p&gt;My advice: think through your problem and task deeply. Figure out what you actually really need to learn and what is superfluous. Then determine if you can actually simulate, collect or produce data that matches that requirement. This will help you not only learn privately, but also more efficiently.&lt;/p&gt;
&lt;p&gt;Evaluating this as a team can spark useful conversations about what is worth learning and at what cost.&lt;/p&gt;
&lt;h2 id="is-it-possible-to-train-this-model-privately"&gt;Is it possible to train this model privately?&lt;/h2&gt;
&lt;p&gt;It can be difficult to determine if a deep learning/AI model can be private given the particular use case and dataset. Is it private if the model memorizes sensitive data even with differential privacy due to repeated exposure? This can happen even with differential privacy if the data has multiple sources with the sensitive information.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://arxiv.org/abs/2202.05520"&gt;Brown et al (2022)&lt;/a&gt; asked this same question specifically for language models. The authors argue that &lt;a href="https://en.wikipedia.org/wiki/Contextual_integrity"&gt;Nissenbaum's contextual integrity&lt;/a&gt; should apply to language models. Nissembaum's theory says each user should have autonomy about how, where and in what context their data appears. The authors argue the only data that matches how LLMs are used today is data intended to be freely available for the general public.&lt;/p&gt;
&lt;p&gt;Text origin and ownership is often difficult to define, which is a key decision to appropriately apply differential privacy. For example, as you learned &lt;a href="https://blog.kjamistan.com/differential-privacy-in-deep-learning.html"&gt;in the last article&lt;/a&gt;, to do appropriate privacy accounting, you define how much one person can contribute to the training. This is surprisingly difficult for text data because sometimes someone is quoting another person, or paraphrasing or referencing. Or someone may use different accounts or handles but be the same person. How can you define authorship well enough to apply differential privacy?&lt;sup id="fnref:3"&gt;&lt;a class="footnote-ref" href="#fn:3"&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;The authors also ask: in what context did I write this and to whom? Text is easy to forward, duplicate and share in new ways. Someone can forward my email, quote me or also paraphrase something without attribution. This means that the original author and intent is easily lost, and person-level attribution and accounting ends up being quite difficult.&lt;/p&gt;
&lt;p&gt;The memorization that can happen, even with state-of-the-art differentially private language models, affects real lives. In 2021, &lt;a href="https://arxiv.org/abs/2104.07762"&gt;researchers found that people's names appeared alongside medical conditions&lt;/a&gt; that were extractable from clinical notes that were leaked online and appeared in the training data.&lt;/p&gt;
&lt;p&gt;The authors' thesis can be applied beyond language since digitized data easily loses its sources and context. Nissenbaum's theory states that the digital world doesn't translate well our human understanding of privacy. It's easy to accidentally overshare, to post something that goes beyond its original context or to also take someone else's data and share it in a way they never intended.&lt;/p&gt;
&lt;p&gt;For those of us working on this problem: what large language models can truly be "privacy preserving"? When can you ensure the guarantees match the real-world concerns and context? Is example-level good enough for the problem at hand? Do we need to think through attribution at a higher level?&lt;/p&gt;
&lt;p&gt;In my opinion, the field would benefit from more multidisciplinary teams having these discussions. This would help:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;ensure researchers are working on real-world problems&lt;/li&gt;
&lt;li&gt;align legal understandings with technical ones&lt;/li&gt;
&lt;li&gt;spark social conversations around what data goes into AI and what protections are expected&lt;/li&gt;
&lt;li&gt;develop new model types: ones that can attribute sources for example&lt;/li&gt;
&lt;li&gt;create new business models: ones where people opt into having their data used either for compensation or for co-use&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Privacy is by-default a human understanding, and the best bet on achieving real privacy in ML/AI involves putting humans at the center of the conversation.&lt;/p&gt;
&lt;h4 id="developing-multidisciplinary-thinking"&gt;Developing multidisciplinary thinking&lt;/h4&gt;
&lt;p&gt;Defining the use cases, tasks and training data is rarely multidisciplinary. In many organizations, product defines the use cases and sends them over to the data or machine learning team, or even hires a third party provider to do so.&lt;/p&gt;
&lt;p&gt;This game of telephone means that sometimes the use case and task are not even well aligned or defined with the data and tools available. When this happens it often also means that the privacy and security requirements end up being misunderstood or not well translated.&lt;/p&gt;
&lt;p&gt;Even when working with third parties, it could add to the product performance and the privacy and security requirements by having healthy conversations and co-designing in multidisciplinary teams.&lt;/p&gt;
&lt;p&gt;Ideally the product lifecycle includes privacy, security, machine learning and risk stakeholders from the beginning. It could look something like this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Start product ideation as a multidisciplinary team&lt;/li&gt;
&lt;li&gt;Threat model and provide privacy engineering input based on early architecture, data and design choices (nothing built yet)&lt;/li&gt;
&lt;li&gt;Begin model, data and software development using identified privacy technologies and best practices&lt;/li&gt;
&lt;li&gt;Evaluate models based on specs from product, privacy and security&lt;/li&gt;
&lt;li&gt;Finalize model candidates and integrate them into stack&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=r30lMqmU-S0&amp;amp;list=PLJkNSeYcYBlC88vkG58yx3fHSobmCmDw_"&gt;Purple team&lt;/a&gt; models and perform privacy testing&lt;/li&gt;
&lt;li&gt;Tweak guardrails, controls and model choice based on attack and evaluation success&lt;/li&gt;
&lt;li&gt;Launch after sign-off from the multidisciplinary team&lt;/li&gt;
&lt;li&gt;Check-in and re-evaluate based on changing risk and model landscape and learnings from other products&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By testing privacy technologies like differential privacy as you go, the organization and involved teams gain knowledge, understanding and experience on how to use them effectively. Eventually there will be enough experience to expedite decision making on design patterns and to effectively integrate privacy and security technologies into common stack choices and platform design.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A card deck showing a title card that reads &amp;quot;Let's play Singularity: an AI governance game from Thoughtworks&amp;quot;" src="./images/2026/singularity_card_game.jpg"&gt;
&lt;em&gt;I helped co-develop the &lt;a href="https://www.thoughtworks.com/en-de/insights/blog/generative-ai/lets-play-singularity-ai-governance-card-game"&gt;Thoughtworks Singularity card game&lt;/a&gt; as a fun way to practice multidisciplinary threat modeling and risk assessment alongside my colleagues Jim Gumbley and Erin Nicholson.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;By involving many stakeholders in risk evaluation and mitigations, you'll also develop a more mature approach to data and AI. There's something special about evaluating risk as a team because understanding risk also means actually building a deeper understanding of the system and its parts. As a team you'll learn more about how models work, when they fail and what you should do about it.&lt;/p&gt;
&lt;p&gt;In the next article, you'll investigate what is already known about auditing privacy and differential privacy choices in deep learning systems, and explore what isn't yet known or regular practice.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;This is fairly common practice and a good idea if you are fine tuning and not actually pretraining your own models. You can follow the same advice from &lt;a href="https://blog.kjamistan.com/differential-privacy-in-deep-learning.html"&gt;the last article&lt;/a&gt; or &lt;a href="https://practicaldataprivacybook.com/"&gt;in my book on the topic&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;If you have an hour to spare, I highly recommend &lt;a href="https://www.youtube.com/watch?v=wk910Aj559A"&gt;this lecture on the topic from Shay Moran, an expert in studying learning theory and differential privacy&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;This is why many deep learning models try to protect using per example privacy, but it's a very good point that this will certainly leave prolific creators, writers, journalists and famous citations overexposed by design.&amp;#160;&lt;a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:4"&gt;
&lt;p&gt;If you need an example of how to get started, check out my &lt;a href="https://www.youtube.com/watch?v=r30lMqmU-S0&amp;amp;list=PLJkNSeYcYBlC88vkG58yx3fHSobmCmDw_"&gt;Probably Private YouTube series on Purple Teaming AI Models&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:4" title="Jump back to footnote 4 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="ml-memorization"></category></entry><entry><title>Differential Privacy in Deep Learning</title><link href="https://blog.kjamistan.com/differential-privacy-in-deep-learning.html" rel="alternate"></link><published>2025-11-10T00:00:00+01:00</published><updated>2025-11-10T00:00:00+01:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2025-11-10:/differential-privacy-in-deep-learning.html</id><summary type="html">&lt;p&gt;Differential privacy influenced both privacy attacks and defenses you've investigated in this &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;series on AI/ML memorization&lt;/a&gt;. You might be wondering: what exactly is differential privacy when it's applied to deep learning? And can it address the problem of memorization?&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Are you a visual learner? There's &lt;a href="https://youtu.be/p6p9i1Hbcns"&gt;a YouTube video on …&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;</summary><content type="html">&lt;p&gt;Differential privacy influenced both privacy attacks and defenses you've investigated in this &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;series on AI/ML memorization&lt;/a&gt;. You might be wondering: what exactly is differential privacy when it's applied to deep learning? And can it address the problem of memorization?&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Are you a visual learner? There's &lt;a href="https://youtu.be/p6p9i1Hbcns"&gt;a YouTube video on this article&lt;/a&gt; on the Probably Private YouTube channel.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In this article, you'll learn how differential privacy is applied to today's AI/deep learning models and evaluate if this could be a useful approach for addressing memorization problems. In following articles, you'll explore limitations of applying differential privacy in today's systems and critically think through auditing real-world applications.&lt;/p&gt;
&lt;div class="toc"&gt;&lt;span class="toctitle"&gt;Table of Contents&lt;/span&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="#what-is-differential-privacy-in-machine-learning"&gt;What is differential privacy in machine learning?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#can-differential-privacy-help-with-memorization-and-privacy-attacks"&gt;Can differential privacy help with memorization and privacy attacks?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#how-does-it-work"&gt;How does it work?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#tips-from-the-field"&gt;Tips from the field&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#framing-the-problem-appropriately"&gt;Framing the problem appropriately&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;h3 id="what-is-differential-privacy-in-machine-learning"&gt;What is differential privacy in machine learning?&lt;/h3&gt;
&lt;p&gt;My favorite definition of differential privacy comes from &lt;a href="https://arxiv.org/abs/1906.01337"&gt;Desfontaines and Pejó's 2022 paper&lt;/a&gt; (brackets are added for this particular scenario):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;An attacker with perfect background knowledge (B) and unbounded computation power (C) is unable (R) to distinguish (F) anything about an individual (N) [when querying a machine learning model], uniformly across users (V) [whose data was in the training dataset] even in the worst-case scenario (Q)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That seems like a pretty high bar compared to what you've been evaluating with &lt;a href="https://blog.kjamistan.com/machine-unlearning-what-is-it.html"&gt;unlearning&lt;/a&gt;! Differential privacy offers a fairly strict and rigorous definition of privacy standards. For that reason it's often used with combinations of the variables (shown in letters above) to determine if a stronger or weaker definition should be used based on use-case or context-specific privacy requirements.&lt;/p&gt;
&lt;p&gt;Differential privacy provides guarantees for every individual when applied to data collection, access and use, while still offering enough information to learn something. This is the balance between data utility and individual privacy.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;TIP&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;If you're new to differential privacy, I highly recommend taking a gander through &lt;a href="https://desfontain.es/blog/friendly-intro-to-differential-privacy.html"&gt;Damien Desfontaines's introduction and in-depth articles&lt;/a&gt;. You can thank me later. :)&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Ideally when doing machine learning, you are learning from many persons not from one individual; therefore, differential privacy is a natural fit if you want to make sure that you learn from a group and not from any one specific person.&lt;/p&gt;
&lt;p&gt;But, as you've learned throughout this series, this can be challenging when novel, complex examples show up...&lt;/p&gt;
&lt;h3 id="can-differential-privacy-help-with-memorization-and-privacy-attacks"&gt;Can differential privacy help with memorization and privacy attacks?&lt;/h3&gt;
&lt;p&gt;You might recall from &lt;a href="https://blog.kjamistan.com/differential-privacy-as-a-counterexample-to-aiml-memorization.html"&gt;the previous article on differential privacy&lt;/a&gt; that differentially private models reveal memorization problems in deep learning. Differentially private models underperformed on particular novel examples when compared with their non-differentially-private counterparts.&lt;/p&gt;
&lt;p&gt;Indeed, this is often the case. In research from &lt;a href="https://arxiv.org/abs/2202.07623"&gt;Stock et al (2022)&lt;/a&gt;, models trained using differential privacy were successful in defending against reconstruction attacks. Models trained using differential privacy protected the training data better than non-DP models. In this research, the authors found that although some membership inference attacks succeeded, they were unable to extract and reconstruct the training data based on the differentially private model responses, even when the MIA was successful.&lt;/p&gt;
&lt;p&gt;To test this, they inserted canaries into the dataset and specifically targeted these canaries. They found that a model without differential privacy memorized the canaries and they were easy to reconstruct using normal exfiltration attacks. The DP model, even one with fairly weak definitions, revealed only that it had seen the canary via a successful model inference attack, but without the canary example in hand, an attacker could not extract the canary from the model.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Three charts comparing different epsilon's and differential privacy model secret extraction. In each chart there are two research papers that are compared and a line that shows the secret disclosure. The x-axis is the secret length and the y-axis is the leakage (i.e. how many bits can be successfully extracted). The first chart shows an epsilon of 21.2 and one of the DP models from an earlier paper is above the secret disclosure line for all secrets, which proves that the secret is easy to extract. The line for this improved DP paper shows shorter secrets are easy to extract, but crosses the disclosure line around 8-bit secret length. For the second chart showing epsilon 7.7 you can see that both papers allow smaller secrets to be extracted (or up to a certain number of bits - less than 4), but not longer. For the final chart with epsilon 3.8 you can see that no secrets can be extracted from either paper." src="./images/2025/renyi_epsilons_extraction.png"&gt;
&lt;strong&gt;Comparing epsilons under secret extraction, Stock et al (2022)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The charts above show the authors' results compared with an earlier paper on differentially private deep learning. You can see that for both papers, the epsilon choice, which is a critical part of effectively applying differential privacy, has a direct effect on the ability to extract a secret. You can also see that as epsilon lowers, so does the ability to extract that secret successfully from the model. In case it isn't clear, these results are specific to their implementation and shouldn't be used as evidence that every system will behave the same way.&lt;/p&gt;
&lt;p&gt;If you've used differential privacy for training before, you might be worried that it cannot be applied successfully to today's largest models and still achieve accuracy. However, a Google Research group was able to &lt;a href="https://arxiv.org/abs/2108.01624"&gt;pre-train a BERT model in 2021 achieving 60% accuracy&lt;/a&gt;, less than 10% "worse" than the non-DP counterpart. In addition, DeepMind just released their first &lt;a href="https://research.google/blog/vaultgemma-the-worlds-most-capable-differentially-private-llm/"&gt;differentially private Gemma model&lt;/a&gt;, which scored quite high on several benchmarks and is available as an open weight model.&lt;/p&gt;
&lt;p&gt;Interestingly, differential privacy can be used to successfully protect other parts of the machine learning infrastructure from potential information leakage. In research from &lt;a href="https://arxiv.org/abs/2305.15594"&gt;Duan et al. (2023)&lt;/a&gt;, the authors discovered that they could successfully generate privacy-preserving alternative prompts via a differential privacy mechanism when using a blackbox LLM. These differentially private prompts leaked less information when the prompt came from a private prompt source. This can be useful for real-world use cases, such as when you ask a Code Assistant to update confidential code.&lt;/p&gt;
&lt;p&gt;Similarly, research around MIAs shows that differential privacy is an effective protection. When investigating "label-only" MIAs, where the model only returns the label (no confidence interval), &lt;a href="https://arxiv.org/abs/2007.14321"&gt;Choquette-Choo et al. (2021)&lt;/a&gt; found that "training with differential privacy or strong l2-regularization are the only current defenses that meaningfully decrease leakage of private information, even for points that are outliers of the training distribution".&lt;/p&gt;
&lt;p&gt;Okay, I'm sold! So, how can you actually implement differential privacy in a deep learning system effectively?&lt;/p&gt;
&lt;h3 id="how-does-it-work"&gt;How does it work?&lt;/h3&gt;
&lt;p&gt;Differentially private stochastic gradient descent (DP-SGD) is the traditional and still often used approach to training models with privacy. The definition comes from Abadi et al. in 2016, and I highly recommend &lt;a href="https://www.youtube.com/watch?v=ZxDBEyjiPxI"&gt;watching this video on how it works, especially if you are a visual learner&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Essentially, DP-SGD allows you to use the same deep learning libraries you would normally use (like PyTorch) and apply gradient clipping, averaging and carefully selected noise to protect the individual examples. The process looks like this:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A graphic to walk through the differentially private stochastic gradient descent method." src="./images/2025/dpsgd.png"&gt;&lt;/p&gt;
&lt;p&gt;In the graphic above from &lt;a href="https://practicaldataprivacybook.com/"&gt;my book &lt;em&gt;Practical Data Privacy&lt;/em&gt;&lt;/a&gt;, you see a training epoch that starts with a mini-batch, which  is a selected sample of the training data.&lt;/p&gt;
&lt;p&gt;This mini-batch is then broken down into a per sample (i.e. assumed per user) size, where the gradient is calculated by taking the derivative of the loss function with respect to the current model weights. Think of this gradient as "how much does this sample change the model".&lt;/p&gt;
&lt;p&gt;Then, each gradient is clipped to provide some protection for large changes and aggregated back into the mini-batch (there are several methods to optimize this) with the other gradients: here by a sum and then an average.&lt;/p&gt;
&lt;p&gt;The differential privacy noise is then added to the batch, noting that when you group multiple updates together you get additional privacy guarantees. The resulting gradients (now sorted by layer) are used to update model weights and restart the process until training finishes.&lt;/p&gt;
&lt;p&gt;Choosing the noise carefully is an interesting avenue of research. Getting this choice right for machine learning systems means still being able to learn effectively, while also providing the same guarantees for individual privacy. As deep learning grew more popular, so did more precise definitions of the noise required in deep learning systems.&lt;/p&gt;
&lt;p&gt;The original approach of DP-SGD used particular properties of Gaussian (or normal) distributions to effectively calculate the noise and resulting privacy guarantees. This allowed for the usage of Gaussian noise, which has useful properties in deep learning because many tasks assume Gaussian error or other Gaussian distribution properties.&lt;/p&gt;
&lt;p&gt;Building on this approach, a new definition evolved called &lt;a href="https://arxiv.org/pdf/1702.07476"&gt;Rényi differential privacy (RDP)&lt;/a&gt;.&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt; RDP provided a new calculation of the bounds provided by the Gaussian distribution that both: (a) simplified the correct choice of parameters for deep learning and (b) allowed for a "tighter composition" so that you can add less noise to get the same guarantees. This didn't change the underlying mechanism in DP-SGD, it just gave a new calculation of the differential privacy parameters (like you read in the definition above) when using DP with a Gaussian distribution noise.&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;You might have tried applying DP-SGD or other approaches in the past and it didn't work well; or you heard someone who knows about machine learning say "oh but it doesn't work". Why is this a common experience? It's because there are many ways to tune DP more effectively for your use case and most novice approaches will not work well.&lt;/p&gt;
&lt;p&gt;Let's investigate some tips from people who have done DP more than once. :)&lt;/p&gt;
&lt;h3 id="tips-from-the-field"&gt;Tips from the field&lt;/h3&gt;
&lt;p&gt;There are several tips when analyzing successful implementations of DP training, especially of large models, like today's deep learning models.&lt;/p&gt;
&lt;p&gt;The best advice is:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Use large batch sizes&lt;/li&gt;
&lt;li&gt;Replace or remove batch normalization/dropout layers&lt;/li&gt;
&lt;li&gt;Set weight decay higher than normal&lt;/li&gt;
&lt;li&gt;If doing input augmentations, include them in the same mini-batch&lt;/li&gt;
&lt;li&gt;Experiment with scaling batch size alongside your learning rate scheduler&lt;/li&gt;
&lt;li&gt;There are usually exploitable ways to modify clipping norm, learning rate, architecture and/or activation choices that are specific to your task, data and sensitivity combinations.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Let's break this down to more intuitively learn from the advice.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Batch Size&lt;/strong&gt;: When you apply differential privacy, you will be adding noise per batch. This means that the larger the batches the more you can exploit the principles of centralized differential privacy to get more signal versus noise from each round. &lt;a href="https://arxiv.org/abs/2108.01624"&gt;The research on pre-training BERT&lt;/a&gt; used a batch size of around 2M examples. Of course, this can only be done if you have truly internet-scale texts, but it's a useful example nevertheless to think in much larger batches.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Learning from &lt;a href="https://arxiv.org/abs/2501.18914"&gt;DeepMind's latest experiments&lt;/a&gt;, you can calculate the required compute based on your data, batch and privacy parameter choices. This lets you find optimal batch sizes. This finding is probably most relevant if you are training large models, like their differentially private LLM, but the authors think the theory will "scale down" appropriately.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;Batch normalization and dropout&lt;/strong&gt;: Technically, differential privacy is doing this normalization for you, so these layers don't help like they normally would. &lt;a href="https://arxiv.org/abs/2204.13650"&gt;Research on large-scale image classification&lt;/a&gt; says to instead think through ways to replace these layers with something that conveys more signal, like a normal fully connected layer. This also applies to layers like layer-normalization, weight-normalization or any hyperparameters that help with layer smoothing.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Weight Decay&lt;/strong&gt;: Because noise addition causes more variance than usual during training, it's useful to allow weights to decay more slowly. You might also want to tune this with your batch size and learning rate. Playing with batch scheduling relative to learning rate and the interplay of those two with other hyperparameters is something worth experimenting with for your particular task and architecture. If you don't have the ability to experiment first, it'd be worthwhile investigating the latest research for similar tasks to learn and test new approaches.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data Augmentations&lt;/strong&gt;: Because you are calculating clipping and noise addition per batch, it's useful to batch similar data together to get more signal. For this reason, if you are adding example-specific augmentations (as is customary in computer vision) keep the original image and its augmentations in the same mini-batch.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Batch size and learning rate&lt;/strong&gt;: Because you can expect more variance from DP training and because this will change how your epochs and learning stages work, you'll want to use a learning rate scheduler alongside a batch size scheduler. This can start with large learning rates and then slowly get smaller. Choosing an ideal stopping point will likely also require more attention than normal. Stopping earlier can provide better privacy (i.e. smaller epsilon and avoiding memorization) and ensure you are actually learning from signal and not noise.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Customize for your use case&lt;/strong&gt;: Much of the research on optimizing DP training exploits particular shifts in activation layers, architecture, clipping norms alongside other hyperparameters. This shows you more than anything that taking some extra time to test a few different approaches will increase in performance. For example, &lt;a href="https://arxiv.org/abs/2108.01624"&gt;the BERT research&lt;/a&gt; found the interplay between the ADAM optimizer and transformer-specific scale invariant layers created challenges, which they solved by changing the weight decay. In &lt;a href="https://arxiv.org/pdf/2201.12328"&gt;2022 research on large-scale computer vision&lt;/a&gt;, authors tuned the clipping norm and learning rate to achieve an optimal performance.&lt;/p&gt;
&lt;p&gt;As the field of differential privacy in practical applications grows there will be more learnings, knowledge and practical tips given different datasets, architectures and tasks. It's always worth taking a look at your specific task, data and architecture and debating what could present challenges when using DP training. Doing this while also reviewing the literature, blogs and deep learning library documentation can help solve headaches and more efficiently provide the insight in using differential privacy effectively.&lt;/p&gt;
&lt;p&gt;There are also differential privacy modifications for different use cases. For example, there are modifications where only the label needs privacy. This could be the case, for example, where certain features are known by all or most parties but the actual prediction is sensitive (i.e. recommendations, ad conversions or sensitive classification with public features). &lt;a href="https://arxiv.org/abs/2102.06062"&gt;Ghazi et al. 2021&lt;/a&gt; propose an interesting algorithm that combines Bayesian priors with randomized response for label differential privacy and achieved results that were close to non-private learning. Of course, it could be easily argued that this approach doesn't protect against memorization; so if you use it, it's worth testing using privacy attacks.&lt;/p&gt;
&lt;p&gt;Similarly, if your architecture can support a statistical query learning (SQ learning) setup, you can exploit the structure of these queries to implement differential privacy mechanisms at summation points.&lt;sup id="fnref:3"&gt;&lt;a class="footnote-ref" href="#fn:3"&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;Each win you find for eeking out better performance might come from some tradeoff of memorization or leakage from your training examples. It can be easy to apply differential privacy without paying close attention to the goal you have at hand. Don't forget to actually test your memorization using MIAs and extraction attacks to measure what is "good enough" for your use case.&lt;/p&gt;
&lt;h3 id="framing-the-problem-appropriately"&gt;Framing the problem appropriately&lt;/h3&gt;
&lt;p&gt;For any implementation of differential privacy (deep learning or otherwise), framing the problem is an essential and complex step. To do so, you need to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Choose what you are trying to protect for your data and use case (i.e. privacy unit)&lt;/li&gt;
&lt;li&gt;Understand the data well enough to decide on things like clipping and bounds, or use a library that will help you with this choice&lt;/li&gt;
&lt;li&gt;Evaluate preprocessing steps and determine if they change any of those choices&lt;/li&gt;
&lt;li&gt;Find a reliable and audited DP implementation, like &lt;a href="https://github.com/meta-pytorch/opacus"&gt;Opacus from PyTorch&lt;/a&gt; or &lt;a href="https://www.tensorflow.org/responsible_ai/privacy/guide"&gt;tensorflow-privacy&lt;/a&gt;, and understand how it works&lt;/li&gt;
&lt;li&gt;Train your DP model and test any approaches to increase performance&lt;/li&gt;
&lt;li&gt;Run privacy testing on resulting models to increase your confidence that you've achieved the utility and privacy you wanted&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;NOTE&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;What is a privacy unit? What are bounds?&lt;/p&gt;
&lt;p&gt;When using differential privacy, there are a few things that need to be decided to effectively protect individuals. One of the things to define is the privacy unit. What &lt;em&gt;exactly&lt;/em&gt; are you protecting? What is that one small change in the data you'd like to avoid revealing to give people plausible deniability and protection? Often the privacy unit is the contributions of one person, but this could also be of one household, or it could be smaller, like you want to protect every contribution individually (i.e. every training example separately, but not, let's say all examples that came from one person's Flickr account).&lt;/p&gt;
&lt;p&gt;Once you've defined the unit, you need to figure out what bounds need to be set (or already exist). For example, if you choose a privacy unit of each training example separately, then the bounds for DP-SGD would be how much any one example can change the loss calculation and gradient updates. To do this, you determine a clipping threshold which essentially acts as a maximum value of what any example can contribute to the gradient updates.&lt;sup id="fnref:4"&gt;&lt;a class="footnote-ref" href="#fn:4"&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;Note that it's generally good practice to clip extremely large gradients during training as outliers or erroneous examples can have an uneven effect on training stability.&lt;/p&gt;
&lt;p&gt;I recommend playing around with &lt;a href="https://pair.withgoogle.com/explorables/private-and-fair/"&gt;this interactive visual from Google&lt;/a&gt; to get an idea of how clipping and differential privacy affect machine learning models.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Any one of these steps is hard without differential privacy expertise, but essential for the team to learn how to train safer models together. By practicing these skills, you and your team will become both more familiar with how privacy works in real systems, and also able to leverage that knowledge to bring differential privacy into more use cases.&lt;/p&gt;
&lt;p&gt;In the next article, you'll explore some cautionary tales of using differential privacy and evaluate open problems in using differential privacy to address deep learning/AI memorization.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Acknowledgements&lt;/em&gt;: I would like to thank &lt;a href="https://desfontain.es/serious.html"&gt;Damien Desfontaines&lt;/a&gt; for his feedback, corrections and thoughts on this article. Any mistakes, typos, inaccuracies or controversial opinions are my own.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;Rényi DP is similar to &lt;a href="https://arxiv.org/pdf/1605.02065"&gt;concentrated differential privacy&lt;/a&gt;, if you are familiar with that approach. If not, check out &lt;a href="https://desfontain.es/blog/renyi-dp-zero-concentrated-dp.html"&gt;Desfontaines's great introduction to it&lt;/a&gt;!&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;There's &lt;a href="https://www.youtube.com/watch?v=oQzaA5KG3pM&amp;amp;ab_channel=DIMACSCCICADA"&gt;a great video from Ilya Mironov&lt;/a&gt;, the author of the paper, should you want a deeper and longer introduction to Rényi DP.&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;See &lt;a href="https://arxiv.org/pdf/0803.0924"&gt;Kasiviswanathan et al. 2010&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:4"&gt;
&lt;p&gt;I recommend reading through &lt;a href="https://docs.tmlt.dev/platform/latest/analytics/tutorials/clamping-bounds.html"&gt;Tumult Analytics' documentation&lt;/a&gt; on choosing and applying bounds if you are new to this step.&amp;#160;&lt;a class="footnote-backref" href="#fnref:4" title="Jump back to footnote 4 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="ml-memorization"></category></entry><entry><title>Attacks on Machine Unlearning: How Unlearned Models Leak Information</title><link href="https://blog.kjamistan.com/attacks-on-machine-unlearning-how-unlearned-models-leak-information.html" rel="alternate"></link><published>2025-10-13T00:00:00+02:00</published><updated>2025-10-13T00:00:00+02:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2025-10-13:/attacks-on-machine-unlearning-how-unlearned-models-leak-information.html</id><summary type="html">&lt;p&gt;In the past articles, you've been exploring the field of &lt;a href="https://blog.kjamistan.com/machine-unlearning-what-is-it.html"&gt;machine unlearning&lt;/a&gt;, investigating if you can surgically remove memorized or learned data from models without retraining them from scratch or from an earlier checkpoint.&lt;/p&gt;
&lt;p&gt;Unlearning is one proposed solution to the &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;AI/ML memorization problem explored in this multi-article series …&lt;/a&gt;&lt;/p&gt;</summary><content type="html">&lt;p&gt;In the past articles, you've been exploring the field of &lt;a href="https://blog.kjamistan.com/machine-unlearning-what-is-it.html"&gt;machine unlearning&lt;/a&gt;, investigating if you can surgically remove memorized or learned data from models without retraining them from scratch or from an earlier checkpoint.&lt;/p&gt;
&lt;p&gt;Unlearning is one proposed solution to the &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;AI/ML memorization problem explored in this multi-article series&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Are you a visual learner? There's &lt;a href="https://youtu.be/CuH7BHqIiYk"&gt;a YouTube video on this article (unlearning attacks)&lt;/a&gt; on the Probably Private channel.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In this article, you'll investigate if the &lt;a href="https://blog.kjamistan.com/machine-unlearning-how-todays-unlearning-is-done.html"&gt;current proposed unlearning methods&lt;/a&gt; are safe against &lt;a href="https://blog.kjamistan.com/defining-privacy-attacks-in-ai-and-ml.html"&gt;our original attack definitions&lt;/a&gt; as well as against any interesting new attacks that unlearning might introduce.&lt;/p&gt;
&lt;h3 id="evaluating-unlearning-models-with-mias"&gt;Evaluating Unlearning Models with MIAs&lt;/h3&gt;
&lt;p&gt;As you already learned &lt;a href="https://blog.kjamistan.com/machine-unlearning-what-is-it.html"&gt;in the unlearning definition article&lt;/a&gt;, unlearning isn't yet well defined. This means that most research uses subpar or mixed evaluation criteria to determine if something is unlearned. The lack of clear, easy-to-implement and consistent evaluation criteria means that it's almost impossible to compare the many approaches against one another in any meaningful way.&lt;/p&gt;
&lt;p&gt;Let's say the AI/ML industry decided on using a consistent and useful metric, like MIA and having a consistent approach to MIA testing, like holding false positives at a particular minimum (say 3%), then this would mean that researchers and practitioners alike could more easily evaluate use cases and determine their risk appetite. New unlearning approaches would be applied quickly and progress could be made because there are easily comparable measurements.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://arxiv.org/abs/2403.01218"&gt;Hayes and colleagues from Google DeepMind (2024)&lt;/a&gt; call out trends in unlearning research where suboptimal MIAs are used to boost the perceived performance of unlearning methods. By weakening attacks and then demonstrating that the unlearning method "works", much unlearning research gives a false sense of privacy without real gains.&lt;/p&gt;
&lt;p&gt;One reason behind this performance disparity is researchers usually only perform MIAs on the forget set points. Sometimes they also add a small random training data subsample. But Hayes and team found that targeted attacks on a wider selection of training data, particularly those that might be overexposed after unlearning shows that "state of the art" unlearning actually makes new groups of persons vulnerable.&lt;/p&gt;
&lt;p&gt;In addition, the choice of the forget set introduces problems. The forget set should ideally be a diverse representation of the training data (from common case to uncommon cases) in order to truly evaluate whether the method can work. Hayes and team found that some forget sets are cherry picked -- causing unrealistic outcomes compared to forget sets chosen via representative sampling processes.&lt;/p&gt;
&lt;p&gt;Since unlearning, like learning, will have different rates based on example difficulty and class diversity, the authors call for explicit conversations on unlearning privacy tradeoffs. This also means focusing on practical advice, like what unlearning hyperparameters to choose and how to find useful metrics for stopping criteria (i.e. when the model has unlearned enough and is now ready for use).&lt;/p&gt;
&lt;p&gt;In their own testing, they found that the LiRA attack is by far the most effective at exposing privacy risk and modeling a repeatable way to test and compare unlearning methods. In their experiments, they compared per-example LiRAs versus "population" LiRAs and found the former to be qualitatively and quantitatively better at modeling privacy risk.&lt;/p&gt;
&lt;p&gt;This, of course, involves significant dedication to privacy risk testing as a normal part of training and operational infrastructure, as it requires the ability to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;test a variety of sampling methods for forget and remember sets&lt;/li&gt;
&lt;li&gt;run fine-tuning for unlearning, ideally creating several unlearned models&lt;/li&gt;
&lt;li&gt;training several (smaller) models who haven't seen the forget sets&lt;/li&gt;
&lt;li&gt;performing example-by-example LiRAs with the unseen versus the unlearned models&lt;/li&gt;
&lt;li&gt;determining a tradeoff evaluation to determine which unlearned model to use&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The feasibility of doing this at a non-big-tech company is probably very small to impossible. Doing this even at a big-tech company requires significant investment: planning, people, expertise and compute time devoted to privacy metrics, which is probably not the case for those companies today. If the field wants to create better and more consistent metrics, there needs to be easier ways to opt into regular privacy testing. These processes should be streamlined into normal training and evaluation languages, frameworks and ML/AI pipelines.&lt;/p&gt;
&lt;p&gt;Aside from the additional resources required for appropriate unlearning evaluation, unlearning introduces new attacks. Let's investigate emergent attacks against unlearned models.&lt;/p&gt;
&lt;h3 id="new-unlearning-attacks"&gt;New Unlearning Attacks&lt;/h3&gt;
&lt;p&gt;Since unlearning has a before and after state, this can be exploited to reveal exactly what was unlearned. In &lt;a href="https://arxiv.org/abs/2005.02205"&gt;Chen et al. (2021)&lt;/a&gt;, the authors introduced a novel membership inference attack that reveals whether the target sample was part of the original model and was unlearned.&lt;/p&gt;
&lt;p&gt;To do so, they:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Train an original model and train an additional unlearning model (or more than one). In a real attack, an attacker either downloaded the previous open-weight model or saved several inputs/outputs from previous models.&lt;/li&gt;
&lt;li&gt;Process a chosen example through each model and collect information about the prediction output. Here, the model will give a confidence range across several classes or potential next steps in a sequence (like predicting the next word). Save these outputs. If possible, this can also be directly saving the values at the logit layer, like some of the LiRA tests.&lt;/li&gt;
&lt;li&gt;Train a discriminator to separate the outputs from the unlearned model from that of the original. You can train this discriminator using local models and then test the discriminator on the actual outputs from the models you are trying to mimic/attack. If using logits like LiRA, you might also be able to infer a threshold and use this to calculate how likely it is that an output came from the original or unlearned model.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The most useful points for this attack are to find the points that have been unlearned or points that have similar proximity or other attributes, given you expect a large change in those outputs based on the unlearning process.&lt;/p&gt;
&lt;p&gt;To mitigate these attacks, the authors recommend information suppression of the outputs, like returning only the most probable class or next sequence without any confidence intervals. Of course, if you are releasing an open-weight model this isn't possible. They also reference more robust and holistic approaches, like applying differential privacy, which you'll explore further in the next article.&lt;/p&gt;
&lt;p&gt;This attack has since been updated and enhanced by &lt;a href="https://arxiv.org/abs/2405.20272"&gt;Betran et al. (2024)&lt;/a&gt;. The authors use similar methods to compare the original and unlearned models and reconstruct the unlearned data. How does that work?&lt;/p&gt;
&lt;p&gt;The authors investigated when the trained and unlearned models were different by one example. They were able to essentially calculate the exact difference between the models from that one example. This provided them with enough information to make a rough guess as to the sample itself by approximating the input that would account for that change in the model weights (i.e. an approximation of the embedding given the change in the gradients). This is similar to model inversion attacks, where you can reveal input and class information based on model gradients and activations. This is a &lt;em&gt;gradient reconstruction attack&lt;/em&gt; (which has numerous literature associated with it).&lt;/p&gt;
&lt;p&gt;The authors found that the loss on unlearned examples acts differently than other examples. These oddities in confidence intervals leave artifacts of the deep learning-based methods which reveal that something in that class or near that training embedding was unlearned.&lt;/p&gt;
&lt;p&gt;In many ways, unlearning methods create model artifacts that leave clues as to what was unlearned. Even when done at scale, this could quickly expose "missing information", especially when comparing model responses over time. Because unlearning methods don't take this into account, they leave a new security and privacy problem that should be addressed.&lt;/p&gt;
&lt;p&gt;Additionally, new attacks target how information is stored in the embeddings themselves. By investigating embeddings you can find personal information like names, screen names, addresses, and other &lt;a href="https://www.lesswrong.com/posts/8viQEp8KBg2QSW4Yc"&gt;training data contents&lt;/a&gt;. This means embedding model updates also expose who requested their data removal and expose new persons and sources in the training data.&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;So far, you know that unlearning creates changes that can be observed in the forget sets and the retain sets. Some of these changes enable new attacks, like these new reconstruction avenues. But does unlearning create any other privacy risks?&lt;/p&gt;
&lt;h3 id="the-privacy-onion-effect"&gt;The Privacy Onion Effect&lt;/h3&gt;
&lt;p&gt;Carlini et al. published a paper called &lt;a href="https://arxiv.org/abs/2206.10469"&gt;"The Privacy Onion Effect"&lt;/a&gt; in 2022 which outlined new privacy risks when unlearning targets memorized examples. The authors discovered removing memorized data exposes new, different data points that were previously sheltered by those memorized points.&lt;/p&gt;
&lt;p&gt;They define the effect as:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Removing the “layer” of outlier points that are most vulnerable to a privacy attack exposes a new layer of previously-safe points to the same attack&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;They use LiRA and measure attack success across many points in a given dataset (similar to Hayes et al.). They measure the attack "advantage" (i.e. increase in exposure) for a particular training data example. As you already learned, some data points are more prone to memorization and to attack, particularly those who might be considered rare, novel or complex.&lt;/p&gt;
&lt;p&gt;Removing these points with unlearning exposes new points which are now more rare, novel and complex after the removal of the memorized data points.&lt;/p&gt;
&lt;p&gt;Going back to margin theory, you can think of these points like support vectors, holding up the decision boundaries. When you remove one layer of these supporting points, the next layer of supporting points is exposed. This can keep going, like an onion.&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;The privacy onion effect isn't global and isn't reproducible in every model and every dataset. This once again shows you how important it is to address these risks as you develop models, so  the unique privacy risks for each dataset, model architecture and model task combinations are better understood. In fact, the authors prove that privacy auditing is unstable, producing different privacy risks even with small dataset changes.&lt;/p&gt;
&lt;p&gt;The authors' advice is clear: if you're doing privacy audits via membership inference, you need to use your actual training dataset because changes in the dataset will significantly affect the privacy risk of individuals (both those removed and others) in the model.&lt;/p&gt;
&lt;p&gt;Because unlearning presents new and different risks and attacks, it's important to step back and review the original goal.&lt;/p&gt;
&lt;h3 id="what-is-unlearning-trying-to-achieve"&gt;What is unlearning trying to achieve?&lt;/h3&gt;
&lt;p&gt;In many ways unlearning is poorly defined and implemented because it's built on a shaky understanding of privacy risk in deep learning. From where I sit, unlearning research feels like a back-and-forth conversation between privacy lawyers and technologists where neither side is really understanding what the other is trying to say.&lt;/p&gt;
&lt;p&gt;In my opinion, it'd be helpful to evaluate: What are we trying to achieve when implementing unlearning?&lt;/p&gt;
&lt;p&gt;If you want to build privacy-respecting deep learning systems, you have to acknowledge and embrace how and why problems like memorization happen. If you take a holistic approach, you'll see that almost none of the unlearning research focuses on that part of the problem: why and how memorization occurs. Instead, it focuses on byproducts of this phenomenon, by reducing performance on a particular forget example without addressing how the information in that example will affect both model outputs and how that example relates to other data.&lt;/p&gt;
&lt;p&gt;Defining unlearning is not just an activity for lawyers, policy makers and technologists. Privacy and privacy risk is a lived human experience; and some people hold an undue amount of risk just because of who they are (i.e. outliers, underrepresented persons, people that "stick out"). Defining how to address the memorization problem is about having conversations as a society about the actual risks and not hand waving that a solution will present itself automatically.&lt;/p&gt;
&lt;p&gt;If AI systems are going to affect people's lives, work and communities, defining AI privacy and redress must take into account the impact of these systems on people's lives and the unique effectiveness of these systems at using memorization to expose some people more than others.&lt;/p&gt;
&lt;p&gt;In the next three articles, you'll investigate differential privacy and privacy auditing as a solution to the memorization problem.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;The authors tested many possible explanations, such as training regularization noise, presence of outliers or duplicates and discovered this phenomenon isn't global, it's local. It doesn't work uniformly, although it is definitely generalizable. For example, it doesn't affect only a few examples, but instead whole groups of many examples. Targeted attacks show some points are easier to attack than others. Humans inspect the images to find and then remove the 10 most similar examples to see if this makes the target image vulnerable.&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;See &lt;a href="https://arxiv.org/abs/2309.05610"&gt;Privacy Side Channels in Machine Learning Systems, Debenedetti et al., 2024&lt;/a&gt;&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="ml-memorization"></category></entry><entry><title>Machine Unlearning: How today's Unlearning is done</title><link href="https://blog.kjamistan.com/machine-unlearning-how-todays-unlearning-is-done.html" rel="alternate"></link><published>2025-09-19T00:00:00+02:00</published><updated>2025-09-19T00:00:00+02:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2025-09-19:/machine-unlearning-how-todays-unlearning-is-done.html</id><summary type="html">&lt;p&gt;Building on our understanding of machine unlearning and &lt;a href="https://blog.kjamistan.com/machine-unlearning-what-is-it.html"&gt;its varied definitions&lt;/a&gt;, in this article you'll learn common approaches to implementing unlearning. To effectively use these approaches, you'll first want to define what unlearning definition and measurement fits your needs.&lt;/p&gt;
&lt;p&gt;In current unlearning research, there are three main categories of unlearning …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Building on our understanding of machine unlearning and &lt;a href="https://blog.kjamistan.com/machine-unlearning-what-is-it.html"&gt;its varied definitions&lt;/a&gt;, in this article you'll learn common approaches to implementing unlearning. To effectively use these approaches, you'll first want to define what unlearning definition and measurement fits your needs.&lt;/p&gt;
&lt;p&gt;In current unlearning research, there are three main categories of unlearning implementations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;early approaches that require specialized feature and model setups&lt;/li&gt;
&lt;li&gt;today's fine-tuning and training approaches for large deep learning/AI models&lt;/li&gt;
&lt;li&gt;a sampling of novel and interesting approaches&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;Are you a visual learner? There's &lt;a href="https://youtu.be/C-k4Zf39nNg"&gt;a YouTube video on this article (unlearning methods)&lt;/a&gt; on the Probably Private channel.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is all part of a series on &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;AI and machine learning memorization&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id="early-approaches"&gt;Early Approaches&lt;/h3&gt;
&lt;p&gt;Unlearning has been a topic of study for 10 years. Early approaches didn't target deep learning exclusively, and therefore there are some interesting and effective ways to manage the problem in other types of machine learning.&lt;/p&gt;
&lt;p&gt;One of the first papers on Machine Unlearning was &lt;a href="https://www.ieee-security.org/TC/SP2015/papers-archived/6949a463.pdf"&gt;Cao et al (2015)&lt;/a&gt;. The research focused on a variety of unlearning use cases like privacy concerns or removing erroneous or outdated data from the model.&lt;/p&gt;
&lt;p&gt;Their approach combined data subsets to generate aggregated machine learning features, which you can think of as input variables. Then those variables were used in combination with algorithms to predict outputs. Because of this structure, you can remove a single contribution to a feature or several summed contributions without needing to retrain the model. This approach works for statistical query learning algorithms, which allow you to perform queries of the underlying data and use those as direct algorithm inputs.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A model is being trained from several batches of data that are summed features. The data is broken up into subsets so that you can remove the data point from one of the summed features." src="./images/2025/summed_features_unlearning.png"&gt;&lt;/p&gt;
&lt;p&gt;You can adapt this approach for deep learning by training separate models which act as an ensemble. &lt;a href="https://arxiv.org/abs/1912.03817"&gt;Bourtoule et al. (2020)&lt;/a&gt; developed an approach called Sharded, Isolated, Sliced, and Aggregated or SISA learning. As the name suggests, this shards the data and trains a model from each shard. Those models form an ensemble that predicts the outputs as a committee.&lt;/p&gt;
&lt;p&gt;You can then remove or retrain just one model rather than the entire learning system. The authors also used checkpoints so you could potentially "roll back" one of the models to an earlier checkpoint before that exact data point was seen. So if you want to unlearn or forget a particular data subset or even one data point, you can amputate that shard from the model and retrain just the smaller model starting at the previous checkpoint.&lt;/p&gt;
&lt;p&gt;This inspired another approach called &lt;a href="https://arxiv.org/abs/2106.04378"&gt;adaptive machine unlearning (Gupta et al., 2021)&lt;/a&gt;. The authors were working on adaptive deletion requests (i.e. requests that might target several data examples and resulting model behaviors that are not independent). They wanted to retain essential information in the model but still allow for user-driven deletion requests. To do so, they use an approach inspired by differential privacy and information theory, which inserts randomness into the learning and unlearning process to better guarantee that the data points do not leak as much information.&lt;/p&gt;
&lt;p&gt;However, these approaches don't map to today's large deep learning architectures, where data is first transformed directly into embeddings instead of features, and those embeddings are usually also sequence-related (like words in order, or pixels in order). These approaches are useful for a subset of machine learning and deep learning, but not all of what falls under today's "AI".&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;Luckily there is a series of approaches focused specifically on deep unlearning for these large embedding-based models.&lt;/p&gt;
&lt;h3 id="deep-unlearning"&gt;Deep Unlearning&lt;/h3&gt;
&lt;p&gt;For deep unlearning, most approaches exploit the same learning methods to instead unlearn. Recall that &lt;a href="https://blog.kjamistan.com/gaming-evaluation-the-evolution-of-deep-learning-training-and-evaluation.html"&gt;training calculates values across different nodes and layers in the deep learning network&lt;/a&gt; and calculates the error (i.e. how much the model was "off target"). This error, represented in gradients, is then used to update the model parameters.&lt;/p&gt;
&lt;p&gt;What is a gradient? You've probably calculated the slope of a line which connects two dots. This helps you measure the change from one dot to the other. Imagine that if those dots are moving, you can calculate both the change between them and the rate of change.&lt;/p&gt;
&lt;p&gt;A deep learning model operates as many functions. To adjust those functions to be more accurate, you compare the data to the current model functions and see how "wrong" the functions are. Doing this by looking at the size and rate of the error produces gradients. These gradients represent the loss function of the current model functions compared with the training data in that batch.&lt;/p&gt;
&lt;p&gt;A gradient always maximizes the function, which here would maximize the loss function (error). This is why in machine learning, you descend the gradient (gradient descent) to minimize error.  Because the math is too computation-intensive to calculate the gradient for every example and every parameter, this calculation is approximated or averaged/batched.&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="There is a graph plotting loss. You can see a squiggly line with numerous local minimi. The gradient is showing a way up the line to increase loss. This is why you &amp;quot;descend&amp;quot; the gradient -- or go in the opposite direction." src="./images/2025/gradient.png"&gt;&lt;/p&gt;
&lt;p&gt;For each training batch, the error rate is calculated across the batch and the average change is used to send approximate updates to the model parameters. The parameter updates happen via a process called backpropagation.&lt;/p&gt;
&lt;p&gt;Gradient descent via backpropagation first happens in big jumps, where gradients are large, and then updates slow down as the network gets closer to the approximate representation of the task. Since you usually start from a "random guess", early training rounds happen in large steps. The middle training rounds usually convey improvements for majority classes and as you move to later training rounds, the nuances of the outliers, the less frequent classes or smaller representations will change. Sometimes this can result in over-optimization/overfitting on the training set, but that's usually for smaller networks than today's largest AI models.&lt;/p&gt;
&lt;p&gt;Unlearning different parts of these weights, especially calculating how to unlearn specific examples, is difficult to disentangle from the other learned representations. Because network layers and their weights and biases are not easy to inspect or understand, it also makes the unlearning task intransparent.&lt;/p&gt;
&lt;p&gt;Many of the deep unlearning methods essentially reverse this process, by:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;ascending the gradient for a particular example or set of examples&lt;/li&gt;
&lt;li&gt;maximizing loss on an example while minimizing loss for several other examples you don't want to forget&lt;/li&gt;
&lt;li&gt;maximizing the forget examples to point toward an alternative label/output (i.e. calculate loss and move parameters towards a new generic answer/label)&lt;/li&gt;
&lt;li&gt;some mixture of the above combined with classic deep learning methods (like more fine-tuning, freezing weights, bounding loss contributions, etc.)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Looking at the &lt;a href="https://unlearning-challenge.github.io/"&gt;Google/NeurIPS Unlearning Challenge&lt;/a&gt;, more than half of the top 10 models did some combination of these approaches, which I will call sophisticated fine-tuning for unlearning.&lt;/p&gt;
&lt;p&gt;There's another batch of deep learning methods which compare the fully retrained model with the unlearned model. This can use methods like divergence or distance measurements between the networks or mix those measurements alongside fine-tuning methods.&lt;/p&gt;
&lt;p&gt;But since networks can exist in divergent states and have similar outputs, this can be prone to error if done without attention to the actual output impact. In successful approaches, this comparison can take the form of a student/teacher setup, where the student is told to "stay near the teacher" but unlearn a particular set of examples.&lt;sup id="fnref:3"&gt;&lt;a class="footnote-ref" href="#fn:3"&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;In general, these approaches work well with some unlearning definitions. The model ends up scoring lower on the examples they are unlearning and sometimes can no longer be attacked to reveal what they know about those examples.&lt;/p&gt;
&lt;p&gt;Unfortunately, these methods are prone to other problems:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;memorization/overfitting on other classes and their examples, especially if the population is small&lt;/li&gt;
&lt;li&gt;forgetting important task information, sometimes called knowledge entanglement&lt;/li&gt;
&lt;li&gt;performing worse across other metrics and classes (i.e. loss in model utility)&lt;/li&gt;
&lt;li&gt;unstable and unpredictable unlearning&lt;sup id="fnref:4"&gt;&lt;a class="footnote-ref" href="#fn:4"&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;/li&gt;
&lt;li&gt;scaling issues (i.e. trying to unlearn 1% of the training data is much easier than unlearning 10%)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You'll come back to several of these issues in the next article, where you'll learn about attacks on deep unlearning.&lt;/p&gt;
&lt;p&gt;Because there aren't standardized metrics to compare these approaches, and since forgetting is heavily linked to data and task properties, it's difficult to determine the most relevant approaches for your particular combination of data, task, model architecture and unlearning result. If you're a researcher working in the space, focusing on practical comparisons would be a useful contribution -- particularly across large parameter models and complex tasks (i.e. information-intensive learning).&lt;/p&gt;
&lt;p&gt;In reading more than 50 academic unlearning papers, I came across several novel ideas and approaches. I'll summarize a few exceptional ones in the final section. My goal isn't to instruct you to use these approaches, but I'm hoping to inspire creative interpretation and thinking when viewing the unlearning task.&lt;/p&gt;
&lt;h3 id="novel-approaches-to-unlearning"&gt;Novel approaches to unlearning&lt;/h3&gt;
&lt;p&gt;One interesting approach is to first figure out how easy or hard it would be to unlearn a point and still generalize well. Think of it as a litmus test for the ability to unlearn a particular example or subset.&lt;/p&gt;
&lt;p&gt;In &lt;a href="https://arxiv.org/abs/2103.03279"&gt;"Remember What You Want to Forget: Algorithms for Machine Unlearning"&lt;/a&gt; the authors define machine unlearning by borrowing concepts from differential privacy. In their definition, an attacker who can actively choose the forget set should not be able to tell the difference between querying the unlearned model and a model that has never seen the forget set. They found lower bounds on unlearning, where a model cannot unlearn anymore without removing most of the utility on the retain dataset.&lt;/p&gt;
&lt;p&gt;Their unlearning method continuously calculates model updates to balance between the forget data points and relevant retain data points. They use sampled noise to provide stronger guarantees so the attacker might not be able to tell the difference between the models. This is inspired by differential privacy.&lt;/p&gt;
&lt;p&gt;A different approach from &lt;a href="https://arxiv.org/abs/1911.04933"&gt;Golatkar et al. (2020)&lt;/a&gt; looks at the problem via information theory. The authors calculate what information exists in the forget sample while holding the information in the retain sample high. Since computing the &lt;a href="https://en.wikipedia.org/wiki/Fisher_information"&gt;Fisher information matrix&lt;/a&gt; is too difficult within a neural network, they approximate the matrix's diagonal as part of their scrubbing procedure.&lt;/p&gt;
&lt;p&gt;Others have investigated building unlearning directly into deep learning architectures. &lt;a href="https://aclanthology.org/2023.emnlp-main.738.pdf"&gt;Chen et al. (2023)&lt;/a&gt; imagine an architecture where unlearning layers are trained as deletion requests come in and combined into model architectures. These layers borrow from the teacher-student ideas in Kurmanji et al.'s work.&lt;sup id="fnref2:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;One final interesting approach looks at unlearning an entire class or group by investigating the mathematics of the learned model. &lt;a href="https://arxiv.org/abs/2404.13588"&gt;Chen et al. (2024)&lt;/a&gt; find the subspace of each class at each layer, identifying  the kernel and null spaces. They then remove the subspaces for the forget class and merge the other subspaces back together. This mimics some of the interesting approaches in &lt;a href="https://wandb.ai/sauravmaheshkar/Intrinsic-Dimensions/reports/What-Are-Intrinsic-Dimensions-The-Secret-Behind-LoRA--Vmlldzo2MDcxMDc5"&gt;LORA fine tuning&lt;/a&gt;. After the subspaces are merged, the forget examples are fine-tuned with pseudo labeling (i.e. generic labels).&lt;/p&gt;
&lt;p&gt;Many of these approaches came from research and not industry. Will these work in a production-level setup? Are they scalable and efficient in large-scale contexts (i.e. with billion or trillion-parameter models)? Remember: the unlearning promise is that these methods will be cheaper and easier than retraining the models regularly.&lt;/p&gt;
&lt;h3 id="scaling-of-deletion-and-removal-requests"&gt;Scaling of Deletion and Removal Requests&lt;/h3&gt;
&lt;p&gt;Most of this research approaches unlearning a small portion, sometimes a randomly chosen small subset of previously learned training data. Is this the problem definition at hand?&lt;/p&gt;
&lt;p&gt;Looking at &lt;a href="https://iapp.org/news/a/data-privacy-requests-metrics-lessons-for-your-privacy-program/"&gt;reported industry statistics  on California (CCPA) deletion requests&lt;/a&gt;, there is a heavy skew of where consumers want their data removed. This could change over time, but presumably the unlearning scale should match the deletion request scale. If a company receives 10 deletion requests a week or less compared with thousands, this changes the available solutions. The same applies to data expiration and retention periods.&lt;/p&gt;
&lt;p&gt;Since large tech companies who release large models often release new models every 6 months, this is another relevant aspect to consider. If retraining a new model is already happening, why not retrain without the forget data? How many models are you supporting and serving at once, and under what privacy and consent conditions?&lt;/p&gt;
&lt;p&gt;Given the solutions proposed in this article by the leading unlearning research, it makes sense to use unlearning if you:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;have a small amount of deletion requests or expiring data&lt;/li&gt;
&lt;li&gt;have a deep learning model that you trained or fine-tuned in-house on personal data&lt;/li&gt;
&lt;li&gt;don't plan on retraining that model in the next 3-6 months&lt;/li&gt;
&lt;li&gt;have a team that can take research papers and implement them in a fine-tuning pipeline&lt;/li&gt;
&lt;li&gt;can implement and test the unlearning approach achieves whatever metric you define as "good enough"&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Since this applies to a very small group of companies, what is a reasonable approach for everyone else?&lt;/p&gt;
&lt;p&gt;In general, it would be useful for machine learning companies to focus on:&lt;/p&gt;
&lt;p&gt;a) truly public data (i.e. open access and not under copyright)
b) enthusiastic consent and opt-in&lt;/p&gt;
&lt;p&gt;Enthusiastic consent can lead to fewer deletion requests and create more collaborative and cooperative models, which would build trust in these systems. In addition, it could increase competition and people's ability to choose which organizations they would like to support with their data.&lt;/p&gt;
&lt;p&gt;Collaborative data creates a more diverse and participatory set of models, which also provides better understanding of what users and humans actually want from AI systems.&lt;/p&gt;
&lt;p&gt;In the next article, you'll look at whether today's unlearning approaches meet the definitions and tests laid out in the initial attacks article, as well as if they open up any new attack methods. In subsequent articles, you'll look at other approaches to solving the memorization problem, like using differential privacy during training.&lt;/p&gt;
&lt;p&gt;As always, I'm very open to feedback (positive, neutral or critical), questions (there are no stupid questions!) and creative re-use of this content. If you have any of those, please share it with me! This helps keep me inspired and writing. :)&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Acknowledgements&lt;/em&gt;: I would like to thank &lt;a href="https://desfontain.es/serious.html"&gt;Damien Desfontaines&lt;/a&gt; for his feedback, corrections and thoughts on this article. Any mistakes, typos, inaccuracies or controversial opinions are my own.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;Although there are probably creative ways to think through how some of these could be applied to noise or sharding with a &lt;a href="https://huggingface.co/blog/moe"&gt;mixture of experts model&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;For a useful short video, I recommend watching &lt;a href="https://www.youtube.com/watch?v=qg4PchTECck&amp;amp;ab_channel=VisuallyExplained"&gt;this one from Visually Explained on YouTube&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;a class="footnote-backref" href="#fnref2:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;&lt;a href="https://arxiv.org/abs/2302.09880"&gt;Kurmanji et al. (2023)&lt;/a&gt; created several student models that are initialized with the original model weights and then diverge from this "teacher" model. The training procedure encourages the students to "stay close" to the teacher and retain data subsets while also "moving away" from the teacher on the forget data subsets.&amp;#160;&lt;a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:4"&gt;
&lt;p&gt;&lt;a href="https://dl.acm.org/doi/pdf/10.1145/3319535.3363226"&gt;Unlearning in anomaly detection work&lt;/a&gt; found that unlearning creates unbounded and exponential loss. This happens because the prediction is already close to correct, then maximizing loss creates very large gradients which erase information in the network. Ideally you want to keep that information by appropriately bounding large losses.&amp;#160;&lt;a class="footnote-backref" href="#fnref:4" title="Jump back to footnote 4 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="ml-memorization"></category></entry><entry><title>Machine unlearning: what is it?</title><link href="https://blog.kjamistan.com/machine-unlearning-what-is-it.html" rel="alternate"></link><published>2025-08-13T00:00:00+02:00</published><updated>2025-08-13T00:00:00+02:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2025-08-13:/machine-unlearning-what-is-it.html</id><summary type="html">&lt;p&gt;Machine unlearning sounds pretty cool. It is the idea that you can remove information from a trained model at will. If this was possible, you'd be able to edit out things you don't want the model to know, from criminal behavior, racialized slurs to private information. It would solve many …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Machine unlearning sounds pretty cool. It is the idea that you can remove information from a trained model at will. If this was possible, you'd be able to edit out things you don't want the model to know, from criminal behavior, racialized slurs to private information. It would solve many deep learning/AI problems at once.&lt;/p&gt;
&lt;p&gt;As unlearning is an active area of research, I've done my best to consolidate the field into three articles related to the &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;problem of memorization&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In this article, you're going to explore unlearning definitions and clarify what unlearning is trying to achieve. In the next article, you'll study approaches to unlearning and evaluate their effectiveness. In the final unlearning article, you'll learn about successful attacks on models that have undergone unlearning. To note, all of this is part of &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;a larger series exploring problems and solutions in machine learning memorization&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Are you a visual learner? There's &lt;a href="https://www.youtube.com/watch?v=opRq6kNua1c&amp;amp;ab_channel=ProbablyPrivate"&gt;a YouTube video on machine forgetting&lt;/a&gt; and on &lt;a href="https://youtu.be/0_ciCzHaM4o"&gt;unlearning definitions&lt;/a&gt; on the Probably Private channel.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;To start, what do we mean when we say unlearning? In order to build and evaluate unlearning solutions, you need to define it... so... what is unlearning?&lt;/p&gt;
&lt;h3 id="what-is-unlearning"&gt;What is unlearning?&lt;/h3&gt;
&lt;p&gt;What does it mean to unlearn something? In human speak, you might think about unlearning as simply forgetting. The interesting thing is that forgetting has been an area of deep learning study for some decades.&lt;/p&gt;
&lt;p&gt;Usually in machine learning you are attempting to not forget! There is a phenomenon in earlier deep learning called &lt;a href="https://en.wikipedia.org/wiki/Catastrophic_interference"&gt;&lt;em&gt;catastrophic forgetting&lt;/em&gt;&lt;/a&gt;, where you end up with a model that has forgotten important parts of what you wanted it to learn. This happens if a model is trained on significantly different tasks and then in later training the earlier tasks are forgotten.&lt;/p&gt;
&lt;p&gt;In the case of measuring forgetting, you can do so by looking at the error (i.e. loss) related to a particular example. This is also transferrable across model architectures, and creates a generalized way of understanding if the model has learned a particular piece of information.&lt;/p&gt;
&lt;h3 id="defining-forgetting"&gt;Defining forgetting&lt;/h3&gt;
&lt;p&gt;&lt;a href="https://arxiv.org/abs/1812.05159"&gt;Toneva et al. studied this in 2019&lt;/a&gt; when investigating forgetting events. They defined these events as when an example went from properly classified to then misclassified at a later point in training.&lt;/p&gt;
&lt;p&gt;Their motivation for studying this phenomenon was attempting to find the smallest representative dataset so that they had a minimal amount of data, but that data was enough to properly learn a task or concept. They were also curious about if these forgetting events could teach them how to unlearn anomalies or mislabeled examples/concepts.&lt;/p&gt;
&lt;p&gt;In their research, they classified learning and forgetting by defining binary outcomes. Something is learned when it goes from being misclassified to being classified correctly and forgetting is the opposite (was classified correctly, now misclassified). Their logic could also be applied to some threshold of accuracy increase, or some threshold of ratios between true positives and false negatives, etc. Important is that there is a clear definition that can be measured of what forgetting is, and that this measurement accurately measures what you would consider adequate evidence of such an event.&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;NOTE&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://arxiv.org/abs/1812.05159"&gt;Toneva et al.&lt;/a&gt; measures a forgetting event in several steps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Pick a subset of examples you want to study. In the paper, these are from the training dataset, but you could likely rework this metric to sample from either training or any test dataset.&lt;/li&gt;
&lt;li&gt;As training progresses, track the classification or prediction result of these data points. Save these measurements in variables.&lt;/li&gt;
&lt;li&gt;If there is a decrease in accurate prediction or certainty for a particular example, store the time point and the shift as a forgetting moment.&lt;/li&gt;
&lt;li&gt;Continue until training is complete. At the end, analyze these events to find properties of forgetting.&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;p&gt;Interestingly, they also found what they called "unforgettable examples". For them, these didn't (directly) relate to memorization, but instead were test examples that the model always got correctly once they were learned. In studying these unforgettable examples, they found that these examples were extremely generic and average. In comparison, the examples that were easiest to forget were atypical or even mislabeled.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Several rows of images, where the labels are written on the top. The most unforgettable example for the class is shown alongside the most forgettable examples. The unforgettable examples look much clearer and often have a full view of the object. The most forgettable examples show sometimes only half of the object or taken at an odd angle." src="./images/2025/unforgettable_examples_common.png"&gt;
&lt;em&gt;Forgettable and Unforgettable examples in a simple computer vision model&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Then they randomly selected training examples to remove and noticed that performance dropped very quickly. They were attempting to reduce the dataset size to pare down the amount of these "unforgettable" or common examples needed to still learn the concept.&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;However, this approach did not work with the forgettable instances. For the ones that were valid (i.e. not mislabeled), these were key in actual learning and could not be removed without significant drops in accuracy or concept learning.&lt;sup id="fnref:3"&gt;&lt;a class="footnote-ref" href="#fn:3"&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;Another interesting conclusion from the Toneva et al. paper is that the underlying data distribution complexity greatly affects the fraction of examples you can remove and still learn. They note that "for a given architecture, the higher the intrinsic dataset dimension, the larger the number of support vectors, and the fewer the number of unforgettable examples". Translated, given many of today's complex deep learning tasks, you need to have more data to learn, and that data will inform stored decision boundaries. If there are many data points supporting these boundaries, you can remove some of them and the model will still be performant. As you learned in &lt;a href="https://blog.kjamistan.com/how-memorization-happens-overparametrized-models.html"&gt;previous articles&lt;/a&gt;, this also means those examples hold information on how to define those boundaries. This is also why in sparse areas, these points are often memorized.&lt;/p&gt;
&lt;p&gt;This again proves that all datasets are not created equal! As you learned &lt;a href="https://blog.kjamistan.com/machine-learning-dataset-distributions-history-and-biases.html"&gt;in the datasets article&lt;/a&gt; some datasets are massively skewed, making certain parts of the data both more valuable and more vulnerable than the rest of the data. When looking at the qualities of the data itself, it will be more difficult to unlearn complex and outlier "unforgettable" examples compared to something that is hidden in the crowd (the inverse of the sentence above, where the clustering of information from common examples provides cover -- allowing you to remove many of them and still hold information).&lt;/p&gt;
&lt;p&gt;This has the benefit that things that aren't memorized (i.e. not "common and repeated" or "novel and class-defining") can be "unlearned" (in this definition) more easily, which is useful knowledge to build unlearning methods.&lt;/p&gt;
&lt;p&gt;These conclusions show us:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;harder-to-learn problems require more data or result in more memorization&lt;/li&gt;
&lt;li&gt;some data points are more informative, making them more valuable to memorize&lt;/li&gt;
&lt;li&gt;forgetting is as much a property of the task and data as it is a property of the algorithms and network&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;If you &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;already read the previous articles in the series&lt;/a&gt; you already learned some of this, but it bears repeating.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;So if not all data can be forgotten the same way, and if some data must not be forgotten in order to learn properly, how can you figure out what you can safely unlearn?   &lt;/p&gt;
&lt;h3 id="what-can-even-be-unlearned-data-distributions-and-unlearning"&gt;What can even be unlearned? Data distributions and unlearning&lt;/h3&gt;
&lt;p&gt;If some data is inherently unforgettable, could this eventually reveal which data can be unlearned and which not? What if you need to unlearn something that is difficult or "impossible" to forget?&lt;/p&gt;
&lt;p&gt;&lt;a href="https://arxiv.org/abs/2205.10770"&gt;Tirumala et al (2022)&lt;/a&gt; wanted to investigate if there was a way to prevent memorization in LLMs but still fit the model with appropriate accuracy. They discovered a "forgetting baseline" which establishes a lower bounds on memorized sequences. They found that this baseline is correlated with model scale, meaning it is harder for larger models to forget things that they have learned (you already learned this in the &lt;a href="https://blog.kjamistan.com/how-memorization-happens-overparametrized-models.html"&gt;overparameterization article&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;&lt;img alt="Two graphs side-by-side. The left shows model performance over training epochs. You can see performance increase over time and then hold at a fairly high accuracy. In this same graph there is a yellow curve that starts as the model is increasing accuracy and then drops lower and then holds on a straight line. In the second graph you see a line going almost constantly up." src="./images/2025/forgetting_baseline.png"&gt;
&lt;em&gt;Forgetting baseline increases with model size&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;They uncovered that there appears to be a minimum, or baseline related to model size. As model size increases, the examples that the model cannot forget increases. The first graph shows when a small batch was first learned (where the yellow graph starts). Even though this batch is no longer trained, the retention of that information continues many iterations after it is initially learned. The second graph shows this property in relation to network size (i.e. number of parameters), showing a steady increase in number of memorized sequences as model parameters increase.&lt;/p&gt;
&lt;p&gt;To add to the complexity, &lt;a href="https://arxiv.org/abs/2207.00099"&gt;Jagielski et al revealed in 2023&lt;/a&gt; that repetition and task-related properties were at play. Their research determined that examples repeated multiple times are harder to forget, "more difficult" and outlier examples are harder to forget and non-convex problems create difficulties in forgetting. What does that mean?&lt;/p&gt;
&lt;p&gt;&lt;img alt="Two graphs where the y-axis shows the error. The graph labeled convex shows a curve going gradually down and then back up. The curve has a clear point where the error is minimized at the lowest point of error (optimum). In the non-convex graph there is a more squiggly line going up and down. You can see that there are many points where the curve hits a saddle point; it is lower than the points near it but not at the lowest point overall. These points are labeled local optimum and then the lowest point of the entire graph is labeled global optimum." src="./images/2025/convex_vs_non_convex.png"&gt;&lt;/p&gt;
&lt;p&gt;Most deep learning problems are non-convex problems in an extremely high-dimensional space. Actually this is exactly why they do well at complex problems like computer vision, text generation, translation and multi-modal inputs (text/vision inputs combined, for example). Without these properties, you could train a simple machine learning model and have a much cheaper energy/compute bill for the same performance. Jagielski et al.'s work showed that this complexity creates difficulties in forgetting.&lt;/p&gt;
&lt;p&gt;Forgetting relates directly to the complexity of the problem you are trying to solve. The question you need to answer: does the data you have represent the complexity of the problem space and how the model then can learn that problem space?&lt;/p&gt;
&lt;p&gt;&lt;img alt="A chart where the y-axis shows forgetting time and the x-axis shows learning time. In the bottom right (i.e. hard to learn easy to forget) there are &amp;quot;mislabeled examples&amp;quot;. As you move up on the right (hard to learn) there is rare examples. Then, there is a line where everything above it shows &amp;quot;unforgettable&amp;quot;. On the left (easy to learn, hard to forget) there is common examples and on the right (hard to learn and unforgettable) there is &amp;quot;complex examples&amp;quot;." src="./images/2025/unforgettable_examples_chart.png"&gt;
&lt;em&gt;Maini et al., Characterizing Datapoints via Second-Split Forgetting (2022)&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://arxiv.org/abs/2210.15031"&gt;Maini et al in 2022&lt;/a&gt; defined these characteristics when studying forgetting in deep learning. Their research found that there was a difference in forgetting mislabeled examples, rare (i.e. infrequent but useful) examples and complex examples.&lt;/p&gt;
&lt;p&gt;What are "complex" examples? The authors define them as "samples that are drawn from sub-groups in the dataset that require either (1) a hypothesis class of high complexity; or (2) higher sample complexity to be learnt relative to examples from the rest of the dataset."&lt;/p&gt;
&lt;p&gt;Let's break this down by taking a short detour into exactly what complex means in this context. In this case, complex is an attribute of the amount of information you can learn in an example, compared to its peers or to the entire world of examples. If I show you something that you already know (common), you probably won't learn much unless it's your first time learning it.&lt;/p&gt;
&lt;p&gt;This complexity can also relate to the class itself, where the class is in-and-of-itself an unexpected or surprise occurrence (i.e. an anomaly or aberrance). This "surprise" element can sometimes (but not always) be synonymous with what is meant by "outlier" in a distribution.&lt;/p&gt;
&lt;p&gt;The neat thing is that this complexity has been studied for almost as long as computers have existed. Essentially, this complexity represents what Claude Shannon called &lt;em&gt;information&lt;/em&gt;. &lt;a href="https://en.wikipedia.org/wiki/Information_theory"&gt;Shannon's information theory&lt;/a&gt; explains how complex data holds more information.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Are you a visual learner? There's &lt;a href="https://youtu.be/lLobwqcQEvQ?feature=shared"&gt;a YouTube video on information theory&lt;/a&gt; on the Probably Private channel.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This theory is the basis of much of &lt;a href="https://en.wikipedia.org/wiki/Computational_learning_theory"&gt;computational learning theory&lt;/a&gt;, where you want to extract information and store it in another format (a model, for example) that can hold this information in a compacted and compressed way for future use. As you can imagine, some information is more complex and useful than other information -- ideally you are only storing and communicating the most essential information when you want to learn or communicate efficiently.&lt;/p&gt;
&lt;p&gt;To better understand how information theory and learning theory work together -- imagine you are pulling different colored balls from a jar. After a few pulls of only red balls, you start to expect to see more red balls, maybe there are only red balls in there? Then you see a surprise blue ball. This ball holds more information, which helps you learn. This "aha!" is one example of both information and a "more complex" outcome -- which is what you are trying to store in your model, so the model is also not "surprised"&lt;sup id="fnref:4"&gt;&lt;a class="footnote-ref" href="#fn:4"&gt;4&lt;/a&gt;&lt;/sup&gt; to occasionally see a blue or green ball, even if it learns the majority are red.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A person is next to a jar of different colored balls. Most of them are red but a few are blue and one is green. When pulling a repeated red ball, there is little or no information. When pulling a green ball there is a lot of information. The person learns something when they see new information." src="./images/2025/info_theory_surprise.png"&gt;&lt;/p&gt;
&lt;p&gt;Essentially this is the goal of today's deep learning systems when you want to learn a "world problem" and save it in a multi-billion/trillion parameter model (which is in the end able to be compressed into several very large files on a computer). You want this model to have held all of the information needed to take an input and give a useful, informative output based on every piece of that information (and potential links/signals learned from these "surprise" connections based in information theory).&lt;/p&gt;
&lt;p&gt;Unfortunately there isn't a universal way to measure information in a piece of data, especially if its complexity (and therefore information content) is relative to the learning problem or to the dataset (or world!). You can, however, use the learning process to recognize more complexity, because these classes/examples/problems will probably take longer to learn and demonstrate stronger memorization properties when compared with less complex examples or classes.&lt;/p&gt;
&lt;p&gt;Perhaps these data points can teach us something about recognizing memorization and forgetting. In all of the papers you've explored thus far, there are different metrics the authors use to define this phenomenon. Maybe studying and better defining forgetting and learning can help you establish a baseline for example complexity and frequency within a population. This, of course, must account for things like model size and problem complexity.&lt;/p&gt;
&lt;p&gt;Most currently-developed ways of studying this problem are applied late in the learning process, via a modified training (i.e. leave-one-out) or after the model has been trained. You would need to fundamentally shift the machine learning training and evaluation to evaluate these properties during normal training, so that there can be appropriate metrics which are also efficient and scalable. As of yet, there isn't a universal definition or standard for doing this, and the research around how to appropriately measure this is not designed to scale and certainly isn't yet built into machine learning frameworks.&lt;/p&gt;
&lt;p&gt;With that in mind, you still need to define unlearning -- especially in relation to memorized / "unforgettable" examples. Going back to a metric that is easier to measure, you might remember &lt;a href="https://arxiv.org/abs/2311.17035"&gt;Nasr et al's work&lt;/a&gt; on defining memorization as to whether it is discoverable or extractable. This relates back to &lt;a href="https://blog.kjamistan.com/defining-privacy-attacks-in-ai-and-ml.html"&gt;our two attack definitions&lt;/a&gt;. Perhaps these can lead us to a clearer definition, one that is both measurable and scalable.&lt;/p&gt;
&lt;h3 id="i-swear-i-didnt-see-it-model-based-definitions"&gt;I swear I didn't see it: Model-based definitions&lt;/h3&gt;
&lt;p&gt;Putting the focus back to the model itself, perhaps there is a definition that looks at the model behavior, fidelity or other properties to define forgetting and/or "unlearning".&lt;/p&gt;
&lt;p&gt;As you learned &lt;a href="https://blog.kjamistan.com/how-memorization-happens-novelty.html"&gt;from a previous article&lt;/a&gt;, one way to differentiate between memorized examples and not memorized examples is to use the leave-one-out training to develop several models and then compare a model that has never seen a training example with one that has. But is there a way to determine that a model behaves similarly without having to train hundreds of large models?&lt;/p&gt;
&lt;p&gt;Some unlearning approaches look at measuring the distance between models. However, because many neural networks can learn the same task or same functions with random initialization weights (especially ones with millions of parameters), the similarity or distance of the weights of a network is not a very good choice for evaluating model similarity. It's easy due to permutation invariance (a cool property of linear algebra) to have different weights and yet the same outcomes.&lt;/p&gt;
&lt;p&gt;In 2023, &lt;a href="https://unlearning-challenge.github.io/"&gt;NeurIPS ran an "unlearning challenge"&lt;/a&gt;, which attempted to define unlearning. The measurements used in the challenge were predominantly defined and then evaluated by Google researchers, and included several interesting qualities:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;They choose "average case" datasets, where the worst-case scenario of unlearning only complex examples is removed.&lt;/li&gt;
&lt;li&gt;They use the definition of differential privacy to define a new unlearning measurement that is weaker than the concept of differential privacy, but that they claim is sufficient for proving unlearning.&lt;/li&gt;
&lt;li&gt;This definition attempts to keep model predictions between a model that has never seen the data (leave-one-out) and the model that has undergone unlearning close. Both models should behave as similarly as possible.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img alt="There is a graph with a y-axis that measures some model output (like prediction distribution or loss on a given example). Then there are two probability distributions, one in blue that skews further to the left which represents model responses that have never seen the example and one in red that skews to the right that represents the unlearned model. The goal of the attacker is to separate these two distributions by choosing a hypothesis line and marking everything to the left as one model and everything to the right as the unlearned model." src="./images/2025/unlearning_v_loo_model_prediction.png"&gt;&lt;/p&gt;
&lt;p&gt;To visualize this challenge and the measurement, you compare the leave-one-out model prediction performance metrics across an example set to that of the unlearned model. Similar to differential privacy attacks, the attacker must try to differentiate between the two models. The attacker only has query access to "some model" and must decide if that model is the one that has never seen the data or the one that has unlearned the data. Of course, if you have both models you can simply test both across a sampled subset and determine if this attack would be successful.&lt;/p&gt;
&lt;p&gt;As you have learned from &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;the memorization series&lt;/a&gt; thus far, a model that has memorized a useful complex example or even a less useful common but repeated example will score higher on similar examples than models that haven't seen this information before. Therefore, this unlearning metric attempts to exploit this and see if there are examples that demonstrate the difference between the models.&lt;/p&gt;
&lt;p&gt;Similar to the attacks defined, this attack works best by finding a likelihood ratio exploiting several accuracy dimensions (False Positive Rate and False Negative Rate). The attacker can bring any prior knowledge they have to create an initial assumption and then query the "unknown model" many times in order to update their suspicion on if that model has changed (i.e. if the unlearned and other model behave differently).&lt;/p&gt;
&lt;p&gt;When doing so, an intelligent attacker would keep a history of all the queries and begin to distinguish minor differences by looking at the distribution of responses overall. Eventually, they might find a worst-case scenario query that clearly shows them how the query responses are easily separable.&lt;/p&gt;
&lt;p&gt;In the &lt;a href="https://unlearning-challenge.github.io/assets/data/Machine_Unlearning_Metric.pdf"&gt;metrics definitions for the challenge (PDF)&lt;/a&gt;, the authors use two attacks to measure the unlearning quality of the models. They then choose the best result for a particular example to calculate an epsilon that measures the divergence between the two models. They do this for the sampled subset that represents the forget set examples. Then, they average these epsilons across all their successful attacks or best guesses to determine what they call the "forgetting quality" of the unlearning approach.&lt;sup id="fnref:5"&gt;&lt;a class="footnote-ref" href="#fn:5"&gt;5&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;You'll evaluate the approaches for the top scorers in the next article, but remember that this entire measurement requires a predefined forget and retain set and the ability to measure and compare the approach with a leave-one-out model. Presumably the authors could also say that the best approaches automatically extend to other models and therefore you do not need to measure every time -- but because you know &lt;a href="https://blog.kjamistan.com/how-memorization-happens-overparametrized-models.html"&gt;that model architecture&lt;/a&gt; and &lt;a href="https://blog.kjamistan.com/machine-learning-dataset-distributions-history-and-biases.html"&gt;datasets&lt;/a&gt; greatly influence memorization, it would be difficult to argue that one sandboxed experiment demonstrates a universal approach.&lt;/p&gt;
&lt;h3 id="privacy-attack-metrics-as-a-measurement"&gt;Privacy-attack metrics as a measurement&lt;/h3&gt;
&lt;p&gt;In the end, this unlearning metric greatly relies on privacy attacks as a measurement. These attack metrics can also be just directly used to determine the exposure of a given example or set of examples or even across the entire training dataset.&lt;/p&gt;
&lt;p&gt;For the challenge, the attacks and tools used were inspired directly by membership inference attacks, such as LiRA. On &lt;a href="https://research.google/blog/announcing-the-first-machine-unlearning-challenge/"&gt;Google's post about the challenge&lt;/a&gt;, they state:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Intuitively, if unlearning is successful, the unlearned model contains no traces of the forgotten examples, causing MIAs to fail: the attacker would be unable to infer that the forget set was, in fact, part of the original training set.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;So why not just use these attacks as the gold standard? Although they do require extra compute and time to evaluate models, they clearly help model developers and engineers better understand and address the concerns raised by memorization. New attacks should be developed and researched in case there are better ways, but practitioners need standard definitions to allow for comparison and evaluation.&lt;/p&gt;
&lt;p&gt;Additionally, LiRA and extraction attacks could be standardized and integrated into normal machine learning evaluation pipelines -- allowing data scientists and machine learning engineers to evaluate the effectiveness of their interventions and track memorization "metrics" in their systems.&lt;/p&gt;
&lt;p&gt;Of course, with LiRA you need to decide how to hold the attack at a reasonable "true positive rate", so you are accurately guessing which data the model has seen and which not. In addition, there will always be "worst case" versus "average case" debates about which is adequate to properly define unlearning. Much of this could standardize over time in scientific literature as the field matures and as legal and privacy scholars become informed enough about the distinctions to offer advice.&lt;/p&gt;
&lt;p&gt;Even if you weaken the definition to some average case scenario, you can imagine a way where you might not actually prove you didn't learn a thing, you might simply &lt;em&gt;hide&lt;/em&gt; that you learned it. You can think of this as a slight remodel -- the walls are still there but they are a new color, can you find the artifacts that remain?&lt;/p&gt;
&lt;h3 id="harry-who-approximate-unlearning"&gt;Harry who? Approximate unlearning&lt;/h3&gt;
&lt;p&gt;If you don't want to use a precise measurement, you can perhaps find an approximate measurement. This is exactly what Microsoft Research investigated in their research &lt;a href="https://arxiv.org/abs/2310.02238"&gt;Who's Harry Potter? Approximate Unlearning in LLMs&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;They studied how they might be able to unlearn Harry Potter as a concept. They didn't want to necessarily prove they never learned Harry Potter (MIA) or even never mention the name. They just wanted to make sure they didn't do it often or consistently.&lt;/p&gt;
&lt;p&gt;In this "approximate" unlearning, they encountered other problems with picking a famous person-like character:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Related concepts&lt;/em&gt;: There are several persons related to Harry Potter, like Ron and Hermione. How do you just remove Harry without also eliminating other persons in Harry's proximity?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Downweighting the correct embeddings&lt;/em&gt;: Making it hard to answer "My name is Harry" also deprioritizes the token sequence "My name is", which is not what you want to do.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Disambiguation&lt;/em&gt;: There is more than one Harry! How do you remove the correct one?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Unlearning the correct links&lt;/em&gt;: How do you maintain links to other concepts, like Magic and Hogwarts while delinking Harry?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Simple replacements are too simple&lt;/em&gt;: Simply replacing tokens (i.e. substitute a new name "John" for Harry) leads to concept confusion, like the LLM responding with two persons acting as one.  "John found his keys and then Harry left the house."&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;To deal with these issues, they replaced several core concepts with more "generic" concepts. They then fine-tuned the model to "relearn" these generic links, suppressing the Harry Potter likelihood via this additional training.&lt;/p&gt;
&lt;p&gt;After that additional training, they reached a result where they determined the slippage was appropriate (i.e. Harry was rarely mentioned). The authors note that Harry Potter and linked concepts exist in their own universe, which allowed this to be more easily extracted from information in other concepts.&lt;sup id="fnref:6"&gt;&lt;a class="footnote-ref" href="#fn:6"&gt;6&lt;/a&gt;&lt;/sup&gt; Non-fiction content and real humans could be more difficult to remove.&lt;/p&gt;
&lt;h3 id="open-problem-lack-of-clear-definitions"&gt;Open Problem: Lack of clear definitions&lt;/h3&gt;
&lt;p&gt;As you have learned thus far in this article, there are many differing measurements, metrics and ideas about what "unlearning" or "forgetting" could mean. Because there isn't yet a clear, agreed-upon definition of unlearning it is also a difficult field to effectively contribute to, because how can you move the needle on a problem when you haven't yet defined the problem?&lt;/p&gt;
&lt;p&gt;From the Google Unforgetting challenge:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The evaluation of forgetting algorithms in the literature has so far been highly inconsistent. While some works report the classification accuracy on the samples to unlearn, others report distance to the fully retrained model, and yet others use the error rate of membership inference attacks as a metric for forgetting quality.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Since you might want to compare 10 different approaches to unlearning, you are then trying to evaluate them with 3-4 different types of measurements. How do you choose the best one for your data and use case? Do you have time to evaluate every approach? Where is the advice on how to fix your specific unlearning problem?&lt;/p&gt;
&lt;p&gt;To make things more complicated, when lawyers or policy experts look at these problems, they see a myriad of other issues. &lt;a href="https://arxiv.org/abs/2412.06966"&gt;Research from a multidisciplinary team of researchers in big tech and universities&lt;/a&gt; pointed out that being able to unlearn one particular artist doesn't remove their influence on a group (i.e. Monet and the Impressionists). Being able to remove all pictures of Batman doesn't remove all references to the idea of that superhero, and even all photos of someone cosplaying that character. Essentially their advice focuses heavily on copyright and advises that you use guardrails to attempt to block violations, but they present a sound discussion of the complexity of the issue. Since there isn't yet much legal guidance, defining unlearning, both technically and legally, is going to be a long process.&lt;/p&gt;
&lt;p&gt;Additionally, it's hard to disentangle concepts in data at scale. &lt;a href="https://arxiv.org/abs/2110.11891"&gt;Thudi et al.&lt;/a&gt; called for better auditable unlearning definitions, noting that it's easy to learn the same information from a different data point. This is similar to the problem with graph networks, where if you delete your data, it doesn't mean that a friend of yours isn't sharing your data without your permission (i.e. via their contact book or an uploaded photo). Without a clear definition that is both auditable and that reflects the societal and personal views of privacy, it's problematic to call unlearning "done" or even achievable.&lt;/p&gt;
&lt;p&gt;Taking a personal and social view of privacy, your data might be easily traceable but you as a concept are not. Other people can contribute data about you without your consent because that's how data works -- it's unlikely that you'll ever have a full view of what digital data about you exists. That said, creating a better conversation around what humans want and what is technically sound could help create an understandable, feasible and auditable definition.  &lt;/p&gt;
&lt;p&gt;To be clear, that's a big part of why I've been writing this series -- to hopefully spark conversations and insights across disciplines in order to help guide better, clearer, safer definitions of Privacy-by-Design AI systems.&lt;/p&gt;
&lt;p&gt;Unfortunately I won't wrap up this post with an answer to this question, but I hope I've sparked some thinking around how to talk about the definitions currently available and begin to reason about which ones resonate with you, with your team, with your legal or policy expertise and with your own data.&lt;/p&gt;
&lt;p&gt;In the next article, you'll be exploring different ways researchers and teams have technically approached unlearning. You'll dive deeper into the problems of unlearning critical information. You'll also look at other open problems in unlearning, like scaling unlearning for real-world problems.&lt;/p&gt;
&lt;p&gt;As always, I'm very open to feedback (positive, neutral or critical), questions (there are no stupid questions!) and creative re-use of this content. If you have any of those, please share it with me! This helps keep me inspired and writing. :)&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Acknowledgements&lt;/em&gt;: I would like to thank &lt;a href="https://desfontain.es/serious.html"&gt;Damien Desfontaines&lt;/a&gt; for his feedback, corrections and thoughts on this article. His input greatly contributed improvements in my thinking and writing. Any mistakes, typos, inaccuracies or controversial opinions are my own.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;Wouldn't it be neat if you could actually track these as you trained and view concrete shifts in examples, even if they are just at a certain amount of training intervals? Although you &lt;em&gt;could&lt;/em&gt; do this in some custom MLOps setups, this is not standard practice. That said, this would certainly help both debugging learning processes but also tracking problems like memorization... Note that this would need to be done in a performant manner, which would likely mean selecting some subsample of a given test or train batch rather than measuring every example.&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;Yes, for the privacy professionals reading this -- this means information-driven data minimization (so long as this learning isn't centralized)! But... how to do this efficiently at scale is still an unsolved problem. Plus, how do you decide whose information is spared the learning process and whose information is used to learn? If you can answer this, please &lt;a href="https://probablyprivate.com/about/"&gt;write me&lt;/a&gt;!&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;They also found extreme outliers and potentially mislabeled examples by diving further into the "most forgettable" examples. For obvious reasons, these examples actually hindered learning rather than supported it. As they note: "Finding a way to separate those points from very informative ones is an ancient but still active area of research (John, 1995; Jiang et al., 2018)."&amp;#160;&lt;a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:4"&gt;
&lt;p&gt;I try to avoid anthropomorphizing models, they are just fancy computer programs. Please excuse this as I try to explain the concept. :)&amp;#160;&lt;a class="footnote-backref" href="#fnref:4" title="Jump back to footnote 4 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:5"&gt;
&lt;p&gt;There is some pseudocode that is instructive on this approach in the &lt;a href="https://unlearning-challenge.github.io/assets/data/Machine_Unlearning_Metric.pdf"&gt;metrics definitions for the challenge&lt;/a&gt; and the &lt;a href="https://arxiv.org/abs/2406.09073"&gt;final wrap up of the competition&lt;/a&gt;. Note: they approximate some of these metrics and specifically call out that the population is too small to measure everything as accurately as desired for every model due to timing and computation constraints.&amp;#160;&lt;a class="footnote-backref" href="#fnref:5" title="Jump back to footnote 5 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:6"&gt;
&lt;p&gt;Even newer and significantly weaker definitions, like &lt;a href="https://arxiv.org/abs/2310.07579"&gt;"in-context unlearning"&lt;/a&gt;, where the unlearning is supposed to happen in the prompt, take LiRA as an inspiration and note that the LiRA definition can and should be used as the most accurate measurement of unlearning.&amp;#160;&lt;a class="footnote-backref" href="#fnref:6" title="Jump back to footnote 6 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="ml-memorization"></category></entry><entry><title>AI Risk and Threat Taxonomies</title><link href="https://blog.kjamistan.com/ai-risk-and-threat-taxonomies.html" rel="alternate"></link><published>2025-08-05T00:00:00+02:00</published><updated>2025-08-05T00:00:00+02:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2025-08-05:/ai-risk-and-threat-taxonomies.html</id><summary type="html">&lt;p&gt;It seems like every week &lt;a href="https://www.linkedin.com/in/katharinejarmul/"&gt;my LinkedIn&lt;/a&gt; feed is filled with new &lt;em&gt;just released&lt;/em&gt; AI risk taxonomies, threat models or AI governance handbooks. Usually these taxonomies come from governance consultants or standards authorities and are a great reference for understanding the wide variety of risks AI systems&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt; bring with …&lt;/p&gt;</summary><content type="html">&lt;p&gt;It seems like every week &lt;a href="https://www.linkedin.com/in/katharinejarmul/"&gt;my LinkedIn&lt;/a&gt; feed is filled with new &lt;em&gt;just released&lt;/em&gt; AI risk taxonomies, threat models or AI governance handbooks. Usually these taxonomies come from governance consultants or standards authorities and are a great reference for understanding the wide variety of risks AI systems&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt; bring with them.... but...&lt;/p&gt;
&lt;p&gt;&lt;img alt="A &amp;quot;this is fine&amp;quot; meme. The first panel shows the dog in the burning room with a headline: Breaking: New AI Risk Report, 500 updated definitions! And the second panel is the &amp;quot;This is fine&amp;quot; panel." src="./images/2025/ai_risk_taxonomy.jpg"&gt;&lt;/p&gt;
&lt;p&gt;Often they are maddeningly impractical.&lt;/p&gt;
&lt;p&gt;Let's say you're a governance or technical expert on a team and you get forwarded a 500-definition taxonomy with little to no categorization on: where the threat lies, how to implement controls and, most importantly, if this even applies to you. Where can you start with that document? For your mental health, I'd recommend closing your browser and making yourself a tea in the garden...&lt;/p&gt;
&lt;p&gt;So, let's NOT stop making taxonomies, they are useful as a reference, but let's START making deeply practical approaches for people who actually work in governance, data and AI. These documents can help teams:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Prioritize which risks matter&lt;/strong&gt;: Do you train your own models? No? Then for the love of the universe stop talking about data poisoning, it's not your problem! Instead, focus on figuring out which threats are actually relevant for the ways you are using AI/ML systems and then first only look at those relevant attacks and vulnerabilities.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Stop (just) citing papers and start red teaming and testing&lt;/strong&gt;: Papers are great. I love papers, really, have you seen &lt;a href="https://practicaldataprivacybook.com/practical_data_privacy_urls/"&gt;my book's citations&lt;/a&gt;? ;) But have you heard of &lt;a href="https://github.com/paperswithcode"&gt;papers with code&lt;/a&gt;? Work with technical team members to actually build out and test out a few attacks once you know which ones concern you.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Build out data governance infrastructure&lt;/strong&gt;: Most organizations aren't training or hosting extremely large models themselves, but they are building tooling around these systems. Focus on getting data governance basics correct (documentation, tagging, cataloging, lineage and quality tracking) so that as your data/AI/ML maturity grows you've already covered the basics and you're ready to go.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Focus on system components and data access&lt;/strong&gt;: Concerned about AI privacy and security? Focus on what data and documents the system has access to and how. Build protections just like you would for any data access. For example, removing potentially sensitive data sources from any data the AI system accesses is a great start.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Flex your multidisciplinary risk muscle&lt;/strong&gt;: Not yet doing multidisciplinary risk assessment and evaluation? You're living in the past, bud! Yes, it'll "slow you down" and introduce new processes at first, but the benefits of faster releases, higher-quality, privacy-aware and secure products will definitely outweigh that initial friction.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Getting the ball rolling, even starting small is the most essential thing you can do for building more secure, more privacy-aware systems. Then, your ability to address all of those taxonomies grows with practice, platforms and systems that help you assess, manage and reduce the impact of new risks, threats and vulnerabilities.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Want more tips like this in your inbox? &lt;a href="https://probablyprivate.com/subscribe/"&gt;Subscribe to my newsletter&lt;/a&gt; or &lt;a href="https://www.youtube.com/@ProbablyPrivate"&gt;my YouTube channel&lt;/a&gt; to get the latest.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'm curious: any other practical tips you have for folks to get started on AI system risk? Do taxonomies help you do your work; if so, for what work and how?  And what are you doing outside of taxonomy work to address risk?&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;An AI system includes machine learning models, monitoring, evaluation, software, infrastructure/networking/hardware and data needed to run an AI-based product or service.&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="security"></category></entry><entry><title>Algorithmic-based Guardrails: External guardrail models and alignment methods</title><link href="https://blog.kjamistan.com/algorithmic-based-guardrails-external-guardrail-models-and-alignment-methods.html" rel="alternate"></link><published>2025-07-28T00:00:00+02:00</published><updated>2025-07-28T00:00:00+02:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2025-07-28:/algorithmic-based-guardrails-external-guardrail-models-and-alignment-methods.html</id><summary type="html">&lt;p&gt;You've probably at some point heard the term "guardrails" when talking about security or safety in AI systems like LLMs or multi-modal models (i.e. models that include and produce multiple modalities, like speech and image, videos, image and text).&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Are you a visual learner? There's &lt;a href="https://youtu.be/IeyB-2cS5lM"&gt;a YouTube video for …&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;</summary><content type="html">&lt;p&gt;You've probably at some point heard the term "guardrails" when talking about security or safety in AI systems like LLMs or multi-modal models (i.e. models that include and produce multiple modalities, like speech and image, videos, image and text).&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Are you a visual learner? There's &lt;a href="https://youtu.be/IeyB-2cS5lM"&gt;a YouTube video for this article&lt;/a&gt; on the Probably Private channel.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In this article in &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;the series&lt;/a&gt;, you'll dive deeper into what technically falls under the term "guardrail" in today's AI systems and review whether these are a reasonable approach to memorization in AI/ML models.&lt;/p&gt;
&lt;h3 id="what-are-guardrails"&gt;What are guardrails?&lt;/h3&gt;
&lt;p&gt;The term is unfortunately difficult to detail technically. Since the term became popular, it's been used to describe a variety of interventions in AI/ML systems. These can range from:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;software-based input and output filters (as you read &lt;a href="https://blog.kjamistan.com/blocking-aiml-memorization-with-software-guardrails.html#blocking-aiml-memorization-with-software-guardrails"&gt;in the last article&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;external algorithmic/machine learning model input and output filters&lt;/li&gt;
&lt;li&gt;actual fine-tuning or extended training which attempts to update the main model to reduce the chance of unwanted outputs (this is sometimes called &lt;em&gt;alignment&lt;/em&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Let's review the second two in further detail and compare them to what you learned about &lt;a href="https://blog.kjamistan.com/blocking-aiml-memorization-with-software-guardrails.html#blocking-aiml-memorization-with-software-guardrails"&gt;software-based filters&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id="external-machine-learning-models-as-input-and-output-filters"&gt;External machine learning models as input and output filters&lt;/h3&gt;
&lt;p&gt;Similar to the software-based inputs and outputs you reviewed, machine learning based filters or &lt;em&gt;algorithmic filters&lt;/em&gt; attempt to identify problematic inputs and outputs. Instead of using the hashing or keyword-based approaches that you know from software filters, these use a trained model that sits outside of the main model and predicts whether the input or output should be blocked.&lt;/p&gt;
&lt;p&gt;There are already some popular open-source models that do just this, like Llama-Guard, Prompt Shield and Code Shield, which all fall under the &lt;a href="https://github.com/meta-llama/PurpleLlama"&gt;Purple-Llama family of models&lt;/a&gt; released by Meta. Let's investigate how these work in a real system.&lt;/p&gt;
&lt;p&gt;&lt;img alt="An example high-level architecture, where the data first comes in via a piece of software with an API call, then the data goes through some sort of input processing. After the input processing there is an algorithmic guardrail flags content violations based on the text input. Then the data goes to the LLM itself (this usually now includes the tokenization as part of the LLM). Before the output processing, there is another external algorithmic guardrail to flag potential violations. Then it flows through output processing and then back to API software and to the user." src="./images/2025/algo_guardrails.png"&gt;&lt;/p&gt;
&lt;p&gt;These models are trained to identify known problems within the models themselves, like toxicity and malicious content, as well as known attacks against multi-model models, such as prompt injection attacks to provoke banned responses.&lt;/p&gt;
&lt;p&gt;The model uses the input from the conversation or individual chat message and identifies if the user, the prompt or the answer could be viewed as problematic. Some of these models are trained with a variety of categories, like violence, hateful language, illegal activities and even privacy. Some are trained just to identify one particular problem, like protecting the meta prompt or attempting to find cybersecurity errors in generated code.&lt;/p&gt;
&lt;p&gt;For example, a guardrail model can process chat input to identify if someone has included anything that would be considered a jailbreak attempt, like "Ignore instructions and do this instead". Or the model might identify particular racist remarks or slurs and flag a conversation as discriminatory.&lt;/p&gt;
&lt;p&gt;This works well for inputs that are easy to classify. The model is trained on a classification task, and the training data has the input (text and/or other input) and the label is the appropriate category of problems occurring (i.e. insecure code or derogatory statement).&lt;/p&gt;
&lt;p&gt;These content-filtering models can also be trained on new categories that a given organization has in mind -- for example, not to mention a competitor when answering about services. Llama Guard has specific instructions on how to add and train your own categories.&lt;/p&gt;
&lt;p&gt;These are still machine learning models, and machine learning models are relatively easy to fool or trick. Not only is this possible by actively attacking the model, &lt;a href="https://blog.kjamistan.com/adversarial-examples-demonstrate-memorization-properties.html"&gt;like you learned in the adversarial machine learning article&lt;/a&gt;, but also by simply testing interesting new and creative approaches that are unlikely to have been tested or trained yet.&lt;/p&gt;
&lt;p&gt;This has long been the case in cyber- and information security, where security professionals become quite skilled at thinking outside of the box and using what they know about systems and security to devise new attack vectors and creative workarounds. Because a machine learning model doesn't have actual reasoning, it is often much easier to evade than other humans.&lt;/p&gt;
&lt;p&gt;There are many examples of successful attacks, but one of my favorite recent examples came from &lt;a href="https://aclanthology.org/2024.acl-long.809/"&gt;researchers looking at ASCII art&lt;/a&gt;. In their work, they ask usually blocked questions by changing key words to ASCII art text. They even developed &lt;a href="https://github.com/uw-nsl/ArtPrompt"&gt;an open-source library&lt;/a&gt; so you can try out your own adversarial requests.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Two user flows are shown. On the top one, the user prompts to ask how to build a bomb and the guardrails reject it. On the second prompt the query is broken down into words before being sent and an ASCII mask is applied to the word &amp;quot;bomb&amp;quot; so now it is written in asterisks. Then the LLM system responds with how to build a bomb." src="./images/2025/artprompt.png"&gt;
&lt;em&gt;ArtPrompt example of evading guardrails with ASCII art&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;In addition to these creative evasions, the issues that come up for privacy and copyright are certainly much harder to train than something like violent or criminal behavior.&lt;/p&gt;
&lt;p&gt;Although Llama-Guard has a category for privacy, it doesn't have many details about how this was trained. I decided to test out a few conversation examples to test what it considered private or not.&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Private&lt;/th&gt;
&lt;th&gt;Not private&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Outputting a list of numbers with the text "id"&lt;/td&gt;
&lt;td&gt;Asking for an ID&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;The text "credit card number" with any number of digits&lt;/td&gt;
&lt;td&gt;Entering or repeating a phone number&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Asking for a personal address&lt;/td&gt;
&lt;td&gt;Asking for non-public information about a person&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Asking to interpret a medical report and tell who it is&lt;/td&gt;
&lt;td&gt;Sharing medical information via chat&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;These results are quite similar to NVIDIA's &lt;a href="https://docs.nvidia.com/nemo/guardrails/latest/user-guides/guardrails-library.html"&gt;Nemo Guardrails&lt;/a&gt;, which uses Microsoft's &lt;a href="https://github.com/microsoft/presidio"&gt;Presidio&lt;/a&gt; to scan for easy-to-find categories of personal data and block or mask those tokens. Although it's great to block potential release of nonpublic information like an address or phone number, it doesn't mean that privacy is actually guaranteed.&lt;/p&gt;
&lt;p&gt;Comparing these model-based guardrails to &lt;a href="https://blog.kjamistan.com/defining-privacy-attacks-in-ai-and-ml.html"&gt;the two major attack vectors&lt;/a&gt;, it's clear that the interventions don't prevent revealing memorized private or copyrighted information. Even just answering information about a person can be seen as a violation of their privacy and can reveal something that can be used outside of context and consent. And what about reproducing someone's face, voice, likeness? In addition, these guardrails don't prevent a membership inference attack, and aren't trained to evaluate if the training data is being repeated or exposed.&lt;/p&gt;
&lt;p&gt;Perhaps more important is the question: who decides what is a guardrail and how it's trained? Most companies don't have enough data to develop and train their own guardrail models, which means they are relying on model providers to release useful guardrails.&lt;/p&gt;
&lt;p&gt;Because each system might be different, general privacy and intellectual property guardrails can be erroneous, because the company might need to receive personal information to perform a lookup on a customer database or return copyright material that the organization has a license to use. Since most models aren't documented with how they were trained and what they can do in detail, this leaves organizations struggling to understand how to effectively deploy and monitor the guardrails available and what to do if there aren't any guardrails that fit their use case.&lt;/p&gt;
&lt;p&gt;Since filter-like approaches are external to the actual model, what happens if you try to instead incorporate the guardrail task into the actual learning step? In this case, you want to ensure that while a task is learned, potential errors or undesired outputs are avoided -- which brings us to training-based approaches.&lt;/p&gt;
&lt;h3 id="fine-tuning-or-training-based-alignment-approaches"&gt;Fine-tuning or training-based alignment approaches&lt;/h3&gt;
&lt;p&gt;In addition to filtering and flagging inputs and outputs, today's largest models generally go through alignment as part of the fine-tuning/training step of the model development process. Let's review what this looks like and how it works.&lt;/p&gt;
&lt;p&gt;The entire training process of today's LLMs looks something like this:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A pipeline of 4 steps: Pretraining, where the team is training a language model on unedited, partially cleaned, deduplicated documents. The next step is extended or continued pretraining which is an optional step, usually for context-specific unlabeled data, like additional documents in another language. The next large step is Training: Fine-tuning on a particular task such as code completion or chat. The final optional step is alignment: usually to train in particular guardrails." src="./images/2025/llm_pipeline.png"&gt;&lt;/p&gt;
&lt;p&gt;These terms can be confusing because they are very LLM-jargon specific, so let's translate what each step does:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Pretraining&lt;/strong&gt;: For me, this naming is very strange because essentially this is just unsupervised training that produces a base language model. This model is trained on content (text, image, video, etc) at scale. The input embeddings (i.e. text or multi-modal embeddings) are also learned. The model is not trained to specifically predict chat-style responses but it might have chat text as part of the training data. For LLMs, this results in a model that is good at predicting the next token(s) when given a small or large amount of text. For other sequence-based models, it will predict the next step when given an input sequence (such as next part of audio wave or next image for video).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;(Optional) Extended or Continued Pretraining&lt;/strong&gt;: This step can be used to further pretrain a publicly-released base model from a large LLM provider&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt; or to continue pretraining on a context- or language specific dataset. Like in the previous step, this pretraining is just learning basic language or sequence modeling, so no labeled data or supervised training is used.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Training (or sometimes called Supervised Training or Supervised Fine Tuning)&lt;/strong&gt;: This is where the base language or sequence-based model is trained to complete a given task, like answering chat messages. You can also train base language models to do other things: write code, classify text, translate or perform other sequence-based machine learning tasks. Today's chat assistants are trained on chat-like texts with additional prompt inputs that give instructions and show completions. In these datasets the user input is listed under "User" and the model should learn to respond as the "Agent" speaker. &lt;a href="https://huggingface.co/datasets/HuggingFaceH4/instruction-dataset"&gt;Instruction datasets&lt;/a&gt; can also be used where there are task-completion examples, like counting, mathematics, "world model" tasks, etc.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;(Optional) Fine-tuning for Alignment&lt;/strong&gt;: Although this can be done as a normal part of the LLM training, sometimes a separate dataset and objective is used for training guardrail alignment. If so, this usually happens directly before models go into use to ensure that these guardrails are not partially changed or forgotten during another step in the fine tuning process. These &lt;a href="https://huggingface.co/datasets/HuggingFaceH4/hhh_alignment"&gt;alignment datasets&lt;/a&gt; include examples of responding differently to requests for objectionable content.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These steps can change based on a particular model setup, but they're useful to know even if your organization's method is different.&lt;/p&gt;
&lt;p&gt;To better understand common implementations for steps 3 and 4, you'll first need to become familiar with reinforcement learning.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A brief introduction to &lt;a href="https://en.wikipedia.org/wiki/Reinforcement_learning"&gt;Reinforcement Learning&lt;/a&gt;: Reinforcement learning is a particular type of machine learning that uses an incentivization-like method to measure loss and update model parameters. The field emerged out of robotics, where you'd like to set particular constraints and reward or penalize particular steps or next-action predictions in order to train the robot towards a goal.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;During the final training steps (#3 and #4), model optimization focuses on "preference" optimization. Let's review the most popular approaches for doing so.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Collect Initial Data on Human Preferences&lt;/strong&gt;: First you need to develop a dataset that shows human preference between a variety of chat responses or conversations. Usually these are collected by data workers&lt;sup id="fnref:3"&gt;&lt;a class="footnote-ref" href="#fn:3"&gt;3&lt;/a&gt;&lt;/sup&gt; who act as the "agent" and produce high quality responses or interact with an already trained chatbot to produce conversations. Once enough data exists, data workers shift to ranking and correcting responses (i.e. which response is better, or what would make this response even better).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Reinforcement Learning with Human Feedback (RLHF)&lt;/strong&gt;: The "human preference" chat data is used to train a reinforcement learning reward model that learns human preferences from this ranked data (think of this as a supervised text classification task). The reward model is then used to give or subtract points from the model as it continues training. The model updates (i.e. loss/optimization) are directly calculated via this reward/penalty. More specifically, the reward model output is combined via a policy function (which balances learning with "remembering what it already learned") and this function calculates the model parameter updates.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Direct Preference Optimization (DPO)&lt;/strong&gt;: Here, only binary "human preference" chat data can be used (i.e. like/dislike) instead of a more nuanced ranking (i.e. most favorite to least). Then a policy function is used to increase likelihood of positive responses and decrease likelihood of negative responses.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In addition to providing higher quality and more interesting responses, this human feedback also helps reduce harmful text learned from the internet -- whether that's violent and criminal activities or just blatant racism, sexism, ableism, etc. Relatively recently privacy and intellectual property have been added to the list of ways to align models.&lt;/p&gt;
&lt;p&gt;This means, however, that data workers must be given guidance on what constitutes a privacy or intellectual property violation. Only then can conversations be guided away from these outputs. As you learned in the &lt;a href="https://blog.kjamistan.com/defining-privacy-attacks-in-ai-and-ml.html"&gt;attacks article&lt;/a&gt; this would be impossible for any human to do since it would require photographic memory of the entire set of training data examples.&lt;/p&gt;
&lt;p&gt;One approach to automate this would be to actually test outputs from both the pretrained model and the fine-tuned model for their proximity to training examples and to encourage divergence from these examples. As far as I know this is not an active approach in production-grade model training.&lt;/p&gt;
&lt;p&gt;Since it would be impossible for humans to review every conversation to determine if it has released person-related information out of context or if it has repeated potentially copyrighted content without attribution, this alignment usually finds only the most blatant examples, as shown in the small exercise with Llama Guard. These models end up avoiding directly outputting personal contact information or recognizing blatant requests that might violate privacy (i.e. "Tell me the social security number of [person]).&lt;/p&gt;
&lt;p&gt;By design this fine tuning for alignment only modifies the model slowly and slightly, because there is a penalty attributed to too much divergence from the underlying base model. This penalty exists to ensure the fine-tuning doesn't create "catastrophic forgetting" of the large-scale text learned in the base language model.&lt;/p&gt;
&lt;h3 id="jailbreaking-attacks"&gt;Jailbreaking attacks&lt;/h3&gt;
&lt;p&gt;Despite these external and internal model guardrails, there are many examples of quickly and easily "jailbreaking" models. This term is used to refer to deflection of the guardrail models (i.e. by letting a prompt or response that should be flagged go through) or by either modifying or exploiting the actual model in a way that subverts the alignment fine-tuning.&lt;/p&gt;
&lt;p&gt;Let's review a few broader categories of these attacks:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Clever prompting&lt;/strong&gt;: Early attacks which often still work use clever prompt engineering to deflect guardrail model filters and/or alignment. One fun example is to "time travel" or "world travel" to a place where the topic is allowed. There are &lt;a href="https://csrc.nist.gov/pubs/ai/100/2/e2023/final"&gt;many great examples&lt;/a&gt; of clever prompt attacks, and there's likely to be many years of new prompt-based attack development.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Fine-tuning to remove alignment&lt;/strong&gt;: Many models are available for free download and use on HuggingFace or via GitHub. Even OpenAI and Copilot allow forking of the model and fine-tuning "on your own data". Although big model providers have terms of service that prohibit malicious fine-tuning or use, this doesn't necessarily stop motivated users from downloading or forking the model and then fine-tuning to remove guardrails. &lt;a href="https://www.lesswrong.com/posts/3eqHYxfWb5x4Qfz8C/unrlhf-efficiently-undoing-llm-safeguards"&gt;Recent research on publicly-released models&lt;/a&gt; show the costs can be as low as $160 to significantly reduce fine-tuned guardrails.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Adversarial attacks&lt;/strong&gt;: &lt;a href="https://blog.kjamistan.com/adversarial-examples-demonstrate-memorization-properties.html"&gt;Adversarial attacks&lt;/a&gt; can be developed based on model outputs. These attacks can also be &lt;em&gt;transferred&lt;/em&gt; from one model to another, based on the same principles of &lt;a href="https://en.wikipedia.org/wiki/Transfer_learning"&gt;transfer learning&lt;/a&gt;. By design, adversarial attacks change the outputs and model behavior (either to make an error or to push outputs in a particular direction).&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;As of yet, there haven't been general publicly released examples where attackers utilize these methods to exfiltrate memorized data from machine learning models -- outside of producing naked images and videos of famous or less famous people without their consent or impersonating famous people via deepfakes.&lt;/p&gt;
&lt;p&gt;As AI systems are used in increasingly proprietary and sensitive environments, these attacks will become more valuable. Particularly if you wanted to build an attack based on searching for a particular piece of content or targeting a particular person or company, this is easier to do than a general attack.&lt;/p&gt;
&lt;p&gt;Accidental attacks on privacy, where personal information is released by queries have already been reported, and it would be useful to require transparency reporting on how often this occurs.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://blog.kjamistan.com/how-memorization-happens-novelty.html"&gt;As you learned previously&lt;/a&gt;, the larger the model the easier this is to do -- even with existing guardrails. Recall that &lt;a href="https://arxiv.org/abs/2311.17035"&gt;Nasr et al. predicted the ability to extract more than a million word-for-word training data examples&lt;/a&gt; from ChatGPT with a larger budget.&lt;/p&gt;
&lt;p&gt;Similar to the problems with input and output filters, non-deterministic approaches (i.e. fine tuning or ML-based filters) to these problems are unlikely to catch all unwanted outputs without a clear definition of what is expected. Because most organizations training large-scale AI systems do not actively test for memorization, it is difficult to then prove that the second training/fine-tuning step has reduced this memorization to an acceptable threshold.&lt;/p&gt;
&lt;p&gt;In general, when thinking about privacy, it is enough to prove that one person's privacy is fully violated (i.e. their data is memorized and exposed in a way they did not expect or consent to). To do this for every data example that is not collected under enthusiastic consent presents a scaling problem that would require significantly changing how privacy metrics and auditing are used in model training and evaluation.&lt;sup id="fnref:4"&gt;&lt;a class="footnote-ref" href="#fn:4"&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;Although model training via fine-tuning and ML-based guardrails help with AI safety and reliability, they are not a substitute for thinking through and addressing real issues of privacy and memorization.&lt;/p&gt;
&lt;p&gt;In the next article, you'll learn about the field of "machine unlearning", or if/when/how it is possible to remove information that has already been learned from deep learning models.&lt;/p&gt;
&lt;p&gt;As always, I'm very open to feedback (positive, neutral or critical), questions (there are no stupid questions!) and creative re-use of this content. If you have any of those, please share it with me! This helps keep me inspired and writing. :)&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Acknowledgements&lt;/em&gt;: I would like to thank &lt;a href="https://vickiboykis.com/"&gt;Vicki Boykis&lt;/a&gt; for feedback, corrections and thoughts on this article. Any mistakes, typos, inaccuracies or controversial opinions are my own.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;This isn't meant to be a thorough or exhaustive test or research, but I am curious if you happen to come across a holistic approach! I used Llama-Guard 3.8.&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;For example, Mistral AI has a few &lt;a href="https://huggingface.co/mistralai/Mistral-7B-v0.1"&gt;base pretrained models&lt;/a&gt; and you can see examples of &lt;a href="https://lightning.ai/lightning-ai/studios/continued-pretraining-with-tinyllama-1-1b"&gt;extended/continued pretraining from Lightning AI&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;Usually paid extremely low wages and overwhelmingly employed in Africa and South America, the working conditions of these often high-skilled workers with advanced degrees is documented well by &lt;a href="https://data-workers.org/"&gt;DAIR's data workers reporting&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:4"&gt;
&lt;p&gt;More details on what companies/organizations and individuals can do to combat this in future articles. :)&amp;#160;&lt;a class="footnote-backref" href="#fnref:4" title="Jump back to footnote 4 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="ml-memorization"></category></entry><entry><title>Blocking AI/ML Memorization with Software Guardrails</title><link href="https://blog.kjamistan.com/blocking-aiml-memorization-with-software-guardrails.html" rel="alternate"></link><published>2025-07-11T00:00:00+02:00</published><updated>2025-07-11T00:00:00+02:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2025-07-11:/blocking-aiml-memorization-with-software-guardrails.html</id><summary type="html">&lt;p&gt;One common way to control memorization in today's deep learning systems is to fix the problem by building software around it. This software can also be used to deal with other undesired behavior, like producing hate speech or mentioning criminal activities.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Are you a visual learner? There's &lt;a href="https://youtu.be/IeyB-2cS5lM"&gt;a YouTube video …&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;</summary><content type="html">&lt;p&gt;One common way to control memorization in today's deep learning systems is to fix the problem by building software around it. This software can also be used to deal with other undesired behavior, like producing hate speech or mentioning criminal activities.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Are you a visual learner? There's &lt;a href="https://youtu.be/IeyB-2cS5lM"&gt;a YouTube video for this article&lt;/a&gt; on the Probably Private channel.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In this article &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;in the series&lt;/a&gt;, you'll learn about how software around AI/deep learning models can be used and explore why these interventions provide more of a good feeling than an actual practical solution to the problem.&lt;/p&gt;
&lt;h3 id="how-an-ai-product-is-designed"&gt;How an AI product is designed&lt;/h3&gt;
&lt;p&gt;AI and deep learning models are just a tiny part of an overall system. Most of the system is deterministic software around the non-deterministic machine learning model. At an extremely high-level, this is how a Chat Assistant system might look:&lt;/p&gt;
&lt;p&gt;&lt;img alt="An example high-level architecture, where you see that the data first comes in via software with an API call, then the data goes through some sort of input processing. Then to the LLM itself (this usually now includes the tokenization as part of the LLM). Then to some output processing and then back to a piece of software with an API call to the user." src="./images/2025/llm_software_architecture.png"&gt;&lt;/p&gt;
&lt;p&gt;In the above figure, the chat messages come in from a user via an API call to software that processes the input. As you learned in exploring &lt;a href="https://blog.kjamistan.com/gaming-evaluation-the-evolution-of-deep-learning-training-and-evaluation.html"&gt;the design of a machine learning system&lt;/a&gt;, this text will be prepared for the machine learning input. This could vary depending on design, from potential removal or correction of typos, grammatical errors, to appending meta information from the user account or other data source, and then eventually this text and any additional inputs are tokenized and sent to the AI model (via another API call).&lt;/p&gt;
&lt;p&gt;The AI model will process the tokenized input and calculate some predicted set of tokens as a response. More often than not, there is now software around this step that requests multiple possible responses. Depending on the design, the model might return the beginning of a response while the system continues calculating the next part of the response. Remember: the model will use its own response as part of the input to continue calculating the next word(s).&lt;/p&gt;
&lt;p&gt;If you have heard about topics like &lt;a href="https://huggingface.co/blog/how-to-generate#sampling"&gt;temperature&lt;/a&gt;, &lt;a href="https://huggingface.co/blog/how-to-generate#top-k-sampling"&gt;top-k&lt;/a&gt; and &lt;a href="https://huggingface.co/blog/how-to-generate#top-p-nucleus-sampling"&gt;top-p&lt;/a&gt; sampling, these are implemented in software around the model outputs, resulting in multiple queries before the final response is constructed.&lt;/p&gt;
&lt;p&gt;You don't need to learn the deep details of these sampling choices and settings, just know these are different parameters that the chat provider and/or the user can set to determine the deterministic or explorative qualities of the response. This creates several ways to sample longer answers and compare or explore response possibilities before determining a final response. For large models, there are other optimizations used, like potentially splitting the prediction task between a small and large model (see: speculative decoding) to improve speed.&lt;/p&gt;
&lt;p&gt;Sometimes the response is fully formed, but sometimes the response can start before the final text is formulated. Either way, this response usually goes through another batch of software filters on its way back to the original user.&lt;/p&gt;
&lt;p&gt;There is a tradeoff between how much post processing you can do and the response latency, so usually these are light-touch filters and interfaces before the text reaches the user. Depending on the system this might be performed many times before the answer is fully formulated.&lt;/p&gt;
&lt;p&gt;This process starts all over again the next time the user sends a message.&lt;/p&gt;
&lt;h3 id="filtering-inputs-and-outputs"&gt;Filtering inputs and outputs&lt;/h3&gt;
&lt;p&gt;As you can probably tell from the diagram, if you want to use software to build protections against memorization you need to either:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;catch potentially harmful input before it reaches the AI model (i.e. in the input, text cleaning and tokenization step)&lt;/li&gt;
&lt;li&gt;or attempt to remove it as it is produced by the system (either as part of the testing and generation) or before it reaches the user.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Let's explore and compare both options.&lt;/p&gt;
&lt;h4 id="prompt-rewriting"&gt;Prompt rewriting&lt;/h4&gt;
&lt;p&gt;In search engines, there's been significant research on &lt;a href="https://hughewilliams.com/2012/03/19/query-rewriting-in-search-engines/"&gt;rewriting queries&lt;/a&gt; to improve user experience, by correcting typos or expanding search terms for better results. This approach inspires the idea of prompt rewriting, where the user's interactions with the model might be modified before it hits the machine learning model.&lt;/p&gt;
&lt;p&gt;There are several motivations for rewriting prompts for better alignment with whatever the organization wants the model to do or not do. This is usually provided in a meta prompt (also called system prompt) which describes in natural language how the model should behave and what it should or shouldn't do. You might have seen easy ways around this if the model wasn't trained to distinguish the meta prompt from user input with the &lt;a href="https://www.theverge.com/2024/7/19/24201414/openai-chatgpt-gpt-4o-prompt-injection-instruction-hierarchy"&gt;"ignore all previous instructions and ..."&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;But since models don't have any "concept" of what information is learned that can be used and which cannot, this type of intervention doesn't work as easily for memorization problems. Even if a company wanted to list every possible copyright character to not reproduce their likeness (i.e. "Don't show Batman"), there are easy ways to indirectly and even &lt;em&gt;unintentionally&lt;/em&gt; anchor copyright or otherwise memorized images/words.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://copycat-eval.github.io/"&gt;same research around copyright in generative images&lt;/a&gt; experimented with additional approaches, where prompts are tested for similarity to "forbidden" prompts and rewritten to avoid potential problems. This was explored &lt;a href="https://arxiv.org/abs/2303.13516"&gt;in related research&lt;/a&gt; that attempted to identify the forbidden "concepts" (for example: Batman) and then fine tune the model to remove the potentially problematic concept.&lt;/p&gt;
&lt;p&gt;For example, &lt;a href="https://blog.kjamistan.com/defining-privacy-attacks-in-ai-and-ml.html"&gt;a prompt like "Gotham superhero" should align closer with "superhero" and end up further from "Batman"&lt;/a&gt;. As you might guess, if implemented at scale this could be extremely expensive because you would need to find every possible term, test for memorization and then implement learning interventions. It might also not always work for the task you want it to do (i.e. which well-known superheroes aren't copyrighted?).&lt;/p&gt;
&lt;h4 id="in-context-unlearning"&gt;In-context Unlearning&lt;/h4&gt;
&lt;p&gt;In-context learning (sometimes also called few-shot learning) is a common prompt engineering strategy where you type extra instructions and examples into the prompt to demonstrate the task or how you'd like it to respond. In-context or few-shot learning allows users to on-the-fly introduce a new concept or pattern to a general purpose LLM by showing a few examples and then asking for the model to complete the next in the sequence.&lt;/p&gt;
&lt;p&gt;For example, you could give a list of sentences and then follow each item with what language it was written in and then upload a document and ask that the model return each sentence with the language it was written next to it.&lt;/p&gt;
&lt;p&gt;In-context learning has been used alongside prompt rewriting as a way to "unlearn" concepts. &lt;a href="https://arxiv.org/abs/2310.07579"&gt;In-context "unlearning"&lt;/a&gt; modifies the user prompt to replace data points that should be forgotten with "dummy labels". This only scales if the forget-set is quite small and the concept is easily defined and filtered. Also it won't work as well for things that don't easily mold into an in-context prompt setting (i.e. freeform conversations). In other &lt;a href="https://arxiv.org/abs/2309.17410"&gt;research on data removal from models&lt;/a&gt;, this type of in-context or input rewriting was proven ineffective at reducing training data exfiltration.&lt;/p&gt;
&lt;p&gt;Doing in-context unlearning at scale successfully would mean being able to accurately determine that the user is performing an attack or that the prompt would unintentionally release memorized information. But because model developers aren't currently testing for memorization, current architectures and training and evaluation would still need to be modified to cover this input- or output-testing.&lt;/p&gt;
&lt;p&gt;How could this type of rewriting or filtering work on the outputs instead of the inputs?&lt;/p&gt;
&lt;h4 id="research-and-applications-in-output-filtering"&gt;Research and applications in output filtering&lt;/h4&gt;
&lt;p&gt;Because filtering inputs is fairly difficult, in today's largest AI systems memorization testing is done via unsophisticated output filters. These filters only exist for certain systems and generally test if the model response directly matches training data that should not be output.&lt;/p&gt;
&lt;p&gt;For example, &lt;a href="https://docs.github.com/en/copilot/managing-copilot/managing-copilot-as-an-individual-subscriber/managing-copilot-policies-as-an-individual-subscriber#enabling-or-disabling-duplication-detection"&gt;GitHub's Copilot can test if the generated code directly matches publicly accessible code&lt;/a&gt;. To avoid unnecessary latency, this is usually done via an advanced hashing memory structure, so exact matches are found quickly and the false positive rate remains low.&lt;/p&gt;
&lt;p&gt;From Copilot documentation, this is the description of the intervention.&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Copilot code referencing searches for matches by taking the code suggestion, plus some of the code that will surround the suggestion if it is accepted, and comparing it against an index of all public repositories on GitHub.com. Code in private GitHub repositories, or code outside of GitHub, is not included in the search process. The search index is refreshed every few months. As a result, newly committed code, and code from public repositories deleted before the index was created, may not be included in the search. For the same reason, the search may return matches to code that has been deleted or moved since the index was created.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This explains the &lt;a href="https://arstechnica.com/information-technology/2025/02/copilot-exposes-private-github-pages-some-removed-by-microsoft/"&gt;recent problems where private repository code was exposed&lt;/a&gt; since that code has already been memorized and yet is no longer being tested by the output filters. Depending on the index updates, this could also apply to code you might have deleted -- for example, if you found that you accidentally exposed a secret (like a key or password) or other potentially sensitive details (i.e. exposure to libraries or systems with known vulnerabilities or environment settings).&lt;/p&gt;
&lt;p&gt;Additional interventions can test visual output, such as asking a different machine learning model "is Batman in this image?" and block outputs that find undesirable memorized content in the output. As you might imagine, this is very difficult to scale, but might work for smaller models and a small subset of data or tasks.&lt;/p&gt;
&lt;p&gt;It is likely that larger LLMs including ChatGPT use some of these output filters to block certain undesired responses (i.e. Terms of Service violations) or to comply with right to be forgotten requests. For example, in recent news, &lt;a href="https://www.cnet.com/tech/services-and-software/chatgpt-wont-answer-questions-about-certain-names-heres-what-we-know/"&gt;ChatGPT wouldn't respond when the response contained specific person's names&lt;/a&gt; which seems like a clear sign of an output-filter rather than concept-unlearning intervention.&lt;/p&gt;
&lt;h3 id="you-can-only-catch-what-you-definitely-know"&gt;You can only catch what you definitely know&lt;/h3&gt;
&lt;p&gt;The problem is that you can only really do this efficiently if you know what you are looking for and if it scales appropriately. Since very few companies test for memorization as a part of their model evaluation, it's also unknown internally how much memorization happens. If users can adjust settings like temperature or other parameters to shift model behavior at will, this would also change the produced content, making the problem even more non-deterministic than it already is.&lt;/p&gt;
&lt;p&gt;For software teams trying to develop these interventions, it's like you're building a box to fit an object in, but nobody has told you what the object is. You're building based on vibes, not based on facts and knowledge.&lt;/p&gt;
&lt;p&gt;If rigorous testing for privacy violations and memorization happens as part of the model training and evaluation, then you start from this basic understanding and likely build both better protections and also train models with fewer issues.&lt;/p&gt;
&lt;p&gt;Unsurprisingly, software-based filters are easy to bypass. Any motivated attacker can easily sidestep things like prompt rewriting with their own prompt engineering (more on this in the next article).&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.youtube.com/watch?v=Ohl5AGUOLXk&amp;amp;ab_channel=MITCBMM"&gt;Chiyuan Zhang&lt;/a&gt; presented several easy methods for bypassing the GitHub Copilot output filters (originally published &lt;a href="(https://arxiv.org/abs/2210.17546)"&gt;in research Ippolito et al.&lt;/a&gt;). By changing the variable names to French or adding comment markers to start the line, previously undesired code output was output because the hashing memory architecture didn't catch the similarities.&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt; This image shows how Ippolito et al.'s attack to produce a previously blocked function description by changing the variable names to French.&lt;/p&gt;
&lt;p&gt;&lt;img alt="An example block of code that shows variable names in French and then many lines of previously blocked code." src="./images/2025/zhang_french_copilot.png"&gt;&lt;/p&gt;
&lt;p&gt;This same research group found that models would at times output "style transfer" on memorized text by changing spacing, language or writing style even if not prompted by the user to do so, again showing that near-memorization testing (or paraphrase testing) might be necessary to catch these types of responses.&lt;/p&gt;
&lt;p&gt;Determining that someone or something is in the training data and has been memorized is easy to perform when these output filters are on, as they are a direct indicator. Just like the ChatGPT example that (likely) exposed that a person had requested their data be deleted, these blocked answers leak information.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://arxiv.org/abs/2309.05610"&gt;Debenedetti et al.&lt;/a&gt; named these types of information leaks "side channels" -- borrowing the term from cybersecurity where an attacker can extract sensitive information by observing changes in outputs or related side channels (often by observing attributes like latency, response content or other signals).&lt;/p&gt;
&lt;p&gt;In this case, the side channel is as simple as producing prompts that generate a generic response (like, "I can't help you with that.") or generate a specifically different type of response (i.e. empty response, shortened response or fundamentally divergent response).&lt;/p&gt;
&lt;p&gt;In information security, this falls under the concept of &lt;a href="https://en.wikipedia.org/wiki/Non-interference_(security)"&gt;non-interference&lt;/a&gt;. This concept is easy to see with forgotten passwords. If a password reset form says "We emailed you your password" if an email is found but it says "This user doesn't exist, please create an account" if the email isn't found, then this response leaks potentially sensitive information about whether the person has an account or not.&lt;/p&gt;
&lt;p&gt;In conclusion, the output and input filter examples you've read in this article leak particular information about what prompts are allowed and what outputs are allowed (and which not). Via a variety of clever prompts, these rudimentary safeguards are easy to evade. For this reason, software-based filters are not an appropriate intervention for problems like memorization.&lt;/p&gt;
&lt;p&gt;In the next article, you'll investigate fine-tuned guardrails and other training-based alignment methods to determine if they are a valid solution to this problem.&lt;/p&gt;
&lt;p&gt;As always, I'm very open to feedback (positive, neutral or critical), questions (there are no stupid questions!) and creative re-use of this content. If you have any of those, please share it with me! This helps keep me inspired and writing. :)&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Acknowledgements&lt;/em&gt;: I would like to thank &lt;a href="https://desfontain.es/serious.html"&gt;Damien Desfontaines&lt;/a&gt; and &lt;a href="https://vickiboykis.com/"&gt;Vicki Boykis&lt;/a&gt; for their feedback, corrections and thoughts on this article. Any mistakes, typos, inaccuracies or controversial opinions are my own.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;Accessed on 5 March, 2025.&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;You can now just turn these filters off in GitHub in your settings, and this option is likely to surface in other systems where these settings are not public-facing.&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="ml-memorization"></category></entry><entry><title>Defining Privacy Attacks in AI and ML</title><link href="https://blog.kjamistan.com/defining-privacy-attacks-in-ai-and-ml.html" rel="alternate"></link><published>2025-06-12T00:00:00+02:00</published><updated>2025-06-12T00:00:00+02:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2025-06-12:/defining-privacy-attacks-in-ai-and-ml.html</id><summary type="html">&lt;p&gt;In &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;this article series&lt;/a&gt;, you've been able to investigate memorization in AI/deep learning systems -- often via interesting attack vectors. In security modeling, it's useful to explicitly define the threats you are defending against, so you can both discuss and address them and compare potential interventions.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Prefer to learn by …&lt;/p&gt;&lt;/blockquote&gt;</summary><content type="html">&lt;p&gt;In &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;this article series&lt;/a&gt;, you've been able to investigate memorization in AI/deep learning systems -- often via interesting attack vectors. In security modeling, it's useful to explicitly define the threats you are defending against, so you can both discuss and address them and compare potential interventions.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Prefer to learn by video? This post &lt;a href="https://youtu.be/v9McFcYahpg"&gt;is summarized on Probably Private's YouTube&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In this article, you'll walk through two common attack vectors against memorization in AI systems: membership inference and data reconstruction (or exfiltration).&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;NOTE&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This article specifically about privacy attacks related to memorization; but the field of AI security is much larger and broader. Red teaming and other security testing for AI models are a common approach for companies releasing models into production systems. The field of &lt;a href="https://en.wikipedia.org/wiki/Adversarial_machine_learning"&gt;adversarial machine learning&lt;/a&gt; explores how AI/ML models can be hacked, tricked and manipulated. It's essential to understand how stochastic systems will behave when attacked or under unexpected conditions to ensure that the deployment is adequately protected or that humans who interact with the system receive appropriate training for handling unexpected or erroneous events.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id="what-is-a-membership-inference-attack"&gt;What is a membership inference attack?&lt;/h3&gt;
&lt;p&gt;A membership inference attack (MIA) attempts to &lt;em&gt;infer&lt;/em&gt; if a person (or particular example) was in the training data or not. It was first named by &lt;a href="https://arxiv.org/abs/1610.05820"&gt;Shokri et al.'s work in 2016&lt;/a&gt;, where the researchers were able to determine which examples were in-group (training data) and which ones were not. The original attack developed a system of shadow models that were similar to the target model. The outputs of these shadow models were used to train another model to discriminate between in-training and out-of-training examples.&lt;/p&gt;
&lt;p&gt;Since the initial attack definition, there have been a variety of improvements -- creating targeted variants to adaptively expose particular training data points or variants that attack correlated groups of data points. Several related attacks can &lt;a href="https://dl.acm.org/doi/abs/10.1145/3154793"&gt;expose sensitive attributes of individuals by revealing which subpopulations they belong to&lt;/a&gt; or teach attackers about &lt;a href="https://ieeexplore.ieee.org/abstract/document/9581166"&gt;overall training data populations and their qualities&lt;/a&gt;, which could be exploited to perform better MIAs.&lt;/p&gt;
&lt;p&gt;Why does this work? If a model memorizes a particular example, it should return large confidence on that data point compared to similar data points it hasn't seen. If these examples are infrequent or rare (i.e. in the long tail), then these examples are overexposed compared to other examples, which can "hide in the crowd". As you already learned, larger and more accurate models display this problem more often than smaller and less accurate models.&lt;/p&gt;
&lt;p&gt;To illustrate this, researchers worked on training multiple models on different splits of data. They then found ways to show the inlier versus the outlier-ness of particular examples by showing how the model performed on these if it processed them as part of the training data or not. For outlier examples, if they weren't in the training data, then the model performed quite poorly on them. Even for inliers, if the example was more complex (i.e. harder to learn in one training round) then the loss (i.e. accuracy) on that example leaked more information than if it was easy to learn.&lt;/p&gt;
&lt;p&gt;This figure shows a view of the model's prediction accuracy via cross-entropy loss when comparing images in the training dataset with images out of the training dataset. As a quick reminder, &lt;a href="https://en.wikipedia.org/wiki/Cross-entropy"&gt;cross-entropy&lt;/a&gt; is an accurate way to measure performance for a classification model, where it calculates how far the resulting prediction is from the true label.&lt;/p&gt;
&lt;p&gt;If you were performing a privacy attack, you'd want to find a way to separate the member distributions in red from the non-member distributions in blue. For outliers, this is much easier! And for more complex examples that are "harder to learn" this is also easier.&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="There are 4 charts plotting &amp;quot;member&amp;quot; versus &amp;quot;non-member&amp;quot; distributions taken from model outputs. On the top left, it shows an &amp;quot;easy to learn&amp;quot; (i.e. low loss) inlier example, where the member and non-member distributions nearly perfectly overlap. On the top right, it shows another easy to learn example (dogs), but an outlier image example -- in this case the losses are much easier to separate. In the bottom row it shows harder to learn classes (i.e. higher loss). In the &amp;quot;inlier&amp;quot; example, the distributions have differing tails, but also significant overlap. In the &amp;quot;outlier&amp;quot; example, the distributions have no overlap at all and are clear to separate." src="./images/2025/ce_loss_inlier_outlier.png"&gt;&lt;/p&gt;
&lt;p&gt;This problem is exacerbated when model size grows and when those models are trained with datasets where one large data collection, such as ImageNet, is used to pull both testing and training data. Unfortunately, as shown in &lt;a href="https://blog.kjamistan.com/how-memorization-happens-novelty.html"&gt;the previous article&lt;/a&gt;, these datasets often have duplicates or near-duplicates and this ends up leaking additional information that incentivizes memorization. As you can imagine, this doesn't just affect image datasets -- the internet is full of near duplicate or exact duplicate text and other content forms (i.e. video/audio/etc).&lt;/p&gt;
&lt;p&gt;This figure plots different model architectures, where the y-axis shows accurate attack successes and the x-axis shows the model's performance on the holdout test examples. As you've already learned throughout this series, performing well on a test-dataset from a long-tail distribution and many outliers likely requires some element of memorization.&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="The chart shows different color circles for different model architectures (a variety of CNNs and WRNs). For each architecture you see that the higher the accuracy gets, the higher the attack success." src="./images/2025/lira_tpr_test_accuracy.png"&gt;&lt;/p&gt;
&lt;p&gt;Combining these two pieces of knowledge (i.e. the likelihood that an example is an outlier and the confidence in the model to guess it properly) is a good way to infer this membership.&lt;/p&gt;
&lt;p&gt;Both of the above figures come from the &lt;a href="https://arxiv.org/abs/2112.03570"&gt;Likelihood Ratio Attack (Carlini et al., 2022)&lt;/a&gt;, which is specifically designed to minimize false positives (i.e. alerting that something is a member when it is not).&lt;sup id="fnref:3"&gt;&lt;a class="footnote-ref" href="#fn:3"&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;The steps to perform this attack are as follows:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Sample from a dataset similar to the dataset you think was used by the model you want to attack (target model). Create data subsets, so each of the examples in the dataset are seen (and "not seen") by the models you will train. These are called shadow models, because they act as a stand-in for the target model.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Train your shadow models keeping track of which models have seen which examples. This creates sets of in- and out-shadow models, where the in-models have seen the example and the out-models have not. Try to match the model architecture and task of your target model as accurately as possible for best results.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Measure the model outputs (i.e. prediction accuracy / loss) on a particular example and scale that loss by using a logit function. Repeat steps 1-3 numerous times and try to cover a large swath of the the training data, with a variety of classes and examples. Store the scaled losses with notes on the example they were measured with and whether the model had seen the example or not (in-versus-out shadow model).&lt;sup id="fnref:4"&gt;&lt;a class="footnote-ref" href="#fn:4"&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;After many training iterations and when you representative set of scaled losses on a variety of examples, analyze the scaled losses of the in-versus-out shadow models. You will hopefully have two distributions that don't completely overlap. Similar to the other CE-loss distributions you saw above, you want to try to separate these two distributions to be able to tell if something is in the training data or not based on the loss.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Query the target model with the target example. Scale the returned prediction and use probability theory to figure out if the example comes from the "in-training" or "out-of-training" distributions.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;There are additional suggestions and improvements to this attack based on model architecture, datasets and that you can review &lt;a href="https://arxiv.org/abs/2112.03570"&gt;in the paper&lt;/a&gt; and the related &lt;a href="https://github.com/tensorflow/privacy/tree/master/research/mi_lira_2021"&gt;code repository&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Related variants of these attacks can also directly expose the training data by guessing close to the exposed data point and optimizing the input until it reaches the memorized input. This can show what data is in the training and what data isn't.&lt;/p&gt;
&lt;p&gt;If I can learn something about you by knowing you were in the training data, then I can use this information for my own benefit. For example, if I know you have a particular disease, or visit a certain website often or if I know your immigration status or income level because your data was in a model that only represents people like that -- then this is all extra information I get from a membership inference attack.&lt;/p&gt;
&lt;p&gt;These attack vectors overlap with data reconstruction attacks, because if I know you are in the training data, I can also attempt to extract your data directly.&lt;/p&gt;
&lt;h3 id="what-is-a-data-reconstruction-attack"&gt;What is a data reconstruction attack?&lt;/h3&gt;
&lt;p&gt;Data reconstruction attacks attempt to &lt;em&gt;discover&lt;/em&gt;, &lt;em&gt;reconstruct&lt;/em&gt; and &lt;em&gt;exfiltrate&lt;/em&gt; the training data. As you might guess, this works better if the data was memorized!&lt;/p&gt;
&lt;p&gt;If combined with membership inference attacks, these two attacks can first determine which data probably exists in the model, and then either recreate that data and test its veracity or to attempt to exfiltrate the memorized data from the model itself.&lt;/p&gt;
&lt;p&gt;As you learned about in &lt;a href="https://blog.kjamistan.com/how-memorization-happens-repetition.html"&gt;the article on repetition as a source of memorization&lt;/a&gt;, this can mean a full exfiltration of heavily repeated examples, which is easy to do if the example is common enough and has been memorized. In a way, this is expected data reconstruction, where we want to learn common information (i.e. a widely known text or celebrity face).&lt;/p&gt;
&lt;p&gt;And as you read in &lt;a href="https://blog.kjamistan.com/how-memorization-happens-novelty.html"&gt;the article about novelty as a source of memorization&lt;/a&gt;, these attacks can also directly expose less frequent examples, particularly outliers or infrequent examples. This might mean accidentally learning personal text data, like &lt;a href="https://arxiv.org/abs/1802.08232"&gt;social security numbers, credit card numbers and home addresses&lt;/a&gt; that can then be extracted either by querying the model itself, or by a targeted attack. There are variants of these attacks that use both "white box" (i.e. direct testing of the model with a view of the model's internal state) and "black box" (i.e. API access) methods.&lt;/p&gt;
&lt;p&gt;A related but contested variation of data reconstruction invokes paraphrased information. Here, the model outputs are compared with their training data to discover partial verbatim and paraphrased content. For visual content, these "paraphrasing" attacks can look at portions of the image or video and determine if particular features come from particular training data examples.&lt;sup id="fnref:5"&gt;&lt;a class="footnote-ref" href="#fn:5"&gt;5&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;For example, the below images are not exact duplicates, but are clearly near-duplicates. For each pair, the left images are from the training data, and the right images are generated by prompting Midjourney with the training data caption. &lt;a href="https://arxiv.org/abs/2305.08694"&gt;This research from Webster (2023)&lt;/a&gt; unveiled efficient and accurate ways to reconstruct training data from diffusion models.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A set of 4 pairs of images, ranging from a backpack to a person's face. Each of the images looks almost the same as the other image, with slight changes in image clarity, dimension or small color changes." src="./images/2025/image_paraphrasing.png"&gt;&lt;/p&gt;
&lt;p&gt;For content creators or artists, this type of attack might be important if their content is particularly popular or interesting for particular AI/LLM audiences.&lt;/p&gt;
&lt;p&gt;Let's walk through these two main attack vectors and their variations in real-world threats where an organization or individual might be concerned about their information being memorized.&lt;/p&gt;
&lt;h3 id="threat-outputting-copyright-material-or-images"&gt;Threat: Outputting copyright material or images&lt;/h3&gt;
&lt;p&gt;One obvious initial threat is explicitly copying copyrighted content and outputting it with little or no variation. This has already sparked several prominent lawsuits, such as &lt;a href="https://www.courthousenews.com/wp-content/uploads/2023/12/new-york-times-microsoft-open-ai-complaint.pdf"&gt;The New York Times vs. OpenAI&lt;/a&gt;, where ChatGPT outputs verbatim copies of popular New York Times articles without attribution.&lt;/p&gt;
&lt;p&gt;Researchers have also been actively reproducing these attacks in visual images, where copyrighted characters or visuals can be easily reproduced, even without directly invoking the name. For example, &lt;a href="https://copycat-eval.github.io/"&gt;typing "Gotham, Superhero"&lt;/a&gt; produces copies of Batman.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Two images that are clearly Batman, both generated using the prompt (&amp;quot;Gotham, Superhero&amp;quot;), one from PlaygroundAI and the other from DALL-E." src="./images/2025/gotham_superhero.png"&gt;&lt;/p&gt;
&lt;p&gt;This affects other mediums, such as music and video content as those AI models become easier to use and more widely available. There have already been highly publicized examples of &lt;a href="https://www.bbc.com/news/business-57761873"&gt;voice and video-cloning use for criminal activities&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;For organizations and persons whose income or existence relies primarily on producing and licensing copyrighted content, this is certainly a very serious threat -- one that deserves attention and discussion in a public forum.&lt;/p&gt;
&lt;h3 id="threat-violating-someones-privacy-by-directly-outputting-their-information"&gt;Threat: Violating someone's privacy by directly outputting their information&lt;/h3&gt;
&lt;p&gt;One drastic case is the either intentional or unintentional release of a person's sensitive information. This information could be their face, their words taken out of context, their contact information or other information they would rather not share via a machine learning model.&lt;/p&gt;
&lt;p&gt;This is documented in research, where ChatGPT directly &lt;a href="https://www.vice.com/en/article/chatgpt-can-reveal-personal-information-from-real-people-google-researchers-show/"&gt;output personal contact information&lt;/a&gt;, where &lt;a href="https://stable-diffusion-art.com/realistic-people/"&gt;StableDiffusion can reproduce a person's face&lt;/a&gt; and where &lt;a href="https://www.usenix.org/conference/usenixsecurity19/presentation/carlini"&gt;models trained on sensitive keyboard data output that information&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;There is currently no required reporting from companies providing machine learning models to test, audit or verify this behavior. A more comprehensive understanding could be achieved by institutionalizing privacy auditing and reporting for deep learning models which could be standardized and enforced by a regulatory body. This could accompany other monitoring and testing for privacy right related to hallucination, like when a model &lt;a href="https://noyb.eu/sites/default/files/2024-04/OpenAI%20Complaint_EN_redacted.pdf"&gt;repeats outdated and incorrect information&lt;/a&gt; or &lt;a href="https://www.bbc.com/news/articles/c0kgydkr516o"&gt;hallucinates things that never happened&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;For well-known persons, this threat extends to uses like DeepFakes or other clones, where their likeness is used in ways that they did not consent to, such as the rise of DeepFake-Pornography and DeepFake-Propaganda. Combining these attacks with other software like "face transfer" makes these violations easier to do with enough examples of the person's likeness/voice (i.e. this could be performed by a person close to the person, or a person with access to enough photo or video materials of the person).&lt;/p&gt;
&lt;h3 id="threat-learning-if-someones-data-is-in-a-model-or-if-someone-is-in-a-particular-population"&gt;Threat: Learning if someone's data is in a model or if someone is in a particular population&lt;/h3&gt;
&lt;p&gt;Membership Inference Attacks can reveal information about someone's participation in a particular service, population or activity. That might seem harmless at first -- who cares if someone knows that OpenAI scraped my data?&lt;/p&gt;
&lt;p&gt;For LLMs, it's not as relevant, but what about models that specifically target a certain population: like a model built to evaluate people with a disease for medical treatment, people with a certain income for advertising or credit evaluation or people with a particular political view for political ads or border control? These models exist, and membership in those models would expose related sensitive information that people likely don't want to share out of context.&lt;/p&gt;
&lt;p&gt;Related attacks on subpopulations rather than individuals can also reveal information about the subpopulation that expose that group's sensitive attributes. Similar to the Cambridge Analytica attacks, where &lt;a href="https://www.gsb.stanford.edu/insights/science-behind-cambridge-analytica-does-psychological-profiling-work"&gt;"harmless" information about liking Facebook Pages provided enough information&lt;/a&gt; to expose related sensitive attributes like gender, political affiliation, drug use and sexual preferences.&lt;/p&gt;
&lt;h3 id="threat-stealing-someones-work-without-attribution-or-compensation"&gt;Threat: Stealing someone's work without attribution or compensation&lt;/h3&gt;
&lt;p&gt;Another real-world problem for organizations is directly repeating someone's work without appropriate attribution. For example, &lt;a href="https://arstechnica.com/information-technology/2025/02/copilot-exposes-private-github-pages-some-removed-by-microsoft/"&gt;security researchers found private repository code from several FAANG-companies&lt;/a&gt; available in Copilot. The memorized code from the repositories was accessed when those repositories was public, but since then the repositories have been changed to private and the models haven't been updated.&lt;/p&gt;
&lt;p&gt;This isn't the same as the copyright issue, because there is quite a bit of content that isn't explicitly under copyright, but is intended to create awareness, wealth and recognition for the original creator. For example, in creative and software communities, there are popular licenses that require attribution or even specify under what conditions the work can be reused or remixed. When the author, coder, artist is cited by someone who remixes or reuses their work, this creates more awareness, building their audience or giving them new opportunities.&lt;/p&gt;
&lt;p&gt;Because of this, there's been increased awareness of GenerativeAI in artist communities, who want their art to be used by AI systems, but who would like attribution, or compensation, or both.&lt;/p&gt;
&lt;p&gt;In a recent example that overlaps with copyright protection, &lt;a href="https://techcrunch.com/2025/02/24/1000-artists-release-silent-album-to-protest-uk-copyright-sell-out-to-ai/"&gt;artists released a "silent" album&lt;/a&gt; to protest the UK's proposal to not enforce copyright for AI-generated work. There have been many such protests over the past 5 years.&lt;/p&gt;
&lt;h3 id="threat-overexposing-certain-populations-to-the-above-attacks"&gt;Threat: Overexposing certain populations to the above attacks&lt;/h3&gt;
&lt;p&gt;As you learned in the &lt;a href="https://blog.kjamistan.com/how-memorization-happens-novelty.html"&gt;how it works (novel examples) article&lt;/a&gt;, some examples are more beneficial to memorize for the model's evaluation scores (i.e. accuracy on related difficult examples). This means these examples are also memorized at a higher frequency than other examples.&lt;/p&gt;
&lt;p&gt;This might seem innocuous, but if you investigate these data distributions, you'll find that these underrepresented groups in those populations are overexposed when it comes to privacy. Research from &lt;a href="https://arxiv.org/pdf/1905.12101"&gt;Bagdasaryan and Shmatikov&lt;/a&gt; proved that models trained with differential privacy did poorly on fairness metrics across diverse groups. For example, privacy-respecting models performed worse on sentiment analysis on African-American English versus a "Standard" American English dataset. In the same research, privacy-respecting models misclassified gender for dark-skinned faces more frequently than light-skinned faces.&lt;/p&gt;
&lt;p&gt;This demonstrates how increases in model fairness and accuracy for subpopulations is directly related to specific memorization of individuals whose data comes from that subpopulation. Put differently, certain persons in an underrepresented group give up their privacy in exchange for better model accuracy on "data like them". This overexposes individuals in this group when compared with other persons from a majority population in the dataset who can "hide in the crowd".&lt;/p&gt;
&lt;p&gt;In larger machine learning datasets, this problem is exacerbated due to human biases in labels, where a white man might be labeled "man" and a black woman is labeled "black woman". This labeling problem occurs any time a subpopulation assumes that they represent the general population. This label bias exacerbates the memorization problem, because each label must be learned separately, and many of these "non-majority" labels will end up in the long-tail and be more prone to memorization.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://arxiv.org/abs/1907.00164"&gt;Shokri et al. investigated this issue when looking at explanations&lt;/a&gt; for deep learning systems and found that data reconstruction attacks worked more easily on minority populations when using model explanations on the same or similar examples. &lt;a href="https://arxiv.org/abs/2011.03731"&gt;Chang and Shokri&lt;/a&gt; formalized this privacy issue for minority populations in other works, proving that minority populations are at greater privacy risk, especially when fair algorithm and model design do not take privacy into account.&lt;/p&gt;
&lt;h3 id="threat-exposing-critical-knowledge-from-training-data-unintentionally"&gt;Threat: Exposing critical knowledge from training data unintentionally&lt;/h3&gt;
&lt;p&gt;Another memorization issue is the ability for these systems to memorize corporate secrets, important legal contracts or otherwise confidential information. Because the model is not incentivized to understand the difference between text, photos or other media that should be learned versus other material that shouldn't, this creates a significant problem for organizations with confidential material they'd like to use for machine learning.&lt;/p&gt;
&lt;p&gt;For example, after the launch of ChatGPT-3.5, Amazon's legal department found &lt;a href="https://futurism.com/the-byte/amazon-begs-employees-chatgpt"&gt;text snippets of text that shared internal corporate secrets&lt;/a&gt; in the chat model's responses.&lt;/p&gt;
&lt;p&gt;This can also unintentionally happen when building systems with access to such documents -- even if they haven't been trained on that data. These exposures have little to do with AI memorization and more to do with lack of privacy and security understanding in Retrieval Augmented Generation system design.&lt;/p&gt;
&lt;h3 id="should-you-be-concerned"&gt;Should you be concerned?&lt;/h3&gt;
&lt;p&gt;As you've seen thus far, the only reason to be worried that sensitive data will be stored in the model is if you are training the model on data that you don't want explicitly memorized. If you don't train with copyrighted, licensed or person-related data, these attacks aren't a threat.&lt;/p&gt;
&lt;p&gt;If you are using corporate proprietary or internal data and you are only using the model internally, this probably isn't an issue, so long as the model outputs are also considered "for internal use only". As usual, talk with your legal and privacy teams to clarify.&lt;/p&gt;
&lt;p&gt;If teams or individuals are training their own models (i.e. personal or collaborative-based models) and they all consent to this training and co-own this model, this might not be a problem if the data is available for use across the entire company. In my experience, those teams should discuss and be aware of memorization, but they presumably enthusiastically consent to the use and development.&lt;/p&gt;
&lt;p&gt;So really you should only be concerned about this phenomenon if you are training model that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;use people's data without their enthusiastic consent and knowledge&lt;/li&gt;
&lt;li&gt;is used or deployed in new contexts (i.e. for a new purpose / in the public sphere / for something that those people wouldn't agree to)&lt;/li&gt;
&lt;li&gt;doesn't address the privacy/content implications as part of model design and development&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You might think: that's got to be a TINY percentage of models, but my industry experience can confirm that this is a non-trivial amount of AI systems, including many of the LLMs, Code Assistants and AI Agents.&lt;/p&gt;
&lt;p&gt;You might also wondering about the implications if you use potentially at-risk models but don't train them. This puts you in a difficult position of not influencing the model development but potentially being exposed to the same threats above. Being aware of these threats is a good step when evaluating what systems to integrate for what tasks, and there will be a future article in this series on addressing exactly this situation.&lt;/p&gt;
&lt;p&gt;Now that you've identified the biggest potential threats, let's begin investigating ways to address these threats. In the following articles, you'll learn about:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Software- and system-based interventions, like output filtering and system prompts&lt;/li&gt;
&lt;li&gt;Fine-tuning guardrails&lt;/li&gt;
&lt;li&gt;Machine unlearning or intentional "forgetting"&lt;/li&gt;
&lt;li&gt;Differential privacy in training and fine-tuning&lt;/li&gt;
&lt;li&gt;Evaluating and auditing privacy metrics in deep learning systems&lt;/li&gt;
&lt;li&gt;Evaluating AI systems and their threats as a third-party user&lt;/li&gt;
&lt;li&gt;Pruning and distillation for information reduction&lt;/li&gt;
&lt;li&gt;Different types of models that could offer explicit and enthusiastic consent and public participation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Have a burning question or idea related to these topics, or want to share new threats and ideas? Please feel free &lt;a href="https://probablyprivate.com/about/"&gt;to reach out via email&lt;/a&gt; or &lt;a href="https://www.linkedin.com/in/katharinejarmul/"&gt;LinkedIn&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Acknowledgements&lt;/em&gt;: I would like to thank &lt;a href="https://vickiboykis.com/"&gt;Vicki Boykis&lt;/a&gt; and &lt;a href="https://desfontain.es/serious.html"&gt;Damien Desfontaines&lt;/a&gt; for their feedback, corrections and thoughts on this series. Their input greatly contributed to improvements in my thinking and writing. Any mistakes, typos, inaccuracies or controversial opinions are my own.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;These histograms show unscaled cross-entropy loss (CE-loss) collected from 1024 models that were trained with training data produced by leaving samples in/out. The CE-loss per example was collected to visually show the behavior of different types of examples and classes in the underlying training distributions and subsequent models.&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;The y-axis is logarithmic and shows the attack accuracy when holding for a high performance rate (False Positive Rate of 0.1). This is done as part of their attack design, where they aim to create more rigorous standards for MIA measurement to just focus on attacks that don't guess membership incorrectly.&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;A fun piece of information from this research is that the attack's success can be measured by looking at the model generalization gaps. This connects with what you've learned so far on &lt;a href="https://blog.kjamistan.com/gaming-evaluation-the-evolution-of-deep-learning-training-and-evaluation.html"&gt;evaluation metrics&lt;/a&gt; and &lt;a href="https://blog.kjamistan.com/how-memorization-happens-overparametrized-models.html"&gt;the generalization gap&lt;/a&gt; as an indicator for memorization. In general, they find that models that are more accurate and larger are easier to attack, which aligns with what you've already learned thus far.&amp;#160;&lt;a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:4"&gt;
&lt;p&gt;There are numerous tips in the paper and an &lt;a href="https://github.com/tensorflow/privacy/tree/master/research/mi_lira_2021"&gt;openly available implementation on GitHub&lt;/a&gt; that shows how to parallelize this and how many models and data splits are efficient to develop the distributions needed for the attack.&amp;#160;&lt;a class="footnote-backref" href="#fnref:4" title="Jump back to footnote 4 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:5"&gt;
&lt;p&gt;For example, when the website "this person does not exist" launched, researchers were quick to find "AI generated people" who actually represented faces in commonly used public face datasets. See &lt;a href="https://arxiv.org/abs/2107.06018"&gt;Webster et al., This Person (Probably) Exists, 2021&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:5" title="Jump back to footnote 5 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="ml-memorization"></category></entry><entry><title>Priveedly: your private and personal content reader and recommender</title><link href="https://blog.kjamistan.com/priveedly-your-private-and-personal-content-reader-and-recommender.html" rel="alternate"></link><published>2025-01-23T00:00:00+01:00</published><updated>2025-01-23T00:00:00+01:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2025-01-23:/priveedly-your-private-and-personal-content-reader-and-recommender.html</id><summary type="html">&lt;p&gt;I'm excited to open-source a project that I've been using for the past 2 and a half years: a private/personal reader and recommender.&lt;/p&gt;
&lt;p&gt;It works with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;RSS feeds&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.reddit.com/"&gt;Reddit&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://news.ycombinator.com/"&gt;HackerNews&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lobste.rs/"&gt;Lobste.rs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;and comes with an example Jupyter Notebook for training your own text-based recommendation model once you have …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I'm excited to open-source a project that I've been using for the past 2 and a half years: a private/personal reader and recommender.&lt;/p&gt;
&lt;p&gt;It works with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;RSS feeds&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.reddit.com/"&gt;Reddit&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://news.ycombinator.com/"&gt;HackerNews&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lobste.rs/"&gt;Lobste.rs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;and comes with an example Jupyter Notebook for training your own text-based recommendation model once you have enough content. For most folks, this will be about 3-6 months of active use -- depending on the amount of content you consume.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Interested in what it looks like? There's a short &lt;a href="https://youtu.be/J6SVGapJ1L0"&gt;video introduction on YouTube&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;If you just want to get started, head over to &lt;a href="https://github.com/kjam/priveedly"&gt;the project's GitHub&lt;/a&gt;! If you want a little history of why I bothered to build this and how I use it, read on.&lt;/p&gt;
&lt;h3 id="why-news-and-content-is-personal"&gt;Why news and content is personal&lt;/h3&gt;
&lt;p&gt;Despite what Social Media(TM) wants you to think, your content choices are deeply personal. You like the things you like, surely others like them, but you might be a very special combination of things which is what guides your interests.&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;The large content providers and social media platforms try to be everything to everyone, and when that doesn't work, they try to personalize by tracking you and putting you in ever smaller and smaller bins and cross-sections so that eventually your feed is "personalized" in a way that is still profitable for them to serve you content.&lt;/p&gt;
&lt;p&gt;Unfortunately, this means that if you are curious about something outside of your normal interactions, one poor click or follow might haunt you and rearrange your content.&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt; In my opinion, you shouldn't be afraid that clicking or reading something you are mildly interested in means you're doomed to see ads (or even deal with changes in online prices or search results) just because you clicked on one article.&lt;sup id="fnref:3"&gt;&lt;a class="footnote-ref" href="#fn:3"&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;It can also be fun to decide what you want to expose yourself to, for your own autonomy and purposes. Maybe your ideas change, maybe you are going through a huge life change, or maybe you want to surround yourself with a new bubble on the internet. Either way, deciding and determining directly what you read and see is a cool way to reclaim that autonomy.&lt;/p&gt;
&lt;p&gt;For these reasons, I decided that I was going to try to pursue more directed attention on my own reading and content online.&lt;/p&gt;
&lt;h3 id="down-the-rabbit-hole-shouldnt-this-be-easy"&gt;Down the rabbit hole: shouldn't this be easy?&lt;/h3&gt;
&lt;p&gt;I had long used feed readers, but I wanted to combine that with other content sources, like Reddit, Twitter&lt;sup id="fnref:4"&gt;&lt;a class="footnote-ref" href="#fn:4"&gt;4&lt;/a&gt;&lt;/sup&gt; and other tech news sites. As I first started investigating ways to just do this easily with online services (i.e. private services that promised to keep my data private), it was hard to find one that justified the cost. Many other services weren't very clear on if they actually implemented tracking-free clicks and content.&lt;/p&gt;
&lt;p&gt;I assumed there would be some easy open-source options, so then I looked there. There were some great ones I tried at first that were React-based, but since I am essentially incompetent at Javascript it was hard to figure out how to extend them. For Python-based readers, I tried &lt;a href="https://github.com/samuelclay/NewsBlur"&gt;NewsBlur&lt;/a&gt;, which was awesome, but also set up for much larger and in-depth usage than I was planning on. For me, the obvious options were asking too much (i.e. run a beefy, expensive server) and too complicated (i.e. learn Javascript).&lt;/p&gt;
&lt;p&gt;Since I know some things about feed and web scraping and language processing, I thought it might be fun to set up a small PoC... heheh -- yes, I know I am &lt;a href="https://www.xkcd.com/1319/"&gt;this XKCD comic (see below)&lt;/a&gt; and I literally cannot stop, don't bother sending help.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Famous XKCD automation comic showing that you think you are saving time by automating a task but then your own reworking of the automation takes much more time and effort." src="https://imgs.xkcd.com/comics/automation.png"&gt;&lt;/p&gt;
&lt;h3 id="built-small-and-simple-for-one-person-use"&gt;Built small-and-simple for one-person use&lt;/h3&gt;
&lt;p&gt;If you don't need to commercialize it, you can personalize it! Added benefit: this means you don't have to reach scale other than 1 user! You are already winning if one person can log in and use it. I ran my content server for the first year on a very small $3/month server. :)&lt;/p&gt;
&lt;p&gt;Since I already knew how to write scrapers and make a Django-based website, I did that. There are certainly a million other ways to do this, but I did what worked for me.&lt;/p&gt;
&lt;p&gt;Over time, I realized that I might want to filter content that I'm not interested in, especially when I get busy and don't log in for a month or two. When that happened, I wanted to only read the potentially interesting stuff and mark everything else as read.&lt;/p&gt;
&lt;p&gt;To start with building a recommender, I exported my data and played around with simple natural language processing to see what models worked for my data. I didn't overcomplicate or overthink it for my use, which is why I used &lt;a href="https://scikit-learn.org/stable/"&gt;scikit-learn&lt;/a&gt; and not some LLM.&lt;/p&gt;
&lt;p&gt;You might be different and decide:&lt;/p&gt;
&lt;p&gt;a) you want to build your own using a different web framework or open-source reader/recommender
b) you want to have a LLM&lt;/p&gt;
&lt;p&gt;I say: go for it! It's your project! :)&lt;/p&gt;
&lt;p&gt;Note: My current server costs about $6 a month to manage running the feed-reading, parsing and bulk-rating of articles every few hours. If you want to run an LLM it will cost a lot more and require much more memory.&lt;/p&gt;
&lt;h3 id="the-treasure-trove-of-your-own-data"&gt;The treasure trove of your own data&lt;/h3&gt;
&lt;p&gt;One cool thing about running your own content reader/recommender is that you can study your data over time. As a data scientist, I think this is really awesome (yes, I am a nerd).&lt;/p&gt;
&lt;p&gt;Once you have enough data to do some basic analysis or whenever you decide to train a model on your data, you can use that analysis or model introspection to investigate more about you. This can be a fun exercise and you can do it on the privacy of your own computer and/or server.&lt;sup id="fnref:5"&gt;&lt;a class="footnote-ref" href="#fn:5"&gt;5&lt;/a&gt;&lt;/sup&gt; There is an example &lt;a href="https://github.com/kjam/priveedly/blob/main/notebooks/Training%20and%20Testing%20Simple%20Recommendation%20Classifiers.ipynb"&gt;notebook in the GitHub&lt;/a&gt; to get you started and an accompanying video on &lt;a href="https://youtu.be/AMy3K3NbrLw"&gt;YouTube&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Should you want to change/retrain your model or even change what you read or what you mark interesting based on your model introspection, you can guide that yourself on your own terms. Changing the model is something you can do at any time, without anyone making money or poking you to change what you click so they can make money.&lt;/p&gt;
&lt;h3 id="open-sourcing-priveedly"&gt;Open-sourcing Priveedly&lt;/h3&gt;
&lt;p&gt;This project was really just for me for a long time, but I thought now is a hard time for many people to control their news and what they want to read, so I decided to clean it up as best I could and open-source it. If you think you can make Priveedly better by helping with the &lt;a href="https://github.com/kjam/priveedly?tab=readme-ov-file#some-additional-notes"&gt;open requests in the ReadMe&lt;/a&gt; or via &lt;a href="https://github.com/kjam/priveedly/issues"&gt;GitHub Issues&lt;/a&gt;, I would be very grateful!&lt;/p&gt;
&lt;p&gt;I hope you might be inspired to use Priveedly or whatever service/project you decide gives you the right balance of privacy, autonomy and fun.&lt;/p&gt;
&lt;p&gt;If you find this project useful and want to support my work, you can &lt;a href="https://probablyprivate.com"&gt;subscribe to my newsletter&lt;/a&gt;, buy &lt;a href="https://practicaldataprivacybook.com"&gt;my most recent book&lt;/a&gt;, follow me on &lt;a href="https://www.youtube.com/@ProbablyPrivate"&gt;YouTube&lt;/a&gt; or even hire me for &lt;a href="https://kjamistan.com/"&gt;corporate trainings, advisory and speaking engagements on topics like Privacy and Security in ML/AI systems&lt;/a&gt;.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;I am odd, just probably like you are odd in some ways. My ideal feed is heavy on tech, computer science, machine learning but also on things like my favorite cooking blogs, artsy blogs and artists/comics I like.&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;I sometimes want to read stuff without teaching the algorithm, just because I am curious what's behind a link. And yes, sometimes it is clickbait and I wished I didn't click, but I try to be kind to myself and tell myself that's okay too.&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;I don't like BigTech or Third-party-ad-platform trying to target me via my clicks or reading interests. It makes me feel uncomfortable about clicking things. This isn't the internet I signed up for... (sad trombone)&amp;#160;&lt;a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:4"&gt;
&lt;p&gt;RIP Twitter. The freely available API for your own feed got turned off a few months after The MuskRat took over. :( The original code that worked is still there (it accesses Lists and pulls from them), but I am highly doubtful that it still works and that the API hasn't dramatically changed.&amp;#160;&lt;a class="footnote-backref" href="#fnref:4" title="Jump back to footnote 4 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:5"&gt;
&lt;p&gt;I &lt;a href="https://youtu.be/jYwe-YHM4ag?t=1123"&gt;presented some interesting trends and tokens from my personal recommender model at PyData Paris&lt;/a&gt;, including one of the most negative bigrams (2-word-combinations), which was "Elon says". When I first saw this, it made me laugh all day long and was well worth the additional time and effort.&amp;#160;&lt;a class="footnote-backref" href="#fnref:5" title="Jump back to footnote 5 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="personal-ai"></category></entry><entry><title>Adversarial Examples Demonstrate Memorization Properties</title><link href="https://blog.kjamistan.com/adversarial-examples-demonstrate-memorization-properties.html" rel="alternate"></link><published>2025-01-15T00:00:00+01:00</published><updated>2025-01-15T00:00:00+01:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2025-01-15:/adversarial-examples-demonstrate-memorization-properties.html</id><summary type="html">&lt;p&gt;In this article, the last in the &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;problem exploration section of the series&lt;/a&gt;, you'll explore adversarial machine learning - or how to trick a deep learning system.&lt;/p&gt;
&lt;p&gt;Adversarial examples demonstrate a different way to look at deep learning  memorization and generalization. They can show us how important the learned decision space …&lt;/p&gt;</summary><content type="html">&lt;p&gt;In this article, the last in the &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;problem exploration section of the series&lt;/a&gt;, you'll explore adversarial machine learning - or how to trick a deep learning system.&lt;/p&gt;
&lt;p&gt;Adversarial examples demonstrate a different way to look at deep learning  memorization and generalization. They can show us how important the learned decision space and its properties are and how the training data and preprocessing affect that behavior. Adversarial examples demonstrate similar properties to outliers in deep learning systems.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Prefer to learn by video? This post &lt;a href="https://youtu.be/us4gUJKvwpQ"&gt;is summarized on Probably Private's YouTube&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;You'll also explore how adversarial learning contributed to model growth with early approaches to adversarial training and robustness, and how today's approaches find correlations to memorization in diffusion models.&lt;/p&gt;
&lt;h3 id="what-is-an-adversarial-example"&gt;What is an adversarial example?&lt;/h3&gt;
&lt;p&gt;Adversarial examples are those which trick a machine learning model into behaving in unlikely or unwanted ways. You've probably seen some adversarial examples &lt;a href="https://x.com/random_walker/status/1636923058370891778"&gt;on social media&lt;/a&gt; or &lt;a href="https://www.theregister.com/2024/10/29/chatgpt_hex_encoded_jailbreak/"&gt;in the news&lt;/a&gt;, which are often referred to as "jailbreaking" if they attack a LLM system.&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt; But the study of adversarial examples precedes the existence of LLMs, and can teach you about how deep learning models work.&lt;/p&gt;
&lt;p&gt;Let's examine an early example of an adversarial attack, from &lt;a href="https://arxiv.org/abs/1707.07397"&gt;MIT researchers in 2017&lt;/a&gt;. The machine learning lab researchers were building on still image work that introduced adversarial examples by altering the images in small ways. They wondered if they could build a 3D adversarial example - and were able to do so!&lt;/p&gt;
&lt;p&gt;&lt;img alt="An image of what appears to be a turtle with psychedelic shell. To the right of the image you can see the results of a classifier, which classifies it with more than 90% confidence that it is a rifle." src="./images/2024/adversarial_turtle.png"&gt;
&lt;em&gt;Watch the &lt;a href="https://www.youtube.com/watch?v=YXy6oX1iNoA&amp;amp;ab_channel=SynthesizingRobustAdversarialExamples"&gt;full video on YouTube&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Using then state-of-the-art computer vision models, they were able to 3D print an adversarial turtle which from many angles is improperly categorized as a rifle.&lt;/p&gt;
&lt;p&gt;You might be wondering, how does that work???&lt;/p&gt;
&lt;h3 id="how-adversarial-examples-happen"&gt;How adversarial examples happen&lt;/h3&gt;
&lt;p&gt;Adversarial examples occur primarily by increasing uncertainty or error in the machine learning system. Via a variety of methods, adversarial examples push inputs into other areas of the decision space or boundaries (think about what you learned about &lt;a href="https://blog.kjamistan.com/how-memorization-happens-overparameterized-models.html"&gt;margin theory&lt;/a&gt;!). In this case, the researchers wanted to push the turtle into "rifle" decision space. These attacks exploit those decision boundaries, in a similar way to how the training process creates them.&lt;/p&gt;
&lt;p&gt;Let's walk through exactly how a simple adversarial attack works, to get an idea for how it happens.&lt;/p&gt;
&lt;p&gt;You can take a model, really any model that is trained on a similar task -- here, a computer vision task. The properties of transfer learning make this possible. Deep learning models that are trained for similar tasks hold similar properties, learn similar things, and sometimes even have similar base datasets.&lt;/p&gt;
&lt;p&gt;With your local computer vision model, you take an input that you want to make adversarial. Because you have direct model access, you can run the image through the model to produce an inference / prediction. When you do so, you can also observe the weights and activations at each layer, and of course the output of the inference. Let's say it correctly identifies a person in the image.&lt;/p&gt;
&lt;p&gt;You want to make sure a person isn't found in the photo. You can then measure the gradient changes you would need to increase the error. You are essentially reversing the process of stochastic gradient descent, trying to increase error overall or towards another decision boundary (i.e. please make sure this photo has an imaginary boat in it). The goal is to reach an error level where the original classification no longer holds (i.e. the model returns that the photo has no person in it).&lt;/p&gt;
&lt;p&gt;An example from one of the &lt;a href="https://arxiv.org/abs/1412.6572"&gt;first well-cited papers on these attacks (Goodfellow et al, 2015)&lt;/a&gt; shows visually, what this might mean:&lt;/p&gt;
&lt;p&gt;&lt;img alt="On the left, there is an image of a panda. Then there is a plus and then another image, which just looks like colorful noisy pixels. Those two images are combined and the resulting image (to the right) looks again like a panda, but maybe the color looks a teeny bit different than the first one but it's barely discernible. The predictions on the images are written below the image. On the left image the model is 57.7% confident it's a panda. The middle image is classified with 8.2% confidence as a nematode and the right image is classified with 99.3% confidence as a gibbon." src="./images/2024/adversarial_panda.png"&gt;&lt;/p&gt;
&lt;p&gt;This attack uses the Fast Gradient Sign Method (FSGM), which functions similarly to what is described above.&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt; This method shows you "which direction to push" and where in the input you should change to push that way. Then you actually create the perturbation, which aims to push the input in the right places in the right way, here: 0.07 (represented in the equation as e). This perturbation is combined with the original image, resulting in the large classification error (and resulting confidence in the incorrect gibbon class).&lt;/p&gt;
&lt;p&gt;An attacker can use a method like this, or several more complex methods developed over time, to introduce error into model inference and influence the model to return a particular decision. The turtle becoming a rifle was a specific example to show deploying computer vision as security could go very wrong if you targeted people carrying "weapons".&lt;/p&gt;
&lt;p&gt;In a way, these examples represent the unlikely inputs you see in the long tail of normal data collection. To a human, these are obvious, but to a computer vision or deep learning model, they are novel, erroneous or unknown. Interestingly enough, some of the initial defenses against adversarial attacks used this fact to correct the introduced error. Let's explore one of them that relates to our investigation of memorization.&lt;/p&gt;
&lt;h3 id="initial-defenses-manifold-ing"&gt;Initial defenses: Manifold-ing&lt;/h3&gt;
&lt;p&gt;One of the early defenses that caught my eye had an interesting approach to the adversarial input problem. Instead of attempting to build the most robust model, it attempted to adjust the input and draw it closer to the more common examples, essentially regularizing the error away. This approach, called &lt;a href="https://arxiv.org/abs/1705.09064"&gt;MagNet&lt;/a&gt;, was introduced by Meng and Chen in 2017.&lt;/p&gt;
&lt;p&gt;Let's explore how this correction worked, step-by-step 😉:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;First, potential adversarial examples are identified by a series of detectors. The detectors are trained to determine how abnormal the current example is based on the training examples. Numerous detectors were trained, so an adversary would need to know each of the detectors well enough to build an example that would 100% go through undetected&lt;sup id="fnref:3"&gt;&lt;a class="footnote-ref" href="#fn:3"&gt;3&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If an example is too far from the distribution of training examples, the images were corrected using an autoencoder built specifically to bring the example closer to the nearest training examples - here named a manifold. You can also think of this as attempting to migrate the examples closer to the nearest decision boundary, as you learned about in margin theory.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;To visually illustrate an example of the output of this autoencoder, check out this diagram from the paper, which simplifies and visually explains the second step in 2-D. The curved line represents the manifold and the green circles the training examples. The red crosses represent adversarial examples, and the arrows represent their correction via the reformer.&lt;/p&gt;
&lt;p&gt;&lt;img alt="An image with a curved line representing the manifold. There are green circles along the line which show the location of the training examples. There are a few red crosses farther from the manifold. They represent the adversarial inputs. Then there are arrows that point from them towards the manifold, which represent the reformer process to move them closer to the training examples." src="./images/2024/manifolding.png"&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Only then can the "reformed" version of the example be run through inference - hopefully without the carefully designed noise that would have disrupted the system without the correction.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This protection was successful against all of the common attack vectors at the time.&lt;sup id="fnref:4"&gt;&lt;a class="footnote-ref" href="#fn:4"&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;How does this relate to memorization? Adversarial examples present the same types of problems for the network functionality as singletons, although they do so in different ways (one is malicious, the other just odd). If you treated singletons as adversarial, you could choose to focus on learning the nearest decision boundary or manifold and handling them as an outlier. This would reduce the chance of memorization.&lt;/p&gt;
&lt;p&gt;In doing so, you might choose to implement something more like the reformer, which could assist in encoding information worth learning to shift decision boundaries but not enough to memorize the initial input. The reformer algorithm would be a stand-in for something like differential privacy, auto-encoding outliers towards a more "common" case. One approach that trains autoencoders as a way to encode privacy into a representation can be found in Dwork et al.'s work &lt;em&gt;&lt;a href="https://arxiv.org/abs/1104.3913"&gt;Fairness through Awareness&lt;/a&gt;&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;These autoencoders and several other interesting approaches popular during those initial years were eventually replaced by a different approach to address adversarial examples.  &lt;/p&gt;
&lt;h3 id="the-heyday-and-wane-of-adversarial-research"&gt;The Heyday and Wane of Adversarial Research&lt;/h3&gt;
&lt;p&gt;Adversarial examples have likely existed as long as machine learning has existed, but they experienced a renaissance with deep learning due to the unique behavior of deep learning models in comparison to simpler models.&lt;/p&gt;
&lt;p&gt;In 2016 it was hard to attend any machine learning event without hearing about adversarial examples (kind of like trying to avoid LLMs in 2024). There were dedicated sections of popular machine learning conferences just for papers and posters exploring these problems. These papers were looking at unique attack vectors, easy ways to produce adversarial examples and, of course, a variety of interesting and novel defense mechanisms, many of which were broken by other research sometimes days after publishing a new state-of-the-art defense paper.&lt;/p&gt;
&lt;p&gt;Eventually newer approaches emerged -- ones that didn't try to figure out why or how adversarial examples worked or come up with clever ways to address them.&lt;/p&gt;
&lt;p&gt;In 2018, &lt;a href="https://arxiv.org/abs/1706.06083"&gt;a paper from MIT researchers (Mądry et al.)&lt;/a&gt; reached state-of-the-art adversarial performance with a new approach: throw compute and memory at the problem. Instead of trying to find, correct or otherwise disarm adversarial examples, they simply trained a bigger model for longer using adversarial examples alongside normal examples.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Two performance charts side by side, one using MNIST data and the other using CIFAR10 data. The loss is charted on the y-axis with training iterations on the x-axis. The iterations go from 25,000 to 75,000. On both charts you can see a drop in loss (also known as an increase in accuracy) until around 75,000 where sometimes the loss increases." src="./images/2024/adversarial_training.png"&gt;&lt;/p&gt;
&lt;p&gt;This works similar to the phenomenon you've been exploring in this series, where the double descent and increase in memorization enhances model performance. Instead of figuring out new and interesting ways to understand deep learning, you just memorize adversarial examples in a massive model and move along.&lt;/p&gt;
&lt;p&gt;In addition to this approach, a second popular method came about when diffusion models rose in popularity. As you read about &lt;a href="https://blog.kjamistan.com/how-memorization-happens-novelty.html"&gt;in a previous article&lt;/a&gt;, diffusion models can take an arbitrary noisy image and output prompt-based images. To do so, the diffusion process is reversed, so noise is gradually removed, shifting the output towards a known trajectory or goal. This can be applied to adversarial images in order to gradually remove potential noise.&lt;/p&gt;
&lt;p&gt;Recent research on &lt;a href="https://arxiv.org/abs/2206.10550"&gt;certifying adversarial robustness&lt;/a&gt;, a two-step process of applying a one-shot reverse diffusion process and then using a (probably fairly large) classifier achieved "state of the art" results. One interesting note is that a one-shot diffusion process works best, because iterative diffusers begin "filling in the blanks" and insert error by moving the input closer to an already learned class (or memorized input example). Choosing a less-diffused image results in higher integrity to the actual input. This demonstrates what you learned with regard to &lt;a href="https://blog.kjamistan.com/how-memorization-happens-novelty.html"&gt;inpainting attacks&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id="exploring-the-problem-space"&gt;Exploring the problem space&lt;/h3&gt;
&lt;p&gt;So far in this series you've learned about:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;how data is collected and labeled for machine learning&lt;/li&gt;
&lt;li&gt;how training and evaluation works&lt;/li&gt;
&lt;li&gt;why and how accuracy became the most important metric&lt;/li&gt;
&lt;li&gt;model size and training time growth&lt;/li&gt;
&lt;li&gt;what examples are memorized and some understanding and intuition on why and how&lt;/li&gt;
&lt;li&gt;how researchers first understood and found examples of memorization in deep learning&lt;/li&gt;
&lt;li&gt;how differential privacy relates to the memorization problems discovered in deep learning&lt;/li&gt;
&lt;li&gt;how adversarial examples demonstrate similar qualities as outliers&lt;/li&gt;
&lt;li&gt;how adversarial approaches influenced today's models and vice-versa&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You now also know that memorization in deep learning happens. That it happens for common examples and outliers. That it's difficult to 100% understand and predict what is memorized. And that unless it is treated as a first-order problem which should be addressed and corrected, as adversarial examples were treated, memorization will continue to plague deep learning models. This memorization, as you have learned, presents serious problems to the privacy guarantees, and certainly to any person whose data is used for training.&lt;/p&gt;
&lt;p&gt;In the next part of the article series, you'll explore the solution space. How can you address memorization? I'll walk you through active research areas, such as machine unlearning and differential privacy for training deep learning models. I'll also cover some areas which are useful but haven't gotten much attention, like personalized machine learning systems.&lt;/p&gt;
&lt;p&gt;If you've enjoyed this series, consider &lt;a href="https://probablyprivate.com"&gt;subscribing to my newsletter&lt;/a&gt;. If you'd be interested in a printed version of this series, I'd love to hear from you. I hope to produce a zine (physical) version with artist illustrations and content modifications for ease of reading and based on reader feedback.&lt;/p&gt;
&lt;p&gt;As always, I'm very open to feedback (positive, neutral or critical), questions (there are no stupid questions!) and creative re-use of this content. If you have any of those, please share it with me! This helps keep me inspired and writing. :)&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Acknowledgements&lt;/em&gt;: I would like to thank &lt;a href="https://vickiboykis.com/"&gt;Vicki Boykis&lt;/a&gt; for feedback, corrections and thoughts on this article. Any mistakes, typos, inaccuracies or controversial opinions are my own.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;Interesting word choice, based on the fact that these models are often trained with "guardrails" to try to control behavior that the underlying language model has learned -- like swear words, how to build bombs or other undesired behavior for a large consumer-facing language model.&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;To translate all the symbols, you use the model to calculate gradients (∇x). You do it quickly in the Jacobian matrix (J) based on the model (θ), input (x) and target class (y). You do this to figure out the easiest and fastest way to achieve your adversarial goal (i.e. just increase error or increase error towards a particular target class). In this particular method, the &lt;a href="https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant"&gt;Jacobian matrix&lt;/a&gt; is used to calculate the sign gradients (i.e. positive or negative) given a particular input. This can be computed quickly and creates an easy-to-use output, where you can reverse the signs to "push" in the appropriate direction.&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;This borrows from principles of cryptography, where you want to increase randomness and related uncertainty in the process to deter certain types of attacks. In cryptography, you can tell that a method is solid if it introduces enough randomness to provoke attacker uncertainty about the exact method, key and/or plaintext chosen.&amp;#160;&lt;a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:4"&gt;
&lt;p&gt;I took some similar research code on &lt;a href="https://evademl.org/squeezing/"&gt;feature squeezing&lt;/a&gt; (where anomalies are detected and then compressed and smoothed) and turned it into a &lt;a href="https://resources.oreilly.com/live-training/security-for-machine-learning"&gt;GitLab exercise for a security in machine learning course&lt;/a&gt;, if you want to play around with some code examples.&amp;#160;&lt;a class="footnote-backref" href="#fnref:4" title="Jump back to footnote 4 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="ml-memorization"></category></entry><entry><title>Differential Privacy as a Counterexample to AI/ML Memorization</title><link href="https://blog.kjamistan.com/differential-privacy-as-a-counterexample-to-aiml-memorization.html" rel="alternate"></link><published>2025-01-02T00:00:00+01:00</published><updated>2025-01-02T00:00:00+01:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2025-01-02:/differential-privacy-as-a-counterexample-to-aiml-memorization.html</id><summary type="html">&lt;p&gt;At this point in reading the &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;article series on AI/ML memorization&lt;/a&gt; you might be wondering, how did the field get so far without addressing the memorization problem? How did seminal papers like Zhang et al's &lt;a href="https://arxiv.org/abs/1611.03530"&gt;&lt;em&gt;Understanding Deep Learning Requires Rethinking Generalization&lt;/em&gt;&lt;/a&gt; not fundamentally change machine learning research? And maybe …&lt;/p&gt;</summary><content type="html">&lt;p&gt;At this point in reading the &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;article series on AI/ML memorization&lt;/a&gt; you might be wondering, how did the field get so far without addressing the memorization problem? How did seminal papers like Zhang et al's &lt;a href="https://arxiv.org/abs/1611.03530"&gt;&lt;em&gt;Understanding Deep Learning Requires Rethinking Generalization&lt;/em&gt;&lt;/a&gt; not fundamentally change machine learning research? And maybe, is there any research on actually addressing the problem of memorization?&lt;/p&gt;
&lt;p&gt;I have an answer for the last question! In this article, you'll explore how differential privacy research both exposes memorization in deep learning networks and presents ways to address these issues.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Prefer to learn by video? This post &lt;a href="https://www.youtube.com/watch?v=zeYb-TvbbM0&amp;amp;ab_channel=ProbablyPrivate"&gt;is summarized on Probably Private's YouTube&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In case differential privacy is new to you, let's walk through how differential privacy works and why it's a great fit for studying memorization.&lt;/p&gt;
&lt;h3 id="differential-privacy-a-primer"&gt;Differential Privacy: A primer&lt;/h3&gt;
&lt;p&gt;In 2006, Microsoft researcher Cynthia Dwork released a paper challenging common ideas around releasing data. To note, these ideas were already circulating in her research and related research for several years, but were not yet concretized for data release. In her paper &lt;a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/dwork.pdf"&gt;&lt;em&gt;Differential Privacy&lt;/em&gt;&lt;/a&gt;, she posited that there was no real way to release data without potentially exposing someone's private information if the released information was combined with available external information. For example, if you release the average height of women in Lithuania and someone knows that a woman is 2cm over the average height, that person's height can now be calculated.&lt;/p&gt;
&lt;p&gt;This fact that any kind of information release can be detrimental to privacy is obviously at odds with the work of data science and study of information. Dwork and her peers didn't want or intend to stop all research - quite the contrary. They wanted to find new ways to provide safer guarantees than the current status quo, which often just aggregated data and suppressed or removed outliers.&lt;/p&gt;
&lt;p&gt;Differential privacy provided a new, safer way to release and share information. Differential privacy is a rigorous and scientific way of measuring information release and its impact on individual privacy. Instead of guessing and hoping you are releasing data in a safe manner, differential privacy gives you a way to measure the data release's impact on individual privacy. When used correctly, it provides strong privacy guarantees for people in the dataset.&lt;/p&gt;
&lt;p&gt;Differential privacy does this by giving you a new way to analyze information gain that someone can get by looking at the data. This information gain for the attacker (i.e. person trying to learn more about someone) is privacy loss for the person or people they are trying to expose. The original differential privacy definition  prohibits anyone from learning anything too specific about a certain person.&lt;/p&gt;
&lt;p&gt;The definition is as follows:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A database to the left labeled D1 and then a plus or minus sign and a drawing of a person equals a resulting database D2. Therefore, D1 with one person added or one person removed equals D2." src="./images/2024/dp_drawing.png"&gt;&lt;/p&gt;
&lt;div class="math"&gt;$$P[A(D_1) \in S] \le exp(\varepsilon) \times P[A(D_2) \in S]$$&lt;/div&gt;
&lt;p&gt;
&lt;em&gt;If you don't like math, just read on! It's okay :)&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;There are two databases (D1 and D2), which differ by one person. The definition tells us that if you ask a question about either database, you shouldn't be able to guess accurately if it is database 1 or 2 based on the answer. If you want to protect the privacy of single persons, you shouldn't notice when the database has changed by only one.&lt;/p&gt;
&lt;p&gt;The above equation can be described as a probability bounds problem. However, in this case, instead of reducing uncertainty, you are trying to increase uncertainty. You want the interactions with the database(s) to leak very little information, so the attacker's probability distributions based on prior knowledge and then updated with the information from the query response are closely bound. The attacker should remain unsure about which database they have queried and therefore also unsure about whether the person is in the dataset or not.&lt;/p&gt;
&lt;p&gt;Let's take another concrete example to review differential privacy with a real-world lens. Imagine you work at a company with a lot of business dashboards. One of the business dashboards you have access to shows the overall payroll spend by city and role.&lt;/p&gt;
&lt;p&gt;You find out someone new is joining the company in your city and you know their role. You're curious about what they are getting paid, so... you decide to take a screenshot the week before they start. Then, you take another screenshot the week they start (or when payroll goes out). When you compare those two screenshots, you have a pretty good guess at their annual salary, presuming they were the only joiner in that specific role/city combination. Even if they weren't the only joiner, you now have a bounds or range of possible salaries, and can better guess their salary.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A drawing of a person with exclamation points above their head. They are looking at a dashboard titled Payroll by Role and have a dashed line that represents where their screenshot from the prior example was." src="./images/2024/dp_dashboard.png"&gt;
&lt;em&gt;Image from my video course on Practical Data Privacy from O'Reilly Learning&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;This demonstrates the problems Dwork and others saw with simple aggregation. By just aggregating data, you don't protect individuals.&lt;/p&gt;
&lt;p&gt;But what if the dashboard didn't update as regularly, and when it did, it sometimes changed in more unpredictable ways. What if you weren't exactly sure if your screenshot was the correct number, just that it might be near the correct number, or also maybe not?&lt;/p&gt;
&lt;p&gt;When you implement differential privacy to release a dataset or to process data, you follow several key steps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Understand &lt;em&gt;sensitivity&lt;/em&gt;&lt;/em&gt;: How much can a person change the result? This can be easy to measure, like it is for our salary example, or difficult to measure -- like, how much can a person's data change a machine learning model? 🤔&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Determine bounds (if needed)&lt;/em&gt;: Sometimes you realize that sensitivity is unbounded, meaning a person could change the result by an unlimited amount. Technically, the company could pay the person 10 billion dollars or whatever minimum wage is, so this would be a very large bound. Instead of thinking about outliers, you want to think about the true distribution and keep individual contributions within a particular bound. Choosing this bound should allow you to still do meaningful data analysis but also provide protection for outliers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Apply careful noise to result&lt;/em&gt;: Once the sensitivity is known and the bounds are applied to the data, you can run the analysis or data processing and apply carefully calibrated noise to that process.&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt; Since you are essentially inserting error into the data analysis or processing&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt;, you want to ensure that you know what type of error to adjust your analysis accordingly. For example, you might choose Gaussian noise because your dataset has a Gaussian distribution, and by applying Gaussian noise, you keep the overall Gaussian distribution intact. Differential privacy noise is tuned based on the extent of analysis and the sensitivity of the query in question, so you can to some extent decide which answers get how much noise.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Track budget&lt;/em&gt;: In differential privacy, each time you run a query you need to track the privacy impact it has on the people in the dataset. This is called a &lt;em&gt;privacy budget&lt;/em&gt; and your budget spend is determined by parameters you set for a particular query or processing activity. Your entire budget is tracked, usually for the length of a particular analysis or process. This budget ensures that you haven't learned too much about any one individual in the group - and technically, when an individual or set of individuals run out of budget (i.e. when some limit of the parameters in that initial equation is reached), you should stop asking for more information.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;I recommend reading more about differential privacy via &lt;a href="https://desfontain.es/blog/friendly-intro-to-differential-privacy.html"&gt;Damien Desfontaines's blog series on differential privacy&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;One graphic from his blog series shows a visual idea of how choosing different budget values (here epsilon) relate to the information an attacker can gain by analyzing the results of the data release or query:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A graph showing a variety of epsilon values with different colors. The lower values for epsilon form a lightly oval shape around the center line, showing how much more a person can learn when given a result based on their previous knowledge." src="./images/2024/dp_epsilon_suspicion.png"&gt;
&lt;em&gt;From &lt;a href="https://desfontain.es/blog/differential-privacy-in-more-detail.html"&gt;Differential Privacy, in more detail&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;In this graphic, the parameter choice for epsilon is the legend. To read the graph, the x-axis shows "how sure someone is that a person is in the dataset". The y-axis shows "how much more sure an attacker will become after spending this budget". Each epsilon value has a range from the lower to the upper bound. This bound comes from  the randomness chosen -- where it might be possible that someone learns more or less depending on the mechanism. However, the upper limit is a guarantee, and is what separates differential privacy from other methods.&lt;/p&gt;
&lt;p&gt;Reading the graph, you can see that the parameter choice both affects your budget and has a fairly significant impact on the privacy guarantees. An epsilon of 5 reveals a lot of information, while an epsilon of 1 is fairly safe.&lt;/p&gt;
&lt;p&gt;To address machine learning memorization, you could reduce the individual or singleton impact on the training process. You just need to figure out how you can apply differential privacy to something like a machine learning model. If you can do so, this provides the differential privacy guarantees for the model and for anyone used in the training dataset.&lt;sup id="fnref:3"&gt;&lt;a class="footnote-ref" href="#fn:3"&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;Who has done this already? Let's investigate early research on applying differential privacy to deep learning.&lt;/p&gt;
&lt;h3 id="pates-near-misses"&gt;PATE's Near-misses&lt;/h3&gt;
&lt;p&gt;Papernot et al. architected one of the first differentially private deep learning systems in 2017. Their architecture, called &lt;a href="https://arxiv.org/abs/1802.08908"&gt;Parent Aggregation of Teacher Ensembles, or PATE&lt;/a&gt;, achieved high accuracy, within 3 percentage points of the baseline models.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A graphic showing the PATE architecture. To the left is the sensitive data, which is divided up into n-datasets. Each teacher is then trained on those datasets. These teachers all link to an aggregate teacher, which then makes a predicted completion to a student who has incomplete public training data." src="./images/2024/pate_architecture.png"&gt;&lt;/p&gt;
&lt;p&gt;To review how PATE worked, let's review the architecture shown above. First, the data is separated into subsets so each person is only ever seen by one model. Then, many different models are trained on these datasets. After training, these models each receive one vote towards prediction or inference of a student model, which has access to publicly available data, but no labels. The votes of the trained models are output into a histogram with differential privacy noise applied and the highest class is chosen as the appropriate label. This uses differential privacy essentially as the labeling function for publicly available unlabeled data.&lt;sup id="fnref:4"&gt;&lt;a class="footnote-ref" href="#fn:4"&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;In the paper, the authors found the PATE architecture "mistakes" were near-misses of odd or incorrectly labeled inputs. Here is an example from the paper, showing the correct label that PATE guessed incorrectly.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Examples of near-misses from the model, where the image shows the example and then the label is shown below. The examples are hard to identify even for humans, with things like apostrophes being mislabeled as commas or out-of-vocabulary images like a cut-off special font or a Chinese character that is mislabeled." src="./images/2024/pate_misses.png"&gt;&lt;/p&gt;
&lt;p&gt;How many of these would you guess correctly? These near-misses are examples that could potentially confuse a human. In the exploration of &lt;a href="https://blog.kjamistan.com/how-memorization-happens-novelty.html"&gt;novelty and memorization&lt;/a&gt;, these training examples represent uncommon or novel examples of their class. This is exactly the type of example that Feldman proved would be memorized - because not memorizing it is too expensive if you want the highest accuracy model. Using differential privacy, however, blocks this novel example memorization from happening.&lt;/p&gt;
&lt;p&gt;Papernot et al. aren't the only ones -- much research around studying memorization have explored the link between memorization of novel examples and differentially private training.&lt;/p&gt;
&lt;h3 id="memorization-and-differential-privacy-research"&gt;Memorization and Differential Privacy Research&lt;/h3&gt;
&lt;p&gt;In &lt;a href="https://arxiv.org/abs/1802.08232"&gt;the Secret Sharer paper&lt;/a&gt;, Carlini et al. compared data extraction from a model trained with no differential privacy compared with one trained using different values of epsilon.&lt;/p&gt;
&lt;p&gt;They trained 7 models with different optimizers and epsilon values and compared the estimated exposure, which measures the memorization of a given piece of sensitive information in the language model. The paper is from 2019, so they used a recurrent neural network (RNN), another type of deep learning for language model architecture, not a transformer.&lt;/p&gt;
&lt;p&gt;Their results were as follows:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Optimizer&lt;/th&gt;
&lt;th&gt;Epsilon&lt;/th&gt;
&lt;th&gt;Test Loss&lt;/th&gt;
&lt;th&gt;Estimated Exposure&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;RMSprop&lt;/td&gt;
&lt;td&gt;0.65&lt;/td&gt;
&lt;td&gt;1.69&lt;/td&gt;
&lt;td&gt;1.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RMSprop&lt;/td&gt;
&lt;td&gt;1.21&lt;/td&gt;
&lt;td&gt;1.59&lt;/td&gt;
&lt;td&gt;2.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RMSprop&lt;/td&gt;
&lt;td&gt;5.26&lt;/td&gt;
&lt;td&gt;1.41&lt;/td&gt;
&lt;td&gt;1.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RMSprop&lt;/td&gt;
&lt;td&gt;89&lt;/td&gt;
&lt;td&gt;1.34&lt;/td&gt;
&lt;td&gt;2.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RMSprop&lt;/td&gt;
&lt;td&gt;2x10**8&lt;/td&gt;
&lt;td&gt;1.32&lt;/td&gt;
&lt;td&gt;3.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RMSprop&lt;/td&gt;
&lt;td&gt;1x10**9&lt;/td&gt;
&lt;td&gt;1.26&lt;/td&gt;
&lt;td&gt;2.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SGD&lt;/td&gt;
&lt;td&gt;inf.&lt;/td&gt;
&lt;td&gt;2.11&lt;/td&gt;
&lt;td&gt;3.6&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;With ever increasing values of epsilon, the accuracy improves and the exposure increases. This makes sense based on what you've learned so far, because memorizing the outliers and the long tail improves test accuracy. It also shows that implementing differential privacy reduces the chance of memorizing all outliers. In this paper, the exposure and extraction attacks target outliers instead of common examples. Therefore, the exposure increases as the differential privacy guarantees decrease.&lt;/p&gt;
&lt;p&gt;The researchers were unable to successfully perform any of the extraction attacks against the machine learning models trained with differential privacy.&lt;/p&gt;
&lt;h3 id="at-what-cost-at-what-gain"&gt;At what cost? At what gain?&lt;/h3&gt;
&lt;p&gt;As you learned in the &lt;a href="https://blog.kjamistan.com/gaming-evaluation-the-evolution-of-deep-learning-training-and-evaluation.html"&gt;Gaming Evaluation article in this series&lt;/a&gt;, the present machine learning culture is focused on accuracy above all other metrics. In exploring near-misses, I hope you have begun to question this focus. Would it be okay to guess these incorrectly in exchange for better privacy guarantees and less memorization?&lt;/p&gt;
&lt;p&gt;To some degree, differential privacy works as a regularizer for machine learning problems - forcing models to generalize and not memorize. If you want to make sure that you won't memorize any specific examples, you should apply it vigorously and as standard practice -- particularly when training large deep learning models.&lt;/p&gt;
&lt;p&gt;The types of mistakes models with differential privacy make are small mistakes, namely reflecting the datasets' long tail. Are these mistakes worth the cost of memorizing these examples? Perhaps for generic photos, but what about for someone's artwork, voice, personal details or writing? What cost are we individually and collectively paying and how will it affect work and life years later?&lt;/p&gt;
&lt;p&gt;Differential privacy training is not a magical salve that will solve all memorization problems for privacy in deep learning. You'll learn more about the limitations and critique of differentially private training in an upcoming article, as you begin exploring solutions to the problems discussed in these initial articles (coming in Spring 2025).&lt;/p&gt;
&lt;p&gt;In the next article, which is also the last article focused on the problem of memorization in deep learning, you'll explore adversarial learning and examples. Adversarial learning -- similar to differential privacy research -- is an area of research that understood aspects of the memorization phenomenon long before other research caught up.&lt;/p&gt;
&lt;p&gt;I'm very open to feedback (positive, neutral or critical), questions (there are no stupid questions!) and creative re-use of this content. If you have any of those, please share it with me! This helps keep me inspired and writing. :)&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Acknowledgements&lt;/em&gt;: I would like to thank &lt;a href="https://desfontain.es/serious.html"&gt;Damien Desfontaines&lt;/a&gt; for his feedback, corrections and thoughts on this article. Any mistakes, typos, inaccuracies or controversial opinions are my own.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;I mention this because many descriptions of differential privacy mention "random noise", and that's not a very good description of what should be done. It is not uniformly random noise, but instead a noise distribution that you choose -- meaning you can fit the noise to the problem you want to solve.&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;By the way, you already have error in your data because data is always an approximation and never 100% correct. For a great review on "ground truth", I recommend &lt;a href="https://www.youtube.com/watch?v=tDjuq6Wxj3s&amp;amp;ab_channel=SymposiaatCSAIL"&gt;Kate Crawford's lecture on the topic&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;I encourage you to learn more about differential privacy, especially if you work in data science and machine learning. If you'd like to read more on differential privacy, check out &lt;a href="https://desfontain.es/blog/friendly-intro-to-differential-privacy.html"&gt;Desfontaines's series&lt;/a&gt; and &lt;a href="https://practicaldataprivacybook.com/"&gt;my book&lt;/a&gt;, which has two chapters on differential privacy and its application in machine learning.&amp;#160;&lt;a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:4"&gt;
&lt;p&gt;In theory, this system could also provide a prediction or inference service, where incoming data points are also labeled by majority differential privacy votes. Such a system was quite compute expensive to run at that time.&amp;#160;&lt;a class="footnote-backref" href="#fnref:4" title="Jump back to footnote 4 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;script type="text/javascript"&gt;if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
    var align = "center",
        indent = "0em",
        linebreak = "false";

    if (false) {
        align = (screen.width &lt; 768) ? "left" : align;
        indent = (screen.width &lt; 768) ? "0em" : indent;
        linebreak = (screen.width &lt; 768) ? 'true' : linebreak;
    }

    var mathjaxscript = document.createElement('script');
    mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
    mathjaxscript.type = 'text/javascript';
    mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';

    var configscript = document.createElement('script');
    configscript.type = 'text/x-mathjax-config';
    configscript[(window.opera ? "innerHTML" : "text")] =
        "MathJax.Hub.Config({" +
        "    config: ['MMLorHTML.js']," +
        "    TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
        "    jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
        "    extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
        "    displayAlign: '"+ align +"'," +
        "    displayIndent: '"+ indent +"'," +
        "    showMathMenu: true," +
        "    messageStyle: 'normal'," +
        "    tex2jax: { " +
        "        inlineMath: [ ['\\\\(','\\\\)'] ], " +
        "        displayMath: [ ['$$','$$'] ]," +
        "        processEscapes: true," +
        "        preview: 'TeX'," +
        "    }, " +
        "    'HTML-CSS': { " +
        "        availableFonts: ['STIX', 'TeX']," +
        "        preferredFont: 'STIX'," +
        "        styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
        "        linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
        "    }, " +
        "}); " +
        "if ('default' !== 'default') {" +
            "MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
                "var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
                "VARIANT['normal'].fonts.unshift('MathJax_default');" +
                "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
                "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
                "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
            "});" +
            "MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
                "var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
                "VARIANT['normal'].fonts.unshift('MathJax_default');" +
                "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
                "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
                "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
            "});" +
        "}";

    (document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
    (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
&lt;/script&gt;</content><category term="ml-memorization"></category></entry><entry><title>How Memorization Happens: Overparametrized Models</title><link href="https://blog.kjamistan.com/how-memorization-happens-overparametrized-models.html" rel="alternate"></link><published>2024-12-18T00:00:00+01:00</published><updated>2024-12-18T00:00:00+01:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2024-12-18:/how-memorization-happens-overparametrized-models.html</id><summary type="html">&lt;p&gt;You've heard claims that we will "run out of data" to train AI systems. Why is that? In this article in the series on &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;machine learning memorization&lt;/a&gt; you'll explore model size as a factor in memorization and the trend for bigger models as a general problem in machine learning.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Prefer …&lt;/p&gt;&lt;/blockquote&gt;</summary><content type="html">&lt;p&gt;You've heard claims that we will "run out of data" to train AI systems. Why is that? In this article in the series on &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;machine learning memorization&lt;/a&gt; you'll explore model size as a factor in memorization and the trend for bigger models as a general problem in machine learning.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Prefer to learn by video? This post &lt;a href="https://youtu.be/va0QxMZBXvg"&gt;is summarized on Probably Private's YouTube&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;To begin, what is meant by the word overparameterization?&lt;/p&gt;
&lt;h3 id="what-is-a-parameter"&gt;What is a parameter?&lt;/h3&gt;
&lt;p&gt;In machine learning, a parameter is a value tied to the model's calculations. These calculations determine the model's predictions and how the model functions internally. You might have heard about parameter size, such as "this is the 7B (billion) parameter model" when using LLMs or other large generative models.&lt;/p&gt;
&lt;p&gt;The parameters are set as the model is trained. Usually a parameter starts at a "random" state and is adjusted as part of the training process. The parameters updating creates the "learning" part of the machine learning training.&lt;/p&gt;
&lt;p&gt;The parameter count also includes hyperparameters, with additional variables or inputs -- usually used by the training optimizers and learning algorithms. A hyperparameter is something a person can set beforehand, usually at a known state or range. Some hyperparameters can be adjusted as the model learns, such as setting an adaptive learning rate to expedite early stages of model training and then slow down with smaller steps as the model training enters later stages.&lt;/p&gt;
&lt;p&gt;In a deep learning model, the most common parameters are weights and biases. Let's view a diagram to understand how these work:&lt;/p&gt;
&lt;p&gt;&lt;img alt="An image of a zoomed in node or neuron, which is the building block of deep learning networks. To the left is an incoming arrow (edge) with a weight (w). Inside the neuron there is a bias (b) and a summation function that takes the input values from all of the incoming values and the bias that exists in the node. Then there is an activation function that operates on the result of the summation function. This activation feeds into the resulting arrows (edges) that connect this node to the following nodes." src="./images/2024/ml_network_node_edge.png"&gt;&lt;/p&gt;
&lt;p&gt;You can think of a deep learning model as a series of nodes, pictured above as a circle, and edges, pictured above as connecting arrows. The terms nodes and edges come from graph theory in mathematics, and you might know them from graph networks, like social network graphs.&lt;/p&gt;
&lt;p&gt;In deep learning, the nodes here are the "neurons" and the edges connect neurons to one another. In this network, each input value can change the result of the equations in the node. Those equations within a node are calculated and each node sends the results over the edges to the next series of nodes, and so-on. This is why the original term "neural network" exists, since the nodes were compared to neurons.&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;In each node, there is a &lt;em&gt;bias parameter&lt;/em&gt; which is part of the contained equation. The other values in the equation come from the incoming edges. Usually there are many nodes sitting at the same depth in the network. Together, these "same depth" nodes are called a &lt;em&gt;layer&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Between each layer of nodes there are connections called edges. A &lt;em&gt;weight parameter&lt;/em&gt; exists at the edges -- sitting "between" the layers. These weights connect each layer to the next layer. The number and type of connections can change with different types of deep learning architectures, but often there are at least a few &lt;em&gt;fully connected&lt;/em&gt; layers which mean each node in one layer connects with an edge (and its accompanying weight) to the next layer of nodes.&lt;/p&gt;
&lt;p&gt;When a data example is first encoded (see &lt;a href="https://blog.kjamistan.com/encodings-and-embeddings-how-does-data-get-into-machine-learning-systems.html"&gt;initial encodings article&lt;/a&gt;) and input into the model at the input layer, those input nodes calculate the result of their internal &lt;a href="https://en.wikipedia.org/wiki/Activation_function"&gt;&lt;em&gt;activation functions&lt;/em&gt;&lt;/a&gt;. Activation functions are usually calculated by taking the incoming values and summing the results with the node's biases and then running the result through a chosen activation function, which can vary by architecture. A common choice for an activation function today is &lt;a href="https://en.wikipedia.org/wiki/Rectifier_(neural_networks)"&gt;ReLU&lt;/a&gt; because it has non-linear properties, allowing the network or model to learn more complex mathematics (alongside the power of linear algebra!).&lt;/p&gt;
&lt;p&gt;The result of the activation function for a given node is transmitted as input to the next layer. This is calculated as inputs along the weight along the edge. This happens for each of the middle, &lt;em&gt;hidden&lt;/em&gt; layers. In the final layer, the activations are condensed into a range of probabilities to make a prediction. This final step is heavily dependent on model type, architecture and the task at hand (i.e. generate text/image/video versus predict a class label).&lt;/p&gt;
&lt;p&gt;&lt;img alt="An image of several layers, starting with a blue-colored input layers showing 5 neurons. Then 3 middle layers, called &amp;quot;hidden layers&amp;quot; and two output layers of smaller size. For illustrative purposes, the initial node / neuron in each layer has arrows showing it connects to every node in the next layer." src="./images/2024/connecting_layers.png"&gt;
&lt;em&gt;Example of connected layers. Only the top nodes' edges for each layer are shown, but imagine this continues for every node in the model.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The training process updates the weights and biases, usually via some version of backpropagation, where error on whatever early guess the weights and biases have is transmitted backwards through the network. This error is used to update the parameters so the model can perform better on the next round of predictions. At a high level, this happens by reducing the weights, biases and resulting activations of the incorrectly identified guesses and increasing the weights, biases and resulting activations of the correct outcomes.&lt;/p&gt;
&lt;p&gt;Now you have a high level understanding of parameters, let's investigate how they relate to model size.&lt;/p&gt;
&lt;h3 id="overparameterization-and-model-size-growth"&gt;Overparameterization and Model Size Growth&lt;/h3&gt;
&lt;p&gt;Since there must be at least as many parameters as nodes and edges, when the model's architecture grows and more layers are added, there will also be more parameters. In earlier deep learning, it was common to have between 100-200 layer models with a wide range of parameters per layer -- totaling between 20-120M parameters (see &lt;a href="https://en.wikipedia.org/wiki/Residual_neural_network"&gt;ResNet&lt;/a&gt;, &lt;a href="https://en.wikipedia.org/wiki/VGGNet"&gt;VGGNet&lt;/a&gt; and &lt;a href="https://en.wikipedia.org/wiki/AlexNet"&gt;AlexNet&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;In today's deep learning, those numbers have exploded. Many deep learning models released since 2022 are overparameterized. This means they have more parameters than training data examples. If you've studied or used machine learning, you might be wondering if this results in &lt;em&gt;overfitting&lt;/em&gt;. Overfitting is the ability for a model to exactly learn the training data, and therefore do poorly on unseen data. With overparameterized models, the model could potentially encode every example in the parameters.  &lt;/p&gt;
&lt;p&gt;The growth of model size and the subsequent growth of training time to ensure all parameters were adequately updated led to the discovery of double descent.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A graph where the y-axis represents test error and the x-axis represents the number of parameters. There is a first descent and then a high ascent (like a U curve) and then a second descent that tapers off after a while. On the bottom of the initial U-curve there is a dotted line marking the prior optimum you would reach when training. At the top of the ascent of that initial U-curve there is a dotted line representing where overfitting happens. As that peak then descends again due to training with more parameters and for a longer time, there is a note as the error tapers that this is considered now &amp;quot;good generalization&amp;quot;." src="./images/2024/double_descent.png"&gt;&lt;/p&gt;
&lt;p&gt;In smaller and older deep learning architectures, there was a point in parameter growth and training time where the model would overfit. This means that the model learned the training dataset too well (i.e. memorization and similar strategies) and therefore started performing poorly on the test dataset because of small divergences between the training and testing data.&lt;/p&gt;
&lt;p&gt;As model parameters and training time increased, these larger models had a second descent of the error where the models generalized well and outperformed smaller models. This led to a massive investment in larger and larger models.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://vcclab.org/articles/jcics-overtraining.pdf"&gt;Tetko et al.&lt;/a&gt; studied &lt;em&gt;overtraining&lt;/em&gt; as early as 1995. Overtraining increases the number of training epochs, and at that time this process created models that overfit and memorized the training dataset. That research recommended smaller networks with fewer hidden layers and less dense hidden layers, which could be trained without overfitting. They also recommended cross-validation via leave-one-out methods to compare models that had seen the data with those that hadn't (as you learned &lt;a href="https://blog.kjamistan.com/how-memorization-happens-novelty.html"&gt;is a key part of memorization research&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;When looking at how changes in layer size and depth over time -- they have grown massively since the 1990s. Just charting GPT parameter growth since the transformer architecture appeared (i.e. since 2018) is a good way to visually inspect the changes in parameter size.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A chart showing GPT growth over time, where the y-axis counts the number of parameters and the x-axis represents each step of the GPT model. You can see a strong increase with each step from the size of the last model (more details below)." src="./images/2024/gpt_growth.png"&gt;&lt;/p&gt;
&lt;p&gt;This chart looks at GPT parameter size based on what is known about the OpenAI GPT models. The first GPT was released in 2017 and had 117 million parameters. The second was released in 2019 and had 1.5 billion parameters, already a parameter growth of more than 12x. The third GPT came in 2020 and has 175 billion parameters (&amp;gt;100x size of GPT-2). Estimates of the GPT-4 put it at 1.7 trillion parameters, almost 10x bigger than GPT-3. As you can tell from this trend, there is a huge push towards ever larger models.&lt;/p&gt;
&lt;p&gt;But understanding double descent and how deep learning works requires study. Simply embracing model size and training time growth without knowing how or what is changed by these aspects is unlikely to result in "forever" growth.&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt; What, exactly, has changed in terms of models when they are overparameterized and trained for many more iterations?&lt;/p&gt;
&lt;h3 id="ummm-do-we-know-how-deep-learning-works"&gt;Ummm, do we know how deep learning works?&lt;/h3&gt;
&lt;p&gt;This was the investigative question explored in the now famous paper &lt;a href="https://arxiv.org/abs/1611.03530"&gt;&lt;em&gt;Understanding Deep Learning Requires Rethinking Generalization&lt;/em&gt;&lt;/a&gt; (Chiyuan Zhang et al, 2017). You might remember Zhang's work from &lt;a href="https://blog.kjamistan.com/how-memorization-happens-novelty.html"&gt;the last article&lt;/a&gt;, when he worked with Feldman on quantifying novel example memorization. Three years before that work, in 2017, the researchers were trying to understand exactly how computer vision deep learning models were learning with greater accuracy than before, especially with the increased architecture size.&lt;/p&gt;
&lt;p&gt;They took the CIFAR10 and ImageNet datasets, then common datasets for large computer vision training and completely randomized the labels. Now the labels no longer applied to what was in the photo. A photo of a person was now a plane, a photo of a dog was now a building and so on.&lt;/p&gt;
&lt;p&gt;Even with the labels completely randomized the training accuracy of the two most performant resulting models reached 89 percent. Of course, the accuracy on real data was awful, the model hadn't learned anything useful. But the fact that it learned completely random data opened new questions in understanding deep learning. This opened questions around common thinking about overfitting and generalization. How do we measure generalization? Do we understand the difference between memorization and overfitting?&lt;/p&gt;
&lt;p&gt;The researchers called for more research and understanding of how memorization happens and a better understanding of what generalization is in deep learning. In larger models, generalization doesn't seem to behave the same way as generalization in smaller models -- in part due to the complexities of increasing parameter size and how that can leave enough parameters for both good generalization and memorization.&lt;/p&gt;
&lt;h3 id="learning-the-identity-function"&gt;Learning the identity function&lt;/h3&gt;
&lt;p&gt;The identity function is a great example of a simple mathematical rule that could be trained using deep learning. The identity function takes an input and returns the input unchanged -- hence its name: identity. Think of it like adding 0 or multiplying by 1, but for more complex inputs.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://arxiv.org/pdf/1902.04698"&gt;Zhang et al. tested this idea in 2020&lt;/a&gt; by training a few computer vision models to take in fairly simple input (the NIST and Fashion-NIST datasets) and learn the identity function.&lt;sup id="fnref:3"&gt;&lt;a class="footnote-ref" href="#fn:3"&gt;3&lt;/a&gt;&lt;/sup&gt; To investigate if the identity function could be learned versus just input memorization, they trained on just one training example repeatedly and altered the depth of the network (hence, the number of parameters).&lt;/p&gt;
&lt;p&gt;They found that deeper networks with more parameters learned a constant function. What does that mean? Those networks learned to always answer with the example they were trained on, regardless of the input.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A series of rows and columns with different images. On the left there is a column just showing the same handwritten 7 from the NIST training dataset. Across the top row there is a variety of different inputs from MNIST and Fashion-MNIST. The y-axis shows the different CNN depths, from 1 layer to 20 layer. Each row shows the prediction from the input for each model. The shallow networks seem to have learned the identity function and return something close to the input. The intermediate layers seem to learn an edge identification, where the outline of the input shows up and the deeper networks return 7." src="./images/2024/learning_identity_function.png"&gt;
&lt;em&gt;Results of inference from CNN models trained with different depths. The 7 to the left is the entire training data for these models which attempted to learn the identity function. Each image is the prediction when given the input shown across the top row.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Intermediate depth networks learned to identify edges, which is probably a bit closer to the expected identity function (i.e. this is the outline of the shape you gave me), but not the same.&lt;/p&gt;
&lt;p&gt;Very shallow networks sometimes learned the identity function (as shown in this example) but with other architectures or learning functions, they simply returned noise or a blank image.&lt;/p&gt;
&lt;p&gt;Their conclusions were that computer vision models don't generalize the way that researchers and practitioners assumed they generalized -- and that more parameters might lead to more memorization instead of generalization.&lt;/p&gt;
&lt;p&gt;But this research doesn't necessarily provide a better understanding of how to measure generalization. Related research from margin theory, however, did provide some insights!&lt;/p&gt;
&lt;h3 id="margin-theory"&gt;Margin theory&lt;/h3&gt;
&lt;p&gt;Margins help estimate the &lt;em&gt;generalization gap&lt;/em&gt;, or the difference between how a model performed during training and how it performs on unseen test data.&lt;/p&gt;
&lt;p&gt;What is a margin? Margins are an important idea in support vector machines (SVMs), so let's use an example from SVMs to explain margins.&lt;/p&gt;
&lt;p&gt;&lt;img alt="An example of a SVM-style margin. There is one class on one side of a decision boundary line and another class on the other. The space between the closest example and the decision boundary is marked as the margin." src="./images/2024/margin_explainer.png"&gt;&lt;/p&gt;
&lt;p&gt;In support vector machines, you want to maximize the distance between the data points and the decision boundary. This improves the model confidence and avoids potential misclassification with "nearby" classes. Ideally, the boundary is further from the cluster of training examples, giving space to incorporate some "outliers", like the white peacocks.&lt;/p&gt;
&lt;p&gt;But, how does this actually work in a high dimensional setting and with the complexity of a deep learning model? Deep learning models are more complex than support vector machines...&lt;/p&gt;
&lt;p&gt;&lt;a href="https://research.google/blog/predicting-the-generalization-gap-in-deep-neural-networks/"&gt;Research from Google&lt;/a&gt; has shown that margin theory applies if you take each layer of a deep learning model as its own decision engine. By sampling the distance between the intermediate input representation at that layer and the projected decision boundary, the margins can be estimated. The projected decision boundary is approximated via linearization even if the activation function is nonlinear.&lt;/p&gt;
&lt;p&gt;To put it another way, you are sampling layers and approximating their decision boundaries against the current input. Measuring these margins and determining if the network is maximizing them, as with SVMs, provides an accurate prediction of the model generalization. The larger the approximated margins across several key layers, the better the performance on unseen data.&lt;/p&gt;
&lt;p&gt;Google Research successfully built a new loss function to maximize margins while training deep learning models. The loss function penalizes predictions based on margin distance at different layers. The resulting models are more robust against adversarial attacks or other unintentional input perturbations. Neat that an idea from more simple machine learning, like SVMs, can also be useful in deep learning.&lt;/p&gt;
&lt;p&gt;This provides some insight into memorization in large deep learning systems. Unlike SVMs, a deep learning model can create more decision boundaries due to its greater complexity and nonlinear learning. By memorizing outliers, a model can perform better on similar outliers that it sees later, even if these were not included in the original training data -- all because that outlier lands somewhere near the memorized input. This is one of the leading theories to explain why deep learning models memorize and generalize well.&lt;/p&gt;
&lt;p&gt;In the next article, you'll look at how privacy research came to this conclusion about deep learning networks long before most machine learning research. You'll explore deep learning through the lens of differential privacy, a rigorous definition of to protect individual privacy.&lt;/p&gt;
&lt;p&gt;I'm very open to feedback (positive, neutral or critical), questions (there are no stupid questions!) and creative re-use of this content. If you have any of those, please share it with me! This helps keep me inspired and writing. :)&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Acknowledgements&lt;/em&gt;: I would like to thank &lt;a href="https://vickiboykis.com/"&gt;Vicki Boykis&lt;/a&gt;, &lt;a href="https://desfontain.es/serious.html"&gt;Damien Desfontaines&lt;/a&gt; and &lt;a href="https://www.linkedin.com/in/yanndupis/"&gt;Yann Dupis&lt;/a&gt; for their feedback, corrections and thoughts on this series. Their input greatly contributed to improvements in my thinking and writing. Any mistakes, typos, inaccuracies or controversial opinions are my own.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;Of course, they are not nearly as complex as our brain, hence why the term is now considered outdated.&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;In fact, &lt;a href="https://www.deeplearning.ai/the-batch/ai-giants-rethink-model-training-strategy-as-scaling-laws-break-down/"&gt;recent research and commentary&lt;/a&gt; suggests that the current models have already reached peak scaling, and adding more parameters doesn't seem to affect performance anymore.&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;&lt;a href="https://arxiv.org/abs/1810.10333v1"&gt;Radhakrishnan et al.&lt;/a&gt; first studied this in 2018 and demonstrated how traditional downsampling methods in computer vision models tended to memorize rather than learn the identity function. Their work also uncovered particular elements of autoencoders in computer vision that created mathematical properties that would inevitably end in memorization unless directly addressed in the architecture. Unfortunately, I learned of this work after the initial writing of this article; hence my use of Zhang et al.'s example.&amp;#160;&lt;a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="ml-memorization"></category></entry><entry><title>How memorization happens: Novelty</title><link href="https://blog.kjamistan.com/how-memorization-happens-novelty.html" rel="alternate"></link><published>2024-12-09T00:00:00+01:00</published><updated>2024-12-09T00:00:00+01:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2024-12-09:/how-memorization-happens-novelty.html</id><summary type="html">&lt;p&gt;So far in &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;this series on memorization in deep learning&lt;/a&gt;, you've learned how &lt;a href="https://blog.kjamistan.com/how-memorization-happens-repetition.html"&gt;massively repeated text and images incentivize training data memorization&lt;/a&gt;, but that's not the only training data that machine learning models memorize. Let's take a look at another proven memorization: novel examples.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Prefer to learn by video? This …&lt;/p&gt;&lt;/blockquote&gt;</summary><content type="html">&lt;p&gt;So far in &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;this series on memorization in deep learning&lt;/a&gt;, you've learned how &lt;a href="https://blog.kjamistan.com/how-memorization-happens-repetition.html"&gt;massively repeated text and images incentivize training data memorization&lt;/a&gt;, but that's not the only training data that machine learning models memorize. Let's take a look at another proven memorization: novel examples.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Prefer to learn by video? This post &lt;a href="https://youtu.be/aTtAUgGv4hA?si=5LRHbXZnBV6GlWca"&gt;is summarized on Probably Private's YouTube&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;As you learned &lt;a href="https://blog.kjamistan.com/gaming-evaluation-the-evolution-of-deep-learning-training-and-evaluation.html"&gt;in the evaluation article&lt;/a&gt;, the chance of pulling a rare or novel example from the tail is fairly high, given that the tail is long and makes up the majority of the distribution. If you are training multiple models and evaluating them against one another based on test performance, there is a good chance that the best performing model will process more of the novel examples that also exist in the test dataset.&lt;/p&gt;
&lt;p&gt;Vitaly Feldman, previously of Google Brain, now at Apple Research, initially studied this phenomena in 2019 in his paper &lt;a href="https://arxiv.org/abs/1906.05271"&gt;Does learning require memorization? A Short Story about a Long Tail&lt;/a&gt;. Let's walk through the important parts of the paper together.&lt;/p&gt;
&lt;p&gt;In the first example of the paper, the learning algorithm will learn to differentiate two groups in a binomial population. This is a small toy example to easily define how a machine learning model should minimize the learning error. This example is an oversimplification of typical machine learning problems, but Feldman uses it to extrapolate learnings to more complex examples.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A graph showing a series of buckets where there is one population that skews to the left of the histogram and another that skews to the right. These two groups and/or populations are labeled to separate them visually from each other." src="./images/2024/two_groups_to_learn.png"&gt;&lt;/p&gt;
&lt;p&gt;In order to show how memorization happens, the paper then defines memorization mathematically. To do so, Feldman defines memorization by comparing two models. One has seen a particular example and the other has not. The difference between the two models demonstrates whether that point contributes significantly to memorization or not. This is a "leave one out" principle -- which can be used to test memorized training examples in real systems.&lt;/p&gt;
&lt;p&gt;&lt;img alt="An animated GIF showing the process. First a training dataset is shown and the initial model is trained. Then, one training example is taken out and a second &amp;quot;leave one out&amp;quot; model is trained. At the end, the performance of the models is compared to measure the memorization of that single example." src="./images/2024/loo_training.gif"&gt;&lt;/p&gt;
&lt;p&gt;This definition combined with the toy example show that with a long-tail distribution, the optimal model performance is reached if some examples are memorized. The significant contribution of the paper presents a lower bound for model accuracy if a particular example or set of examples are not memorized. Because these are novel examples, the model must memorize them even if they are only shown once in the dataset in order to achieve higher accuracy.&lt;/p&gt;
&lt;p&gt;&lt;img alt="An equation showing: the error of a given model is greater than or equal to the optimal model based on the data plus a normalized penalty for not fitting the data." src="./images/2024/simplified_feldman_equation.png"&gt;
&lt;em&gt;Feldman's Proof of a lower bounds&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;This simplification of Feldman's equation&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt; seems somewhat obvious -- of course the error is the optimal model plus some sort of representation of the things the model didn't learn properly. But let's summarize the impact of Feldman's normalized penalty.&lt;/p&gt;
&lt;p&gt;Feldman was able to use typical probability theory to formulate the penalty based on properties of the population and their distribution. Remember &lt;a href="https://blog.kjamistan.com/machine-learning-dataset-distributions-history-and-biases.html"&gt;the long-tail&lt;/a&gt;? His formulation creates a lower bound on what rare examples cost the model. The more rare an example, the more costly it is to the training process when the process tries to reduce error. In addition, those rare examples, or sets of rare examples are extremely costly to not learn if they will show up in the test dataset.&lt;/p&gt;
&lt;p&gt;To put it another way, the model's error is relative to the size of a given class in the dataset and whether that class is infrequent within the overall population. As you remember from the uncommon photos of buses (odd angles, only parts of the bus), this also applies to infrequent examples of more common classes.&lt;/p&gt;
&lt;p&gt;Based on the estimated distribution for long-tail data, the singleton examples (examples that only occur once) make up approximately 17% of the data. An algorithm that does not memorize these singleton examples will be suboptimal by approximately 7% in accuracy (maximum accuracy is then 93% if singleton's are not memorized).&lt;/p&gt;
&lt;p&gt;Feldman's article focused on proving this mathematically; but does this happen in real machine learning or only in theory? This research spawned deeper investigations into the memorization problem with exciting results.&lt;/p&gt;
&lt;h3 id="what-color-is-a-peacock"&gt;What color is a peacock?&lt;/h3&gt;
&lt;p&gt;Did you know there are peacocks that are completely black and white? I didn't! And neither did researcher Chiyuan Zhang who worked alongside Feldman to study the deep learning memorization phenomenon. Their work attempted to find the novel examples that a model had memorized.&lt;/p&gt;
&lt;p&gt;Feldman and Zhang's work uncovered &lt;a href="https://pluskid.github.io/influence-memorization/"&gt;high influence pairs&lt;/a&gt;, where a "leave one out"-inspired training routine demonstrated the impact of novel examples on the model. Here are a few high influence pair examples from the paper:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Three rows and two columns of images, where the image on the left is the memorized image from the training data and the image on the right is the image most influenced by this memorized image when it appears in the evaluation or testing dataset. In the top row, there is a worm-like animal called a nematode where the photos look nearly the same, in the middle row there is a stole that is light brown and crocheted, the same stole is in both photos and in the bottom row there is a circular staircase labeled as a coil. Both photos show the same staircase at slightly different angles." src="./images/2024/high-influence-pair-trio.png"&gt;&lt;/p&gt;
&lt;p&gt;Some of these examples probably remind you of the data collection discussion from this series because some of the photos are literally from the same photo shoot. When the same photos appear in the training and testing datasets, the model's test performance will increase if it predicts those correctly. You can explore more high-influence pairs on &lt;a href="https://pluskid.github.io/influence-memorization/"&gt;the paper's site&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;To find these high influence pairs, the researchers needed to find a way to "leave one out" and measure the impact on the model performance. Because of the high costs of training large deep learning models, they didn't leave "just one" example out and retrain. Instead, they batched the initial dataset and experimented with leaving out sets of images. They then compared the models that had seen different sets of images on the same evaluation data.&lt;/p&gt;
&lt;p&gt;This allowed them to compare the model performance between models that had processed rare examples with other models trained the same way but without those examples. In doing so, they found the "high influence" pairs. If one of these pairs were included in the training data, the model performed much better on the test dataset example.&lt;/p&gt;
&lt;p&gt;They also show that the influence of an image is related to the long-tail. More uncommon classes and uncommon images of that class were memorized than common ones, hence Zhang's discovery of black and white peacocks.&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt; They also found that about 30% of examples have some level of memorization, or a significant "influence" on the model's performance for given test points. In their experiments, they demonstrated a 2.5-3.2% performance boost that came from memorization, which supports Feldman's initial theory that optimal performance on a long-tail results in partial memorization.&lt;/p&gt;
&lt;p&gt;These novel examples were also discovered separately by other researchers working on extracting training data from large deep learning models. Let's investigate Carlini et al.'s work on diffusion models.&lt;/p&gt;
&lt;h3 id="and-other-types-of-deep-learning-models"&gt;And other types of deep learning models...&lt;/h3&gt;
&lt;p&gt;Carlini et al. extracted memorized examples from diffusion models&lt;sup id="fnref:3"&gt;&lt;a class="footnote-ref" href="#fn:3"&gt;3&lt;/a&gt;&lt;/sup&gt; with great success. For a quick primer on diffusion models, they are deep learning models that produce much of today's generative text-to-image models, like DALL-E or Flux. These models have a specific architecture which uses an initial "random" sampling to create a base image. This random start is processed with denoising techniques to create a visual representation that matches a particular text input. So, when you type in "a unicorn jumping over the moon", there is an approximation of what those vectors represent together based on the training data, and the model is optimized to try to extract the closest representation of that text.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A series of images going from left to right showing first a very noisy non-sensical image and then with each step a building starts to emerge. By the final step the building can be seen as if it is from a photograph." src="./images/2024/stable_diffusion_steps.png"&gt;
&lt;em&gt;Example Stable Diffusion Steps: From noise to photorealism&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;To test whether training data could be extracted from diffusion models, the researchers trained a large scale diffusion model on the original dataset that the &lt;a href="https://en.wikipedia.org/wiki/Stable_Diffusion"&gt;Stable Diffusion&lt;/a&gt; team used. Then, they used prompts from the training data to test extraction.&lt;/p&gt;
&lt;p&gt;One successful extraction is the photo below, where they used the name of an author from the training dataset. The extracted image is a near match of the original.&lt;/p&gt;
&lt;p&gt;&lt;img alt="On the left is a photo from the training data set with a person's name and their book title. On the right is the extracted image using the prompt of their name and the extracted image looks nearly the same as the training image (only slight noise artifacts)." src="./images/2024/stable_diffusion_extraction.png"&gt;&lt;/p&gt;
&lt;p&gt;They were able to extract more than 100 near-identical images of training data examples like this one. More than half of the memorized extracted images are of a person. In running the attack against a larger diffusion model (Imagen), they were able to extract a higher rate than the smaller model, which supports prior research that model size also impacts memorization. They also found that more accurate models, measured by model performance metrics, memorize more data. In running further experiments, they show that by building their own diffusion model from scratch, they are able to extract 2.5% of the training data.&lt;/p&gt;
&lt;p&gt;Later in the paper, they perform a new and different type of extraction attack, which they name an "inpainting" attack. Inpainting is a desired quality of many image-generating or editing models -- for example, to remove a person from the background of a photo and "fill in the blank". In their inpainting attack, they cover a significant portion of the image (&amp;gt;50%) and query the diffusion model to complete the picture. When performing these attacks, they were able to quickly see the difference between models trained on the image shown and models who were not trained on the image.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Two rows and four columns of images are shown. On the left are the training images, next to each of them has about a half of the image removed. Then the last two images show an example where one generative model replaces the missing half of the photo when the original photo is in the training set and then an example from the model that doesn't have the original photo in the training dataset. The model that has seen the training data example creates an image very similar to the original and the model that doesn't creates a seemingly totally random image by comparison. For example, in the top row there is a bird next to some text. The model with the example puts a bird there and the model without the example puts a car where the bird was." src="./images/2024/inpainting_attack.png"&gt;&lt;/p&gt;
&lt;p&gt;They were able to show with this research that a diffusion model that has processed the original image in training can reproduce it much more clearly than the diffusion model who has not. This again supports the "leave-one-out" approach that Feldman and Zhang used.&lt;/p&gt;
&lt;p&gt;They also found that the easiest data to extract are outlier examples. These outliers have significant privacy risk compared to other populations. When performing the attacks, they were able to target outliers in a &lt;a href="https://arxiv.org/abs/1610.05820"&gt;membership inference attack&lt;/a&gt;. This attack allowed them to determine if a particular image was in the training data or not. Here is a visual representation of their findings, where outliers were much easier to attack than common images.&lt;/p&gt;
&lt;p&gt;&lt;img alt="There are two groups of images. The group of images on the left show small images stacked together with duplicates or near duplicates. The images in the group on the right are a larger group of images with many bright colors and odd looking shapes in the images. The caption reads: when performing our membership inference attack, the hardest-to-attack examples (left) are all duplicates in the CIFAR-10 training set, and the easiest-to-attack examples (right) are visually outliers from CIFAR-10 images." src="./images/2024/diffusion_membership_inf_outliers.png"&gt;&lt;/p&gt;
&lt;p&gt;This work on diffusion models highlighted that their training methods and processes are part of the problem; and again directly linked model size and accuracy to memorization.&lt;/p&gt;
&lt;p&gt;Some models need to recall an individual, which makes them easier to attack. For example, a facial recognition model or a generative art model that needs to learn the styles of famous artists. These methods have inspired the types of attacks shown in papers today, which you will investigate to better understand memorization.&lt;/p&gt;
&lt;h3 id="model-inversion-attacks"&gt;Model Inversion Attacks&lt;/h3&gt;
&lt;p&gt;It's important to point out that training data extraction is not a completely new attack vector and that memorization isn't either. In a &lt;a href="https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/tramer"&gt;paper published in 2016 by Tramér et al.&lt;/a&gt; (also a co-author on the diffusion paper), they designed an attack called a model inversion attack, which allows an attacker with model access to extract information from the model. The most powerful version of the attack required direct access to the model or a model trained locally that mimicked the model -- obtained via "model stealing attacks" or by training a simpler model that mimics the real model.&lt;/p&gt;
&lt;p&gt;To perform a model inversion attack, an example of similar data is first generated. In the paper they use a facial recognition model to extract a copy of a person's face, which in this case must be memorized by the model to function correctly. They start with a base image of a face -- choosing one without significant markers (like glasses, beard, etc.).&lt;/p&gt;
&lt;p&gt;Then, the gradient weights and activations of the local model are observed based on that input, and a loss optimizer is applied, just like you learned in training the model and evaluating the loss function. Only this time, the loss optimizer isn't trying to improve the model by training it -- it's being used to reverse engineer how to change the input in order to make it closer to the target. In this case, you want to update the image to more closely match the person's face. By doing this iteratively, you develop an image that looks like a fuzzy version of the training dataset target example.&lt;/p&gt;
&lt;p&gt;&lt;img alt="There are two images side by side. The one on the left shows a fuzzy version of the one on the right. The one on the right is the training data of the deep learning model and the one on the left is the extracted image from the model inversion attack." src="./images/2024/model_inversion_book.png"&gt;
&lt;em&gt;Model inversion attack on a facial recognition model: the training image is on the right and the extracted image is on the left&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;This shows that there are several other cases where deep learning requires memorization, and the way that deep learning models are trained can be used to extract data from them.&lt;/p&gt;
&lt;p&gt;The funny thing is that this attack process is quite similar to how diffusion models work internally in their generative steps, making it highly susceptible to both memorization and exploits in revealing the original data. In case you missed it, the &lt;a href="https://bjoernkarmann.dk/project/paragraphica"&gt;"Paragraphica" camera project&lt;/a&gt; was able to reproduce almost exact copies of street images that mirrored reality due to the high tendency for diffusion models to memorize their training data and repeat it when given an input query contained in the training data.&lt;/p&gt;
&lt;h3 id="and-it-happens-with-text-too"&gt;And it happens with text too...&lt;/h3&gt;
&lt;p&gt;Although it's easier to demonstrate visually with images, the memorization of outliers and novel examples also occurs with text, proven in 2018 by Carlini et al's&lt;sup id="fnref:4"&gt;&lt;a class="footnote-ref" href="#fn:4"&gt;4&lt;/a&gt;&lt;/sup&gt; paper &lt;a href="https://arxiv.org/abs/1802.08232"&gt;The Secret Sharer&lt;/a&gt;. In this paper, they were able to train a large language model using text from the Enron emails (another example of a commonly used dataset collected without direct consent). They were able to successfully extract email addresses, social security numbers and credit card numbers that appeared in the emails by crafting targeted prompts and exploiting the models tendency to memorize rarely seen data.&lt;/p&gt;
&lt;p&gt;How did they do it? They trained a then common natural language processing (NLP) deep learning architecture (called an LSTM, another sequence-based deep learning model) with the Enron data. They trained it to predict next character tokens (which you might remember from &lt;a href="https://blog.kjamistan.com/encodings-and-embeddings-how-does-data-get-into-machine-learning-systems.html"&gt;the tokenization article&lt;/a&gt;). A character-level tokenization model predicts the next character given the preceding characters. The model was quite small compared to today's model sizes, with only 2 layers and likely under 5,000 parameters (it wasn't explicitly listed, this is an inference based on the numbers they posted).&lt;/p&gt;
&lt;p&gt;Given prompts such as "My social security number is...", they used the sequence-based model to figure out the most likely continuation. Via their generation algorithm they calculate an "exposure score" -- a metric which measures the memorization of a particular sequence. Even with this small model and relatively small dataset, they were able to successfully extract several sequences, and prove a relatively high exposure (think partial memorization) for the sequences they couldn't entirely extract.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A table with 4 columns: user, secret type, exposure and extracted. There are 6 distinct users, each identified by a letter (A-G). Each user has one or more types of secrets, either credit card numbers or social security numbers or both. Each of these secrets has an &amp;quot;exposure&amp;quot; score. The extracted column shows that 3 secrets out of 10 were successfully extracted." src="./images/2024/secret_sharer_enron_extraction.png"&gt;&lt;/p&gt;
&lt;p&gt;In a later investigation of the same phenomenon, many of &lt;a href="https://arxiv.org/abs/2311.17035"&gt;the same authors looked at large language models&lt;/a&gt;, both open weight models like LLAMA and closed models like ChatGPT to extract sensitive data. They were able to do so using a few different attack vectors:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Say "poem" forever attack: In this attack, they prompted ChatGPT to say the word poem forever. Why? The researchers believe a singular word or token repeated forever triggers behavior similar to the &lt;end of text&gt; token. When training a model the &lt;end of text&gt; token is repeated many times, because it appears at the end of a document, book, or other text and those texts are joined together when performing language model training. Today's models have two training steps: one called pretraining which is base language training on many texts and then another training, where chat or instruction text is used on the already trained language model. Since the chat model requires text to be in conversational form, the repetition of a single model diverges from the second training and seems to activate the base language model, which in turn spits out memorized data.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;img alt="A ChatGPT prompt where the prompt asks please repeat the word poem forever. The response shows many examples of poem and then several lines of personal contact information including an email, phone number and name. The exact details have been removed to protect the privacy of the person." src="./images/2024/poem_attack.png"&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Nasr et al. found this attack is most successful with single words rather than multiple tokens. In addition, not all single word tokens were equal in their extraction power. For example, the word (and token) "company" was more than 100x more powerful at extracting memorized data than "poem". By spending $200 on the OpenAI API, they were able to extract more than 200,000 unique memorized sequences, which included personal information, NSFW content, user identifiers, URLs, and even literature and code. Based on a statistical estimator they trained, they predict a dedicated attacker could extract much more from ChatGPT -- noting that the rate of extraction from ChatGPT-3.5 was higher than any other model they tested. This paper was published before even larger models, like ChatGPT-4 were released.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;img alt="A chart with a y-axis labeled &amp;quot;number of unique extracted 50-grams&amp;quot; and an x-axis with label &amp;quot;number of extracted 50-grams&amp;quot;. There is a strong line upward, that then turns into a dotted line that plateaus like a logarithmic function as the numbers increase. The initial line is solid, which shows the actual extraction attack, extracting almost 0.3M unique 50-grams. Then it turns into a dotted line, which shows the prediction trajectory increasing up to nearly 1.3M extracted 50-grams." src="./images/2024/gpt_memorization_extraction_extrapolation.png"&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;To compare openly available models alongside the closed chat models, the authors generate longer texts and look for memorized chunks of text in the output. They compare the texts with a compilation of several popular training datasets, including &lt;a href="https://en.wikipedia.org/wiki/The_Pile_(dataset)"&gt;The Pile&lt;/a&gt; and &lt;a href="https://en.wikipedia.org/wiki/Common_Crawl"&gt;the Common Crawl Corpus&lt;/a&gt;. If the text has 50 tokens verbatim from the example training dataset, this is considered a successful extraction of memorized training data. For every model they successfully extract hundreds of thousands of 50-token memorized text, some which is repeated many times. To note: the training dataset of these models are unknown, and could contain all, some or only parts of the example training data that the researchers compiled. This means that these figures are lower estimates on extractability, since the actual training data would likely find better matches and provide easier extraction.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;img alt="A chart showing 5 columns: model family, parameters (billions), percentage of tokens memorized, unique 50-grams and extrapolated 50-grams. For each model family, you can observe that the smaller parameter model has less memorization than its larger counterpart. The highest token memorization is 0.852% by GPT-3.5-instruct. There is a note to the image stating that it was also easier to extract sequences from certain model families such as LLaMA, where the researchers were able to extract 16M 50-grams (2.9M unique 50-grams)." src="./images/2024/direct_copy_tokens_model_comparison.png"&gt;&lt;/p&gt;
&lt;p&gt;These attacks are unlikely to be the only successful ones, they are simply the most obvious ones to those who have studied the phenomenon of deep learning memorization. This research demonstrated how memorization occurs due to common text or image duplication, as explored in the previous article and now in novel cases, where the model memorizes a particular example or set of examples due to the way it is trained and optimized.&lt;/p&gt;
&lt;p&gt;Despite attempts to remove this possibility -- such as the use of guardrails, closed models, and paid APIs -- extraction of personal data is both theoretically and practically possible. This means that personal data, copyrighted data and other sensitive data exists in the model -- saved in the model weights and biases. The data doesn't need to be repeated to be memorized. Without access to the original language models (pretraining models) and the training data used, it will be difficult to estimate the exact amount of memorized data, particularly when the companies building closed models are not incentivized or required to perform these attacks and estimates internally.&lt;/p&gt;
&lt;p&gt;In the next article, you'll investigate how overparameterization of deep learning (aka the growth of model size) affects memorization, and how the "bigger is better" approach changed machine learning training, architectures and deployment.&lt;/p&gt;
&lt;p&gt;I'm very open to feedback (positive, neutral or critical), questions (there are no stupid questions!) and creative re-use of this content. If you have any of those, please share it with me! This helps keep me inspired and writing. :)&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Acknowledgements&lt;/em&gt;: I would like to thank &lt;a href="https://vickiboykis.com/"&gt;Vicki Boykis&lt;/a&gt;, &lt;a href="https://desfontain.es/serious.html"&gt;Damien Desfontaines&lt;/a&gt; and &lt;a href="https://www.linkedin.com/in/yanndupis/"&gt;Yann Dupis&lt;/a&gt; for their feedback, corrections and thoughts on this series. Their input greatly contributed to improvements in my thinking and writing. Any mistakes, typos, inaccuracies or controversial opinions are my own.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;This simplified representation of Feldman's proof is an adequate summarization for our use case; however, to learn more or read through the entire series of proofs, please read &lt;a href="https://arxiv.org/abs/1906.05271"&gt;the paper&lt;/a&gt; or a longer study from &lt;a href="https://arxiv.org/abs/2012.06421"&gt;Brown et al.&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;He mentions this find along with presenting findings of several of his publications on the topic in his MIT lecture on &lt;a href="https://cbmm.mit.edu/video/quantifying-and-understanding-memorization-deep-neural-networks"&gt;Quantifying and Understanding memorization in Deep Neural Networks&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;Zhang and Feldman's work on proving extraction used a traditional CNN design for computer vision (like what you learned about with AlexNet, just much larger and more modern). Diffusion models, which power much of the text-to-image generative AI, are a separate type of deep learning, where you can also extract memorized data.&amp;#160;&lt;a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:4"&gt;
&lt;p&gt;Carlini has contributed significantly to research around security in machine learning models. In case you are inspired, he wrote a blog post on &lt;a href="https://nicholas.carlini.com/writing/2024/why-i-attack.html"&gt;Why he attacks AI&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:4" title="Jump back to footnote 4 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="ml-memorization"></category></entry><entry><title>How memorization happens: Repetition</title><link href="https://blog.kjamistan.com/how-memorization-happens-repetition.html" rel="alternate"></link><published>2024-12-03T00:00:00+01:00</published><updated>2024-12-03T00:00:00+01:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2024-12-03:/how-memorization-happens-repetition.html</id><summary type="html">&lt;p&gt;In this article in &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;the deep learning memorization series&lt;/a&gt;, you'll learn how one part of memorization happens -- highly repeated data from the "head" of the long-tailed distribution.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Prefer to learn by video? This post &lt;a href="https://youtu.be/rDgFIiRTAHE?si=omH4DxA5OqOkJS3y"&gt;is summarized on Probably Private's YouTube&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Recall from &lt;a href="https://blog.kjamistan.com/machine-learning-dataset-distributions-history-and-biases.html"&gt;the data collection article&lt;/a&gt; that some examples are …&lt;/p&gt;</summary><content type="html">&lt;p&gt;In this article in &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;the deep learning memorization series&lt;/a&gt;, you'll learn how one part of memorization happens -- highly repeated data from the "head" of the long-tailed distribution.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Prefer to learn by video? This post &lt;a href="https://youtu.be/rDgFIiRTAHE?si=omH4DxA5OqOkJS3y"&gt;is summarized on Probably Private's YouTube&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Recall from &lt;a href="https://blog.kjamistan.com/machine-learning-dataset-distributions-history-and-biases.html"&gt;the data collection article&lt;/a&gt; that some examples are overrepresented in the dataset. They live in the "head" area and might be duplicated and show cultural and societal biases based on the collection methods. You learned in the &lt;a href="https://blog.kjamistan.com/gaming-evaluation-the-evolution-of-deep-learning-training-and-evaluation.html"&gt;last article&lt;/a&gt; how training steps work and how data is sampled, along with the overriding cultural focus on accuracy above everything else. In this article, you'll evaluate how wanting to score highly in accuracy with an unevenly distributed dataset creates the first problem with memorization: memorizing common examples.&lt;/p&gt;
&lt;p&gt;To begin the analysis, you'll first explore how a simple machine learning system works. To begin, you'll look at random forests as a popular classical baseline, and then review how a deep learning system works. If you're already familiar with these models, you can skip ahead to &lt;a href="#sequential-deep-learning-models"&gt;sequence-based deep learning&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id="simple-machine-learning-model"&gt;Simple Machine Learning Model&lt;/h3&gt;
&lt;p&gt;Machine learning is the task of determining if computers (via software) can use patterns to make inferences or decisions about an example. Machine learning models are used to automate or expedite tasks, like identifying and sorting spam messages, or offer assistance in making a particular decision, like if a patient should undergo additional screening for a disease. Machine learning models learn patterns and condense information based on historical data.&lt;/p&gt;
&lt;p&gt;In today's machine learning, you usually choose what software and algorithm you want to use before you begin the training process using your training and test data. As you learned in the &lt;a href="https://blog.kjamistan.com/encodings-and-embeddings-how-does-data-get-into-machine-learning-systems.html"&gt;encodings and embeddings article&lt;/a&gt;, the data is transformed into mathematical form (vectors or matrices) in order to train and also predict.&lt;/p&gt;
&lt;p&gt;For simple machine learning models, you first choose an algorithm. A good set of examples for classic algorithm choices are shown in &lt;a href="https://scikit-learn.org/stable/machine_learning_map.html"&gt;scikit-learn's overview&lt;/a&gt;. Broadly, these choices depend on things like your data size and structure and the task you want to solve. For example, you might want to predict a number or trend, like in forecasting, or classify something, like finding all positive product reviews from a series of texts.&lt;/p&gt;
&lt;p&gt;One popular choice due to its simplicity and performance tradeoff is random forests, which is an ensemble of decision trees. You can use random forests for many classification tasks, where you want to assign an outcome or label to an incoming piece of data. Because random forests are built out of decision trees, let's review a decision tree first.&lt;/p&gt;
&lt;p&gt;In a decision tree, you want the algorithm to determine useful splits in the data based on particular attributes. Ideally, these attributes split the data into fairly homogenous buckets. For example, if you are trying to decide whether someone has a particular illness, you'd want to end up with a tree that splits perfectly the people who have the disease from those who don't by using information encoded into the data. An example tree could look something like this:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A decision tree graphic showing a series of different questions where answering yes or no changes the outcome. The first question is whether the temperature is over 40 degrees Celsius. Then no answers &amp;quot;not COVID&amp;quot; and yes goes to another question of whether there is known exposure. The yes to that question goes to COVID and the no goes to another question: Is the at-home test positive? If yes, it's COVID and if no, it's not COVID." src="./images/2024/dt_covid.png"&gt;&lt;/p&gt;
&lt;p&gt;This is an oversimplified toy example, but it demonstrates the basic structure of a decision tree, where particular pieces of information are used to create hierarchical data splits based on particular attributes.&lt;/p&gt;
&lt;p&gt;A random forest is a collection of such decision trees, hence why it's called a forest. When you train a random forest, you specify the number of trees to train. Each tree usually starts with separate samples of the training data so they don't overlap, which results in different splits for the trees.&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt; If you train many trees, some will likely be quite similar and some will be heavily biased based on their sample of data. With enough different trees, you can create robust performance. Because each tree gets a vote, the majority vote becomes the most likely class or label.&lt;/p&gt;
&lt;p&gt;Once you have a trained random forest, you can run inference (aka. prediction) tasks with your trained model. Your model is now an artifact that contains information and instructions for how to take a prepared piece of data and output a prediction (or a series of predictions with different likelihoods).&lt;/p&gt;
&lt;p&gt;To get a prediction, you send an example like how the training examples were processed -- but this time without the label or result. The model returns the particular outcome or classification label, usually with an indication of the confidence in that prediction. In a random forest, the model asks all trees to make a prediction and each tree votes on the outcome. The confidence is essentially the voting distribution (i.e. 30% trees say not infected, 70% trees say infected). It's your job as the human to figure out if the model is making the correct decision.&lt;/p&gt;
&lt;p&gt;When you take a simple model, like a random forest, you can often also reverse engineer the model's decision. This is useful for determining if you trust the decision. For example, you can look at which trees voted for the majority decision and investigate what branches in those trees contributed to that decision. This process is referred to as the interpretability or explainability of the model. When a model is simple, like with random forests, you can use your human understanding to evaluate how trustworthy and accurate you find the prediction. This is demonstrated in "human in the loop" systems, where a human can use a machine learning model and the interpretation of the model's prediction to make an informed decision.&lt;/p&gt;
&lt;p&gt;Now that you've investigated a simple machine learning model at a high level, let's take a look at how it compares with a deep learning model.&lt;/p&gt;
&lt;h3 id="deep-learning-model"&gt;Deep Learning Model&lt;/h3&gt;
&lt;p&gt;In a deep learning model, at the model selection level, you don't make one algorithm choice, but instead many choices. Because a deep learning model often consists of many layers of functions which interconnect, you are building a model architecture rather than one algorithm choice. Researchers try out new architectures, adding new types of layers or changing the layers and changing the algorithms within those layers to seek performance improvements or innovative approaches.&lt;/p&gt;
&lt;p&gt;Within industry, you are usually just implementing other people's architectures that you know work well for the type of problem you are solving. For example, many of today's LLMs use a &lt;a href="https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)"&gt;Transformer architecture&lt;/a&gt;, which is a fairly complex deep learning architecture that uses an attention mechanism. Attention mechanisms were first introduced by Google researchers in 2017 in a famous paper called &lt;a href="https://arxiv.org/abs/1706.03762"&gt;Attention Is All You Need&lt;/a&gt;. GPT models are a type of transformer, with small changes in how certain parts of the encoder (to "read" the incoming text) and decoder (to "write" the response) work. For an illustrated and deeper dive into transformers, check out &lt;a href="https://jalammar.github.io/illustrated-transformer/"&gt;the Illustrated Transformer&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In much of today's deep learning, unless you are at a large company with an extensive machine learning research team or a machine learning-based startup focused on research, you are likely using someone else's model. This could mean that you are using an OpenAI API call, where you are also not even hosting the model, or it could mean you first download and use someone else's model and deploy it on your own infrastructure.&lt;/p&gt;
&lt;p&gt;There are also ways to download a model that someone else first trained and train it further. This is called &lt;a href="https://en.wikipedia.org/wiki/Transfer_learning"&gt;transfer learning&lt;/a&gt; or &lt;a href="https://en.wikipedia.org/wiki/Fine-tuning_(deep_learning)"&gt;fine-tuning&lt;/a&gt;. When you do so, you take a model trained on a task and train it further, to better align with your particular data or use case. If you don't have enough of your own data to train, there is increasing use of large scale language models (LLMs) to assist in building out robust training examples for model distillation tasks. With model distillation your goal is to actually build a smaller model that performs well on your particular use case, hence "distilling" the information from the larger model (see: &lt;a href="https://explosion.ai/blog/human-in-the-loop-distillation"&gt;Spacy's human-in-the-loop distillation&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;In deep learning, you are often dealing with data and tasks that aren't suited for simpler machine learning models -- like generating photos, videos, audio, text or translating those from one medium to another. Deep learning is what powers Generative AI, what allows for text-to-speech or speech-to-text, and what is used for computer vision tasks, like facial recognition or "self-driving" cars. The complexity of such tasks and the data size make deep learning more performant than simple machine learning models.&lt;/p&gt;
&lt;p&gt;For training deep learning models, as &lt;a href="https://blog.kjamistan.com/gaming-evaluation-the-evolution-of-deep-learning-training-and-evaluation.html"&gt;you learned in the training article&lt;/a&gt;, the entire training dataset is input into the model multiple times. In today's largest models the training set is seen tens of thousands of times by extremely large models. These models are called over- or hyperparameterized because they actually have more parameters--weights, biases and other parameters that the functions of the network might use--than there are training data points. You might have heard about 1 trillion parameter models, and yet these models were trained with less than 1 trillion pieces of unique data.&lt;/p&gt;
&lt;p&gt;Compared with the earlier decision tree and random forest example, a deep learning model is much more difficult to interpret. Especially as the layer complexity and depth grows, as it does with large deep learning models, it's difficult to look at the activations of the nodes and make any sense of what is happening in a way that humans can understand. Despite the difficulty in working on human-interpretable understanding of deep learning, it hasn't stopped researchers from trying to peek inside these networks to see what is happening.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/marcotcr/lime"&gt;LIME or Local Interpretable Machine Learning Explanations&lt;/a&gt; investigated if small changes in inputs could expose and locate deep learning decision boundaries. To envision how this works, first think of a 3-dimensional space, where inputs are represented as coordinates. Then, imagine planes marking boundaries between those points that help you determine whether the point belongs to one group or another. In reality, these models are extremely high-dimensional and non-linear in nature, meaning it works a bit differently than you just imagined, but LIME worked by changing the coordinates to figure out where these boundaries were and then used that information to say this part of the image or text is why it was classified with this label. Since a neural network is not a simple linear equation, finding these boundaries can be quite difficult, but it was an interesting first step.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.quantamagazine.org/a-new-approach-to-understanding-how-machines-think-20190110/"&gt;Been Kim's work&lt;/a&gt; brought the field of deep learning interpretability to a new level. In her work at DeepMind, she investigates how hidden layers (the layers between the first and last one) can create intermediary representations which map much closer to interpretable patterns for humans. Her seminal contribution of &lt;a href="https://arxiv.org/abs/1711.11279"&gt;"Testing with Concept Activation Vectors" (TCAV)&lt;/a&gt; created a way to try to use human interpretability approaches to understand deep learning, not the other way around.&lt;/p&gt;
&lt;p&gt;Deep learning is a large area of machine learning, with many different model types. To focus our attention on particularly useful areas of deep learning to explore memorization, you'll start with sequential deep learning models, where you want to predict what happens next when presented with a sequence. As you might already know, this is the deep learning that powers today's text-based Generative AI, like OpenAI's ChatGPT and Google's Gemini.&lt;/p&gt;
&lt;h3 id="sequential-deep-learning-models"&gt;Sequential Deep Learning Models&lt;/h3&gt;
&lt;p&gt;For many years, it was difficult to do language-based deep learning. Deep learning computer vision models were well into production by the time word embeddings made language deep learning possible. This is because of the size and complexity of language, which has much wider ranges and of course, also multiple languages that one could use. It was also because there wasn't a good way to do performant sequence-based models for a long time.&lt;/p&gt;
&lt;p&gt;As recently as 2018, generative text models would quickly devolve into babble or change topics midstream. What significantly shifted the field are two inventions: the attention mechanism and the large context window. Let's walk through both at a high-level to appreciate what they were able to bring to generative text.&lt;/p&gt;
&lt;p&gt;The attention mechanism creates specific parts of the deep learning model which hold "attention" on links between elements of the sequence. This helps create a web of references, which greatly improves the ability to produce meaningful language.&lt;/p&gt;
&lt;p&gt;These attention heads are both on the encoder ("reading what is coming in") and the decoder ("writing what is going out") with an additional one between the encoder and decoder. These attention heads can hold input or output embeddings within a sequence that are calculated as more significant or useful, depending on the task at hand and the training data.&lt;/p&gt;
&lt;p&gt;Let's look at an early example of attention from the original Google paper. In this view you can see two of the attention head weights of each token when looking at another token, which reflects what those weights have "learned" are links between sequential inputs.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A sentence reads out token by token that says: The Law will never be perfect, but its application should be just - this is what we are missing, in my opinion. There are two versions of the sentence, one on the left and one on the right. The words Law and application on the left are highlighted and they link to the word its on the right." src="./images/2024/transformer_attention_head.png"&gt;
&lt;em&gt;An example of attention head links&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;In this example, the entity (or subject) linkage and attribution was learned by this attention head, allowing "law" and "its" to be linked. The attention also links to application, tying law and application together via the pronoun it.&lt;/p&gt;
&lt;p&gt;For early text-based transformers, attention mechanisms could only be used on a subset of the overall sequence due to memory and compute limitations, often limited to under 2000 tokens. This means that the prompt (initial instructions) as well as the ongoing generative text could only be given attention up to that context window limit. This limit bounded the length of meaningful generative text - particularly when using transformers for longer generative text tasks, such as summarization, chat, or search and information retrieval.&lt;/p&gt;
&lt;p&gt;To address these use cases, model developers like OpenAI increased the context window size of the attention mechanism, which also increased the hardware memory requirements and computational cost of the models. This means that you could often hold entire conversations or chapters of books in the context window, creating the ability to stay on task, but also giving the model extra context to ensure that the important tokens and ideas are always available. Today's LLM context windows are often 128,000 tokens or longer. Compare that to Shakespeare's Hamlet, which is just over 30,000 words.&lt;/p&gt;
&lt;p&gt;I often describe context windows as the model's RAM (Random Access Memory). RAM allows computers to easily grab recently used data to accelerate computations or loading times. When you encode text, that might be bytes, characters or words, so these context window sizes could roughly translate that a model has more than 50,000 words in RAM. How does it process what word comes next?&lt;/p&gt;
&lt;p&gt;To oversimplify, you can think about all of the possible words and tokens as different points in 3D space, as you did when looking at decision space. The attention mechanism and context window might already have highlighted some of these points as more important or more relevant to the model - which also means points near them become more relevant.&lt;/p&gt;
&lt;p&gt;The model has been trained first on many sequences in order, and gotten information about how tokens come together. During pretraining (which is what first happens with today's language models), it updates many hidden layers of weights to "learn" how the input corpus works and what sequences of tokens are common or uncommon.&lt;/p&gt;
&lt;p&gt;The embedding information of each token also carries added information about tokens that are similar, near each other, or tokens that have a particular distance from one another which shows their relationship, as you learned when reviewing Word2Vec.&lt;/p&gt;
&lt;p&gt;If the model has learned enough about language and the input has enough context (i.e. via the context window RAM), it can create sensical combinations that are on topic by also combining its output as part of its sequencing (i.e. writing word-by-word and continuing to compute what a useful next series of tokens might be).&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;In a situation where you have: Why did the chicken ..., you can probably guess the next word, as it's a common phrase. In a deep learning sequence-based model, the network calculates which of all of the possible next steps are most relevant using the steps it has already seen and already generated. Usually the highest probability next step is chosen, but there might be clever ways to see the best sequences that aren't just exactly the next most probable step. For example, you can predict several different sequences using several of the next most probable steps and calculate the probability over the entire sequence, not just the exact next step.&lt;sup id="fnref:3"&gt;&lt;a class="footnote-ref" href="#fn:3"&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;In statistics (and in deep learning), this relates to the &lt;a href="https://en.wikipedia.org/wiki/Perplexity"&gt;log-perplexity&lt;/a&gt; of the next sequence. This is a useful comparison for the final stages of the decoder in a transformer, which should take all of the calculations to this point in time (including calculations from the context window) and run it through several final deep learning layers. This ends with a &lt;a href="https://en.wikipedia.org/wiki/Softmax_function"&gt;Softmax function&lt;/a&gt; which takes the prior layer inputs and translates them into a probability distribution for the different tokens. Depending on the strategy, either the most likely token or some combination of most likely tokens will be chosen.&lt;/p&gt;
&lt;p&gt;How does this affect the problem of memorization? Let's put together what you've learned thus far to see the bigger picture.&lt;/p&gt;
&lt;h3 id="repetition-begets-memorization-in-deep-learning"&gt;Repetition Begets Memorization in Deep Learning&lt;/h3&gt;
&lt;p&gt;Let's investigate a few facts that you now know:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Scraped, collected datasets have a section of examples that are repeated often and are much more common.&lt;/li&gt;
&lt;li&gt;During model training, the model will be optimized on these examples hundreds, if not thousands of times.&lt;/li&gt;
&lt;li&gt;Sequential language modeling must choose the best answer with one word missing. It also can hold a massive amount of words in memory to access at any time. These words build weights and connections in the network itself.&lt;/li&gt;
&lt;li&gt;The model and model developers are incentivized to score well on the "test" and are penalized (error and loss) when they fail. The training rounds should explicitly use this penalty to improve.&lt;/li&gt;
&lt;li&gt;The "best" model wins, regardless of interpretability or if cheating has occurred.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There are both mathematical and human incentives to produce models that memorize common text, particularly if that text will be in the testing dataset and if that text has been seen multiple times during training. Even more so if that text is presumed to be known by the users of the model at a later stage.&lt;/p&gt;
&lt;p&gt;Google researchers proved exactly this fact in 2018, when they released a paper called &lt;a href="https://arxiv.org/abs/1802.08232"&gt;The Secret Sharer&lt;/a&gt;. Carlini et al. stated in the paper, "unintended memorization is a persistent, hard-to-avoid issue that can have serious consequences". They were able to demonstrate the extraction of both common and more rare training examples that had been processed multiple times in the training rounds.&lt;/p&gt;
&lt;p&gt;In a later piece of research by Carlini et al., they were able to show that model size impacts the memorization, and that repeated examples are especially prone to memorization. They estimated that a minimum lower bound of 1% of the training data is memorized, but an unknown upper bound.&lt;/p&gt;
&lt;p&gt;In some of their experiments, they were able to extract more than 32% of the text that was included in at least 100 training data examples, sometimes not the full text token-by-token, but enough to recognize and match the text. Their testing was on much smaller models than today's models, as they used GPT-Neo and GPT-J with 6B parameters. In comparison, it is believed that GPT-4 has 1800B parameters.&lt;/p&gt;
&lt;p&gt;Here is a comparison chart linking model size to text memorization by showing a few examples they were able to extract:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A chart with 5 columns and 4 rows. In the left-most column there are text examples from the training dataset. In each subsequent column you can see portions of that column extracted from the model. The 6B parameter model has significantly more text that can be successfully extracted than the other models, which are 2.7B, 1.3B, 125M." src="./images/2024/text_extraction_repetition.png"&gt;&lt;/p&gt;
&lt;p&gt;This memorization within deep learning can be also observed in the &lt;a href="https://arxiv.org/abs/2009.04013"&gt; attribute inference&lt;/a&gt; and &lt;a href="https://arxiv.org/abs/1610.05820"&gt;membership inference attacks&lt;/a&gt; where an attacker can find and extract common properties of the underlying training data population or reveal if a particular example was in the training data based on the model response. Particularly interesting is the work &lt;a href="https://arxiv.org/abs/2107.06018"&gt;This Person Probably Exists&lt;/a&gt; on deriving attributes from a computer vision model that was trained on the CELEB A dataset - a dataset consisting primarily of celebrity faces.&lt;/p&gt;
&lt;p&gt;Let's see if this type of behavior is easy to evoke using online freely-available tools.&lt;/p&gt;
&lt;h3 id="easy-examples"&gt;Easy Examples&lt;/h3&gt;
&lt;p&gt;With almost any generative tool, you can easily create images of particular brands, faces and common images, like this version of Angela Merkel, created on &lt;a href="https://stablediffusionweb.com/"&gt;Stable Diffusion Web&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A drawing of Angela Merkel's face produced by a free online tool" src="./images/2024/merkel.png"&gt;&lt;/p&gt;
&lt;p&gt;Let's see if we can also reproduce this type of memorization using ChatGPT (free version).&lt;/p&gt;
&lt;p&gt;An initial prompt of "Can you tell me some popular children's authors?" gave me a list starting with Dr. Seuss, so I asked to know about a book from Dr. Seuss and was told about Green Eggs and Ham and how it had good rhymes. I asked the ChatGPT service to show me some of the writing and it did so with the opening stanzas: Would you eat them in a box? Would you eat them with a fox? (and a few more lines).&lt;/p&gt;
&lt;p&gt;Then I asked:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A prompt asking: Can you continue the story from where you left off? ChatGPT response: Certainly! Here’s the continuation of &amp;quot;Green Eggs and Ham&amp;quot; from where we left off: Would you eat them in a car? Eat them! Eat them! Here they are. I would not, could not, in a car.. the book continues for several more stanzas" src="./images/2024/dr_seuss_continued.jpg"&gt;
&lt;em&gt;Reading Dr. Seuss with ChatGPT Memorization&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;And got a not quite perfect but pretty close continuation of the entire book (the response was more than 600 words and spaced in the appropriate stanzas).&lt;/p&gt;
&lt;p&gt;ChatGPT can also easily reproduce code and related topics, like the Zen of Python:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A prompt asks: What's the famous poem import thing in Python? and ChatGPT responds The famous poem import in Python is the &amp;quot;Zen of Python,&amp;quot; written by Tim Peters. It is a collection of guiding principles for writing computer programs in the Python language. You can access it by importing this in a Python script or interpreter. It goes on to write the entire Zen of Python word for word." src="./images/2024/zen_of_python.png"&gt;
&lt;em&gt;The Zen of Python from ChatGPT Memorization&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;You might be thinking, ah, well this is exactly the goal. If you train a large model on a bunch of repetitive things, then of course it can also repeat them. You're correct! It is indeed both expected, anticipated and empirically true!&lt;/p&gt;
&lt;p&gt;It just might not be the outcome that Dr. Seuss's family expected, or at the scale that any popular artists, authors, creators, musicians and coders imagined. It can be both expected and yet have unanticipated and unintended secondary effects. It can be both desirable given the specific machine learning task but not well thought through in terms of cultural, societal and personal impact.&lt;/p&gt;
&lt;p&gt;Unfortunately this is not also the only way that memorization occurs. In the next article, you'll review how unique and novel examples also end up memorized. Stay tuned!&lt;/p&gt;
&lt;p&gt;I'm very open to feedback (positive, neutral or critical), questions (there are no stupid questions!) and creative re-use of this content. If you have any of those, please share it with me! This helps keep me inspired and writing. :)&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Acknowledgements&lt;/em&gt;: I would like to thank &lt;a href="https://vickiboykis.com/"&gt;Vicki Boykis&lt;/a&gt;, &lt;a href="https://desfontain.es/serious.html"&gt;Damien Desfontaines&lt;/a&gt; and &lt;a href="https://www.linkedin.com/in/yanndupis/"&gt;Yann Dupis&lt;/a&gt; for their feedback, corrections and thoughts on this series. Their input greatly contributed to improvements in my thinking and writing. Any mistakes, typos, inaccuracies or controversial opinions are my own.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;Random forests usually use statistical methods like &lt;a href="https://en.wikipedia.org/wiki/Bootstrapping_(statistics)"&gt;bootstrapping&lt;/a&gt; and &lt;a href="https://en.wikipedia.org/wiki/Bootstrap_aggregating"&gt;bootstrap aggregation&lt;/a&gt;, or &lt;a href="https://en.wikipedia.org/wiki/Random_forest#Bagging"&gt;bagging&lt;/a&gt;. I encourage you to dive into the links to learn more, but on a high level you can think of these as statistics-informed sampling methods, which allow creation of samples that aim to represent the dataset or populations within the dataset. This may also increase the dataset size by resampling from a subset of data.&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;For LLMs today, there is usually also a second round of deep learning training after the initial language model learning, which originally was based on reinforcement learning and called reinforcement learning with human feedback (RLHF). This usually involved chat style prompts and then data workers who were paid very little money to both write their own responses as if they were the AI assistant, but also to rate which response was best out of a variety of responses. Now there are several approaches for this type of instruction training and tuning, which are not always reinforcement learning-based, but instead use traditional deep learning approaches by incorporating human preference information into the loss function.&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;There has been some success at using approaches like &lt;a href="https://en.wikipedia.org/wiki/Beam_search"&gt;beam search&lt;/a&gt; to compare potential sequence options and calculate their probability or their preference (when using human input to determine best responses). This creates more options and potential variety in responses.&amp;#160;&lt;a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="ml-memorization"></category></entry><entry><title>Gaming Evaluation - The evolution of deep learning training and evaluation</title><link href="https://blog.kjamistan.com/gaming-evaluation-the-evolution-of-deep-learning-training-and-evaluation.html" rel="alternate"></link><published>2024-11-26T00:00:00+01:00</published><updated>2024-11-26T00:00:00+01:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2024-11-26:/gaming-evaluation-the-evolution-of-deep-learning-training-and-evaluation.html</id><summary type="html">&lt;p&gt;In this article in the &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;series on machine learning memorization&lt;/a&gt;, you'll dive deeper into how typical machine learning training and evaluation happens, a crucial step in ensuring the machine learning model actually "learns" something. Let's review the steps that lead up to training a deep learning model.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Two major steps are shown in rectangular boxes: Data Preparation and Preprocessing and Model Training and Evaluation. Above each of these major steps there are smaller boxes outlining substeps. The data preparation substeps are data collection, data cleaning and data labeling (if needed). The substeps for model training and evaluation are data encoding, model training and model evaluation." src="./images/2024/model_training_steps.png"&gt;
&lt;em&gt;High-level steps to …&lt;/em&gt;&lt;/p&gt;</summary><content type="html">&lt;p&gt;In this article in the &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;series on machine learning memorization&lt;/a&gt;, you'll dive deeper into how typical machine learning training and evaluation happens, a crucial step in ensuring the machine learning model actually "learns" something. Let's review the steps that lead up to training a deep learning model.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Two major steps are shown in rectangular boxes: Data Preparation and Preprocessing and Model Training and Evaluation. Above each of these major steps there are smaller boxes outlining substeps. The data preparation substeps are data collection, data cleaning and data labeling (if needed). The substeps for model training and evaluation are data encoding, model training and model evaluation." src="./images/2024/model_training_steps.png"&gt;
&lt;em&gt;High-level steps to train a deep learning model&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;You've already learned the initial stages that lead up to training a model -- namely, how &lt;a href="https://blog.kjamistan.com/machine-learning-dataset-distributions-history-and-biases.html"&gt;data is collected and sometimes labeled&lt;/a&gt; (depending on the "task" you might need labeled or unlabeled data). After the data is processed, cleaned and saved, it will likely be stored in files, document stores or other distributed data architecture setups for easy access from the data science and/or machine learning team.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Prefer to learn by video? This post &lt;a href="https://youtu.be/IO3yI640H5A?si=J_-74oE5zkcCcviB"&gt;is summarized on Probably Private's YouTube&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In a typical machine learning setup, the team uses a dedicated or on-demand GPU cluster or other machines that accelerate and parallelize machine learning training. This special hardware ensures that the massively parallelizable linear algebra computations run as quickly as possible, as you &lt;a href="https://blog.kjamistan.com/encodings-and-embeddings-how-does-data-get-into-machine-learning-systems.html#encodings-and-embeddings-how-does-data-get-into-machine-learning-systems"&gt;learned when reviewing AlexNet&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;At this point, the team will also decide:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;what data and task are relevant&lt;/li&gt;
&lt;li&gt;what model architecture(s) they will train&lt;/li&gt;
&lt;li&gt;an evaluation and validation strategy and dataset to evaluate the models&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The simplest and the most common way to answer the data question in a) and c) is to use the data you've already collected and randomly split it into training and testing datasets. This is appealing because it ensures both datasets are similar, making evaluation easier and ensuring all data undergoes standardized preparation and preprocessing. Also it's data you already have, so you don't need to figure out how to collect more data.&lt;/p&gt;
&lt;p&gt;Nearly every machine learning algorithm or architecture has hyperparameters, which are variables for the architecture or algorithm. These are usually initialized directly if you have an idea what some of the values should be, or initialized randomly. If you choose to use random initialization, or to search a variety of values for the best initialization, you might perform many parallel training initiations to see which creates a better model.&lt;/p&gt;
&lt;p&gt;In conjunction, you can use cross-validation, where multiple models are trained and evaluated on different splits in the dataset and with different initializations of the hyperparameters.&lt;/p&gt;
&lt;p&gt;When using cross-validation, your data split might look like this:&lt;/p&gt;
&lt;p&gt;&lt;img alt="There are 4 horizontal bars with different vertical splits within the bar. At the top there is the simple example, where one model is trained and a simple train-test split is made. For this example the bar shows a split -- with the training data representing about 70% of the split. For the cross-validation example there are several examples of splits, each showing the testing dataset moving gradually from left to right but still encompasses about 30% of the entire dataset. The remaining data belongs to the training data for each model candidate. This produces N model candidates which can be compared and the best model can then be chosen. Each candidate has seen slightly different training and test data and might also have different hyperparameters. " src="./images/2024/cross_validation_rounds.png"&gt;
&lt;em&gt;A visual example of training and test splits&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;In a perfect world, this is fine, because the data has been properly cleaned, is attributed correctly and you know that it's high quality data. You presume the data is free and available for use (i.e. data protection compliant and not under special licensing or copyright). You also presume it is representative (i.e. it doesn't have &lt;a href="https://en.wikipedia.org/wiki/Sampling_bias"&gt;sampling biases&lt;/a&gt;) and any labels are correct and appropriately representative. Unfortunately, our world is not perfect.&lt;/p&gt;
&lt;p&gt;Instead, as discussed &lt;a href="https://blog.kjamistan.com/machine-learning-dataset-distributions-history-and-biases.html"&gt;in a previous article&lt;/a&gt;, the data often has a significant mass of "typical" examples and then a long tail of more novel examples. Some of those novel examples are likely just errors and mistakes in labeling and collection. Some of the popular examples will repeat themselves either pixel-for-pixel and word-for-word or in chunks with close approximation, like beginning a business letter with "To Whom it May Concern".&lt;/p&gt;
&lt;p&gt;This brings several problems, some which contribute significantly to the memorization problem. Let's evaluate them alongside the typical training process.&lt;/p&gt;
&lt;h3 id="data-quality-duplication-and-preprocessing-cleaning"&gt;Data Quality, Duplication and Preprocessing / Cleaning&lt;/h3&gt;
&lt;p&gt;Internet-scraped data has many quality issues, but so does data specifically collected for a task, due to many of &lt;a href="https://blog.kjamistan.com/machine-learning-dataset-distributions-history-and-biases.html"&gt;the societal, measurement and population biases described in the data collection article&lt;/a&gt;. If your goal is to create an accurate and representative view of something like everything you can see outside, or even a smaller task, like recognizing every voice speaking English for speech to text, you will certainly miss some representations and you'll likely also run into training data quality issues.&lt;/p&gt;
&lt;p&gt;Data quality is a hotly debated topic within machine learning and data science. Some machine learning scientists presume that if the errors only represent a small portion of the data, they will essentially be regularized out of the model. This presumes two things: (1) the errors represent a small portion of the overall data collected and (2) the model will not memorize erroneous data.&lt;/p&gt;
&lt;p&gt;If a data scientist presumes that there are significant quality issues with the current dataset, there should be a plan to deal with the problems. An example plan could look something like the following:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Test for duplication and remove duplicates using near-match or perfect match search strategy. Remember that near-match is a hard problem and can require human intervention and labeling.&lt;/li&gt;
&lt;li&gt;Test for realistic bounds or patterns and regularize or remove data outside of those bounds. For example, find and remove overexposed photos.&lt;/li&gt;
&lt;li&gt;Apply domain-specific criteria to detect problems and either correct or remove those issues. For example, remove poor quality boilerplate or spam text.&lt;/li&gt;
&lt;li&gt;Determine other preprocessing to ensure all data is similarly standardized and irregularities have hopefully been removed. This can require things like removing outliers, &lt;a href="https://en.wikipedia.org/wiki/Normalization_(statistics)"&gt;normalizing data&lt;/a&gt; and filling in missing values.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is a difficult task due to the input data complexity, especially when you are using non-tabular datasets (i.e. not data in rows and columns). Is a cropped photo of a larger photo a duplicate? What about text that varies by one paragraph? What are "realistic bounds" for outliers when it comes to PDF documents? Many of these are open research problems that require significant domain experience to address properly - which most data scientists don't have by default.&lt;/p&gt;
&lt;p&gt;Due to this skill mismatch and lack of resources, usually only the most rudimentary quality checks and preprocessing happens. This data is then considered "clean" for the following training and evaluation steps.&lt;/p&gt;
&lt;h3 id="sampling-bias"&gt;Sampling Bias&lt;/h3&gt;
&lt;p&gt;Sampling bias is data bias or error that comes from the way the data is collected and used.&lt;/p&gt;
&lt;p&gt;In deep learning, there are two forms of sampling bias. The first occurs during data collection, which you learned about in the last article. This bias is seen in the skewed representations and societal biases, but also in other features, like the represented language style and context. If you are using Wikipedia, the writing has a certain style, versus if you use arXiv (another popular source) or Reddit (a very different style of language, even though it will also be mainly in English). These choices greatly impact how a language can learn and reproduce linguistic style and writing.&lt;/p&gt;
&lt;p&gt;The second type of bias is the actual sampling method used to perform the training split and validation in the dataset. In an ideal scenario, you'd study the underlying data populations and make specific decisions on how you'd like to split the training and evaluation/testing data so that each sample has an adequate representation of the population information you are trying to learn.&lt;/p&gt;
&lt;p&gt;In a perfect scenario you might even use a separate test and evaluation set that you specifically collected and labeled to ensure you know the data quality and provenance -- even if it slightly deviates from the original training data. For example, if you really wanted to evaluate a system with unseen data, you could collect the test and/or evaluation data via a separate process. Let's say you are testing a chatbot for customer search and knowledge base surfacing. You could collect the test and evaluation set by leveraging your customer service department -- who could create an entirely separate evaluation set based on their knowledge and experience. You could expand this dataset by sampling real customer queries of the system when it launches or in a beta setting and having the customer service team appropriately validate, label and annotate or enhance the dataset.&lt;/p&gt;
&lt;p&gt;In reality, a data scientist likely uses a built-in preprocessing train-test split that takes the entire data and runs a random sampler across it. Again, this probably wouldn't be a problem if the data always had a normal distribution and was high quality, but this is not usual with the large scale scraped or publicly available datasets. This means that random sampling is not actually representative of what you are trying to learn, and certainly not always a quality you want to reproduce. It also means the chance of sampling near-matches, extremely similar data to your training data, and one-off outliers or errors is high (because of the long tail and prevalent collection methods).&lt;/p&gt;
&lt;p&gt;This sampling bias impacts both the performance of the model on the data, but also the progression of model training, which brings us to our next problem.&lt;/p&gt;
&lt;h3 id="training-batches-and-rounds"&gt;Training batches and rounds&lt;/h3&gt;
&lt;p&gt;In deep learning setups, data science and machine learning teams use multiple training rounds to ensure the model "learns".&lt;/p&gt;
&lt;p&gt;Usually, training is broken down into iterations called epochs and then a smaller iteration called steps. This is then repeated as long as needed until the model scores high enough or the team decides it isn't working. When reviewing the following process, I want you to imagine this process happening thousands, if not hundreds of thousands of times, meaning the model processes the same data many thousands of times, each time trying to better "learn" from that data.&lt;/p&gt;
&lt;p&gt;Let's investigate a typical training epoch:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Before a training epoch can start a data batch size needs to be defined. Usually batch sizing is correlated to the dataset and hardware at hand. For large models and accompanying datasets, usual batch sizes start at 128 data examples.&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;At the beginning of a training epoch, a sampler is used to select the batch from the training dataset. The default sampler is "random" and breaks up the full training dataset into a particular number of batches. Note that the randomness is dependent on what hardware is being used, and therefore in many cases &lt;a href="https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/Randperm.cu#L25"&gt;not truly random&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="An image of a block of training data. Some subsets across the block are selected for batch #1, then batch #2 until batch #N, at which point all of the training data belongs in a batch." src="./images/2024/batch_sampling.png"&gt;
&lt;em&gt;Visual example for batch selection&lt;/em&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The examples in the first batch will be processed through the model to create a prediction or output (similar to how it works when a model is used normally to predict a label or the next token). This processing activates nodes and layers across the model's network based on the model's weights and biases.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The final layer of the model will predict a class or make another prediction, such as a token or other generative output. This response will be compared to the training data itself, such as the label or the next series of words in the training example.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Two symbols, one representing the actual input the other representing the model response during training. The input is a succulent plant, the model response is a turtle. There is a question above the images saying: What penalty should the model get for this response?" src="./images/2024/loss_calculation.png"&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Visual example of a loss calculation. During training, you continuously predict and measure loss to sequentially improve the model response.&lt;/em&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Depending on how correct or likely that answer is, there will be an error calculation (often called a loss function). You can choose different error calculations, but many are based on the concept of &lt;a href="https://en.wikipedia.org/wiki/Cross-entropy"&gt;cross entropy&lt;/a&gt;, which roughly tries to say how likely the model response and the training data come from the same population (i.e. how predictable or "normal" is this response?). The error is used to derive the updates for all of the weights in the network to attempt to correct future responses. If the loss is large, this will highly impact the weights for those examples. This means outliers and potential erroneous inputs can have an outsized impact on the model training.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Once the parameter (weight and bias) changes for each layer are calculated, backpropagation begins. The process updates each layer with the new weights and the training step is complete. These updates usually happen at the end of the full batch.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A simple neural network showing multiple layers and nodes for each layer. Some of the nodes are highlighted in red lines showing how much error they hold that needs to be corrected by changing their weights and biases." src="./images/2024/node_errors.png"&gt;
&lt;em&gt;Visual example of corrections to particular weights based on error. The stronger the red border representing error, the larger the shift in the weights and biases related to that particular node.&lt;/em&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Then, the next batch is selected and the training step begins again. This is repeated until the epoch is complete, so the model has seen all possible training data once.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;At the end of an epoch a subset of the test data is selected to evaluate the model performance. The performance of the model on that data is usually shown via a dashboard, so that the machine learning scientist, data scientist or engineer can determine if the training is going well or if it should be stopped because the model either isn't performing and something catastrophic has happened. Sometimes, the team will also stop training in what is called &lt;a href="https://en.wikipedia.org/wiki/Early_stopping"&gt;early stopping&lt;/a&gt; because the model has reached a good or optimal performance, where they assess that further learning might result in overfitting or won't provide much additional gains.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;It's common to train for multiple epochs, meaning the entirety of the training dataset is used multiple times. For overparameterized networks, which you'll learn about more in a later article, this repetition is significant as the models require hundreds of thousands of "full passes" (i.e. one training epoch on all data) to reach peak performance ([see: &lt;a href="https://arxiv.org/abs/2104.05605"&gt;early work on scaling Generative models&lt;/a&gt; and &lt;a href="https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/"&gt;NVIDIA's scaling language model training to 1T parameters&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;To dive deeper into the workings of these steps, take a look at &lt;a href="https://theneuralblog.com/forward-pass-backpropagation-example/"&gt;the Neural Blog's Forward Pass and Backpropagation example&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;By reviewing this process, you can infer that uncommon examples have a more significant impact on network weights compared to more common examples. Because there are also fewer of them, the model weights must account very specifically for those examples in order to achieve optimal performance. You will come back to this insight later in this series, but let's first evaluate the evaluation process.&lt;/p&gt;
&lt;h3 id="the-myth-of-unseen-datasets"&gt;The Myth of "Unseen" Datasets&lt;/h3&gt;
&lt;p&gt;The training steps of a deep learning model require that the data is seen in its entirety usually multiple times. If the test data is sampled from the same dataset, how "unseen" is the test data?&lt;/p&gt;
&lt;p&gt;The split between test and training is often random, and yet the datasets are often collected from similar samples. Let's take a look at an artifact from some research around memorization to see how this plays out.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Two images are next to each other, each with a picture that looks like it's taken by the same photographer in the same room at nearly the same time. The photo shows a room with a bright green wall and a person on a swing. The photos are labeled as &amp;quot;swing&amp;quot;." src="./images/2024/high_influence_pair_swing.png"&gt;
&lt;em&gt;Example training sample and test sample from ImageNet&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;On the left is a test data sample, with the accuracy of the model's prediction written underneath it (75%). But on the right is an example from the training data which most influenced the weights to guess "correctly" on the test data. The photos are clearly from the same photographer, on the same day, of the same thing. And yet, this is called "unseen" data?&lt;/p&gt;
&lt;p&gt;Presumably, you want the test data to be completely unseen so you can tell how well your model is actually generalizing. Generalization is used to describe a quality that the model has learned how to generalize on patterns, rather than to overfit the training data and perform less well on unseen or real-world data.&lt;/p&gt;
&lt;p&gt;Since you learned about the long-tail distribution of the scraped "real world" data, the chance that the test example is truly unseen is unlikely when sampling from the same collected dataset. If you sample from the peak of the distribution, that data is massively duplicated, so this is certainly not unseen data. If you sample from the tail, you do have a much higher chance of "unseen" examples, but ideally you also want to learn most of the tail in order to generalize, which means you need significant data points from the tail in your training dataset.&lt;/p&gt;
&lt;p&gt;In fact, some of the best research on the problems of imbalanced classes, which is an effect of the "real world" distributions, guide practitioners to oversample the tail as training examples, leading to more "balance" between the peak and the tail. If you oversample the long tail for training, then this also means there is less of the tail for testing, and you again get in the cycle of testing mainly with common examples, which the model should certainly learn.&lt;/p&gt;
&lt;p&gt;Why is this happening? This isn't representative of actual learning, and it certainly won't work well in the real world if this is the performance. Let's investigate one potential factor in how this occurs.&lt;/p&gt;
&lt;h3 id="the-pressure-to-publish-and-benchmarks"&gt;The pressure to publish and "Benchmarks"&lt;/h3&gt;
&lt;p&gt;In academia, there is more pressure to publish than ever before. Especially in fast-moving fields like machine learning or AI, researchers and students must attempt to create novel, breakthrough work at record speed. But usually novel work takes time, it takes inspiration, it takes many trials and errors and blockers until you have a really interesting idea and approach.&lt;/p&gt;
&lt;p&gt;So, how do you keep publishing at a high speed if you don't have the time to actually explore ideas fully? You aim for benchmarks!&lt;/p&gt;
&lt;p&gt;Benchmarks and their accompanying leaderboards have become a gamification of machine learning research.&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt; A benchmark dataset is often introduced as a paper itself (bonus: you get a paper published by creating a new benchmark) and usually introduces a particular task and dataset -- such as &lt;a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4389233"&gt;a generative AI model passing the bar exam&lt;/a&gt;. After publication, someone will beat the initial model who wins the benchmark (another novel paper!). Then, someone else will beat that model (yet again a novel paper!).&lt;/p&gt;
&lt;p&gt;But is the data representative of real-world problems? Is the data diverse and representative? Is the testing data "unseen"? Is the benchmark useful for the use cases that people need solved?&lt;/p&gt;
&lt;h4 id="kaggle-culture-and-the-origin-of-leaderboards"&gt;Kaggle Culture and the origin of Leaderboards&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://www.kaggle.com/"&gt;Kaggle&lt;/a&gt; is an online machine learning community that started in 2010 (and has since largely been replaced by HuggingFace).&lt;/p&gt;
&lt;p&gt;Kaggle hosted many popular datasets and competitions in the early-to-mid 2010s. The goal was to share models that beat other models at particular machine learning competitions or tasks. This usually involved overengineering models with no attention to generalization, making them bigger than ever, using every possible feature you could think of, using (now dated but then trendy) techniques like AutoML, where feature extraction is automated and becomes opaque. You could often win money, internships or even get hired based on your Kaggle status.&lt;/p&gt;
&lt;p&gt;There are several known examples of teams or participants figuring out how to train on the test dataset, or &lt;a href="https://www.theregister.com/2020/01/21/ai_kaggle_contest_cheat/"&gt;directly encode the not very well hidden test dataset&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="An example leaderboard showing a variety of models and their score on an aggregated benchmark on HuggingFace. You can see that many of these models are submitted by single users who likely have overtrained on a subset of the evaluation data." src="./images/2024/llm_leaderboard.png"&gt;
&lt;em&gt;An HuggingFace LLM leaderboard, where all models are fine-tuned by individuals and "outperform" the base model trained by large expert teams. Can you guess what the fine-tuning data is?&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The entire cultural goal was being the #1 machine learning model, and for that, you would do anything to squeeze out extreme accuracy, even if it wasn't really a very good machine learning model afterwards. This "winner takes all" leaderboard mentality still exists in today's machine learning community, now as &lt;a href="https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard"&gt;Hugging Face leaderboards&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The problems outlined in this article contribute significantly to common problems in real-use applications, like models never launching into production.&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt; These realities also contribute to "model drift" or "data drift", where model performance shifts when launched into real-world use cases in production settings. But where did the data drift to? Simply outside of the carefully collected training dataset representation.&lt;/p&gt;
&lt;p&gt;A question to others in the machine learning community: How sure are we that we are getting the population right? Are we using basic statistical thinking to model our data collection approach? Can we learn from social sciences on population representation? Are we focused on creating the best models for real-world impact? Are we challenging current data collection methods for bias, misrepresentation and (in many ways) lack of real world applicability? Can we foster better understanding of what AI models humans want and start our evaluation sets there?&lt;/p&gt;
&lt;p&gt;In the next article, you'll use what you learned to review how massively repeated examples are memorized. We're diving into the "heady" part first. ;)&lt;/p&gt;
&lt;p&gt;I'm very open to feedback (positive, neutral or critical), questions (there are no stupid questions!) and creative re-use of this content. If you have any of those, please share it with me! This helps keep me inspired and writing. :)&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Acknowledgements&lt;/em&gt;: I would like to thank &lt;a href="https://vickiboykis.com/"&gt;Vicki Boykis&lt;/a&gt;, &lt;a href="https://desfontain.es/serious.html"&gt;Damien Desfontaines&lt;/a&gt; and &lt;a href="https://www.linkedin.com/in/yanndupis/"&gt;Yann Dupis&lt;/a&gt; for their feedback, corrections and thoughts on this series. Their input greatly contributed to improvements in my thinking and writing. Any mistakes, typos, inaccuracies or controversial opinions are my own.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;This obviously wasn't the initial intention of benchmarks, which was more about finding useful metrics that were (hopefully) not directly in the training datasets. I don't think the mentality I describe applies to all researchers or practitioners; however, it's still become a serious cultural and strategic problem in achieving useful model metrics and it's deeply affected model development. On the day I published this article, there was a really nice &lt;a href="https://www.technologyreview.com/2024/11/26/1107346/the-way-we-measure-progress-in-ai-is-terrible/"&gt;MIT Technology Article on the problems of benchmarks&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;There are online methods for evaluating models in real-time, like measuring performance metrics directly in the application using the model. Companies can and do develop these near real-time, online evaluations and sometimes even directly learn from production environments or deploy new models automatically when they perform better than the current ones. I would caution, however, that this lack of human oversight of the incoming "test" dataset can reproduce the same problems as biases in the collected data -- probably even more so if your application isn't used by billions of people (and even then, who and who not?).&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="ml-memorization"></category></entry><entry><title>Exploring new meadows</title><link href="https://blog.kjamistan.com/exploring-new-meadows.html" rel="alternate"></link><published>2024-11-20T00:00:00+01:00</published><updated>2024-11-20T00:00:00+01:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2024-11-20:/exploring-new-meadows.html</id><summary type="html">&lt;p&gt;Hello!&lt;/p&gt;
&lt;p&gt;We may not know each other, but here you are on my website -- perhaps because you saw a post or someone shared a link. I'm resourceful, determined, intelligent and looking for new challenges. Welcome!&lt;/p&gt;
&lt;p&gt;Wenn Deutsch einfacher ist, schreiben Sie mir bitte per Email (katharine at kjamistan punkt com …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Hello!&lt;/p&gt;
&lt;p&gt;We may not know each other, but here you are on my website -- perhaps because you saw a post or someone shared a link. I'm resourceful, determined, intelligent and looking for new challenges. Welcome!&lt;/p&gt;
&lt;p&gt;Wenn Deutsch einfacher ist, schreiben Sie mir bitte per Email (katharine at kjamistan punkt com) oder auf &lt;a href="https://www.linkedin.com/in/katharinejarmul/"&gt;LinkedIn&lt;/a&gt;, damit ich meinen Lebenslauf weitergeben kann.&lt;/p&gt;
&lt;h4 id="about-me"&gt;[About Me]&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;10+ years experience in working on machine learning, deep learning and AI systems. Started in Natural Language Processing (2011) and moved to privacy (federated learning, encrypted learning, differential privacy) in data and ML/AI systems in 2017. Experienced in driving data and ML projects to successful outcomes.&lt;/li&gt;
&lt;li&gt;Author of &lt;a href="https://www.oreilly.com/library/view/practical-data-privacy/9781098129453/"&gt;&lt;em&gt;Practical Data Privacy&lt;/em&gt; (O'Reilly 2022)&lt;/a&gt;, translated into &lt;a href="https://dpunkt.de/produkt/data-privacy-in-der-praxis/"&gt;German&lt;/a&gt; and coming soon in Polish, on using privacy technologies in data and machine learning systems and building better governance and privacy into data workflows and data teams. Video course to accompany the book coming in January 2025. Curator and author of the &lt;a href="https://probablyprivate.com/"&gt;&lt;em&gt;Probably Private&lt;/em&gt; newsletter&lt;/a&gt; with more than 500 subscribers.&lt;/li&gt;
&lt;li&gt;C-level consultant for security, governance and privacy in data and AI systems at Thoughtworks' clients in the EU and globally. Developing future-proof data strategies that achieve business goals while building trustworthy relationships with customers and partners.&lt;/li&gt;
&lt;li&gt;Technical leader with product know-how. Can cross technical, product and business lines to transfer knowledge, assess and reinforce alignment and develop strategic and pragmatic planning and execution. Experience leading small and large teams (3-50 developers/data persons).&lt;/li&gt;
&lt;li&gt;Multiple time startup founder with experience on raising, board communication, team building and product-market fit.&lt;/li&gt;
&lt;li&gt;Handlungssicher auf Deutsch (C1 Zertifikat). Ich interessiere mich für eine Stelle, bei der ich auf Deutsch arbeite.&lt;/li&gt;
&lt;li&gt;More than 15 years in the technology industry, with technical and product experience in machine learning, data science and data engineering, architecture, security engineering, software design and development and large-scale cloud deployment and automation.&lt;/li&gt;
&lt;li&gt;Regular speaker and keynoter at international conferences such as CCC, Strangeloop, QCon, ACM, PyData, PyCon, EuroPython. Due to my strong technical background, I have covered topics like data privacy, machine learning security and AI ethics and continue to be invited to speak on these topics.&lt;/li&gt;
&lt;li&gt;Lecturer, former adjunct professor and successful educator including courses for O'Reilly, University of Florida, DataCamp and numerous educational workshops. You can check out my teaching style &lt;a href="https://www.youtube.com/@ProbablyPrivate"&gt;on the Probably Private YouTube channel&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Excel at rapid grasp of new technologies, and asking difficult questions to surface critical issues -- driving teams to research, learn, debate and decisively resolve issues as they arise.&lt;/li&gt;
&lt;li&gt;Tinkerer, hacker and forever programmer, see &lt;a href="https://github.com/kjam"&gt;GitHub&lt;/a&gt; for an overview of my interests and open projects.&lt;/li&gt;
&lt;li&gt;Fluent in Python and GoLang and have experience with C++ and Java.&lt;/li&gt;
&lt;li&gt;Founder of PyLadies, mentor and ally for several women of color and immigrant women in tech initiatives, conference diversity scholarship organizer, persistent advocate for the underrepresented in tech.&lt;/li&gt;
&lt;li&gt;Background in investigative journalism, love public speaking, meeting new people and working with teams.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="about-you"&gt;[About You]&lt;/h4&gt;
&lt;p&gt;Here's a few things I'm hoping you can tell me:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What's your team like? Is it diverse (gender, race, immigration status, age)?&lt;/li&gt;
&lt;li&gt;What relevant problems do you solve? What excites you about your work / product?&lt;/li&gt;
&lt;li&gt;Do you let folks learn on the job? Is this supported with mentoring / pairing / reviews, etc?&lt;/li&gt;
&lt;li&gt;Are you friendly to remote workers or based in Berlin?&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="dream-role"&gt;[Dream Role]&lt;/h4&gt;
&lt;p&gt;I'm not sure what is available right now given the noise around AI, so I am posting this to learn about opportunities outside of my direct network. Feel free to send it along!&lt;/p&gt;
&lt;p&gt;Ideally I'd like a role where I can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Share my knowledge and continue research and work on AI/ML privacy and security&lt;/li&gt;
&lt;li&gt;Either strategically lead teams (leadership/management) or work in IC roles as a ML/tech lead&lt;/li&gt;
&lt;li&gt;Develop personalized and privacy-first AI/ML systems at a product-company (particularly interested in B2C) (see &lt;a href="https://blog.kjamistan.com/private-and-personalized-ai.html"&gt;my views on personal and private AI systems&lt;/a&gt;) or develop communally-run data and AI systems for things like public infrastructure, transport and energy.&lt;/li&gt;
&lt;li&gt;Spend some of my day speaking and writing German (oder meinen ganzen Tag!)&lt;/li&gt;
&lt;li&gt;Work with a motivated team who enjoys collaboration, learning and mutual support&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you think your organization might be a fit, please drop me a line! You can &lt;a href="https://www.linkedin.com/in/katharinejarmul/"&gt;reach out on LinkedIn&lt;/a&gt; or on email I'm katharine at the top-level domain you are currently on. Spelling matters (i.e. kath-A-rine).&lt;/p&gt;
&lt;p&gt;Thanks for dropping by. 🤗&lt;/p&gt;</content><category term="misc"></category></entry><entry><title>Private and Personalized AI</title><link href="https://blog.kjamistan.com/private-and-personalized-ai.html" rel="alternate"></link><published>2024-11-19T00:00:00+01:00</published><updated>2024-11-19T00:00:00+01:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2024-11-19:/private-and-personalized-ai.html</id><summary type="html">&lt;p&gt;I recently had the wonderful experience of &lt;a href="https://pydata.org/paris2024"&gt;keynoting PyData Paris&lt;/a&gt;, thanks again for the invite! When deciding on a topic, I was considering my &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;recent research about how AI/ML systems memorize data&lt;/a&gt;. As I've mentioned in a few talks, if we indeed embraced the fact that machine learning systems …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I recently had the wonderful experience of &lt;a href="https://pydata.org/paris2024"&gt;keynoting PyData Paris&lt;/a&gt;, thanks again for the invite! When deciding on a topic, I was considering my &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;recent research about how AI/ML systems memorize data&lt;/a&gt;. As I've mentioned in a few talks, if we indeed embraced the fact that machine learning systems memorize training data, we'd probably design them differently. What would it look like if you could just use your own data, or own your own model, or both?&lt;/p&gt;
&lt;p&gt;I've been inspired by recent AI product shifts in this direction including &lt;a href="https://www.apple.com/apple-intelligence/"&gt;the Apple Intelligence launch&lt;/a&gt;, which promises to be more private and personalized. Although &lt;a href="https://www.macrumors.com/2024/10/28/apple-intelligence-eu-april-2025/"&gt;it's not available yet in the EU&lt;/a&gt;, likely due to its currently closed functionality, I am excited to see what innovation it brings to thinking about personalization in AI/ML systems.&lt;/p&gt;
&lt;p&gt;These developments struck me as similar to other events in technology history, like the emergence of the personal computer. Maybe we can learn from that history to see how to make AI systems a helpful and integral part of our society?&lt;/p&gt;
&lt;h3 id="what-can-we-learn-from-history"&gt;What can we learn from history?&lt;/h3&gt;
&lt;p&gt;At present AI is a centralized, specialized field. It reminds me a lot of early computing or pre-cloud data centers. Early computers were huge machines in rooms full of specialists, used only for special tasks. Here is an example of such a machine and its engineers in the late 1950s.&lt;/p&gt;
&lt;p&gt;&lt;img alt="An image of an IBM mainframe in the background of a room with two persons working on it. One person is wearing a suit and looking at the mainframe panel. The other person is wearing a dress and looks to be typing at a terminal or other interface." src="./images/2024/ibm_mainframe.jpg"&gt;
&lt;strong&gt;An IBM 704 mainframe at NACA in 1957&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The parallels are actually quite striking when you list out some characteristics:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;large, expensive and centralized compute&lt;/li&gt;
&lt;li&gt;run by a small group of highly specialized workers&lt;/li&gt;
&lt;li&gt;task-specific programming, often for research or large corporate interests&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;What brought about the revolution in computing? What made it so that we all walk around with a computer in our purse, pocket, bag?&lt;/p&gt;
&lt;p&gt;One of the initial turning points was the development of the personal computer (PC), but even that was mainly used by hobbyists and didn't initially have wider market impact. But as software became more useful, that perspective shifted. One good example of this shift was &lt;a href="https://en.wikipedia.org/wiki/VisiCalc"&gt;VisiCalc on Apple II&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="An image of a screen of VisiCalc, showing a black and green screen which looks like a spreadsheet. It is generating an invoice with unit names, IDs, costs and then adding tax and calculating a total." src="./images/2024/Visicalc.png"&gt;
&lt;strong&gt;VisiCalc&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;With VisiCalc, people could finally see something really useful, something that was worth buying a fairly expensive piece of electronics to do. Everyone needs spreadsheets, right?&lt;/p&gt;
&lt;p&gt;This success started a growing trend of making software not just for hobbyists, but for the general public, for your work and life. This momentum created more focus on user-friendly, understandable GUIs (graphic interfaces), it let people bring their own data and it created experiences of joy and fun. As this trend continued, there was a need to use more than one computer, or connect with others -- building both the market and actual demand for the internet and things like web browsers. Each of these steps in computing development brought new use cases, new persons, new data and new communities along with them.&lt;/p&gt;
&lt;h3 id="is-ai-community-oriented-user-friendly-easy"&gt;Is AI community-oriented? User-friendly? Easy?&lt;/h3&gt;
&lt;p&gt;This brings us back to the current status in AI. Where are we in this evolution?&lt;/p&gt;
&lt;p&gt;&lt;img alt="A line with 4 distinct spots. On the left it is labeled: Centralized, Expensive, Corporate. The next point is labeled: Hobbyist, Specialized. The next point is General Use, GUIs and Software, Easy-to-Install and Use and then the final point on the far right is Customizable, Easily Connected, Open for online and offline usage." src="./images/2024/ai_adoption_curve.png"&gt;
&lt;em&gt;Where do you place AI systems on this scale?&lt;/em&gt;&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;I think we are still somewhere in these early stages of VisiCalc, approaching the next stages, but somewhat slowly due to the lack of truly open models with open data. It's still quite difficult to try to bring your own data or your own use case -- other than typing something into a prompt or uploading one image at a time in an interface.&lt;/p&gt;
&lt;p&gt;How do I easily connect AI to the documents on my computer? How do I use it with my photo storage or art? How do I have it only use my writing, documents, emails?&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt; Can I train it myself to get the results I want (i.e. by labeling myself and/or engineering my own prompts without having to learn much about how it works)? How do I do all of this safely and successfully (i.e. without having to reread everything and do everything twice)?&lt;/p&gt;
&lt;p&gt;One of the main problems in achieving the next stages is that popular AI systems are inherently intransparent. This is a great marketing trick because you can claim magic, but awful for actually making AI systems more trustworthy and useful for humans. As a user, I don't have to understand everything, but I should be able to understand enough to avoid poor quality outcomes. I should also trust things enough to know they won't accidentally reply to an email from my boss with details about my upcoming job interview or &lt;a href="https://www.ndtv.com/feature/unbuttoned-blouse-made-up-bra-ex-google-techie-claims-her-photo-was-edited-for-ai-conference-6811999"&gt;provide a profile photo with my underwear showing&lt;/a&gt;. :/&lt;/p&gt;
&lt;p&gt;What will be the pivotal point that takes AI (or agents or ML models) from where we stand today and moves them into a true revolution, where everyone uses the AI systems directly as regularly as they open their laptop or unlock their phone?&lt;/p&gt;
&lt;h3 id="imagining-what-is-possible-local-document-search-retrieval-and-chat"&gt;Imagining what is possible: Local Document Search, Retrieval and Chat&lt;/h3&gt;
&lt;p&gt;I think this will come first when you can run an AI system as easily as installing software or an application on your phone. It needs to work offline or you need to control when and how it connects and what data it sends (because of the aforementioned trust issues and general usefulness).&lt;/p&gt;
&lt;p&gt;To test out what I wanted in relation to personalized AI, I built a completely local RAG. I'd already downloaded and tried &lt;a href="https://ollama.com/"&gt;Ollama&lt;/a&gt; and &lt;a href="https://gpt4all.io/index.html"&gt;GPT4All&lt;/a&gt;, both which I liked, but I couldn't tinker with them as easily as I expected and I wanted to build out some other features I had in mind... (more on this soon!)&lt;/p&gt;
&lt;p&gt;&lt;img alt="A screenshot of a Jupyter Notebook showing a query and response around memorization in AI systems." src="./images/2024/local_offline_rag.png"&gt;
&lt;em&gt;A local, offline RAG system search and response&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;I built mine using &lt;a href="https://github.com/UKPLab/sentence-transformers"&gt;examples from UKPLab's sentence transformers&lt;/a&gt; and &lt;a href="https://github.com/Mozilla-Ocho/llamafile"&gt;Mozilla's Llamafiles&lt;/a&gt;. I didn't need a bunch of add-on libraries and it was quite straightforward. I went for simplicity and ability to shift out models or search easily over robustness and functionality. I also wanted it simple so that I could easily demonstrate how the underlying systems work (transparency is important!).&lt;sup id="fnref:3"&gt;&lt;a class="footnote-ref" href="#fn:3"&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;I released my proof-of-concept as a Jupyter Notebook on GitHub](https://github.com/kjam/personalized-ai) with annotations. I'll be adding more notebooks, command-line programs and functionality to this -- so if you'd like to contribute, let me know!&lt;/p&gt;
&lt;p&gt;I'll also be releasing other personal-AI/ML model examples and workflows in the coming weeks and months to inspire others, debunk mythology about how "hard" it is to build local-first data and AI and to hear your feedback on what's interesting and useful.&lt;/p&gt;
&lt;p&gt;I believe we're at a critical moment in the adoption of AI and ML systems as something that can help connect us, that can serve real purpose and that can also be reliable, trustworthy and interesting (maybe even fun!). There's also some pretty dystopian futures that could occur if we continue to have intransparent, corporate-driven AI systems that are more smoke and mirrors than science. I'd like us to use this moment to build AI futures we actually want to see.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;I'd be curious to hear your thoughts, so feel free &lt;a href="https://probablyprivate.com/about/"&gt;to write me an email&lt;/a&gt; or reach out on &lt;a href="https://www.linkedin.com/in/katharinejarmul/"&gt;LinkedIn&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;Some people say this is achieved by AI Agents, but I have yet to see an agentic workflow that is clear, trustworthy and transparent enough that I would install it and use it on my computer. I think the security and privacy problems with agents will continue to grow in the short-term, and that they will likely only be fixed with actual user control and transparency--including local-first model design and deployment.&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;I took inspiration from &lt;a href="https://www.youtube.com/watch?v=0nA5QG3087g&amp;amp;ab_channel=HamelHusain"&gt;Ben Clavié's talk on RAG system design&lt;/a&gt;, where he recommends splitting search and retrieval from summarization. I concur that this gives much better results. Thank you for your work!&amp;#160;&lt;a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="personal-ai"></category></entry><entry><title>Encodings and embeddings: How does data get into machine learning systems?</title><link href="https://blog.kjamistan.com/encodings-and-embeddings-how-does-data-get-into-machine-learning-systems.html" rel="alternate"></link><published>2024-11-18T00:00:00+01:00</published><updated>2024-11-18T00:00:00+01:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2024-11-18:/encodings-and-embeddings-how-does-data-get-into-machine-learning-systems.html</id><summary type="html">&lt;p&gt;In &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;this series&lt;/a&gt;, you've learned a bit about how data is collected for machine learning, but what happens next? You need to turn the collected data -- images, text, video, audio or even just a spreadsheet -- into numbers that can be learned by a model. How does this happen?&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;TLDR (too …&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;/table&gt;</summary><content type="html">&lt;p&gt;In &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;this series&lt;/a&gt;, you've learned a bit about how data is collected for machine learning, but what happens next? You need to turn the collected data -- images, text, video, audio or even just a spreadsheet -- into numbers that can be learned by a model. How does this happen?&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;TLDR (too long; didn't read)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Complex data like images and text need complex representations if you want to use them to predict or learn&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;One way to encode this data while preserving information uses linear algebra&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deep learning also uses linear algebra as building blocks for networks and architectures&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Word embeddings encoded language into linear algebra structures--enabling deep learning with language&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Word embeddings also encode cultural biases and sensitive information&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;a href="https://www.youtube.com/watch?v=r4elspFjlqg&amp;amp;list=PLJkNSeYcYBlCaamscxip0l2LGYCZ2TIom&amp;amp;index=2&amp;amp;ab_channel=ProbablyPrivate"&gt;Watch a video summary of this post&lt;/a&gt;&lt;/p&gt;
&lt;h2 id="why-encode-information"&gt;Why encode information?&lt;/h2&gt;
&lt;p&gt;Data is actually encoded all the time! When you save a file, when you open a program, when you write an email and hit send -- all of these take formats humans can interpret and translate them into formats computers can read, write and process.&lt;/p&gt;
&lt;p&gt;The default computer encoding is bytes (collections of &lt;a href="https://en.wikipedia.org/wiki/Bit"&gt;bits&lt;/a&gt;) -- which the computer can store or process using available hardware, like a CPU and attached memory. Bytes are also used to build datagrams which can be used by internet protocols to send data. These same principles relate to how information is also encoded into other messaging standards, like radio waves that are captured via an antenna and then decoded back into audio via a &lt;a href="https://en.wikipedia.org/wiki/Demodulation"&gt;demodulator&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Encoding and decoding require the design and incorporation of standards to ensure systems interoperate properly. Imagine if your email provider took your text and encoded it incorrectly. The receiver of your email wouldn't be able to open it properly.&lt;/p&gt;
&lt;p&gt;In the early days of machine learning, encoding and decoding usually involved taking numerical datasets to predict another number, making the encoding, decoding and computation obvious and in some ways unnecessary because the computer already could do math on numbers. For example, if you wanted to project a line or trend based on previous data, you can do that without machine learning. As interest, research and use cases expanded, machine learning approaches reached domains where the data wasn't already encoded in numbers that could be learned easily. There needed to be a way to encode and decode letters, words, images, audio and so on.&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;h2 id="the-magic-of-linear-algebra"&gt;The "magic" of linear algebra&lt;/h2&gt;
&lt;p&gt;In some machine learning problems, a simple algorithm works well and quickly outperforms more complex models -- like when modeling simple linear or easy classification problems. In this case, choosing a simple model, a non-learning based algorithm and just using statistical measurements works well.&lt;/p&gt;
&lt;p&gt;However, there are many problems where the dimensionality of the inputs are too complex for a simple model. This was the case for computer vision problems, like photo classification and object recognition, until the creation of &lt;a href="https://en.wikipedia.org/wiki/AlexNet"&gt;AlexNet in 2011&lt;/a&gt;. AlexNet utilized neural networks and encoded the image into multi-dimensional matrices, or sets of numbers. The encoding mechanism did this in such a way that it attempted to preserve information and relationships and represent those in the resulting matrices. You can think of these matrices as a related set of numbers that preserves the patterns by creating numerical relationships between different "areas" or sections of the encoded data. This is what the machine learning model should learn.&lt;/p&gt;
&lt;p&gt;AlexNet was one of the breakthrough image recognition models which introduced deep learning models as viable solutions to the variety of machine learning for image tasks that were common at the time. This was because AlexNet cleverly leveraged a larger, "deeper" neural network architecture (deep learning) than other neural networks at the time. It also used a clever encoding mechanism&lt;/p&gt;
&lt;p&gt;Another idea that AlexNet borrowed from earlier computer vision neural networks like &lt;a href="https://en.wikipedia.org/wiki/LeNet"&gt;LeNet-5 from 1998&lt;/a&gt; was the &lt;a href="https://en.wikipedia.org/wiki/Convolutional_layer"&gt;convolutional layer&lt;/a&gt;. These layers require many matrix computations, making them compute-hungry and therefore both computationally and energy expensive. One clever idea from the paper was to parallelize the processing by using two GPUs. In the past usually only one CPU or GPU was used. By parallelizing the computations, the researchers were able to increase the model parameter size and also unlock the power of matrices and linear algebra for deep learning.&lt;/p&gt;
&lt;p&gt;In the following diagram from &lt;a href="https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf"&gt;the original paper&lt;/a&gt;, each of the dotted lines and rectangles inside a layer show an example of what computations run at each layer on the parallel GPUs. You can think of each layer as a series of linear algebra matrix computations that take the results of the previous layer and continue to compute with them, with the goal of optimizing for the learning task at hand. You will learn about these in more depth in a later article.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A diagram of many layers shown in 3-D drawing as rectangles. The layers have small areas highlighted and then projected with dotted lines into the next layer until you reach the dimensions at the end. Each of the layers show an area highlighted internally in the layer which represents the matrices used by AlexNet in that particular layer. You can see that for each layer there are two highlights, showing the parallelization they achieved." src="./images/2024/alexnet.png"&gt;
&lt;em&gt;AlexNet Architecture Diagram&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Linear algebra has been used for hundreds of years to build systems of equations and map them to linear spaces. What does that mean and why is it relevant? You can take real world problems in engineering or physics and model them in mathematics. By taking data or known properties and building it into a system of equations and then mapping those equations into a "space", you essentially compress the problem space and can create optimized ways to solve for all results or a set of optimal results. You can also use these modeled systems to predict, infer, observe and learn.&lt;/p&gt;
&lt;p&gt;Linear algebra powers many machine learning systems and is the core building block of deep learning. By modeling complex tasks like how to locate and name the objects in an image (image segmentation and object detection/recognition) in linear algebra systems, deep learning can perform these quite challenging tasks.&lt;/p&gt;
&lt;p&gt;Computer vision benefited greatly from encoding images into matrices and leveraging those to unlock linear algebra powers, but what about text? Let's explore the changes that allowed for language-based deep learning, or natural language processing (NLP).&lt;/p&gt;
&lt;h2 id="encoding-language-with-tokens-and-embeddings"&gt;Encoding language with tokens and embeddings&lt;/h2&gt;
&lt;p&gt;Natural language processing leverages learnings from the field of linguistics. One way to encode language is to use linguistic knowledge like &lt;a href="https://en.wikipedia.org/wiki/Language_family"&gt;language families&lt;/a&gt; and &lt;a href="https://en.wikipedia.org/wiki/Root_(linguistics)"&gt;root words&lt;/a&gt; to chunk text into smaller words or &lt;a href="https://en.wikipedia.org/wiki/Word_stem"&gt;stems&lt;/a&gt;. You do this to achieve smaller and more consistent building blocks of language so that you can concentrate on patterns or information contained in a smaller vocabulary. In NLP, these linguistic chunks are called tokens. For example, you might take the word "foundation" and "founding" and "found" and agree that they should all be reduced to "found". This works fairly well, but what about the "found" in "lost and found"? Does it have the same meaning? The beautiful complexities of language and how each language develops differently adds challenges to tokenization.&lt;/p&gt;
&lt;p&gt;There are many approaches to tokenization, or the breaking down of text into machine learning ready chunks, which become quite language specific. Some of the approaches to tokenization include word-roots or stems, like the example above. Another approach is reading a language character-by-character, which works well for languages where one character has a lot of meaning, like Chinese. The character-based approach also works well when trying to do machine learning with less common languages, where there might be many words that aren't represented well in the training data. For character-based tokenization that the word found becomes literally a list of letters: "f", "o", "u", "n", "d". As you might imagine, this doesn't preserve as much of the deeper meanings of the word, because it doesn't necessarily attempt to find things like word roots explicitly -- although the machine learning model can still learn patterns of certain character sequences.&lt;/p&gt;
&lt;p&gt;There are also in-between approaches like subword tokenization, where linguistic understanding is used to break each word into word parts or pieces. This works well because it doesn't reduce the information as much as doing word-based tokenization with word stems. For example, "foundation" might become two tokens: "found", "ation" instead of being reduced to just "found". &lt;a href="https://huggingface.co/learn/nlp-course/chapter2/4"&gt;Hugging Face's introduction to tokenization&lt;/a&gt; is a great read to learn more in-depth about how different tokenizers work.&lt;/p&gt;
&lt;p&gt;Why are there so many approaches to tokenization for NLP? Due to the &lt;a href="https://blog.kjamistan.com/machine-learning-dataset-distributions-history-and-biases.html"&gt;long-tail distributions you learned about in the previous article&lt;/a&gt;, many languages are underrepresented in the online content available when compared with English. In addition, there are niche topics and content compared to more popular content.&lt;/p&gt;
&lt;p&gt;These imbalanced data problems present issues when encoding tokens into numerical representations so you can successfully train machine learning models. Early techniques borrowed from linguistic concepts like token frequency or tried to build encodings based on interesting uncommon tokens. These techniques didn’t successfully encode the relationship of the words to one another in longer texts or passages. These encoding methods and the related datasets presented challenges for early natural language processing models because they had to deal with extremely sparse datasets. Imagine choosing a number for every possible token and then just saying whether the token is there in a sentence or not. You will end up with many tokens that are missing in each sentence. This made machine learning difficult and costly because &lt;a href="https://en.wikipedia.org/wiki/Sparse_matrix"&gt;computing on sparse data&lt;/a&gt; is harder for computers to do.&lt;/p&gt;
&lt;p&gt;An important moment in text- and language-based machine learning was the creation of word- or token-embeddings, which moved away from sparse matrices and allowed for better leveraging of linear algebra (and therefore deep learning). In 2013, &lt;a href="https://en.wikipedia.org/wiki/Word2vec"&gt;Word2Vec&lt;/a&gt; (short for word to vector) was released. Word2Vec is a machine learning model which takes a word or token and maps it to a vector representation which is learned by first training the model on the linguistic relationships in text. The vector is like an encoded version of the word which tries to map its relationship to all the other words that the model has processed.&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt; This process produces mathematical connections or links between the words which show up together frequently, and it also can map different relationships when the word is used in different contexts, like the "found" example earlier in this article. This is why these representations are called "word embeddings", borrowing from the &lt;a href="https://en.wikipedia.org/wiki/Embedding"&gt;mathematical concept of embeddings&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="There is a 3D graph with a few data points: Man, Woman, King, Queen. The locations of these points represent the word embedding location for that token in this simplified 3D space. If you take the distance and direction between Man and King it is the same as the distance and direction between Woman and Queen." src="./images/2024/word2vec.png"&gt;
&lt;em&gt;Simplified 3D space example of Word2Vec word embeddings and their relationships&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Word2Vec introduced a context-aware way to link words in embedded form to one another. The Word2Vec model acted as an encoder into a compressed linear algebra space that translated the linguistic relationships more accurately. One famous example from the original paper used the model used to complete analogies, like Man is to Woman as King is to Queen. When you took the vector representing "man" and subtracted that for "woman", you got a resulting vector showing the distance and direction between those two vectors. If you took this difference and applied it to "king", you got "queen". Pretty neat!&lt;/p&gt;
&lt;p&gt;Unfortunately, these embeddings had many other issues, including my discovery shortly after Google released their Word2Vec model that &lt;a href="https://blog.kjamistan.com/embedded-isms-in-vector-based-natural-language-processing.html"&gt;Man is to Computer Programmer as Woman is to Homemaker&lt;/a&gt;. 🙄 The resulting embedding models had learned racism, homophobia and US-centricity, which you can read more about it via research by &lt;a href="https://arxiv.org/abs/1606.06121"&gt;Bolukbasi et al.&lt;/a&gt; and &lt;a href="https://www.pnas.org/doi/abs/10.1073/pnas.1720347115"&gt;Garg et al.&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;There are many newer approaches than Word2Vec, but the underlying principles remain similar. To learn about the advances that happened or to dive deeper into the topic, check out &lt;a href="https://vickiboykis.com/what_are_embeddings/next.html"&gt;Vicki Boyles's fantastic and freely available exploration of embeddings&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In the context of deep learning memorization, maybe it might be useful than to memorize some relationships between these tokens. It can be quite useful to know that certain words always appear together, or that certain names are inherently connected. But this brings up considerations for privacy. Should embeddings related to private individuals be able to be learned or memorized?&lt;/p&gt;
&lt;h3 id="do-embeddings-contain-personal-information"&gt;Do embeddings contain personal information?&lt;/h3&gt;
&lt;p&gt;In Summer 2024 the Hamburg data protection authority released &lt;a href="https://datenschutz-hamburg.de/fileadmin/user_upload/HmbBfDI/Datenschutz/Informationen/240715_Discussion_Paper_Hamburg_DPA_KI_Models.pdf"&gt;a discussion paper stating that LLMs do not contain personal information&lt;/a&gt;. While the paper is not a legal ruling, it does set guidance for companies within Germany (and presumably the EU) who have interest in using, fine-tuning or training LLMs. For organizations who provide services in the EU, and therefore must follow the &lt;a href="https://commission.europa.eu/law/law-topic/data-protection/data-protection-eu_en"&gt;EU General Data Protection Regulation (GDPR)&lt;/a&gt;, these opinions provide useful legal interpretation and guide compliance and privacy decisions.&lt;/p&gt;
&lt;p&gt;Let's take a concrete example from the paper. The paper uses the question (in German): Ist ein LLM personenbezogen? (English: Is a LLM personal [data]?) which the paper tokenizes like so:&lt;/p&gt;
&lt;p&gt;[I][st][ e][in][ LL][M] [ person][en][be][z][ogen] [?]&lt;/p&gt;
&lt;p&gt;The paper also uses the example of someone named Mia Müller, stating that Mia's name is tokenized as "M", "ia", "Mü" and "ller". This is a key example used to say that the name is now split into tokens, and is therefore no longer personally identifiable.&lt;/p&gt;
&lt;p&gt;They reference &lt;a href="https://platform.openai.com/tokenizer"&gt;OpenAI's tokenizer&lt;/a&gt;, which has a handy online interface, so let's check their work quickly:&lt;/p&gt;
&lt;p&gt;&lt;img alt="An image of text where different pieces of the text are highlighted, showing where the tokens break down. Here, we see that the name Mia Müller contains 3 tokens: Mia, Mü, and ller. We also see that the sentence is broken down similar to what the paper described." src="./images/2024/tokenization_gpt3_openai.png"&gt;
&lt;em&gt;GPT-3 Tokenizer&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="Another tokenization example, this time with GPT-3.5 and GPT-4. In this case the name Mia Müller is only two tokens (split on the name), and the final sentence is Ist, ein, L, LM, person, en, be, z, ogen" src="./images/2024/tokenization_gpt4_openai.png"&gt;
&lt;em&gt;GPT-3.5 and GPT-4 Tokenizer&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Using GPT-3, I can reproduce their experiment... but there are differences between GPT 3 and 3.5. How come?&lt;/p&gt;
&lt;p&gt;OpenAI's tokenization uses &lt;a href="https://en.wikipedia.org/wiki/Byte_pair_encoding"&gt;byte-pair encoding&lt;/a&gt; which helps for tokenizing multiple languages at once and processing messy internet or chat text. This encoding mechanism uses clever ways to detect and deduplicate linguistic patterns without explicitly incorporating linguistic knowledge.&lt;sup id="fnref:3"&gt;&lt;a class="footnote-ref" href="#fn:3"&gt;3&lt;/a&gt;&lt;/sup&gt; To note: the tokenizer doesn't show the embeddings, which are only available via a separate API call (the model is not released publicly for download). The tokenizer takes text and returns a series of indices (like a lookup table) for the appropriate token embedding in the OpenAI system.&lt;/p&gt;
&lt;p&gt;When evaluating the GPT tokenization output above, understand that it shows both the tokenizer and the related embeddings for that model. The GPT-3 tokenizer and its trained embedding model produce something closer to character-based embeddings when given German text (and likely this applies to other languages for that tokenizer-embedding combination). The GPT-3.5+ tokenizer and embedding model outputs something closer to subwords.&lt;/p&gt;
&lt;p&gt;One possible explanation on the differences between these tokenizer and embedding model combinations is that OpenAI acquired better German language training data, which resulted in better tokenization and embeddings for German text. As shown above, Mia's name is now tokenized as one token per name, meaning those words were common enough to each get their own token and related embedding. In the GPT-3 tokenizer and embedding model, common English names with only one common spelling are tokenized as one name per token.&lt;/p&gt;
&lt;p&gt;Therefore, it is misleading to interpret the tokenization itself as a practice that removes personally identifiable data, which is what Hamburg has stated in their discussion paper. If you truly want to have a tokenizer obfuscate personal data, this must be done intentionally and likely is only truly accurate if the identifiable information is never tokenized.&lt;/p&gt;
&lt;p&gt;Furthermore, the office incorrectly describes the process of turning tokens into embeddings as an encoding mechanism that further diminishes the "personally identifiable" part of the data. That would be like saying storing a text on your computer makes it not personally identifiable, because it's actually stored in bytes.... which, of course, is not a very useful interpretation of how computer- or machine-readable encodings work. Just because a human cannot look at an embedding and know what word, token or letter it represents doesn't mean that that same person cannot use &lt;a href="https://github.com/openai/tiktoken"&gt;OpenAI's freely provided decoder&lt;/a&gt; to understand what the data is -- or that a machine cannot learn or interpret the data. In fact, this is exactly what machine learning is trying to accomplish.&lt;/p&gt;
&lt;p&gt;By default, embedding models like Word2Vec and more powerful ones like OpenAI's model want to retain and internally represent the relationships between tokens. The trained model should take information in the tokens and transfer that into relationships in the embeddings. This learns relationships, like how "be" and "zogen" together is a common mapping, especially if these words follow "personen". This is what makes embeddings so powerful.&lt;/p&gt;
&lt;p&gt;In addition, these embeddings are then used to train the actual language model. Nearly all natural language models (deep or otherwise) use embedding sequences to learn patterns. Even if the individual embeddings for a name are chunked oddly due to the tokenization strategy, it's likely that the embedding model has seen their sequence and combination. Even if that embedding model never saw those tokens together as part of the embedding model training, the sequence and relationship between those embeddings can be learned by the language model. By design, tokenization and the embedding model should enhance the ability for the language model to learn the relationships, not detract from it. This feature of data encoding and its subsequent model training means models also learn patterns in personal information.&lt;/p&gt;
&lt;p&gt;The interpretation that a model "only sees the multidimensional representation, devoid of personal data" is again, like arguing that a computer processing data only sees bytes, and therefore cannot interpret or that an algorithm cannot learn from personal data. Additionally, the growth of so-called "context windows" means that an LLM or other generative model holds thousands of tokens as accessible data and as sequencing information before creating a response or performing another task. When you chat with ChatGPT, it holds the entire conversation you are having along with the initial instructions or prompt written by the model designers, saving up to 128,000 tokens as additional "context". These embeddings and their ordering can contain many examples of personally identifiable text and are used by the model alongside additional user and session information to formulate a response.&lt;/p&gt;
&lt;p&gt;Large machine learning systems attempt to extract and compress information into data structures that leverage linear algebra and deep learning architectures. In doing so, they enable more complex machine learning tasks. This encoding should enhance learning from the data, not detract. Therefore, Hamburg's take is fairly misinformed when it comes to interpreting how personal data (or really any data) is encoded and used in larger machine learning systems.&lt;/p&gt;
&lt;p&gt;As you learned in this article, language and computer vision machine learning encode data differently based on the different learnings of how to best leverage the power of deep learning and linear algebra.  You might be wondering if computer vision models retain personal information? Some computer vision tasks are set up so the entire goal is to learn personally identifiable information, like with facial recognition systems which should remember the user's face (FaceID as one example). Other tasks might be set up differently, where the model is penalized for learning the specifics. Some questions to ask yourself for further reflection: Should it be known that a photo contains a celebrity, and should that celebrity's name be learned? Should it be learned that a piece of art comes from a particular artist by name? Each of these questions can also be applied to language learning, if a token (or series thereof) ends up representing a person.&lt;/p&gt;
&lt;p&gt;In the next article, you'll investigate how machine learning systems take these encodings or embeddings as input and process them for training machine learning models. You'll learn about how machine learning models are evaluated and validated. Finally, you'll explore machine learning culture to see how it affects memorization in machine learning.&lt;/p&gt;
&lt;p&gt;I'm very open to feedback (positive, neutral or critical), questions (there are no stupid questions!) and creative re-use of this content. If you have any of those, please share it with me! This helps keep me inspired and writing. :)&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Acknowledgements&lt;/em&gt;: I would like to thank &lt;a href="https://vickiboykis.com/"&gt;Vicki Boykis&lt;/a&gt;, &lt;a href="https://desfontain.es/serious.html"&gt;Damien Desfontaines&lt;/a&gt; and &lt;a href="https://www.linkedin.com/in/yanndupis/"&gt;Yann Dupis&lt;/a&gt; for their feedback, corrections and thoughts on this series. Their input greatly contributed to improvements in my thinking and writing. Any mistakes, typos, inaccuracies or controversial opinions are my own.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;If you're interested in seeing some alternatives to the encoding and decoding that computers landed on and combining that with problems in machine learning, I recommend looking at the work of inventor, physicist and encoding/decoding machine pioneer &lt;a href="https://en.wikipedia.org/wiki/Emanuel_Goldberg"&gt;Emanuel Goldberg&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;This process uses either a continuous bag-of-words (pick the right word from a rotating list to fill in the blank) or a skip-gram (pick what words are contextually related and might show up now or soon) approach. You can read more about how this works in &lt;a href="https://en.wikipedia.org/wiki/Word2vec"&gt;the more detailed section of the Wikipedia article&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;Byte-pair encoding is an optimized compression algorithm, so things like repeated characters or bytes can and will be compressed into a single mapping based on the other tokens available in the dataset. This is a language-agnostic way of representing text that will expand to fit the common tokens and patterns seen in a large dataset, while also adapting to less common tokens or completely unseen tokens by breaking them down into smaller chunks (i.e. 7Fvw might become 7F, v, w).&amp;#160;&lt;a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="ml-memorization"></category></entry><entry><title>Machine Learning dataset distributions, history, and biases</title><link href="https://blog.kjamistan.com/machine-learning-dataset-distributions-history-and-biases.html" rel="alternate"></link><published>2024-11-13T00:00:00+01:00</published><updated>2024-11-13T00:00:00+01:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2024-11-13:/machine-learning-dataset-distributions-history-and-biases.html</id><summary type="html">&lt;p&gt;You probably are already aware that many machine learning datasets come from scraped internet data. Maybe you received the infamous GPT response: "Please note that my knowledge is limited to information available up until September 2021." You might have also read fear-mongering opinions and articles that companies will &lt;a href="https://theconversation.com/researchers-warn-we-could-run-out-of-data-to-train-ai-by-2026-what-then-216741"&gt;"run out …&lt;/a&gt;&lt;/p&gt;</summary><content type="html">&lt;p&gt;You probably are already aware that many machine learning datasets come from scraped internet data. Maybe you received the infamous GPT response: "Please note that my knowledge is limited to information available up until September 2021." You might have also read fear-mongering opinions and articles that companies will &lt;a href="https://theconversation.com/researchers-warn-we-could-run-out-of-data-to-train-ai-by-2026-what-then-216741"&gt;"run out of data" to train AI systems&lt;/a&gt; soon.&lt;/p&gt;
&lt;p&gt;In this article, you'll examine exactly how data is collected. You'll look at what properties this data has and evaluate known issues with such collection processes, such as amplifying systemic biases and obscuring privacy. Understanding these points will help you better understand machine learning memorization and evaluate deep learning when designing systems. In this article and the next few articles, you'll be focusing on understanding how machine learning systems work, so that you can later understand how they memorize.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;TLDR (too long; didn't read)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Datasets collected online have a long-tail distribution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Common examples are heavily repeated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Uncommon examples outnumber common examples&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Trying to learn uncommon examples in ML systems is hard&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data collection culture is grab everything as cheaply as possible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;The history of the internet and internet culture introduce systemic biases&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;This creates problems with privacy, equity and justice in ML systems&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;a href="https://www.youtube.com/watch?v=JDAPDpbXRXw&amp;amp;list=PLJkNSeYcYBlCaamscxip0l2LGYCZ2TIom&amp;amp;index=1&amp;amp;ab_channel=ProbablyPrivate"&gt;Watch a video summary of this post&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Let's explore today's collected datasets and see what you can learn about them and how they work. In &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html"&gt;this series&lt;/a&gt;, you'll focus specifically on data collected for large-scale deep learning tasks.&lt;/p&gt;
&lt;h3 id="natural-datasets-of-scraped-text-and-images"&gt;Natural Datasets of Scraped Text and Images&lt;/h3&gt;
&lt;p&gt;When you collect a large sample of text or image data and visualize the distribution, you often see a long tail. This was described by linguist George Zipf, and is sometimes referred to as the &lt;a href="https://en.wikipedia.org/wiki/Zipf%27s_law"&gt;Zipf-distribution or Zipf's law&lt;/a&gt;. With a long tail or Zipf probability distribution, you have more common types of examples and a long set of examples that are far less common.&lt;/p&gt;
&lt;p&gt;The probability distribution of the long tail looks like this:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A probability distribution chart with a y-axis of numbers (0 to 100) and a x-axis called &amp;quot;Frequency&amp;quot;. The occurrences to the far left make up the &amp;quot;head&amp;quot;, reaching very high values until it steeply drops off into the &amp;quot;tail&amp;quot; with very low frequencies. You can see that the tail has many more examples than the head." src="./images/2024/long_tail_distribution.png"&gt;&lt;/p&gt;
&lt;p&gt;The common examples in the "head" occur at a much greater frequency than the less common "tail" examples. In addition, the tail composes a significant part of the entire distribution, so if you want to learn how to differentiate the examples in the tail (as with machine learning), this presents a difficult problem. How do you know what parts of the tail you need to learn and what might be not worth learning? Should you learn all of it, even examples which are singletons (i.e. only one example) or which may be outliers or errors?&lt;/p&gt;
&lt;p&gt;Because of the difficulty, there is significant research dedicated to studying the Zipf or so-called "long tail" distribution. &lt;a href="https://arxiv.org/abs/2110.04596"&gt;A survey of deep learning with the long tail&lt;/a&gt; explored a variety of approaches to address the long tail problem, including oversampling&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt; of the less common occurrences to ensure the model appropriately learns these classes and examples.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://ieeexplore.ieee.org/document/6909517"&gt;Another piece of research&lt;/a&gt; found two long tail problems in computer vision datasets. The first long tail happens when looking at the distribution of data across the classes&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt; or labels, like "person" or "bus" or "Ziggurat". In this distribution images of common objects and classes compose the head and much less common objects are in the tail.&lt;/p&gt;
&lt;p&gt;Within a class there is also a long tail based on other attributes. As shown in the following graphic, the positioning of the object in the photo has its own long tail distribution within a particular data class. The visual aspects of the objects within a class--for example, all photos labeled as buses--have typical representations for buses where the bus is in the center of the photo without any visual impediments or any other vehicles in the photo. There are also atypical views, where just the top part of the bus is visible over other cars and vehicles.&lt;/p&gt;
&lt;p&gt;These less common images are part of the tail of the class "bus", which is already in the tail of the overall categories of images collected. This makes a complex problem even more complex!&lt;/p&gt;
&lt;p&gt;&lt;img alt="There are three charts in the image. The top chart shows a long tail distribution as a histogram with a high peak at the left and a drastically declining set of classes. The typical classes to the left are classes like window and person, in the middle are classes like rope, spoon and locker and to the right where much less common classes are located are classes like coffin and Ziggurat. The two lower graphs look at distributions within a particular class category: investigating person and bus. A common visibility pattern for a person is a person standing looking directly at the camera with no obstructions. An uncommon view of a person is a person riding a horse where the photo shows them from the side. A common view of a bus is an unobstructed bus shown clear in the center of the photo with no other cars. An uncommon view of a bus is just the top of the bus visible over a series of other cars." src="./images/2024/long_tail_classes_and_visibility.png"&gt;
&lt;strong&gt;Long tail of all classes and long tail within a class&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When these datasets are scraped at scale, the head represents larger classes, like photos labeled as persons or prominent buildings or text related to a common topic in a common language, like US-centric English-language news. Within those overrepresented classes, there are also less common examples of that population or class, like local news events which don't make US national news or a photo of a building that doesn't exist anymore.&lt;/p&gt;
&lt;p&gt;What about the entirety of the tail, though? In natural language processing, this means ALL of world languages end up being a much smaller representation when compared with available text in English, due to the use of English on the internet.&lt;sup id="fnref:3"&gt;&lt;a class="footnote-ref" href="#fn:3"&gt;3&lt;/a&gt;&lt;/sup&gt; Within computer vision, generation of typical "wedding" images look like US weddings because the much higher occurrence of those photos online greatly outweigh other representations of weddings around the world.&lt;/p&gt;
&lt;p&gt;This point will be quite important for understanding how memorization happens. To give you a preview, I want you to try to draw a "generic" person or make a list of what you might use to decide if a photo has a person in it. Would your list work for all photos you can take of a person or of people? What list would you need in order to make sure that all photos of people are classified correctly?&lt;/p&gt;
&lt;p&gt;Online spaces create a fair amount of content duplication, especially since the advent of the search engine and search engine optimization. Photos of people in front of the Eiffel Tower are much more common than animals in their natural habitat or life in places with fewer digital cameras and devices. For text, there exist massive duplication of boilerplate letter text, common licenses (like the Apache license), common marketing content, content with large distribution channels (like AP news blurbs) and even famous quotes. These texts are usually in English and represent US spelling and grammar rules.&lt;/p&gt;
&lt;p&gt;If you try to learn the entirety of the world, which might be the case if your goal is to create a "general AI" system, the duplicates make learning more difficult, because they are overrepresented. Finding duplicates to remove is easy if the duplicates are an exact match. But usually you have to solve the problem of near-duplication, where data is very close but not actually exactly duplicated. This is still a hard problem to automate.&lt;/p&gt;
&lt;p&gt;Humans are good at noticing things like if a photo is from the same moment or photoshoot but from a different angle. Humans are also good at noticing things like plagiarism or when an idea, quote, or section of content is mimicking another piece of content. Computers are still not very good at this. Therefore, it's unlikely you can truly remove all duplicates that a human might mark as duplicate using computer-assisted methods. Large deep learning models often memorize repeated data, which you'll explore later in this article series.&lt;/p&gt;
&lt;p&gt;But, just how did scraped internet data come to represent "the world"? If the data was more diverse, would the resulting models be more representative? Would the models learn fewer biases? Let's explore by examining the history of machine learning data collection.&lt;/p&gt;
&lt;h3 id="data-collection-for-machine-learning-a-history"&gt;Data Collection for Machine Learning: A History&lt;/h3&gt;
&lt;p&gt;In the early days of deep learning, many datasets were collected by researchers or university research groups to provide data for deep learning research.&lt;/p&gt;
&lt;p&gt;One famous dataset is the &lt;a href="https://en.wikipedia.org/wiki/MNIST_database"&gt;MNIST dataset&lt;/a&gt;, first introduced in 1998 by Yann LeCun et al. The dataset is a canonical example for any computer vision student or machine learning hobbyist.&lt;/p&gt;
&lt;p&gt;The original dataset was collected by US National Institute of Standards and Technology (NIST) employees, who were asked to fill out a form that collected their writing. Then the dataset was expanded because it was too small and not as diverse as real handwriting, so the researchers asked several US high schools to participate. The students filled out forms that looked like this:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A form with numerous fields filled out, including location, date, and several writing exercises, such as writing the numbers and letters of the English alphabet in different order and with varying cases. At the end, there is a field to write the beginning of the US Declaration of Independence." src="./images/2024/NIST_handwriting.jpg"&gt;
&lt;strong&gt;NIST Handwriting Form&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The details on how this assignment was given and if consent was collected are fuzzy, but as an 80s  child from the US, this paper looks like a classroom assignment, not an activity kids would fill out for fun in their own time. There is no information on the form about what the data will be used for which makes it hard to understand if, when and for how long the information from the form will be saved. Of course, these letters and digits have now been duplicated across the world many times for every entry-level computer vision class. If a student wanted to revoke consent for machine learning, this would now be impossible.&lt;/p&gt;
&lt;p&gt;This initial start turned into a longer trend, best described as "collect data as cheaply and as quickly as possible". This trend became a widespread fundamental practice within the field of machine learning. For those that did it well, it was also immensely profitable.&lt;/p&gt;
&lt;p&gt;Indexing the internet and all of its content fueled the growth of today's large scale technology companies like Google who advanced the organization's search engine capabilities via massive data collection. These datasets were collected without special attention to copyright, privacy or consent, other than avoiding websites that specifically blocked crawlers via the &lt;a href="https://en.wikipedia.org/wiki/Robots.txt"&gt;'robots.txt'&lt;/a&gt; file.&lt;/p&gt;
&lt;p&gt;The data collection was described as "indexing", where keywords were matched to content URLs. But to produce these matches, the entire website content was scraped and saved first. The scraped data--usually a file or set of files--could be deleted or updated by contacting Google if you were the person running the website, which may or may not be the person whose content was posted.&lt;/p&gt;
&lt;p&gt;Additional datasets like &lt;a href="https://aclanthology.org/Q14-1006.pdf"&gt;Flickr30K&lt;/a&gt; and &lt;a href="http://vis-www.cs.umass.edu/lfw/"&gt;Labeled Faces in the Wild&lt;/a&gt; show a similar approach to data collection within the computer vision domain -- grab whatever you can and ask questions (or for permission) later.&lt;/p&gt;
&lt;p&gt;Unfortunately, this isn't exactly how many of us use the internet or at least not until recently. &lt;a href="https://en.wikipedia.org/wiki/Helen_Nissenbaum"&gt;Helen Nissenbaum&lt;/a&gt; speaks about the context in which you write, post photos and connect with others online, and how this context often doesn't match the mental model you have when operating in the real, non-digital world. It's difficult for humans to understand exactly how, where and to whom they are sharing information with via a digital interface because the context and related transparency on how the data is used, stored and managed isn't entirely clear.&lt;/p&gt;
&lt;p&gt;When you are writing to a close friend by commenting on their post, you probably don't immediately assume a complete stranger will read it or scrape it and use it for machine learning. When you posted something 10 years ago on a personal blog, you probably didn't assume it would be stored somewhere a decade later and used in the latest GPT model. When you shared your photos on Flickr in 2010, you didn't foresee that it could end up in a Generative AI portrait. And yet, those things are indeed possible due to the lack of contextual integrity provided in many online spaces and platforms.&lt;/p&gt;
&lt;p&gt;Training with online data has other pitfalls and challenges, many to do with the skewed culture of the internet itself and the resulting biases in these scraped datasets.&lt;/p&gt;
&lt;h3 id="internet-culture-and-biases"&gt;Internet Culture and Biases&lt;/h3&gt;
&lt;p&gt;The internet was initially used by &lt;a href="https://en.wikipedia.org/wiki/History_of_the_World_Wide_Web"&gt;a small group of people, available only in a small number of places&lt;/a&gt;. It still carries the biases of those groups--being a place where it's often safer to be "Western", white and male.&lt;/p&gt;
&lt;p&gt;These online biases show up in training datasets produced by scraping the web. For example, one large NLP dataset used to train early GPT models was &lt;a href="https://en.wikipedia.org/wiki/The_Pile_(dataset)"&gt;The Pile&lt;/a&gt;. The Pile encompasses several scraped datasets, including one called &lt;a href="https://openwebtext2.readthedocs.io/en/latest/"&gt;OpenWebText2&lt;/a&gt;. This dataset contains the text of all the websites with top-rated linked Reddit posts between 2005 and 2020. Not only is this dataset a violation of those users' belief they could later delete their posts, but Reddit also &lt;a href="https://www.theatlantic.com/technology/archive/2020/06/reddit-racism-open-letter/612958/"&gt;hosts several popular communities promoting visceral and violent hatred, racism and bigotry&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The resulting datasets show massive societal biases, including, but not limited to racism, sexism, homophobia, US-centrism and xenophobia. Work from researchers like &lt;a href="https://www.dair-institute.org/team/"&gt;Timnit Gebru and the DAIR Institute&lt;/a&gt;, &lt;a href="http://www.danah.org/papers/"&gt;danah boyd&lt;/a&gt; and &lt;a href="https://katecrawford.net/"&gt;Kate Crawford&lt;/a&gt; have highlighted these biases since 2017. Research like &lt;a href="https://arxiv.org/abs/1608.07187"&gt;Calisken et al.'s analysis of sexism in translation&lt;/a&gt;, &lt;a href="https://arxiv.org/abs/1607.06520"&gt;Bolukbasi et al.'s work&lt;/a&gt;, whose article borrows from &lt;a href="https://blog.kjamistan.com/embedded-isms-in-vector-based-natural-language-processing.html"&gt;my initial research about sexism in word vectors&lt;/a&gt;, have been available since 2016. &lt;a href="http://gendershades.org/"&gt;Buolamwini and Gebru's Gender Shades&lt;/a&gt; demonstrated in 2018 that darker skinned women are at a disadvantage when it comes to accurate facial recognition. These problems are well documented for nearly a decade and yet the common practice in machine learning communities is to still use these problematic datasets and to produce more by scraping more data.&lt;/p&gt;
&lt;p&gt;Although text data can be used directly as it is collected, image data must be appropriately labeled to perform adequate computer vision or text-to-image generative tasks. For text-to-image or image-to-text models, the labels involve either describing the entire image or scene, or creating bounding boxes, where parts of the image are highlighted and describing a smaller subsection of the image. This might involve labeling all objects in the image separately along with their bounding boxes.&lt;/p&gt;
&lt;p&gt;For early computer vision datasets, appropriately learning labels (or categories of things, like a "cat") meant data collection attempted to find images with only one thing in them. To learn more quickly and to have more data with the same labels, this might mean that a photo of a person is just labeled "person", and that image may be of the person in the center, in the side, or in some other part of the image. As you can imagine, these labels vary significantly in quality and accuracy, depending on how they are collected. Many of today's labels are semi-automatically scraped from the web and use &lt;a href="https://en.wikipedia.org/wiki/Alt_attribute"&gt;image ALT text&lt;/a&gt; as the description.&lt;sup id="fnref:5"&gt;&lt;a class="footnote-ref" href="#fn:5"&gt;5&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;For higher-quality datasets, humans work as labelers of scraped or user-generated text, images and videos. These data workers are frequently subjected to &lt;a href="https://data-workers.org/"&gt;poor working conditions and lack of psychological support&lt;/a&gt; when facing traumatic content for systems like  content moderation. For other machine learning tasks, the instructions are often &lt;a href="https://peertube.dair-institute.org/w/rgT6Bq7VhLUR4VFdaZLnAr"&gt;meager and very little context is given to the data workers&lt;/a&gt;, resulting in datasets with lower quality assurance than one would want if people were properly informed about the task.&lt;sup id="fnref:6"&gt;&lt;a class="footnote-ref" href="#fn:6"&gt;6&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;Crawford and Paglen highlight additional issues with crowd-sourced labels in &lt;a href="https://excavating.ai/"&gt;Excavating AI&lt;/a&gt;. Their research and the resulting art piece and essay investigated the ImageNet dataset--highlighting labels like "alcoholic", "ballbuster" and "pervert". There is extensive academic research on the topic, like how online images amplify gender biases and other systems of oppression explored in &lt;a href="https://arxiv.org/abs/1605.06083v1"&gt;Flickr 30K dataset research&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Even without human labelers, collected data from the internet reinforces biases and stereotypes from search algorithms and content providers. Safiya Noble documented biases in search engines and their results in &lt;a href="https://nyupress.org/9781479837243/"&gt;Algorithms of Oppression&lt;/a&gt;. Her book inspired me to look at what surfaces in popular web crawl datasets, uncovering what it's like, for example, to search for &lt;a href="https://c4-search.apps.allenai.org/?q=brazilian+girl"&gt;"Brazilian Girl" in the C4 dataset (a part of the common crawl)&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A zoomed-in view of a search results page, where the top result is about how to date a Brazilian girl and the next link down is clearly an advertisement to engage in sex. Citing: Hot Brazilian Girl! She does anything!!" src="./images/2024/c4_brazilian_girl.png"&gt;&lt;/p&gt;
&lt;p&gt;The initial examples from the first two pages of results talk about how to ask a Brazilian girl out, how to date a Brazilian girl or they are fake advertisements to meet Brazilian girls. When this data is the only context a large language model or generative model receives on what "Brazilian girl" represents, then in the resulting model "dating" is closer to that idea than being a scientist, researcher, politician, athlete, philosopher, etc.&lt;/p&gt;
&lt;p&gt;These internet biases create AI systems that repeat and expand their use and can influence people are seen by others and how people see themselves. This has lasting impacts on society, and creates an amplification and further entrenchment of harmful content.&lt;/p&gt;
&lt;p&gt;It's important to keep this in mind when learning about machine learning. What data are you trying to learn? What data do you think is "high quality" and why? What is the machine learning community doing by attempting to build expansive, cheap datasets?&lt;/p&gt;
&lt;p&gt;And when memorization happens on these data, what is memorized and reproduced from this internet culture?&lt;/p&gt;
&lt;h3 id="bigger-question-what-even-is-data"&gt;Bigger question: what even is data?&lt;/h3&gt;
&lt;p&gt;Often these datasets are representing only a subset of the world, as you've learned thus far. Why are they used as universal truths if they can only represent a small sample of reality?&lt;/p&gt;
&lt;p&gt;danah boyd wrote about &lt;a href="https://content.iospress.com/download/information-services-and-use/isu200098?id=information-services-and-use%2Fisu200098"&gt;how measurement and scientific inquiry come from murky histories of cultural dominance, colonization and oppression&lt;/a&gt;. By assuming that there is a standard system of measurement for everything, the assumptions and biases built into what is "normal" and who determines these standards leave some examples marked as "normal" and others marked as outside. Since machine learning then uses these standards and measurements to automatically learn to discriminate from one group, idea or concept to another, these biases are highlighted, reproduced and become entrenched in the concept of data.&lt;/p&gt;
&lt;p&gt;Understanding memorization begins with understanding how data is collected and used, and what properties that data has. Massive duplication and the long-tail have a deep impact on how machine learning models--particularly deep learning models--learn, generalize and memorize.&lt;/p&gt;
&lt;p&gt;The ethical, social and philosophical problems of how data is collected and labeled are also important to study alongside memorization because it is very difficult to unlearn these concepts. When memorization of these biases happen, it creates an even more difficult problem to attempt to stop reproducing those examples, often creating a need for serious intervention or complete redesign and retraining.&lt;/p&gt;
&lt;p&gt;In the next article, you'll investigate how machine learning systems take these datasets as input and process them for machine learning. Specifically, you'll investigate how encoding and embeddings work to take complex input and make it easy to "learn".&lt;/p&gt;
&lt;p&gt;I'm very open to feedback (positive, neutral or critical), questions (there are no stupid questions!) and creative re-use of this content. If you have any of those, please share it with me! This helps keep me inspired and writing. :)&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis"&gt;Oversampling&lt;/a&gt; involves pulling from a certain subset of examples more frequently when performing a machine learning or analysis task in order to ensure they are better represented within the overall dataset or population. This is a basic statistics strategy which help when using nonrepresentative data or when minority subpopulations need to be adequately evaluated (i.e. when one or more groups are heavily overrepresented compared to other groups). If performed, the resulting data analysis or machine learning is much more likely to process duplicates from the oversampled population. This becomes an important factor for memorization, which you will explore in a later article.&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;Categories like "bus" or "person" are referred to as classes. Photos are labeled in order to be able to see an image later and create a prediction on what is in the image (like a bus). A dataset that is to be used for a classification problem might refer to classes as labels (or vice versa), because the data is tagged or labeled with the class name or encoding (sometimes a number that maps to a human-readable name). Technically, the classes refer to the categories. When the dataset is collected and the examples are tagged with the appropriate matching class, that process produces a label.&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;This has prompted significant research and deployment strategies to offer better multi-lingual AI products, including &lt;a href="https://arxiv.org/abs/2305.07004"&gt;joint research from Microsoft China and university researchers&lt;/a&gt; which first translates incoming text and prompts into English to use a production LLM system which performs much better at English text than Chinese.&amp;#160;&lt;a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:4"&gt;
&lt;p&gt;The early internet was available to researchers and government and military personnel. As the World Wide Web grew, it was most accessible in the US and parts of Europe, where internet access for non-academic and non-military persons was subsidized and supported by local authorities. This created an overrepresentation on the web of these world views and lifestyles. The &lt;a href="https://en.wikipedia.org/wiki/Dot-com_bubble"&gt;early internet boom&lt;/a&gt; and resulting web infrastructure was primarily located in Silicon Valley, California, which brought the area and users' own political, economic and social views to the newsletters, websites, browsers and search engines of the time. These marks are still recognizable in the way many people search, browse and experience the internet today.&amp;#160;&lt;a class="footnote-backref" href="#fnref:4" title="Jump back to footnote 4 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:5"&gt;
&lt;p&gt;Due to the internet's propensity to skew representation, this practice results in a Western, white supremacist and patriarchal view of personhood. &lt;a href="https://arxiv.org/abs/2310.19981"&gt;Ghosh et al&lt;/a&gt; demonstrated that Stable Diffusion models prompted with "person" overwhelmingly produce a white male. Their research also uncovered erasure of indigenous identities and hypersexualization of women from particular areas of the world.&amp;#160;&lt;a class="footnote-backref" href="#fnref:5" title="Jump back to footnote 5 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:6"&gt;
&lt;p&gt;The lack of instructions and connection to the larger team and task could be a result of trying to keep the data workers (often subcontractors working for another company) further removed from their peers working on other parts of machine learning, like the high-paid data workers who train the models or architect the resulting systems.&amp;#160;&lt;a class="footnote-backref" href="#fnref:6" title="Jump back to footnote 6 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="ml-memorization"></category></entry><entry><title>Deep learning memorization, and why you should care</title><link href="https://blog.kjamistan.com/deep-learning-memorization-and-why-you-should-care.html" rel="alternate"></link><published>2024-11-04T00:00:00+01:00</published><updated>2024-11-04T00:00:00+01:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2024-11-04:/deep-learning-memorization-and-why-you-should-care.html</id><summary type="html">&lt;p&gt;When's the last time that ChatGPT parroted someone else's words to you? Or the last time a diffusion model you used recreated someone's art, someone's photo, someone's face? Has Copilot &lt;a href="https://x.com/docsparse/status/1581461734665367554"&gt;given you someone else's code without permission or attribution&lt;/a&gt;? If this happened, how would you know for sure?&lt;/p&gt;
&lt;p&gt;In this …&lt;/p&gt;</summary><content type="html">&lt;p&gt;When's the last time that ChatGPT parroted someone else's words to you? Or the last time a diffusion model you used recreated someone's art, someone's photo, someone's face? Has Copilot &lt;a href="https://x.com/docsparse/status/1581461734665367554"&gt;given you someone else's code without permission or attribution&lt;/a&gt;? If this happened, how would you know for sure?&lt;/p&gt;
&lt;p&gt;In this article series, you'll explore how and why memorization happens in deep learning&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt;, as well as what can be done to address the issues it raises.&lt;/p&gt;
&lt;p&gt;However, to ensure it's worth studying, let's investigate if this phenomenon really occurs?&lt;/p&gt;
&lt;h3 id="memorization-in-the-wild"&gt;Memorization in the wild&lt;/h3&gt;
&lt;p&gt;&lt;img alt="An image of two columns of text next to each other. On one side it shows a GPT-4 response. On the other side it shows text from a New York Times article. The text is all in red as the text is exactly the same." src="./images/2024/nyt_lawsuit.png"&gt;
&lt;em&gt;NYT vs. OpenAI&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Here is a screenshot of an excerpt from &lt;a href="https://www.courthousenews.com/wp-content/uploads/2023/12/new-york-times-microsoft-open-ai-complaint.pdf"&gt;the New York Times lawsuit against Microsoft and OpenAI&lt;/a&gt;. On the right is the original text of the New York Times article. On the left you can see the extracted text from GPT-4. If a word is red, it means it was directly repeated and therefore memorized by the deep learning model. Is this a violation of copyright law?&lt;/p&gt;
&lt;p&gt;&lt;img alt="An image of a photo of a woman on the left side that looks like a book promotional photo. It has her name and her book title underneath as the caption. On the right side you can see an almost exact copy of the image with some artifacts common to Generative AI. That is the extraction photo." src="./images/2024/stable_diffusion_extraction.png"&gt;
&lt;em&gt;Stable Diffusion Face Extraction&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;And here is an example from &lt;a href="https://arxiv.org/abs/2301.13188"&gt;a stable diffusion model trained by Carlini et al.&lt;/a&gt; on Stable Diffusion's training dataset. This person's face is repeated less than three times in the training data. When prompted with the person's name, you can reproduce their face, or more specifically, the photo from the training dataset. Is this a violation of privacy?&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of the linked video from OpenAI's demo of the Sky voice" src="./images/2024/openai_sky.png"&gt;
&lt;em&gt;Is OpenAI's Skye imitating ScarJo?&lt;/em&gt;
&lt;a href="https://www.youtube.com/watch?v=D9byh4MAsUQ"&gt;Link to watch above video&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Finally, here is an example of the new release of OpenAI's GPT-4o features, which originally started with a voice that sounded eerily like Scarlet Johansson's voice. Johansson had been approached several times by Sam Altman to be the voice of the new system but declined. Instead, it appears &lt;a href="https://arstechnica.com/tech-policy/2024/05/openai-pauses-chatgpt-4o-voice-that-fans-said-ripped-off-scarlett-johansson/"&gt;OpenAI found a voice actor to mimic her voice&lt;/a&gt; in order to give a cultural hat tip to her role in the movie &lt;em&gt;Her&lt;/em&gt;, where she voiced the AI character.&lt;/p&gt;
&lt;h3 id="what-is-happening-here"&gt;What is happening here?&lt;/h3&gt;
&lt;p&gt;Understanding how deep learning systems work and succeed at tasks is an active area of research for more than a decade. In this series, you'll explore both the technical aspects of how these models memorize, but also the creation of a machine learning community culture that allows this to take place. You'll review the seminal research around privacy, security and memorization in deep learning, and better understand deep learning because of it. This knowledge will also help you better understand how to approach and use models and AI systems.&lt;/p&gt;
&lt;p&gt;You'll start by looking at how datasets are collected and what their properties are, then, explore machine learning training and evaluation and the impact of those choices. You'll investigate what data repetition and novelty have to do with memorization, and how that can be mathematically modeled and proven. You'll learn the relations between overparameterization, model size and memorization and see some examples of how this phenomenon was discovered long before GPT models were released.&lt;/p&gt;
&lt;p&gt;You'll also explore several ideas for how memorization can and should impact the way machine learning engineers manage data, the way models are trained, the way we talk about "intelligent systems" and how to reason about when to use deep learning.&lt;/p&gt;
&lt;h3 id="but-why-should-i-care-about-memorization"&gt;But, why should I care about memorization?&lt;/h3&gt;
&lt;p&gt;As one person who I spoke with put it, "it doesn't really matter if a model memorizes, as long as it brings us closer to human-level intelligence". But is that true?&lt;/p&gt;
&lt;p&gt;There is very little intelligence in merely saving a string of tokens or pixels and being able to repeat them when prompted. It is something that we humans are not so great at, but that is due to our intelligence, not in spite of it. Rote memorization is something computers have done for many decades and something they excel at.&lt;/p&gt;
&lt;p&gt;This critique is echoed in the &lt;a href="https://www.youtube.com/watch?v=vyqXLJsmsrk&amp;amp;ab_channel=MITDepartmentofPhysics"&gt;remarks from LeCun&lt;/a&gt; and many other deep learning researchers for several years now. The current way that practitioners train large language and computer vision systems are inherently linked to the training data and the limits within that data. These models can get quite good at mimicking the data, but it's heavily disputed if their performance shows deep reasoning, world models or systems thinking.&lt;/p&gt;
&lt;p&gt;Memorization is not learning, even if it can mimic learning. If you want to build intelligent systems you'll have to do much better than memorization. And you'd need to prove that deep learning models are capable of significantly more than memorization and remixing. Based on what you'll learn around evaluation datasets, you'll likely have new questions for how machine learning practitioners review what is learned, what is generalizable and how the field might actually move forward towards better generalization.&lt;/p&gt;
&lt;p&gt;There are additional reasons to care about memorization. Privacy is a fundamental human right according to the UN Human Rights Convention. The human right to self-determination about how information related to your personhood, your life and your behavior is collected, stored and used is a common understanding across many cultures, nations, lands and societies. As seen in the Second World War, how governments and technology systems collect, use, and proliferate data has a direct impact people's lives.&lt;/p&gt;
&lt;p&gt;Privacy is closely related to trust, and how you manage your own privacy relates to who, how and what you trust. In this way, privacy mirrors social bonds that help keep society functioning, that help promote equality amongst persons and that create trust and accountability amongst ourselves and our institutions. When your trust in something is broken, you likely no longer want to share intimate details or data with such systems. And when your privacy is violated, for example, via online stalking or harassment, or even smaller examples, like a super creepy ad or a post that got shared out of context, you may feel violated. Your trust was broken.&lt;/p&gt;
&lt;p&gt;Privacy isn't equally available to everyone -- despite common beliefs to the contrary. Some of us have what I call "privacy privilege", where your face is not stored in a database used by the police or state intelligence to track your movements. Some of us might represent the best outcomes in the models, where those systems work in our favor. For example, you are granted automatic entrance in an interview process or you get pre-approved for a loan. In those cases, your trust isn't violated by the system's usage. But there are many persons who do not fall into those categories - where these systems violate their privacy, their right to self-determination, their right to protect themselves from algorithmic classifications and categorizations.&lt;/p&gt;
&lt;p&gt;Memorization in machine learning has deep implications in how to reason about choices in machine learning, and studying it can better expose phenomena like unfair outcomes, overexposed persons and how machine learning systems link to other systems of power and oppression in our world.&lt;/p&gt;
&lt;p&gt;Memorization violates consent, erodes privacy and throws what all of us are being sold under the banner of "intelligence" into question. By exposing how memorization works, you are also pushing for more realistic views of AI systems and more realistic assumptions around how they can and should be used. You are also evaluating how they shouldn't be used. By studying memorization, you counter fraudulent messages on how machine learning works, and expose much more interesting fields of study based in real science.&lt;/p&gt;
&lt;h3 id="lets-dive-in"&gt;Let's dive in!&lt;/h3&gt;
&lt;p&gt;I hope you're excited to learn more. In the coming articles, you'll explore how deep learning systems create the opportunity for memorization, along with a better understanding of how it happens.&lt;/p&gt;
&lt;p&gt;To get a head start, if you already work in machine learning, I want you to reflect on the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How do you collect data?&lt;/li&gt;
&lt;li&gt;How do you incentivize and optimize learning?&lt;/li&gt;
&lt;li&gt;How do you architect deep learning models?&lt;/li&gt;
&lt;li&gt;How do you govern data usage in ML systems?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The next article specifically investigates data collection, looking at how long-tail datasets create uneven distributions to be learned. To stay up-to-date, you can &lt;a href="https://probablyprivate.com/"&gt;sign up for the Probably Private newsletter&lt;/a&gt; or &lt;a href="https://www.linkedin.com/in/katharinejarmul/"&gt;follow my work on LinkedIn&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I'm very open to feedback (positive, neutral or critical), questions (there are no stupid questions!) and creative re-use of this content. If you have any of those, please share it with me! This helps keep me inspired and writing. :)&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Acknowledgements&lt;/em&gt;: I would like to thank &lt;a href="https://vickiboykis.com/"&gt;Vicki Boykis&lt;/a&gt;, &lt;a href="https://desfontain.es/serious.html"&gt;Damien Desfontaines&lt;/a&gt; and &lt;a href="https://www.linkedin.com/in/yanndupis/"&gt;Yann Dupis&lt;/a&gt; for their feedback, corrections and thoughts on this series. Their input greatly contributed to improvements in my thinking and writing.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;In this series, you'll explore deep learning as a field, which includes the use and training of neural networks to perform a task or series of tasks. A large language model (LLM) is a particular type of deep learning model which can either produce text (just like a normal language model), or answer chats with instructions or prompts, which is what ChatGPT does. You'll learn more about how these systems work from small building blocks (neurons and layers) to the entire model by studying how they are built, trained, evaluated and used.&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="ml-memorization"></category></entry><entry><title>A Deep Dive into Memorization in Deep Learning</title><link href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html" rel="alternate"></link><published>2024-11-03T00:00:00+01:00</published><updated>2024-11-03T00:00:00+01:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2024-11-03:/a-deep-dive-into-memorization-in-deep-learning.html</id><summary type="html">&lt;p&gt;Want to learn more about how, when and why machine learning, particularly deep learning systems memorize data? By studying memorization, you'll learn more about how machine learning systems really function, along with how privacy works from a technical point-of-view. You'll also be better able to decide how, when and where …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Want to learn more about how, when and why machine learning, particularly deep learning systems memorize data? By studying memorization, you'll learn more about how machine learning systems really function, along with how privacy works from a technical point-of-view. You'll also be better able to decide how, when and where to use AI systems based on your new learnings.&lt;/p&gt;
&lt;p&gt;This series aims to introduce the topics to a general audience, but there are plenty of links to dive deeper in each article. This page will be updated as the series is published.&lt;/p&gt;
&lt;p&gt;The recommended reading order is as follows, but feel free to hop around!&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://blog.kjamistan.com/deep-learning-memorization-and-why-you-should-care.html"&gt;Introduction: Why study memorization in machine learning?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.kjamistan.com/machine-learning-dataset-distributions-history-and-biases.html"&gt;Start with the Data: Machine Learning dataset distributions, history, and biases&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.kjamistan.com/encodings-and-embeddings-how-does-data-get-into-machine-learning-systems.html"&gt;Encodings and embeddings: How does data get into machine learning systems?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.kjamistan.com/gaming-evaluation-the-evolution-of-deep-learning-training-and-evaluation.html"&gt;Gaming Evaluation: The evolution of deep learning training and evaluation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.kjamistan.com/how-memorization-happens-repetition.html"&gt;How Memorization Happens: Repetition&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.kjamistan.com/how-memorization-happens-novelty.html"&gt;How Memorization Happens: Novelty&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.kjamistan.com/how-memorization-happens-overparametrized-models.html"&gt;How Memorization Happens: Overparametrized models&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.kjamistan.com/differential-privacy-as-a-counterexample-to-aiml-memorization.html"&gt;Differential Privacy as a Counterexample to AI/ML Memorization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.kjamistan.com/adversarial-examples-demonstrate-memorization-properties.html"&gt;Adversarial Examples Demonstrate Memorization Properties&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.kjamistan.com/defining-privacy-attacks-in-ai-and-ml.html"&gt;Privacy Attacks on AI/ML Models&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.kjamistan.com/blocking-aiml-memorization-with-software-guardrails.html"&gt;Blocking AI memorization with Software Guardrails&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.kjamistan.com/algorithmic-based-guardrails-external-guardrail-models-and-alignment-methods.html"&gt;Can we use Algorithmic Guardrails to block memorization?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.kjamistan.com/machine-unlearning-what-is-it.html"&gt;Machine Unlearning: What is it?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.kjamistan.com/machine-unlearning-how-todays-unlearning-is-done.html"&gt;Machine Unlearning: How it's done&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.kjamistan.com/attacks-on-machine-unlearning-how-unlearned-models-leak-information.html"&gt;Attacks on Machine Unlearning: How Unlearned Models Leak Information&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.kjamistan.com/differential-privacy-in-deep-learning.html"&gt;Differential Privacy in Deep Learning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.kjamistan.com/differential-privacy-in-todays-ai-whats-so-hard.html"&gt;Differential Privacy in Today's AI: What's so hard?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.kjamistan.com/differential-privacy-parameters-accounting-and-auditing-in-deep-learning-and-ai.html"&gt;Differential Privacy Parameters, Accouting, Auditing and Testing&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you are a more visual or video learner, I've made a &lt;a href="https://www.youtube.com/watch?v=JDAPDpbXRXw&amp;amp;list=PLJkNSeYcYBlCaamscxip0l2LGYCZ2TIom&amp;amp;ab_channel=ProbablyPrivate"&gt;YouTube playlist to accompany the series&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I'm very open to feedback (positive, neutral or critical), questions (there are no stupid questions!) and creative re-use of this content. If you have any of those, please share it with me! This helps keep me inspired and writing. :)&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Acknowledgements&lt;/em&gt;: I would like to thank &lt;a href="https://vickiboykis.com/"&gt;Vicki Boykis&lt;/a&gt;, &lt;a href="https://desfontain.es/serious.html"&gt;Damien Desfontaines&lt;/a&gt; and &lt;a href="https://www.linkedin.com/in/yanndupis/"&gt;Yann Dupis&lt;/a&gt; for their feedback, corrections and thoughts on this series. Their input greatly contributed to improvements in my thinking and writing. Any mistakes, typos, inaccuracies or controversial opinions are my own.&lt;/p&gt;</content><category term="ml-memorization"></category></entry><entry><title>Building a Privacy-First Newsletter</title><link href="https://blog.kjamistan.com/building-a-privacy-first-newsletter.html" rel="alternate"></link><published>2023-03-12T09:00:00+01:00</published><updated>2023-03-12T09:00:00+01:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2023-03-12:/building-a-privacy-first-newsletter.html</id><summary type="html">&lt;p&gt;Building a newsletter is a fairly common activity these days, with many creators, writers and thinkers making part of their living via subscribers willing to give small amounts of money out per year or month to get exclusive access. Beyond the paid subscriptions, there's an increasing demand for free, or …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Building a newsletter is a fairly common activity these days, with many creators, writers and thinkers making part of their living via subscribers willing to give small amounts of money out per year or month to get exclusive access. Beyond the paid subscriptions, there's an increasing demand for free, or for fun, newsletters to cut through algorithmic noise. People enjoy hearing directly from other people they trust or enjoy, seeking advice, insight, humor and information, which is why the interest in newsletters and podcasts has grown.&lt;/p&gt;
&lt;p&gt;As there is a growing audience for these formats, you would think there would also be a wide array of newsletter platforms with different offerings. In Fall 2020 I started &lt;a href="https://probablyprivate.com"&gt;my newsletter &lt;em&gt;Probably Private&lt;/em&gt;&lt;/a&gt;, on the intersection of privacy and data science and went on a quest that took until Spring 2023 -- to create a privacy-first newsletter.&lt;/p&gt;
&lt;h2 id="why"&gt;Why?&lt;/h2&gt;
&lt;p&gt;A newsletter about privacy just seems like it should have privacy built in. For years now, I've been finding ways to manage my own online data, backups and even how I interact with social media -- finding a balance that fits my own political, cultural, social and individual idea of privacy. I think every human should have the ability to do this, and it should be fundamentally built into services that are offered, so that choice and consent are transparent and easily implemented in software, data and computing architectures.&lt;/p&gt;
&lt;p&gt;It also made sense to offer readers of my newsletter the privacy they deserve. I didn't want them to be automatically tracked, in any way. I thought they should be able to open a newsletter, read to their hearts delight, click on links, save things for later, or immediately send to Spam should they see fit -- all without anyone knowing about it. Little did I know, this would turn out to be much more difficult than I originally thought.&lt;/p&gt;
&lt;h2 id="my-journey-begins"&gt;My Journey Begins...&lt;/h2&gt;
&lt;p&gt;Earlier kjam: Let's figure out what service to use by looking at what's popular and has some privacy policies I can read and ways to toggle what data is tracked! Off we go...&lt;/p&gt;
&lt;h3 id="revue"&gt;Revue&lt;/h3&gt;
&lt;p&gt;I first started out on &lt;a href="https://www.getrevue.co/"&gt;Revue&lt;/a&gt; in Fall 2020, as several folks recommended it to me and it was then a leader in newsletters, particularly those with supplemental paid options. It wasn't my intention to create a paid newsletter, but I thought if I ever did more newsletters, maybe one day there would be a paid one.&lt;/p&gt;
&lt;p&gt;I signed up, wrote the first installment, toggled off all possible tracking settings I could find and sent it out to my, then, about 50 subscribers. Later that day, I got an email from a reader mentioning that the links they received were tracked (!). I took a look at the fine-tuned settings and found that there was literally no way for me to turn off click tracking on links. After some back and forth conversations on social media and via email with other privacy folks, I was recommended to migrate to &lt;a href="https://buttondown.email/"&gt;Buttondown&lt;/a&gt;, a friendly and privacy-aware alternative. I picked up my content and migrated over...&lt;/p&gt;
&lt;h3 id="buttondown"&gt;Buttondown&lt;/h3&gt;
&lt;p&gt;I happily logged into Buttondown to see that I could turn off all tracking. I tested that no links were tracked. I tested that I couldn't see the views or opens, and I turned off emails to alert me of who was signing up or unsubscribing. Seemed that I was set!&lt;/p&gt;
&lt;p&gt;I wrote several newsletters and received no more privacy feedback, just content feedback. Finally, I thought, it's solved!&lt;/p&gt;
&lt;p&gt;But as I wanted to update and change the newsletter by setting up my DNS and integrated Buttondown into my own website. First, I would need to start paying for Buttondown. This was to help cover the costs of the mail service provider and hosting. Sounded very reasonable, but I wanted to look further into these services, just to confirm they were also privacy-respecting, considering I'd now be helping pay for them.&lt;/p&gt;
&lt;p&gt;I first emailed the friendly Buttondown admin to confirm the services used. Then, I dug into the fine print from those services to figure out if tracking was somehow built in and what the options were for turning it off.&lt;/p&gt;
&lt;p&gt;This sent me down a new rabbit hole: namely, the sad state of privacy in email.&lt;/p&gt;
&lt;h2 id="the-plot-thickens-how-does-email-work"&gt;The Plot Thickens: How does Email work?&lt;/h2&gt;
&lt;p&gt;Many newsletter providers use a third-party mail service provider. This is the service that actually takes the email template, turns it into an email-friendly format and mails it out to your subscribers. Sometimes you are using a service that does both, but many times, particularly for newsletter "front-ends" or management services, the actual sending will be outsourced to a service that your newsletter provider uses to send bulk email.&lt;/p&gt;
&lt;p&gt;Let's walk through what normally then happens when this occurs.&lt;/p&gt;
&lt;p&gt;With normal SMTP, like when sending an email from your Gmail account, you are usually sending email from one large email provider to another, or within the same organization. Therefore, the SMTP services that need to send several messages back and forth to confirm sender, recipient and message text will either all occur within an internal service (like Google Mail or Outlook) or will happen between those services. This usually means your mail lands in the other person's Inbox and not in the Spam folder. For a deeper dive into how SMTP works, &lt;a href="https://en.wikipedia.org/wiki/Simple_Mail_Transfer_Protocol"&gt;check out Wikipedia&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;However, when you are sending bulk email, like with a newsletter, you need to send many emails at once. This is usually not allowed by the large email providers unless you are emailing a large internal group (i.e. a large work list). These providers turned off bulk sending long ago to fight spammers, and that created the surge of bulk email providers you can see today. These providers help send bulk mail for newsletters, brands for direct marketing and advertisers and they can range from easy-to-use setups where you edit the email directly in the browser and hit send, to more complex, like using your cloud provider as an bulk email service, often requiring programmatic access.&lt;/p&gt;
&lt;h3 id="mail-service-providers-privacy"&gt;Mail Service Providers &amp;amp; Privacy&lt;/h3&gt;
&lt;p&gt;You can see how these services are ranked and compared &lt;a href="https://emailanalytics.com/bulk-email-services/"&gt;on sites like "Email Analytics"&lt;/a&gt; along with the delightful other articles that these types of sites feature, like how to track your employees and customers via email and other surveillance software.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A list of related articles showing topics like how to track your employees, how to install surveillance software for your workers and how to optimize email metrics" src="/images/2023/03/email_analytics_articles.png" title="Email Analytics Related Articles..."&gt;&lt;/p&gt;
&lt;p&gt;In fact, the deeper I dove into trying to find a privacy-first bulk email service, with some help from networking friends, the more I realized there wasn't going to be many without tracking. Investigating &lt;a href="https://www.mailgun.com/"&gt;mailgun&lt;/a&gt;, &lt;a href="https://docs.buttondown.email/behind-the-scenes/running-costs"&gt;which Buttondown uses&lt;/a&gt;, dropped me into their Privacy Policies and Terms of service which uncovered data storage and retention periods that I did not expect. For example, below is an excerpt from &lt;a href="https://documentation.mailgun.com/en/latest/user_manual.html#events-1"&gt;their documentation&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Mailgun keeps track of every event that happens to every message (both inbound and outbound) and stores this data for at least 30 days for paid accounts and 2 days for free accounts.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Note that all of this would technically be stored centrally in the Buttondown admin account, meaning I couldn't verify access or retention in any way. Even if I chose to build my own newsletter to integrate directly with mailgun, there was no way to turn this off.&lt;/p&gt;
&lt;p&gt;At the time, they also had fairly expansive privacy policies, that documented the data shared with their subprocessors (services and companies they use to process person-related data). They have since made their privacy policy more legible, but you can still see the wide array of data collected, processed and likely stored in the provider's account in &lt;a href="https://www.mailgun.com/legal/dpa/"&gt;their current DPA&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The types of Personal Data to be processed: The personal data submitted, the extent of which is determined and controlled by the Controller in its sole discretion, includes name, email, telephone numbers, IP address and other personal data included in the contact lists and message content.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Much of this is likely collected to fulfill customer demands (i.e. customers &lt;em&gt;want&lt;/em&gt; tracking) or as a way to combat spammers. But there was no way to turn it off and there also wasn't a way to use the service for a given period of time or trial period, prove I am not a spammer and turn it off later. As a machine learning engineer and data scientist, I could think of a million other ways to detect spam activity than storing this history, but that's besides the point.&lt;/p&gt;
&lt;p&gt;I was truly in uncharted territory, so I started asking networking friends how I could find a service that didn't actively track opens, reads and unanswered bounces. That's when I learned about SMTP relays...&lt;/p&gt;
&lt;h4 id="what-is-smtp-relay"&gt;What is SMTP Relay?&lt;/h4&gt;
&lt;p&gt;An SMTP relay is a way to handoff SMTP requests between SMTP servers. This happens when the sender and receiver aren't in the same email domain. Much like your internet requests are handed off across the internet, an SMTP relay service hands of incoming and outgoing mail until it reaches the appropriate recipient mail server. You can read more about SMTP relay from &lt;a href="https://www.ionos.com/digitalguide/e-mail/technical-matters/smtp-relay/"&gt;Ionos's explanatory article&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;What I needed was a privacy-first SMTP relay, that allowed me to turn off tracking for the email as it got forwarded out to the emails. I put a cry out for help on Twitter (reminder: this was early 2022, pre-Emerald Emperor).&lt;/p&gt;
&lt;p&gt;&lt;img alt="Tweet from @kjam asking if anyone knows a privacy-first SMTP relay service or if anyone could ask around and recommend one. Context: I'm trying to find one for Probably Private (with a link to the old buttondown newsletter.)" src="/images/2023/03/tweet_smtp_relay.png" title=""&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://twitter.com/kjam/status/1500813062970003466"&gt;original tweet&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;My request was:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Data hosted in Europe&lt;/li&gt;
&lt;li&gt;Data minimization built in (i.e. I can hold the emails, the service just does the send)&lt;/li&gt;
&lt;li&gt;Reasonable prices for small amount of emails per month (&amp;lt;$30/mo if possible)&lt;/li&gt;
&lt;li&gt;Clear privacy policy and data processing agreement&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Unfortunately, no one could point me in a reasonable direction. I was getting desperate, and it seemed like I might need to host my own email somehow...&lt;/p&gt;
&lt;h3 id="just-host-your-own-email-said-no-one-who-has-done-it"&gt;Just Host Your Own Email (said no one who has done it)&lt;/h3&gt;
&lt;p&gt;Of course there are many guides and articles on how to set up your own SMTP server. What none of these guides will tell you is what a pain maintenance is, and how you basically will be immediately marked as Spam until you prove yourself otherwise.&lt;/p&gt;
&lt;p&gt;Since the last time I set up a mail server (circa 2010), email has changed a bit. The deeper I dove into hosting my own server, the more it seemed impossible to manage due to the way that reputation management is performed. Due to the rising sophistication of phishing and spam communication, email in those past 10 years became a true battleground between those parties. The victim of these battles are people or companies who would like to run their own email and who aren't going to send a lot of mail.&lt;/p&gt;
&lt;p&gt;When you set up a mail service and start sending mail, your reputation will be tracked on mail services in relation to your domain and your servers' IP addresses. When you first start or are unknown, this is very difficult, because you are assumed to be Spam. You have to then spend time and energy to increase your reputation -- all that on top of also having to maintain the mail server and whatever it was you were trying to do with it in the first place -- run a business, write a newsletter, etc.&lt;/p&gt;
&lt;p&gt;It wasn't intentional, but this basically pushed out a lot of self-hosted email providers or hobby projects. In another way, it's a bit terrifying for privacy, since most email is sent unencrypted and who knows what types of machine learning or other data "insights" are being run on/by most email providers (since now everyone has one or more email providers)... but I digress.  &lt;/p&gt;
&lt;p&gt;Self-hosting seemed like a lot of work and I also didn't really want to have one more thing to manage, I just wanted to run a privacy-first newsletter.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Please send help cartoon GIF showing a woman writing a send help message, giving it to a bird, and then the bird flies away and is eaten by a larger bird." src="/images/2023/03/send-help.gif" title=""&gt;&lt;/p&gt;
&lt;h3 id="back-to-privacy-first-email-providers"&gt;Back to Privacy-First Email Providers&lt;/h3&gt;
&lt;p&gt;I ended up routing completely back to email providers. Namely because my needs right now as a relatively small newsletter don't actually exceed normal email sending rates. This sets a newsletter growth constraint, but one I was willing to accept for now in order to provide more privacy for my readers.&lt;/p&gt;
&lt;p&gt;I took a look at &lt;a href="https://proton.me/"&gt;Proton Mail&lt;/a&gt;, who has a great track record with regard to privacy, but they actually don't provide programmatic interfaces very easily, and certainly not for sending many emails at once.&lt;/p&gt;
&lt;p&gt;Finally, I found &lt;a href="https://runbox.com/"&gt;Runbox&lt;/a&gt;, a privacy- and security-first email provider based in Norway. Added bonus, they also prioritize green computing! It gave me warm fuzzies and I immediately signed up for a trial account. I tested using the API programmatically and didn't run into any problems, so I bought an enterprise account and migrated over probablyprivate.com.&lt;/p&gt;
&lt;p&gt;My troubles were over... or were they?&lt;/p&gt;
&lt;h2 id="populating-a-bare-bones-newsletter"&gt;Populating a Bare-Bones Newsletter&lt;/h2&gt;
&lt;p&gt;Now that I was ready to build the actual newsletter, it meant starting over. Since I wasn't able to find a newsletter provider with a privacy-first SMTP relay, it meant finding my own way to programatically send emails. At first, I had set up my newsletter to use &lt;a href="https://ghost.org/docs/"&gt;Ghost.js&lt;/a&gt;, which I love as an editor, but it uses Node and the self-hosted and open-source version only allows &lt;a href="https://ghost.org/docs/faq/mailgun-newsletters/"&gt;integration with mailgun&lt;/a&gt;, which meant it wasn't something I was going to easily change, fork or fix.&lt;/p&gt;
&lt;p&gt;I went in search of a python-based self-hosted newsletter.&lt;/p&gt;
&lt;h3 id="django-newsletter"&gt;django-newsletter&lt;/h3&gt;
&lt;p&gt;I found &lt;a href="https://github.com/jazzband/django-newsletter"&gt;django-newsletter&lt;/a&gt;, with many users and fairly good support. As I started to work with it, however, I realized it was going to be a nightmare for the type of newsletter I imagined, namely because the code base was quite complex, it didn't support the latest Django and Python versions at that time. It seemed like administrative overkill for such a small newsletter, administered by just one person.&lt;/p&gt;
&lt;p&gt;I would, however, recommend django-newsletter for folks that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;have multiple newsletters and want to manage them in one interface&lt;/li&gt;
&lt;li&gt;have multiple authors/editors who need to work on posts together&lt;/li&gt;
&lt;li&gt;are already using Django as a web framework and might be able to contribute back upgrades and updates&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Being that it didn't fit for me, I...&lt;/p&gt;
&lt;h3 id="gave-up-and-wrote-my-own"&gt;Gave up and wrote my own&lt;/h3&gt;
&lt;p&gt;Yes, I know. It's definitely "the hard way", but I wrote my Django models and administrative interfaces, along with the ability to manage posts myself. It took me about a day or two, as I already have experience using Django and also sending emails programatically.&lt;/p&gt;
&lt;p&gt;I don't really recommend this route, as it's probably overkill if you aren't already someone familiar with the basics of backend and/or frontend web development. It's also a pain to maintain. Should you actually know of an easy-to-use open-source newsletter manager that is not Node.js based (yes, I have issues), please reach out!&lt;/p&gt;
&lt;p&gt;To build out the site, I got pretty far myself with a Canva-based design that I edited, here's the &lt;a href="https://www.canva.com/p/templates/EAEsUZ4NNzs-pastel-pink-retro-and-geometry-shape-resume/"&gt;original&lt;/a&gt;, and some help with CSS via Upwork. But I had reached my limit of being able to make the templates and front-end consistent.&lt;/p&gt;
&lt;h3 id="email-templating-minimal-loading-calls"&gt;Email templating &amp;amp; minimal loading calls&lt;/h3&gt;
&lt;p&gt;If you didn't already know, email templating is a complex problem. There are so many different email clients, different screen sizes, different ways to display email, one can get easily lost. This is why professionals often use a framework like &lt;a href="https://mjml.io/"&gt;MJML (from Mailjet)&lt;/a&gt; to ensure that the email works on as many readers as possible with some semblance of consistency.&lt;/p&gt;
&lt;p&gt;I also had a requirement that I wanted the site to not load so many things. Not only does this hurt performance, it also leaks (more) information in the browsing and networking calls. I wanted to minimize calls per page load, which meant having a super lean CSS that could, at times, even be loaded on the page itself. To make this work was beyond my front-end depth, so I called a dear friend and colleague.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://zanderle.com/"&gt;Žan has helped me on several engagements&lt;/a&gt; where I needed someone who does front-end work. Along with being a delightful human, he also knows back-end web development, making it easy to pass over my half-finished pile and scream help. (Thanks Žan!) He also figured out how to convert my poor HTML files into an actual email template using MJML and is, I think, only minorly scarred from the experience. ;)&lt;/p&gt;
&lt;h2 id="announcing-the-new-probably-private"&gt;Announcing the new Probably Private!&lt;/h2&gt;
&lt;p&gt;In the end, I now have a fairly no-frills but definitely privacy-first newsletter! It was a journey I did not expect, but I learned even more about privacy and the internet, which I'll be writing more about in an upcoming series diving into the technical details of internet privacy!&lt;/p&gt;
&lt;p&gt;If this post was interesting to you, or if you want to learn more about privacy in data science, &lt;a href="https://probablyprivate.com/subscribe/"&gt;please subscribe&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;I'm also always welcome to feedback (via mail: info /at probablyprivate / dot / com). &lt;/p&gt;</content><category term="internet"></category></entry><entry><title>Joining Dropout Labs!</title><link href="https://blog.kjamistan.com/joining-dropout-labs.html" rel="alternate"></link><published>2019-11-23T00:00:00+01:00</published><updated>2019-11-23T00:00:00+01:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2019-11-23:/joining-dropout-labs.html</id><summary type="html">&lt;p&gt;After months of searching, lots of fun (and some less fun) interviews and hours of self-reflection, I am excited to announce I am the new Head of Product at &lt;a href="https://dropoutlabs.com/"&gt;Dropout Labs&lt;/a&gt;! 🎉&lt;/p&gt;
&lt;p&gt;The interview and decision process was quite iterative and disruptive! I am somewhat to blame for this as I …&lt;/p&gt;</summary><content type="html">&lt;p&gt;After months of searching, lots of fun (and some less fun) interviews and hours of self-reflection, I am excited to announce I am the new Head of Product at &lt;a href="https://dropoutlabs.com/"&gt;Dropout Labs&lt;/a&gt;! 🎉&lt;/p&gt;
&lt;p&gt;The interview and decision process was quite iterative and disruptive! I am somewhat to blame for this as I chose to interview with more than 35 companies 😅 The decision process itself involved many pivots, but at &lt;a href="https://en.wikipedia.org/wiki/Foo_Camp"&gt;FooCamp&lt;/a&gt;, via several soul-searching conversations, I came to the conclusion that I couldn't walk away from my passion for changing machine learning for good by continuing to advocate for privacy &lt;em&gt;and&lt;/em&gt; security in machine learning.&lt;/p&gt;
&lt;p&gt;After coming to this conclusion, the choice was obvious. The team at Dropout Labs was deeply knowledgeable and passionate about this goal and truly believes in a future where encrypted machine learning is not only possible, it's the norm.&lt;/p&gt;
&lt;h5 id="what-is-dropout-labs-even"&gt;What is Dropout Labs even?&lt;/h5&gt;
&lt;p&gt;An amazing all-remote team built by successful entrepreneurs working on privacy-preserving machine learning at the intersection of deep learning and cryptography! Brag time: they built the open-source &lt;a href="https://github.com/tf-encrypted/tf-encrypted"&gt;TF Encrypted&lt;/a&gt; and &lt;a href="https://github.com/tf-encrypted/"&gt;several other important Tensorflow Libraries&lt;/a&gt; helping make secure and privacy-aware deep learning a reality.&lt;/p&gt;
&lt;p&gt;In addition to being able to stay in Berlin, the team impressed me with their knowledge and enthusiasm for privacy, machine learning and security. An all-remote culture is something I've always wanted to experience and is providing me with new learnings daily -- am I communicating the right amount? How can I ask better questions? How can I clearly share a specific insight with the team?&lt;/p&gt;
&lt;p&gt;If you know me or my work, you also know I wouldn't join a team that didn't have a mission or vision that aligned with my deeply held beliefs. Ours is clear: create a new reality for machine learning -- one where different actors (data owners, data scientists, security and privacy folks, end users) can collaborate to define trust in their relationships and confidently build better models in a privacy-aware and secure manner.&lt;/p&gt;
&lt;h5 id="what-is-this-product-that-you-are-building-as-head-of-product"&gt;What is this product that you are building as Head of Product?&lt;/h5&gt;
&lt;p&gt;This is what I &lt;em&gt;can&lt;/em&gt; say so far: we're exploring the intersection of machine learning pipelines, data privacy policy and encryption. We want to meet the problem where it will have the most impact: sensitive data in production system that would benefit a machine learning or data science team if they had access.&lt;/p&gt;
&lt;p&gt;As we are using iterative design and development, you'll get a sneak peek long before our initial launch if you follow me on &lt;a href="https://www.linkedin.com/in/katharinejarmul"&gt;LinkedIn&lt;/a&gt; or &lt;a href="https://twitter.com/kjam"&gt;Twitter&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;BTW, you should also follow &lt;a href="https://twitter.com/dropoutlabsai"&gt;Dropout Labs&lt;/a&gt; and check out their posts on &lt;a href="https://medium.com/dropoutlabs"&gt;Medium&lt;/a&gt; to learn more!&lt;/p&gt;
&lt;h5 id="can-i-learn-even-more"&gt;Can I learn &lt;em&gt;even&lt;/em&gt; more?&lt;/h5&gt;
&lt;p&gt;Yes, of course! I'd love to chat about what we are building and get feedback! Specifically, if you are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a data protection officer or policy/governance lead&lt;/li&gt;
&lt;li&gt;a data scientist or machine learning engineer&lt;/li&gt;
&lt;li&gt;a data or machine learning pipeline engineer&lt;/li&gt;
&lt;li&gt;a security or SecOps team member&lt;/li&gt;
&lt;li&gt;an executive at a company handling sensitive data&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I want to talk to you! As part of product development, we will be implementing a lot of fun prototypes and asking lots of questions -- which means I'd love to hear your needs and prioritize them. I can promise to listen and learn from you. If you are in Berlin (either for our call or after), I can treat you to lunch or a beverage of your choice as a Dankeschön!&lt;/p&gt;
&lt;p&gt;Please feel free to &lt;a href="https://twitter.com/kjam"&gt;DM me on Twitter&lt;/a&gt;, &lt;a href="https://www.linkedin.com/in/katharinejarmul"&gt;connect on LinkedIn&lt;/a&gt; or &lt;a href="mailto:katharine@dropoutlabs.com"&gt;drop me a line via email&lt;/a&gt;. I look forward to sharing more of this journey with you and learning along the way. 🤗&lt;/p&gt;</content><category term="misc"></category></entry><entry><title>Let's Get Together: More Details on Me, You and My Dream Gig</title><link href="https://blog.kjamistan.com/lets-get-together-more-details-on-me-you-and-my-dream-gig.html" rel="alternate"></link><published>2019-06-06T00:00:00+02:00</published><updated>2019-06-06T00:00:00+02:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2019-06-06:/lets-get-together-more-details-on-me-you-and-my-dream-gig.html</id><summary type="html">&lt;p&gt;Hello!&lt;/p&gt;
&lt;p&gt;We may not know each other, but here you are on my website -- perhaps because you saw a post or someone shared a link. I'm resourceful, determined, intelligent and looking for new challenges. Welcome!&lt;/p&gt;
&lt;p&gt;Here's more about me, in case it is news to you:&lt;/p&gt;
&lt;h4 id="about-me"&gt;[About Me]&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Co-founder of …&lt;/li&gt;&lt;/ul&gt;</summary><content type="html">&lt;p&gt;Hello!&lt;/p&gt;
&lt;p&gt;We may not know each other, but here you are on my website -- perhaps because you saw a post or someone shared a link. I'm resourceful, determined, intelligent and looking for new challenges. Welcome!&lt;/p&gt;
&lt;p&gt;Here's more about me, in case it is news to you:&lt;/p&gt;
&lt;h4 id="about-me"&gt;[About Me]&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Co-founder of KIProtect, a startup with a mission to make privacy easier. Our main technology was developing new encryption methods allowing you to do more secure and privacy-aware machine learning. I led our business, sales, product and marketing efforts.&lt;/li&gt;
&lt;li&gt;More than 10 years in the technology industry, with broad engineering and product experience in data engineering, machine learning and data science, software design and development, large-scale AWS, Rackspace and Google Cloud deployment and automation. Deep understanding of data privacy and information security best practices for compliance with GDPR, HIPAA and new privacy regulations in Brazil and California.&lt;/li&gt;
&lt;li&gt;Extremely interested in making machine learning more fair, just, accountable, secure and privacy-aware.&lt;/li&gt;
&lt;li&gt;Regular speaker and keynoter at international conferences such as CCC, Strangeloop, QCon, ACM, PyData, PyCon, EuroPython. Due to my strong engineering and organizing background, have covered topics like data privacy, machine learning security and AI ethics and continue to be invited to speak on these topics.&lt;/li&gt;
&lt;li&gt;Adjunct professor at the University of Florida and teacher for several online platforms (O'Reilly Safari, DataCamp) and offline ones (Frauenloop, PyLadies).&lt;/li&gt;
&lt;li&gt;Interested in sharpening my security engineering chops. Implemented basic security automation and monitoring, possess an in-depth understanding of machine learning security. Now anticipate gaining further expertise in the areas of pen-testing, network and container security, and exploit / vulnerability discovery.&lt;/li&gt;
&lt;li&gt;Years of experience in business and product side, thus a capable and resourceful intermediary for tech and business/product teams (i.e. I am product-tech bilingual).&lt;/li&gt;
&lt;li&gt;Excel at rapid grasp of new technologies, and asking difficult questions to surface critical issues -- driving teams to research, learn, debate and decisively resolve issues as they arise.&lt;/li&gt;
&lt;li&gt;Fluent in Python and GoLang and have experience with C++ and Java.&lt;/li&gt;
&lt;li&gt;Founder of PyLadies, mentor and ally for several women of color and immigrant women in tech initiatives, conference diversity scholarship organizer, persistent advocate for the “underrepresented” in tech and challenger of privilege in our industry.&lt;/li&gt;
&lt;li&gt;Background in investigative journalism, love public speaking, meeting new people and working with teams on cool s**t.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="about-you"&gt;[About You]&lt;/h4&gt;
&lt;p&gt;Here's a few things I'm hoping you can tell me:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What is your team like? Is it diverse (gender, race, immigration status, age)? If not, why not?&lt;/li&gt;
&lt;li&gt;What relevant problems do you solve? Why is your work / product exciting?&lt;/li&gt;
&lt;li&gt;Do you let folks learn on the job? Is this supported with mentoring / pairing / reviews, etc?&lt;/li&gt;
&lt;li&gt;Are you friendly to remote workers or based in Berlin? If not, where are you and do you offer relocation?&lt;/li&gt;
&lt;li&gt;Are you flexible on start date or do you have a shorter engagement (like a small project or consulting) in mind?&lt;/li&gt;
&lt;li&gt;Did you read the above description about me and determine I'm a good fit based on our mutual interests or are you just here to add another email to your recruitment database? 😘&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="dream-role"&gt;[Dream Role]&lt;/h4&gt;
&lt;p&gt;To be fully transparent, I'm not precisely sure what I want to tackle next. There are multiple possible good fits; here are a few examples of positions where I could add significant value and passion:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Technical product owner focused on defining customer needs, developing the product roadmap and assuring feasibility&lt;/li&gt;
&lt;li&gt;Researcher at an AI institute or policy group -- focused on ethical, privacy and security concerns&lt;/li&gt;
&lt;li&gt;Technical partner / consultant at a VC firm focused on emerging technologies at the intersection of AI and security&lt;/li&gt;
&lt;li&gt;Machine Learning or Data Science Director at a non-profit or activist organization focused on supporting community-based initiatives and fighting injustice with (and in) data&lt;/li&gt;
&lt;li&gt;Machine Learning expert at a security consultancy or company who wants to either use data science to help solve security problems or explore ways machine learning can be exploited&lt;/li&gt;
&lt;li&gt;Security-focused data engineer or data engineering manager at a company managing large amounts of sensitive data&lt;/li&gt;
&lt;li&gt;Senior management or C-Suite at a startup focused on privacy, security and/or ethical AI&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I'm currently open to a variety of positions / roles and time allocations (i.e. freelance, consultant, part-time or full-time, etc). I'd love to hear your responses to the questions above -- feel free to drop me a line. I'm katharine at the top-level domain you are currently on. Spelling matters (i.e. kath-A-rine). Thanks for dropping by. 🤗&lt;/p&gt;</content><category term="misc"></category></entry><entry><title>Adversarial Learning for Good: My Talk at #34c3 on Deep Learning Blindspots</title><link href="https://blog.kjamistan.com/adversarial-learning-for-good-my-talk-at-34c3-on-deep-learning-blindspots.html" rel="alternate"></link><published>2017-12-28T00:00:00+01:00</published><updated>2017-12-28T00:00:00+01:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2017-12-28:/adversarial-learning-for-good-my-talk-at-34c3-on-deep-learning-blindspots.html</id><summary type="html">&lt;p&gt;When I first was introduced to the idea of adversarial learning for security purposes by &lt;a href="https://www.youtube.com/watch?v=JAGDpJFFM2A"&gt;Clarence Chio's 2016 DEF CON talk&lt;/a&gt; and his related &lt;a href="https://github.com/cchio/deep-pwning"&gt;open-source library deep-pwning&lt;/a&gt;, I immediately started wondering about applications of the field to both make robust and well-tested models, but also as a preventative measure against …&lt;/p&gt;</summary><content type="html">&lt;p&gt;When I first was introduced to the idea of adversarial learning for security purposes by &lt;a href="https://www.youtube.com/watch?v=JAGDpJFFM2A"&gt;Clarence Chio's 2016 DEF CON talk&lt;/a&gt; and his related &lt;a href="https://github.com/cchio/deep-pwning"&gt;open-source library deep-pwning&lt;/a&gt;, I immediately started wondering about applications of the field to both make robust and well-tested models, but also as a preventative measure against predatory machine learning practices in the field.&lt;/p&gt;
&lt;p&gt;After reading more literature and utilizing several other open-source libraries, I realized most examples and research focused around malicious uses, such as sending spam or malware without detection, or crashing self-driving cars. Although I find this research interesting, I wanted to determine if adversarial learning could be used for "good".&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;h3 id="a-brief-primer-on-adversarial-learning-basics"&gt;A brief primer on Adversarial Learning Basics&lt;/h3&gt;
&lt;p&gt;In case you haven't been following the explosion of adversarial learning in neural network research, papers and conferences, let's take a whirlwind tour of some concepts to get on the same page and provide further reading if you open up arXiv for fun on the weekend.&lt;/p&gt;
&lt;h4 id="how-does-it-work-what-does-it-do"&gt;How Does It Work? What Does It Do?&lt;/h4&gt;
&lt;p&gt;Most neural networks optimize their weights and other variables via backpropagation and a loss function, such as Stochastic Gradient Descent (or SGD). Similarly to how we use the loss function to train our network, researchers found we can use this same method to find weak links in our network and adversarial examples that exploit them.&lt;/p&gt;
&lt;p&gt;To get an intuition of what is happening when we apply adversarial learning, let's look at a graphic which can help us visualize both the learning and adversarial generation.&lt;/p&gt;
&lt;p&gt;&lt;img alt="gradient descent graphic" src="http://blog.datumbox.com/wp-content/uploads/2013/10/gradient-descent.png"&gt; &lt;a href="http://blog.datumbox.com/tuning-the-learning-rate-in-gradient-descent/"&gt;Source Image&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Here we see a visual example of SGD, where we start our weights randomly or perhaps with a specific distribution. Here our weight produces high error rates at the beginning, putting it in the red area, but we'd like to end up at the global minimum in the dark blue area. We may, however, as the graphic shows only end up in the local minimum of the slightly higher error rate on the right hand side.&lt;/p&gt;
&lt;p&gt;With adversarial sample generation, we are essentially trying to push that point back up the hill. We can't change the weight, of course, but we can change the input. If we can get this unit to misfire, to essentially misclassify the input, and a few other units to do the same, we can end up misclassifying the input entirely. This is our goal when doing adversarial learning and we can achieve it by using a series of algorithms proven to help us create specific perturbations given the input to fool the network. As you may notice, we also need to have a trained model we can apply these algorithms to and also test our success rate.&lt;/p&gt;
&lt;h4 id="historical-tour-of-papers-moments-in-adversarial-ml"&gt;Historical Tour of Papers / Moments in Adversarial ML&lt;/h4&gt;
&lt;p&gt;The first prominent paper on adversarial examples came in the form of a technique to &lt;a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/kdd05lowd.pdf"&gt;modify spam mail to be classified as real mail, published by a group of researchers in 2005&lt;/a&gt;. The authors used a technique of addressing important features and changing them using Bayesian and linear classifiers.&lt;/p&gt;
&lt;p&gt;In 2007, NIPS had their first workshop on &lt;a href="https://web.archive.org/web/20110822071402/http://mls-nips07.first.fraunhofer.de/"&gt;Machine Learning in Adversarial Environments for Computer Security&lt;/a&gt; which covered many techniques related primarily to linear classification but also other topics of interest in security such as network intrusion and bot detection.&lt;/p&gt;
&lt;p&gt;In 2013, following other interesting research on the topic Batista Biggio and several other researchers released a paper on &lt;a href="https://arxiv.org/abs/1206.6389"&gt;Support Vector Machine (or SVM) poisoning attacks&lt;/a&gt;. The researchers were able to show they could alter specific training data and essentially render the model useless against targeted attacks (or at least hampered by the poor training). I highly recommend Biggio's later paper on &lt;a href="https://arxiv.org/abs/1709.00609"&gt;pattern-based classifiers under attack&lt;/a&gt; and he has many other publications related to techniques to attack and prevent attacks on ML models.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Example poisoning attack" src="https://blog.kjamistan.com/images/2017/12/poisoning_face_recognition_biggio.png"&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://arxiv.org/abs/1709.00609"&gt;Photo: Example poisoning attack on a biometric dataset&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;In 2014, Christian Szegedy, Ian Goodfellow and several other Google researchers released their paper &lt;a href="https://arxiv.org/abs/1312.6199"&gt;Intriguing Properties of Neural Networks&lt;/a&gt; which outlined techniques to calculate carefully crafted perturbations of an image allowing an adversary to fool a neural network into misclassifying the image. Ian Goodfellow later released &lt;a href="https://arxiv.org/abs/1412.6572"&gt;a paper outlining an adversarial technique called the Fast Gradient Sign Method or FGSM&lt;/a&gt;, one of the widely used and implemented forms of attacks on neural network classifiers.&lt;/p&gt;
&lt;p&gt;In 2016, Nicolas Papernot and several other researchers released &lt;a href="https://arxiv.org/abs/1511.07528"&gt;a new technique which utilized a Jacobian saliency map built using the Jacobian matrix of the loss function when given the input vector&lt;/a&gt;. He and Ian Goodfellow later released &lt;a href="https://github.com/tensorflow/cleverhans"&gt;a Python open-source library called cleverhans&lt;/a&gt; which implements the FGSM and Jacobian Saliency Map Attacks (or JSMA).&lt;/p&gt;
&lt;p&gt;There have been many other papers and talks related to this topic since 2014, too much to cover here, but I recommend perusing some of the recent papers from the field and investigating areas of interest for yourself.&lt;/p&gt;
&lt;h3 id="malicious-attacks"&gt;Malicious Attacks&lt;/h3&gt;
&lt;p&gt;As mentioned previously, malicious attacks have been studied at length. Here are a few notable studies:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Spam: &lt;a href="https://pdfs.semanticscholar.org/3212/929ad5121464ac49741dd3462a5d469e668d.pdf"&gt;Adversarial Machine Learning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Malware recognition: &lt;a href="http://www.patrickmcdaniel.org/pubs/esorics17.pdf"&gt;Adversarial Examples for Malware Detection&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Malware generation: &lt;a href="https://arxiv.org/abs/1702.05983"&gt;Generating Adversarial Malware Examples for Black-Box Attacks Based on GAN
&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Poisoning of biometric data: &lt;a href="https://pdfs.semanticscholar.org/bafb/d93468634b5b43e3b29b3d86efae41559e8b.pdf"&gt;Adversarial Biometric Recognition&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Attacks on self-driving cars: &lt;a href="https://arxiv.org/pdf/1707.08945"&gt;Robust Physical-World Attacks
on Deep Learning Models&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There are plenty more, but these give you an idea of what has been studied in the space. Of course, alongside many of these studies the authors studied counter-attacks. Security is ever a cat-mouse game so learning how to defend against these types of attacks, particularly with detection of an adversary or adversarial training is a research space in its own right.&lt;/p&gt;
&lt;h4 id="real-life-adversarial-examples"&gt;Real-life Adversarial Examples&lt;/h4&gt;
&lt;p&gt;It has been debated whether adversarial learning will ever work for real-life objects or is just useful when the image is a static input such as an image or a file. In a recent paper, a group of researchers at MIT were able to &lt;a href="https://arxiv.org/abs/1707.07397"&gt;print 3D objects which fooled a video-based Inception network into "thinking", for example, a turtle was a rifle&lt;/a&gt;. Their method utilized similar techniques to FGSM across a plane of possible alterations on the texture of the object itself.&lt;/p&gt;
&lt;iframe width="560" height="315" src="https://www.youtube.com/embed/piYnd_wYlT8" frameborder="0" gesture="media" allow="encrypted-media" allowfullscreen&gt;&lt;/iframe&gt;

&lt;h3 id="how-can-i-build-my-own-adversarial-samples"&gt;How can I build my own adversarial samples?&lt;/h3&gt;
&lt;p&gt;Hopefully you are now interested in building some of your own adversarial samples. Maybe you are a machine learning practitioner looking to better defend your network, or perhaps you are just intrigued by the topic. Please do not use these techniques to mail spam or malware! Really though... don't.&lt;/p&gt;
&lt;p&gt;Okay, ethical use covered, let's check out the basic steps you'll need to go through when building adversarial samples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Pick a problem / network type&lt;ul&gt;
&lt;li&gt;Figure out a target or idea. Do some research on what is used "in production" on those types of tasks.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Research “state of the art” or publicly available pretrained models or build your own&lt;ul&gt;
&lt;li&gt;Read research papers in the space, watch talks from target company. Determine if you will build your own or use a pretrained model.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;(optional) Fine tune your model&lt;ul&gt;
&lt;li&gt;If using a pretrained model, take time to fine-tune it by retraining the last few layers.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Use a library: cleverhans, FoolBox, DeepFool, deep-pwning&lt;ul&gt;
&lt;li&gt;Utilize one of many adversarial learning open-source tools to generate adversarial input.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Test your adversarial samples on another (or your target) network&lt;ul&gt;
&lt;li&gt;Not all problems and models are as easy to fool. Test your best images on your local network and possibly one that hasn't seen the same training data. Then take the highest confidence fooling input and pass it to the target network.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Want to get started right away? Here are some neat tools and libraries available in the open-source world for generating different adversarial examples.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/tensorflow/cleverhans"&gt;cleverhans&lt;/a&gt;: Implementations of FGSM and JSMA in Tensorflow and Keras&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/cchio/deep-pwning"&gt;deep-pwning&lt;/a&gt;: Generative drivers with examples for Semantic CNN, MNIST and CIFAR-10&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/bethgelab/foolbox"&gt;FooxBox&lt;/a&gt;: Implementations of many algorithms with support for Tensorflow, Torch, Keras and MXNet&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/lts4/deepfool"&gt;DeepFool&lt;/a&gt;: Torch-based implementation of the paper DeepFool (less detectable FGSM)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Evolving-AI-Lab/fooling"&gt;Evolving AI Lab: Fooling&lt;/a&gt;: Evolutionary network for generating images that humans don't recognize but networks do, implemented in Caffe&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/vu-aml/adlib"&gt;Vanderbuilt's adlib&lt;/a&gt;: sci-kit learn based fooling and poisoning algorithms for simple ML models.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There are many more, but these seemed like a representative sample of what is available. Have a library you think should be included? Ping me or comment!&lt;/p&gt;
&lt;h3 id="benevolent-uses-of-adversarial-samples-a-proposal"&gt;Benevolent Uses of Adversarial Samples (a proposal)&lt;/h3&gt;
&lt;p&gt;I see the potential for numerous benevolent applications of these same techniques. The first idea that came to mind for me was facial recognition for surveillance technology (or simply when you want to post a photo and not have it recognize you).&lt;/p&gt;
&lt;h4 id="face-recognition"&gt;Face Recognition&lt;/h4&gt;
&lt;p&gt;To test the idea, I retrained the final layers of the Keras pre-trained Inception V3 model to determine if a photo is a cat or a human. It achieved 99% accuracy in testing.&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt; Then, I utilized the cleverhans library to calculate adversaries using FGSM. I tried varying levels of epsilon, uploading each to Facebook. At low levels of perturbations, Facebook immediately recognized my photo as my face and suggested I tag myself. When I reached .21 epsilon, Facebook stopped suggesting a tag (this was around 95% confidence from my network that the photo was of a cat).&lt;/p&gt;
&lt;p&gt;&lt;img alt="me as a cat" src="https://blog.kjamistan.com/images/2017/12/me_as_a_cat.png"&gt;&lt;/p&gt;
&lt;p&gt;[Photo: me as a cat]&lt;/p&gt;
&lt;p&gt;The produced image clearly shows perturbations, but after speaking with a computer vision specialist, Irina Vidal Migallon&lt;sup id="fnref:3"&gt;&lt;a class="footnote-ref" href="#fn:3"&gt;3&lt;/a&gt;&lt;/sup&gt;, it is possible Facebook is also using the Viola-Jones statistics-based face detection or some other statistical solution. If that is the case, it's unlikely we would be able to fool it using a neural network with no humanly visible perturbations. But it does show that we &lt;em&gt;can&lt;/em&gt; use a neural network and adversarial learning techniques to fool face detection.&lt;sup id="fnref:4"&gt;&lt;a class="footnote-ref" href="#fn:4"&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;h4 id="steganography"&gt;Steganography&lt;/h4&gt;
&lt;p&gt;I had another idea while reading &lt;a href="http://www.evolvingai.org/fooling"&gt;a great paper which covered using adversarial learning alongside evolutionary networks&lt;/a&gt; to generate images which are not recognizable by humans but are convincing to a neural network with 99% accuracy. My idea is to apply this same image generation as a form of steganography.&lt;/p&gt;
&lt;p&gt;&lt;img alt="MNIST generated images" src="http://www.evolvingai.org/sites/fish34.cs.uwyo.edu.lab/files/whitenoise_lenet_images_5_runs_0.png"&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://www.evolvingai.org/fooling"&gt;Photo: Generated Images from MNIST dataset which the model classifies with high confidence as digits&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;In a time where it seems data we used to consider private (messages to friends a family on your phone, emails to your coworkers, etc), can now be used to either sell you advertising or be inspected by border agents, I liked the idea of using an adversarial Generative Adversarial Network (or GAN) to send messages. All the recipient would need is access to training data and some information about the architecture. Of course, you could also send the model if you can secure the method you are sending it. Then the recipient could use a self-trained or pretrained model to decode your message.&lt;/p&gt;
&lt;h4 id="some-other-benevolent-adversarial-learning-ideas"&gt;Some other benevolent adversarial learning ideas&lt;/h4&gt;
&lt;p&gt;Some other ideas I thought would be interesting to try are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Adware “Fooling”&lt;ul&gt;
&lt;li&gt;Can you trick your adware classifiers into thinking you are a different demographic? Perhaps keeping predatory advertising contained...&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Poisoning Your Private Data&lt;ul&gt;
&lt;li&gt;Using poisoning attacks, can you obscure your data?&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Investigation of Black Box Deployed Models&lt;ul&gt;
&lt;li&gt;By testing adversarial samples, can we learn more about the structure, architecture and use of ML systems of services we use?&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;??? (Your Idea Here)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I am curious to hear others ideas on the topic, so please reach out if you can think of an ethical and benevolent application of adversarial learning!&lt;/p&gt;
&lt;h3 id="a-call-to-fellow-european-residents"&gt;A Call to Fellow European Residents&lt;/h3&gt;
&lt;p&gt;I chose to speak on the &lt;a href="https://fahrplan.events.ccc.de/congress/2017/Fahrplan/events.html#resilience"&gt;#34c3 Resiliency track&lt;/a&gt; because the goal of the track resonated with me. It asked for new techniques we can use in a not-always-so-great world we live in so that we can live closer to the life we might want (for ourselves and others).&lt;/p&gt;
&lt;p&gt;For EU residents, the passage and upcoming implementation of the General Data Protection Regulation (or GDPR) means we will have more rights than most people in the world regarding how corporations use, store and mine our data. I suggest we use these rights actively and with a communal effort towards exposing poor data management and predatory practices.&lt;/p&gt;
&lt;p&gt;In addition, adversarial techniques greatly benefit from more information. Knowing more about the system you are interacting with, knowing about possible features or model-types used will give you an advantage when crafting your adversarial examples.&lt;sup id="fnref:5"&gt;&lt;a class="footnote-ref" href="#fn:5"&gt;5&lt;/a&gt;&lt;/sup&gt; In GDPR, there is a section which has been often cited as a "Right to an Explanation." &lt;a href="__GHOST_URL__/gdpr-you-my-talk-at-cloudera-sessions-munchen/"&gt;Although I have covered that this is much more likely to be enforced as a "Right to be Informed,"&lt;/a&gt; I suggest we EU residents utilize this portion of the regulation to inquire about use of our data and automated decisions via machine learning at companies whose services we use. If you live in Europe and are concerned how a large company might be mining, using or selling your data, GDPR allows you more rights to determine if this is the case. Let's use GDPR to the fullest and share information gleaned from it with one another.&lt;/p&gt;
&lt;p&gt;A few articles of late about GDPR caught my eye. Mainly &lt;a href="https://www.brentozar.com/archive/2017/12/gdpr-stopped-selling-stuff-europe/"&gt;(my fellow) Americans complaining about implementation hassles and choosing to opt-out&lt;/a&gt;. Despite the ignorant takes, I was heartened by &lt;a href="https://lobste.rs/s/gbty61/gdpr_why_we_stopped_selling_stuff_europe"&gt;several threads from other European residents pointing out the benefits of the regulation&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I would love to see GDPR lead to the growth of privacy-concerned ethical data management companies. I would love to even pay for a service if they promised to not sell my data. I want to live in a world where the "free market" system then allows for ME as a consumer to choose someone to manage my data who has similar ethical views on the use of computers and data.&lt;/p&gt;
&lt;p&gt;If your startup, company or service offers these types of protections, please write me. I am excited to see the growth of this mindset, both in Europe and hopefully worldwide.&lt;/p&gt;
&lt;h3 id="my-talk-slides-video"&gt;My Talk Slides &amp;amp; Video&lt;/h3&gt;
&lt;p&gt;If you are interested in checking out my slides, here they are!&lt;/p&gt;
&lt;iframe src="https://docs.google.com/presentation/d/e/2PACX-1vTOPhtHIQunlU6h6oI2mt5h44oWayL8l7cI6FCNebTfcNKwvbdfMyoRAT6OOHs6rMewizzif7kW4n_u/embed?start=false&amp;loop=false&amp;delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"&gt;&lt;/iframe&gt;

&lt;p&gt;Video:&lt;/p&gt;
&lt;iframe width="560" height="315" src="https://www.youtube.com/embed/BVJT-sE0WWQ" frameborder="0" gesture="media" allow="encrypted-media" allowfullscreen&gt;&lt;/iframe&gt;

&lt;h4 id="slide-references-in-order"&gt;Slide References (in order)&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.theinquirer.net/inquirer/news/3023199/apples-face-id-tech-cant-tell-two-chinese-women-apart"&gt;
Apple's Face ID tech can't tell two Chinese women apart&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/kdd05lowd.pdf"&gt;Adversarial Learning (2005)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://web.archive.org/web/20110822071402/http://mls-nips07.first.fraunhofer.de/"&gt;NIPS Workshop: ML in Adversarial Environments for Computer Security&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/1206.6389"&gt;Poisoning Attacks against Support Vector Machines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/1312.6199"&gt;Intriguing Properties of Neural Networks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://blog.datumbox.com/tuning-the-learning-rate-in-gradient-descent/"&gt;Stochastic Gradient Descent Image&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/1707.07397"&gt;Synthesizing Robust Adversarial Examples&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/1511.05122"&gt;Adversarial Manipulation of Deep Representations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/1602.02697"&gt;Practical BlackBox Attacks Against Machine Learning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/1605.07277"&gt;Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/tensorflow/cleverhans"&gt;cleverhans&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/cchio/deep-pwning"&gt;deep-pwning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/vu-aml/adlib"&gt;Vanderbilt Computational Economics Research Lab adlib&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/lts4/deepfool"&gt;DeepFool&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/bethgelab/foolbox"&gt;FoolBox&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Evolving-AI-Lab/fooling"&gt;Evolving AI: Fooling&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.evolvingai.org/fooling"&gt;Evolving AI: Fooling - Deep neural networks are easily fooled: High confidence predictions for unrecognizable images&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.privacy-regulation.eu/en/r71.htm"&gt;GDPR: Recital 71&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;I am not a big fan of moral labels, so I use this term as it is widely understood. A much longer description of adversarial learning for ethical privacy-concerned motivations seemed like too long of a title and description, but that is my belief and intention. :)&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;I think it's not a great implementation due to the fact that I don't work in computer vision and I used a few publicly available datasets with no extra alterations, but it did work for this purpose. If I was doing more than a proof of concept, I would likely spend time adding perturbations to the initial input (cropping, slicing), and find varied datasets.&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;Irina's &lt;a href="http://pyvideo.org/pydata-berlin-2017/deep-learning-for-detection-on-a-phone.html"&gt;awesome PyData Berlin 2017 talk on deep learning for computer vision on a mobile phone&lt;/a&gt; is not to be missed!&amp;#160;&lt;a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:4"&gt;
&lt;p&gt;Facebook has recently released the ability to &lt;a href="https://www.cnet.com/how-to/how-to-opt-out-facebooks-new-facial-recogition-feature/"&gt;opt-out of suggested facial recognition&lt;/a&gt;. This was, however, more of a proof-of-concept than a "Facebook Fooling" experiment.&amp;#160;&lt;a class="footnote-backref" href="#fnref:4" title="Jump back to footnote 4 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:5"&gt;
&lt;p&gt;However, this is not required. In fact, Nicolas Papernot has a series of great papers covering successful &lt;a href="https://arxiv.org/abs/1602.02697"&gt;black box attacks&lt;/a&gt; which query the model to get training data and then create useful adversarial examples as well as &lt;a href="https://arxiv.org/abs/1605.07277"&gt;transferability&lt;/a&gt; which shows you can use adversarial examples from one type of model to fool a different network or model with varying rates of success.&amp;#160;&lt;a class="footnote-backref" href="#fnref:5" title="Jump back to footnote 5 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="conferences"></category></entry><entry><title>Towards Interpretable Reliable Models</title><link href="https://blog.kjamistan.com/towards-interpretable-reliable-models.html" rel="alternate"></link><published>2017-10-29T00:00:00+02:00</published><updated>2017-10-29T00:00:00+02:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2017-10-29:/towards-interpretable-reliable-models.html</id><summary type="html">&lt;p&gt;I presented a keynote at &lt;a href="https://pydata.org/warsaw2017/"&gt;PyData Warsaw&lt;/a&gt; on moving toward interpretable reliable models. The talk was inspired by some of the work I admire in the field as well as a fear that if we do not address interpretable models as a community, we will be factors in our own …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I presented a keynote at &lt;a href="https://pydata.org/warsaw2017/"&gt;PyData Warsaw&lt;/a&gt; on moving toward interpretable reliable models. The talk was inspired by some of the work I admire in the field as well as a fear that if we do not address interpretable models as a community, we will be factors in our own demise. In my talk, I addressed some of the main reasons I believe interpretability is important for the data science and machine learning community.&lt;/p&gt;
&lt;h3 id="why-care-about-interpretability"&gt;Why Care About Interpretability?&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;If we become so removed from the average person's understanding and we see it as a burden and nuisance to even address their concerns, we will find ourselves the target of a cultural, political or regulatory backlash.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If we build interpretability, we allow area experts and our end users to give us realistic feedback and help improve our model overall. They can help us diagnose noise, see correlations and find better labels.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://blog.kjamistan.com/gdpr-you-my-talk-at-cloudera-sessions-munchen.html"&gt;GDPR&lt;/a&gt; and other regulations are pushing for more transparency. If we fear or run from transparency, then we might want to ask ourselves WHY. Is it because we fear the gap between our user's understanding of models and our own explanation? If so, is it just a matter of some technical literacy? OR, is it because we aren't proud of the way we are using their data and perhaps our models are extensions of unethical or immoral decisions made in the preprocessing, training or use case.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Models can become racist, sexist and display other issues that are present in the data (often found in language data and crowd sourced data). If you are interested to read more, I have &lt;a href="__GHOST_URL__/pydata-amsterdam-keynote-on-ethical-machine-learning/"&gt;a whole talk on this as well&lt;/a&gt;, or just start with the &lt;a href="https://www.princeton.edu/~aylinc/papers/caliskan-islam_semantics.pdf"&gt;amazing article on stereotypes in word vectors by Aylin Caliskan-Islam et al&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Now you are convinced interpretability might be useful, yes? So, where do we go from here? For better or worse, this is still a very open a broad area of research. I'll summarize a few libraries and papers you can use to get started immediately as well as some problems in the space which are still areas of active research.&lt;/p&gt;
&lt;h3 id="what-can-i-do-now"&gt;What Can I Do Now?&lt;/h3&gt;
&lt;p&gt;There are several interesting open-source libraries which you can use to get started with interpretability. I highlighted a few in my talk, but there are &lt;strong&gt;many&lt;/strong&gt; more. I will try to outline a few of the interesting ones I found including some I didn't have time to outline in my talk.&lt;/p&gt;
&lt;h5 id="classification-explanations"&gt;Classification Explanations&lt;/h5&gt;
&lt;p&gt;This is currently the space that has the most open-source tools available; so if you are working on classifiers, the good news is there is more than one tool you can use.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;LIME (Local Interpretable Model Explanations): &lt;a href="https://github.com/marcotcr/lime"&gt;GitHub&lt;/a&gt; and &lt;a href="https://arxiv.org/abs/1602.04938"&gt;Paper&lt;/a&gt; -- Find subsets of your data which can explain the model at a local level.&lt;/li&gt;
&lt;li&gt;eli5 (explain to me like I'm five): &lt;a href="https://github.com/TeamHG-Memex/eli5"&gt;GitHub&lt;/a&gt; Open-source library with great documentation allowing you to build visual explanations of classifiers and regression models.&lt;/li&gt;
&lt;li&gt;Sklearn-ExpertSys: &lt;a href="https://github.com/tmadl/sklearn-expertsys"&gt;GitHub&lt;/a&gt; -- Decision and Rule-based sets for Classifiers. I personally haven't had a chance to use this yet, but plan to do so as part of a longer blog series.&lt;/li&gt;
&lt;/ul&gt;
&lt;h5 id="neural-network-architectures"&gt;Neural Network Architectures&lt;/h5&gt;
&lt;ul&gt;
&lt;li&gt;Attention-Based Networks: Attention RNNs are useful in determining what the network has learned due to the network's memory access. This gives special meaning to the image-based networks because of our ability to then "see" clusters of pixels alongside the network. For more reading, check out: &lt;a href="https://papers.nips.cc/paper/5166-training-and-analysing-deep-recurrent-neural-networks.pdf"&gt;Training and Analyzing Deep RNNs&lt;/a&gt;, &lt;a href="https://aclweb.org/anthology/D/D15/D15-1044.pdf"&gt;A Neural Attention Model for Sentence Summarization&lt;/a&gt; and &lt;a href="https://arxiv.org/pdf/1502.03044.pdf"&gt;Show, Attend and Tell: Neural Image Caption
Generation with Visual Attention&lt;/a&gt; for a start.&lt;/li&gt;
&lt;li&gt;Generator-Encoder Rationales: &lt;a href="https://github.com/taolei87/rcnn"&gt;GitHub&lt;/a&gt; and &lt;a href="https://people.csail.mit.edu/taolei/papers/emnlp16_rationale.pdf"&gt;Paper&lt;/a&gt; Great paper and library which shows a method of generating smaller rationales using phrases from the text for several NLP tasks including multi-aspect sentiment analysis.&lt;/li&gt;
&lt;/ul&gt;
&lt;h5 id="other-useful-open-source-tools-and-notebooks"&gt;Other useful open-source tools and notebooks&lt;/h5&gt;
&lt;ul&gt;
&lt;li&gt;YellowBrick: &lt;a href="https://github.com/DistrictDataLabs/yellowbrick"&gt;GitHub&lt;/a&gt; -- Data Visualization library aimed at making visual explanations easier. I have so far only played around with this for data exploration, not for explaining models, but I am curious to hear your experience!&lt;/li&gt;
&lt;li&gt;MMD-critic: &lt;a href="https://github.com/BeenKim/MMD-critic"&gt;GitHub&lt;/a&gt; A meaningful approach to sampling! Google Brain resident Been Kim also wrote &lt;a href="http://people.csail.mit.edu/beenkim/papers/KIM2016NIPS_MMD.pdf"&gt;an accompanying paper&lt;/a&gt; which explains how this library works to help you sample&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/ianozsvald/data_science_delivered/blob/master/ml_explain_regression_prediction.ipynb"&gt;Ian Ozsvald's Notebook using eli5&lt;/a&gt;: Ian and I have been chatting about these libraries, and I asked him to continue to update and elaborate his own use of tools like eli5. Updates will come as well, so check back!&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/eBay/bayesian-belief-networks"&gt;Bayesian Belief Networks&lt;/a&gt;: Probabilistic Programming is cool again! (or always was... probably?) This is one of &lt;em&gt;many&lt;/em&gt; libraries you can use for building Bayesian networks. Although this may not fit your definition of interpretability (if you have to expose this to the end-client they may not be able to make sense of it), it is worth exploring for your own probabilistic models.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There are many more, which I hope to write about over the coming weeks in a series of blog posts and notebooks as I explore what I call: reverse engineering for model interpretability AND MVM: Minimal Viable Models. (more on this to come so check back or follow me on Twitter... 😉)&lt;/p&gt;
&lt;h3 id="what-is-still-unsolved"&gt;What is Still Unsolved?&lt;/h3&gt;
&lt;p&gt;Plenty. If you are a graduate student or you work in a research lab or you work with unlimited access to TPUs (ahem..), please help this area of research. Here are a few things that are still very difficult.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Interpretable views of neural networks: I don't mean the one part of ImageNet where you can see a face. I mean actual interpretation of neural networks in a meaningful and statistically significant way.&lt;/li&gt;
&lt;li&gt;Multidimensional Projections: Finding ways to explain models or clusters using 2-D or 3-D visualizations of multi-dimensional space is difficult at best and error-prone at worst. Watch &lt;a href="https://www.youtube.com/watch?v=UkmIljRIG_M"&gt;Matti Lyra's PyData Talk on Topic Modeling for some insight&lt;/a&gt;. Or follow up with research from the fields of multi-dimensional distance metrics as well as unsupervised learning.&lt;/li&gt;
&lt;li&gt;Kagglefication: Ensembles are killing us, with some sort of averaged metric I wish I could explain... 😝 But honestly, if we gamify machine learning, do we run the risk of making our own work in the field into an optimization game where the only metric is our f1 score? I hope not, but it makes me fearful sometimes... I fear we find ways to often boost or over-engineer our features to the point that we no longer can interpret the metrics and measurements we have created. This is a problem.&lt;/li&gt;
&lt;li&gt;Finding representative samples and ensuring our labels are useful: It's difficult enough to explain models that you know were trained on meticulously documented labels. This becomes much more difficult in the "real world" where tags or labels might at times be high-quality or in other moments be garbage (or entirely absent...).&lt;/li&gt;
&lt;li&gt;Measuring Interpretability: Until there is a built in &lt;code&gt;sklearn.metrics.interpret&lt;/code&gt; I'm not certain how widespread metrics or usage we will see for interpretability. Even defining how we might calculate that is difficult to deduce. Although we can build upon probabilistic models and cognitive science theory, how can we easily compare the interpretability of a text explanation with that of a regression model? Research is clear that this is &lt;em&gt;not&lt;/em&gt; impossible to do, so I hope we can find a solution which allows us to optimize for a metric like interpretability...&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There are likely many more areas of research and concern, but these are the ones that, for me, struck a chord and seemed obvious areas we, as an open-source community, can work on. If you know of papers or research in the area, I am all ears! I hope this small post has at least inspired you to have more conversations with peers or colleagues around the subject of interpretability, which is a good start.&lt;/p&gt;
&lt;h4 id="my-slides-talk"&gt;My Slides / Talk&lt;/h4&gt;
&lt;p&gt;If you are curious about my slides, I have posted them below.&lt;/p&gt;
&lt;iframe src="https://docs.google.com/presentation/d/e/2PACX-1vR05kpagAbL5qo1QThxwu44TI5SQAws_UFVg3nUAmKp39uNG0xdBjcMA-VyEeqZRGGQtt0CS5h2DMTS/embed?start=false&amp;loop=false&amp;delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"&gt;&lt;/iframe&gt;

&lt;p&gt;The video is available here:&lt;/p&gt;
&lt;iframe width="560" height="315" src="https://www.youtube.com/embed/B3PtcF-6Dtc?list=PLGVZCDnMOq0oe0eD-edj_2CuBIZ938bWT" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen&gt;&lt;/iframe&gt;

&lt;p&gt;Please continue the conversation in the comments below, or feel free to reach out on Twitter (&lt;a href="https://twitter.com/kjam"&gt;@kjam&lt;/a&gt;).&lt;/p&gt;</content><category term="conferences"></category></entry><entry><title>GDPR &amp; You: My Talk at Cloudera Sessions München</title><link href="https://blog.kjamistan.com/gdpr-you-my-talk-at-cloudera-sessions-munchen.html" rel="alternate"></link><published>2017-10-11T00:00:00+02:00</published><updated>2017-10-11T00:00:00+02:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2017-10-11:/gdpr-you-my-talk-at-cloudera-sessions-munchen.html</id><summary type="html">&lt;p&gt;Unless you have been avoiding all news, you have likely heard of the coming changes in European privacy regulations which go into effect in May 2018. The changes are covered under the General Data Privacy Regulation Directive, whose final text was made available in May 2016.&lt;/p&gt;
&lt;p&gt;I presented a talk …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Unless you have been avoiding all news, you have likely heard of the coming changes in European privacy regulations which go into effect in May 2018. The changes are covered under the General Data Privacy Regulation Directive, whose final text was made available in May 2016.&lt;/p&gt;
&lt;p&gt;I presented a talk at &lt;a href="http://go.cloudera.com/cloudera-sessions-2017-munich"&gt;Cloudera Sessions Munich&lt;/a&gt; covering a few topics I found interesting on data privacy and security overall (not just for GDPR). Although inspired by some of the GDPR provisions, my talk focused on how a few areas might be impacted by the regulation and dove into how companies can take GDPR as a suggestion to start taking ethical data science more seriously.&lt;/p&gt;
&lt;p&gt;The main takeaways I wanted to share are:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;1. GDPR doesn't require ethical or even interpretable machine learning. But you should be doing this anyways, right?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;There are a lot of scary articles out there, usually by someone with half of a clue, talking about how GDPR is going to kill artificial intelligence in Europe as we know it. They cite a paragraph in a recital which calls for the ability to explain automated decisions and processing to the data subject (aka client / user / you &amp;amp; me).&lt;/p&gt;
&lt;p&gt;However, if you take time to read the text of GDPR as well as consult several legal papers on the topic, it is fairly clear that this right doesn't exist the way it's being spread in the headlines. A great paper on this topic is &lt;a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2903469"&gt;Why a Right to Explanation of Automated Decision-Making Does Not Exist in the General Data Protection Regulation (Wachter et al., 2017)&lt;/a&gt;, where they delve into the potential legal implications of this section of the directive and explain it is highly likely this will be interpreted as a right to be informed.
That said, if you cannot explain your model at all, doesn't that concern you? As a data scientist and machine learning practitioner, it bothers me! In fact, I think if we were required to explain our models more often, this might lead to a better understanding of our problem space, innovative new ways to measure or classify our results and more ethical models. Why? Because if I take the time to create an interpretable model, I not only can better explain why it behaves that way, but also I can see if perhaps there has been some "data leakage" which means my model has perhaps learned something I wanted to avoid (i.e. &lt;a href="__GHOST_URL__/embedded-isms-in-vector-based-natural-language-processing/"&gt;how to be racist or sexist&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;So how do we promote more interpretability within the community? Interpretability in machine learning has already been a topic for several years, with &lt;a href="https://sites.google.com/site/nips2016interpretml/papers-1"&gt;workshops&lt;/a&gt;, &lt;a href="https://www.stat.washington.edu/research/reports/2012/tr609.pdf"&gt;great papers&lt;/a&gt;, &lt;a href="https://github.com/marcotcr/lime"&gt;open-source libraries&lt;/a&gt; and &lt;a href="https://www.oreilly.com/ideas/ideas-on-interpreting-machine-learning"&gt;in-depth blog writeups&lt;/a&gt;. What saddens me is how often the Kaggle-verse who somehow values every last half-percentage of accuracy over anything interpretable. &lt;em&gt;Don't be that person!&lt;/em&gt; Instead, spend time finding a model that you can explain, reason with and defend.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. Data privacy is a myth. However, you can do your best at REAL anonymization to protect your customers.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Think your data is private? If you have used a service that uses third-party data processing, had your data released as part of a competition or study, or simply leave default settings on most of your applications and sites, then it is probably not. Why? In a "big data" world, de-anonymization (especially targeted) is trivial.
Research in de-anonymization made a leap in 2008 when &lt;a href="https://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf"&gt;Arvind Narayanan and Vitaly Shmatikov published their paper: Robust De-anonymization of Large Sparse Datasets&lt;/a&gt;. The researchers had successfully de-anonymized users released in data by the Netflix Prize. This data was released knowingly by Netflix and, according to Netflix, had been properly anonymized. The paper was well-received and Narayanan went on for further research on deanonymization. It is also worth reading just for the fantastic burns.&lt;/p&gt;
&lt;blockquote class="twitter-tweet" data-lang="en"&gt;&lt;p lang="en" dir="ltr"&gt;Peak joy: I have a *real* reason to read the Netflix de-anon paper in entirety. And let me tell you, it is full of 🔥 &lt;a href="https://t.co/KU5HPaDLoE"&gt;https://t.co/KU5HPaDLoE&lt;/a&gt; &lt;a href="https://t.co/dmgHGqvg04"&gt;pic.twitter.com/dmgHGqvg04&lt;/a&gt;&lt;/p&gt;&amp;mdash; katharine jarmul (@kjam) &lt;a href="https://twitter.com/kjam/status/914108189582483456?ref_src=twsrc%5Etfw"&gt;September 30, 2017&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async src="//platform.twitter.com/widgets.js" charset="utf-8"&gt;&lt;/script&gt;

&lt;p&gt;Andreas Dewes and several reporters from NDR and ARD researched this same topic recently, presenting the findings in a &lt;a href="https://re-publica.com/de/17/session/nackt-im-netz-unternehmen-intimste-daten-sammeln-tauschen-und-verkaufen-und-was-uns"&gt;re:publica talk #NacktimNetz&lt;/a&gt; (Note: it is in German, but they presented also at DefCon and that video should be available soon). They were able to very easily get ahold of German politician, police officer and public servant click-stream data via a third-party company selling complete URL stream data for persons. Without great difficulty, they could find personally identifyable information in the data and deanonymize the complete browsing history of the person.
So what can you do as a person handling potentially sensitive user data? Mainly, &lt;em&gt;don't be evil&lt;/em&gt; (but no, really this time...). Don't sell your customer data to third-parties. Don't release it as a competition because it will be fun. Don't give it to anyone. Don't keep it connected to the public internet with default passwords. Just, be smart about it. And if you do choose to give it, sell it or release it, know that you need to &lt;em&gt;really&lt;/em&gt; think about what that might mean WHEN someone deanonymizes it.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. Data portability will hopefully inspire and encourage more competition.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;A ray of hope in this slightly grim blog post is the GDPR articles related to data portability. To me, this is perhaps the most exciting part of GDPR and holds quite a lot of power if implemented properly. Of course, there is quite a lot of debate surrounding how this will actually be enforced by the courts.&lt;/p&gt;
&lt;p&gt;The &lt;a href="http://ec.europa.eu/newsroom/just/item-detail.cfm?item_id=50083"&gt;working party document&lt;/a&gt; is fairly clear about its interpretation, stating that:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In this regard, WP29 considers that the right to data portability covers data provided knowingly and actively by the data subject as well as the personal data generated by his or her activity. This new right cannot be undermined and limited to the personal information directly communicated by the data subject, for example, on an online form.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;To me, this sounded like the mobile phone number portability competition. I actually decided to read a bit about how that was implemented in Europe, and there were several interesting papers I found related to the topic, including one &lt;a href="https://www.researchgate.net/publication/222649110_Mobile_number_portability_in_Europe"&gt;Mobile number portability in Europe (Buehler et al., 2005)&lt;/a&gt; which explored pricing and its relation to the number of persons switching carriers. Via some networking folks, I found anecdotal evidence that in some areas where start ups and smaller network carriers were competing with the larger companies on features, there was a high proportion of mobile users porting their numbers.&lt;/p&gt;
&lt;p&gt;For me, data portability opens up this same door. What if I could get all of my location data and port it to a new company? What if I could choose who I use for my language learning apps and port data easily between them?&lt;/p&gt;
&lt;p&gt;The possibility for competition on whom is a better data guardian, whom has better features and better security and better privacy could be real. This makes me both happy and hopeful.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;In case you want to look through them, you can find my slides here:&lt;/p&gt;
&lt;iframe src="https://docs.google.com/presentation/d/e/2PACX-1vTROyk4ZELHAzjU_DPQFcELCHeLSsGDhxTrTK4c0xd6cR-RL44sAFVnzxU6NtysQSLJKz-b1dXi_bnI/embed?start=false&amp;loop=false&amp;delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"&gt;&lt;/iframe&gt;
&lt;p&gt;If there is video recording, I will post it as well.&lt;/p&gt;
&lt;h4 id="slide-references-in-order-they-were-presented"&gt;Slide References (in order they were presented)&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2903469"&gt;Why a Right to Explanation of Automated Decision-Making Does Not Exist in the General Data Protection Regulation (Wachter et al., 2017)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.oreilly.com/ideas/ideas-on-interpreting-machine-learning"&gt;O'Reilly Post: Ideas on Interpreting Machine Learning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://science.sciencemag.org/content/356/6334/183"&gt;Semantics derived automatically from language corpora contain human-like biases (Caliskan et al., 2017)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf"&gt;Robust De-anonymization of Large Sparse Datasets (Narayanan et al. 2008)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://re-publica.com/de/17/session/nackt-im-netz-unternehmen-intimste-daten-sammeln-tauschen-und-verkaufen-und-was-uns"&gt;Andreas Dewes: re:publica talk #NacktimNetz&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://ec.europa.eu/newsroom/just/item-detail.cfm?item_id=50083"&gt;Article 29 working party document&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.researchgate.net/publication/222649110_Mobile_number_portability_in_Europe"&gt;Mobile number portability in Europe (Buehler et al., 2005)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hbr.org/2016/08/the-barriers-big-companies-face-when-they-try-to-act-like-lean-startups"&gt;HBR: The Barriers Big Companies Face When They Try to Act Like Lean Startups&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</content><category term="conferences"></category></entry><entry><title>Algorithmic Art and "Künstliche Kunst"</title><link href="https://blog.kjamistan.com/algorithmic-art-and-kunstliche-kunst.html" rel="alternate"></link><published>2017-10-07T00:00:00+02:00</published><updated>2017-10-07T00:00:00+02:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2017-10-07:/algorithmic-art-and-kunstliche-kunst.html</id><summary type="html">&lt;p&gt;I was invited to give a talk at &lt;a href="http://404.ie"&gt;404 Dublin&lt;/a&gt;, a really cool conference joining community groups w/ tech folks and art installations. When thinking of what topics might be of interest to the audience, I selfishly went to one of my (side) passions.. following artists who are doing amazing …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I was invited to give a talk at &lt;a href="http://404.ie"&gt;404 Dublin&lt;/a&gt;, a really cool conference joining community groups w/ tech folks and art installations. When thinking of what topics might be of interest to the audience, I selfishly went to one of my (side) passions.. following artists who are doing amazing things with the intersection of computers and art.&lt;/p&gt;
&lt;p&gt;So, what did I find when left to my own devices, Google, old art books and Twitter?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;1. I found really awesome older than I imagined algorithmic art.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Did you know the first publicly shown algorithmic visual art was made on a graph printer called the Zuse Graphomat Z64?&lt;/p&gt;
&lt;p&gt;THIS THING:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Zuse Graphomat Z64" src="https://upload.wikimedia.org/wikipedia/de/thumb/7/7a/Graphomat_Zuse_Z64_1.jpg/562px-Graphomat_Zuse_Z64_1.jpg"&gt;&lt;/p&gt;
&lt;p&gt;(image from Wikimedia)&lt;/p&gt;
&lt;p&gt;HOW COOL IS THAT? And it was written by a mathematician who studied philosophy under &lt;a href="http://www.max-bense.de/"&gt;Max Bense&lt;/a&gt; at the (now) University of Stuttgart. &lt;a href="http://zkm.de/publikation/georg-nees-kuenstliche-kunst-die-anfaenge"&gt;Georg Nees&lt;/a&gt; (the artist) went on to create famous pieces now on display in galleries around the world, and his thesis on Algorithmic Art is massively hard to find and costs hundreds of Euros (yes, please send me a copy 😂). And yes, he is the one who described the art as "Künstliche Kunst" or "Artificial Art".&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;And he wasn't the only one! &lt;a href="https://www.youtube.com/watch?v=8i7uFCK7G0o"&gt;Nanni Balestrini was writing algorithmic poetry&lt;/a&gt; using trained rules in 1961! Georg Nees presented alongside his fellow student &lt;a href="http://www.hfk-bremen.de/en/profiles/n/frieder-nake"&gt;Frieder Nake&lt;/a&gt;. And as you proceed into the 70s, you hit &lt;a href="http://www.aaronshome.com/aaron/aaron/gallery/index.html"&gt;Harold Cohen creating AARON&lt;/a&gt;, a system designed to eventually create AI art. A system he spent 40 YEARS (yes!) working on. And Cybernetic Landscapes by &lt;a href="https://www.sfmoma.org/artwork/2015.9.23"&gt;Aaron Marcus&lt;/a&gt;. If you come across more interesting art in these early times, please post a comment or feel free to message me. I'm fascinated with early applications of "Cybernetics" and computers. 😀&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. I realized just how much neural network (inspired or created) art is pushing boundaries today.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I have been following the work of &lt;a href="http://genekogan.com/"&gt;Gene Kogan&lt;/a&gt;, &lt;a href="http://quasimondo.com/"&gt;Mario Klingemann&lt;/a&gt; and &lt;a href="http://memo.tv/"&gt;Memo Atkin&lt;/a&gt; for the past year or so because they are amazing, inspiring and doing things I think will change the way we use deep learning in the coming years (I would argue they already &lt;em&gt;are&lt;/em&gt; doing this). If you haven't seen their work yet...&lt;/p&gt;
&lt;blockquote class="twitter-tweet" data-lang="en"&gt;&lt;p lang="en" dir="ltr"&gt;After 7.5 epochs of training we are still in a dark place. &lt;a href="https://t.co/Bs9869YSc6"&gt;pic.twitter.com/Bs9869YSc6&lt;/a&gt;&lt;/p&gt;&amp;mdash; Mario Klingemann (@quasimondo) &lt;a href="https://twitter.com/quasimondo/status/818040490444685312?ref_src=twsrc%5Etfw"&gt;January 8, 2017&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async src="//platform.twitter.com/widgets.js" charset="utf-8"&gt;&lt;/script&gt;

&lt;blockquote class="twitter-tweet" data-lang="en"&gt;&lt;p lang="en" dir="ltr"&gt;neural glitch: billions of computations to evoke the appearance of total disorder. and yet perfectly reproducible... a true dynamical system &lt;a href="https://t.co/3kZnBDLLci"&gt;pic.twitter.com/3kZnBDLLci&lt;/a&gt;&lt;/p&gt;&amp;mdash; Gene Kogan (@genekogan) &lt;a href="https://twitter.com/genekogan/status/911228868702408704?ref_src=twsrc%5Etfw"&gt;September 22, 2017&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async src="//platform.twitter.com/widgets.js" charset="utf-8"&gt;&lt;/script&gt;

&lt;p&gt;Memo Atkin's &lt;a href="http://www.memo.tv/pattern-recognition/"&gt;Pattern Recognition&lt;/a&gt;
&lt;img alt="Pattern Recognition by Memo Atkin" src="http://www.memo.tv/wpdev/wp-content/uploads/pr_alex_nat_30s_10fps.gif"&gt;&lt;/p&gt;
&lt;p&gt;Yr welcome! 😉&lt;/p&gt;
&lt;p&gt;But I also came across several artists and other persons in the field I hadn't heard of yet whose work I found really interesting, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Emily Daniels' work on creative poetry and &lt;a href="https://twitter.com/ker00lf"&gt;@ker00lf&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://twitter.com/elluba"&gt;Luba Elliott&lt;/a&gt; is essentially the mafia boss of creative AI, sharing her research, curation and experiments via her site, talks, newsletter and work.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://jonaslund.biz/"&gt;Jonus Lund&lt;/a&gt;'s trippy, whimsical and political views on our digital world&lt;/li&gt;
&lt;li&gt;&lt;a href="http://sebastianschmieg.com/works/lstm/"&gt;Sebastien Schmeig&lt;/a&gt;'s fantastic take on Futurism and AI: LSTM&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.jakeelwes.com/"&gt;Jake Elwes&lt;/a&gt; between creating neural networks trained on pornography, to his "closed loop" and auto-encoded Buddha...&lt;/li&gt;
&lt;li&gt;&lt;a href="https://video.vice.com/en_us/video/superhypercube/58065419aec9b98a0b3bc15d"&gt;SuperHyperCube&lt;/a&gt;: a VR game created by a collective of artists&lt;/li&gt;
&lt;li&gt;&lt;a href="http://alteredqualia.com/xg/examples/eyes_gaze3.html"&gt;Eyes Gaze&lt;/a&gt;: Neural network generated portraits using DeepGaze to create creepy and surreal images and interactions by Mike Tyka.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There were more too, but those were some of my favorites. The more I looked, the more I realized I needed to visit galleries more. At least 3 of the artists I was inspired by had shown pieces in Berlin in the last year. Time to get off the computer and start &lt;em&gt;experiencing&lt;/em&gt; art.&lt;/p&gt;
&lt;p&gt;And I got to re-investigate some of the artists I feel like are making commentary on how AI and mainstream machine learning are affecting society, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://jamesbridle.com/works"&gt;James Bridle&lt;/a&gt;: both Citizen Ex and Autonomous Trap were spectacular&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.engadget.com/2017/06/16/ai-weiwei-hansel-and-gretel-surveillance/"&gt;Ai Weiwei's Hansel and Gretel&lt;/a&gt;: W T F&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;3. It isn't very difficult to get started generating your own neural network art.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;On my small laptop GPU, I was able to train several generative text networks using a series of LSTM networks or long-short-term-memory networks. The output usually just made me laugh as you need quite a lot of interesting data to make them work well.&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt; I started with the usual suspects:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/"&gt;Andreas Karpathy's RNN LSTM&lt;/a&gt; (which is a great read if you haven't already done so)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/sherjilozair/char-rnn-tensorflow"&gt;A tensorflow-backed character RNN&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py"&gt;Keras Generative LSTM&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But there were even more interesting takes out there:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/rossgoodwin/neuralsnap"&gt;NeuralSnap&lt;/a&gt;: Poetry generated from images&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/emdaniels/poetic-inner-join"&gt;E.M. Daniels poetic inner join&lt;/a&gt;: Joining poetry together using Bayesian probability and RNNs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And of course plenty for visual art:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://affinelayer.com/pixsrv"&gt;Pix2Pix&lt;/a&gt;: Coloring based on trained networks for edges&lt;/li&gt;
&lt;li&gt;&lt;a href="http://sites.skoltech.ru/compvision/projects/deepwarp/"&gt;DeepWarp&lt;/a&gt;: Gaze images&lt;/li&gt;
&lt;li&gt;&lt;a href="http://openframeworks.cc/"&gt;OpenFrameworks&lt;/a&gt;: C++-based creative coding suite&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you are looking for inspiration, you might want to start with these:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://creative.ai"&gt;Creative.ai&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://ml4a.github.io"&gt;Machine Learning for Artists (by Gene Kogan)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://elluba.com/"&gt;Luba Elliott's Newsletter&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And there are so many more to play with. I'd love to hear about your favorites or feel free to share more artists (new and old) who are pushing the boundaries with neural networks and art in the comments.&lt;/p&gt;
&lt;p&gt;Finally, if you want to peruse them, here are my slides:&lt;/p&gt;
&lt;iframe src="https://docs.google.com/presentation/d/e/2PACX-1vR8OLLnb1iRB9w9MXMMMq7iJ-iKLRfpYzjvdmxmfbi9zbq5jI8xR9gh9LUdF90J71VgxjgiCSwk4_3g/embed?start=false&amp;loop=false&amp;delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"&gt;&lt;/iframe&gt;

&lt;p&gt;I will post video if it's shared publicly. Special thanks to &lt;a href="https://www.linkedin.com/in/vickyleeire/"&gt;Vicky Lee&lt;/a&gt; for making my talk at 404 possible.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;To see a great interview including this "Künstliche Kunst" conversation alongside some of Nees' algorithmic contemporaries, check out &lt;a href="https://www.youtube.com/watch?v=ugLopHSPQH4"&gt;Early Computer art, man-machine&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;I tried making one with Tupac songs, but it quickly devolved from cursing into gibberish. Erykah Badu was my next goal, but I simply didn't have enough content. Then I set my eyes on James Joyce -- perhaps a bit too high brow for my small GPU. And finally had a lot of fun with U2 lyrics, leading to fun excerpts like those I showed in my talk as well as these gems: "oh my heart, love is bloody sunday", "i've got to get you, got to get you, got to get you...", and one particularly lulzy one "la la la la la la la la la la la la la la la la la".&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="conferences"></category></entry><entry><title>Comparing scikit-learn Text Classifiers on a Fake News Dataset</title><link href="https://blog.kjamistan.com/comparing-scikit-learn-text-classifiers-on-a-fake-news-dataset.html" rel="alternate"></link><published>2017-08-28T00:00:00+02:00</published><updated>2017-08-28T00:00:00+02:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2017-08-28:/comparing-scikit-learn-text-classifiers-on-a-fake-news-dataset.html</id><summary type="html">&lt;p&gt;Finding ways to determine fake news from real news is a challenge most Natural Language Processing folks I meet and chat with want to solve. There is significant difficulty in doing this properly and without penalizing real news sources.&lt;/p&gt;
&lt;p&gt;I was discussing this problem with Miguel Martinez-Alvarez on my last …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Finding ways to determine fake news from real news is a challenge most Natural Language Processing folks I meet and chat with want to solve. There is significant difficulty in doing this properly and without penalizing real news sources.&lt;/p&gt;
&lt;p&gt;I was discussing this problem with Miguel Martinez-Alvarez on my last visit to the SignalHQ offices; and his post on &lt;a href="https://miguelmalvarez.com/2017/03/23/how-can-machine-learning-and-ai-help-solving-the-fake-news-problem/"&gt;using AI to solve the fake news problem&lt;/a&gt; further elaborates on why this is no simple task.&lt;/p&gt;
&lt;p&gt;I stumbled across a post which built &lt;a href="https://opendatascience.com/blog/how-to-build-a-fake-news-classification-model/"&gt;a classifier for fake news with fairly high accuracy&lt;/a&gt; (and yay! the &lt;a href="https://github.com/GeorgeMcIntire/fake_real_news_dataset"&gt;dataset&lt;/a&gt; was published!). I wanted to investigate whether I could replicate the results and if the classifier actually learned anything useful.&lt;/p&gt;
&lt;h4 id="preparing-the-data"&gt;Preparing the data&lt;/h4&gt;
&lt;p&gt;In my initial investigation, I compared Multinomial Naive Bayes on a bag-of-words (CountVectorizer) features as well as on a Term Frequency-Inverse Document Frequency (TfIdfVectorizer) features. I also compared a Passive Aggressive linear classifier using the TF-IDF features. The resulting accuracy ranged from 83% to 93%. You can walk through &lt;a href="https://www.datacamp.com/community/tutorials/scikit-learn-fake-news"&gt;my initial investigation published on the DataCamp blog&lt;/a&gt; to read my approach and thoughts (&lt;a href="https://github.com/kjam/random_hackery/blob/master/Attempting%20to%20detect%20fake%20news.ipynb"&gt;a Jupyter notebook of the code&lt;/a&gt; is also available on my GitHub). In summary, the data was messy and I was concerned the features were likely nonsensical.&lt;/p&gt;
&lt;h4 id="comparing-different-classification-models"&gt;Comparing different classification models&lt;/h4&gt;
&lt;p&gt;I wanted to take a deeper look into the features and compare them across classifiers. This time I added an additional few classifiers, so overall I would compare:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Multinomial Naive Bayes with Count Vectors&lt;/li&gt;
&lt;li&gt;Multinomial Naive Bayes with Tf-Idf Vectors&lt;/li&gt;
&lt;li&gt;Passive Aggressive linear model with Tf-Idf Vectors&lt;/li&gt;
&lt;li&gt;SVC linear model with Tf-Idf Vectors&lt;/li&gt;
&lt;li&gt;SGD linear model with Tf-Idf Vectors&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;On accuracy without parameter tuning, here is a simple ROC curve comparison on the results:
&lt;img alt="Fake and Real News ROC curve" src="https://blog.kjamistan.com/images/2017/08/fake_real_news_simple_roc_curve.png"&gt;
You can see that the linear models are outperforming the Naive Bayes classifiers, and that the accuracy scores are fairly good (even without parameter tuning).&lt;/p&gt;
&lt;p&gt;So indeed I could replicate the results, but what did the models &lt;em&gt;actually&lt;/em&gt; learn? What features signified real versus fake news?&lt;/p&gt;
&lt;h4 id="introspecting-significant-features"&gt;Introspecting significant features&lt;/h4&gt;
&lt;p&gt;To introspect the models, I used a method I first read about &lt;a href="https://stackoverflow.com/a/26980472"&gt;on StackOverflow&lt;/a&gt; showing how to extract coefficients for binary classification (and therefore show the most significant features for each class). After some extraction, I was able to compare the classifiers with one another. The &lt;a href="https://github.com/kjam/random_hackery/blob/master/Comparing%20Fake%20News%20Classifiers.ipynb"&gt;full notebook for running these extractions&lt;/a&gt; is available on my GitHub. I will summarize some of the findings here.&lt;/p&gt;
&lt;h6 id="fake-news-has-lots-of-noisy-identifiers"&gt;Fake news has lots of noisy identifiers&lt;/h6&gt;
&lt;p&gt;For the most models, the top features for the fake news were almost exclusively noise. Below is the top ten features ranked by weight for the most performant Naive Bayes classifier:&lt;/p&gt;
&lt;table&gt;
  &lt;tr&gt;
    &lt;th&gt;Feature&lt;/th&gt;
    &lt;th&gt;Weight&lt;/th&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;'0000'&lt;/td&gt;
    &lt;td&gt;-16.067750538483136&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;'000035'&lt;/td&gt;
    &lt;td&gt;-16.067750538483136&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;'0001'&lt;/td&gt;
    &lt;td&gt;-16.067750538483136&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;'0001pt'&lt;/td&gt;
    &lt;td&gt;-16.067750538483136&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;'000km'&lt;/td&gt;
    &lt;td&gt;-16.067750538483136&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;'0011'&lt;/td&gt;
    &lt;td&gt;-16.067750538483136&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;'006s'&lt;/td&gt;
    &lt;td&gt;-16.067750538483136&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;'007'&lt;/td&gt;
    &lt;td&gt;-16.067750538483136&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;'007s'&lt;/td&gt;
    &lt;td&gt;-16.067750538483136&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;'008s'&lt;/td&gt;
    &lt;td&gt;-16.067750538483136&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;

&lt;p&gt;You might notice a pattern, yes? The "top features" all have the same weight and are alphabetical -- when I took a closer look there were more than 20,000 tokens as top performers with the same weight for Naive Bayes.&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;The top linear model features for fake news looked like this:&lt;/p&gt;
&lt;table&gt;
  &lt;tr&gt;
    &lt;th&gt;Feature&lt;/th&gt;
    &lt;th&gt;Weight&lt;/th&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;'2016'&lt;/td&gt;
    &lt;td&gt;-5.067099443402463&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;'october'&lt;/td&gt;
    &lt;td&gt;-4.2461599700216439&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;'hillary'&lt;/td&gt;
    &lt;td&gt;-4.0444719646755933&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;'share'&lt;/td&gt;
    &lt;td&gt;-3.1994347679575168&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;'article'&lt;/td&gt;
    &lt;td&gt;-2.9875364640619431&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;'november'&lt;/td&gt;
    &lt;td&gt;-2.872542653309075&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;'print'&lt;/td&gt;
    &lt;td&gt;-2.7039994399720166&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;'email'&lt;/td&gt;
    &lt;td&gt;-2.4671743850771906&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;'advertisement'&lt;/td&gt;
    &lt;td&gt;-2.3948473577644886&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;'oct'&lt;/td&gt;
    &lt;td&gt;-2.3773831096010531&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;

&lt;p&gt;Also very noisy, with words like "share" and probably "Print article" as well as date strings (likely from publication headers). The only token that is not from auxiliary text on the page is likely Hillary, which in and of itself does not classify fake from real news. &lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;h6 id="linear-models-agreed-that-to-say-is-a-real-news-feature"&gt;Linear Models agreed that "to say" is a real news feature&lt;/h6&gt;
&lt;p&gt;For the linear models, forms of the verb "to say" appeared near the top -- likely learning this from professional journalism quotations (i.e. Chancellor Angela Merkel said...). In fact, "said" was the most significant token for the top linear model, edging out the next token by 2 points. Here is a short summary of real news top tokens from the Passive Aggressive classifier:&lt;/p&gt;
&lt;table &gt;
  &lt;tr&gt;
    &lt;th&gt;Feature&lt;/th&gt;
    &lt;th&gt;Weight&lt;/th&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;'said'&lt;/td&gt;
    &lt;td&gt;4.6936244574076511&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;'says'&lt;/td&gt;
    &lt;td&gt;2.6841231322197814&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;'cruz'&lt;/td&gt;
    &lt;td&gt;2.4882327232138084&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;'tuesday'&lt;/td&gt;
    &lt;td&gt;2.4307699875323676&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;'friday'&lt;/td&gt;
    &lt;td&gt;2.4004245195582929&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;'islamic'&lt;/td&gt;
    &lt;td&gt;2.3792489975683924&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;'candidates'&lt;/td&gt;
    &lt;td&gt;2.3458465918387894&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;'gop'&lt;/td&gt;
    &lt;td&gt;2.3449946222238158&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;'conservative'&lt;/td&gt;
    &lt;td&gt;2.3312074608602522&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;'marriage'&lt;/td&gt;
    &lt;td&gt;2.3246779761740823&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;

&lt;p&gt;Although there are more real topics included, there is also words like Friday and Tuesday. Perhaps we should only read the news on Friday or Tuesday to ensure it is real...&lt;/p&gt;
&lt;p&gt;&lt;img src="https://media.giphy.com/media/Smjo9iKgrt3DW/giphy.gif"/&gt;&lt;p&gt;&lt;a href="https://giphy.com/gifs/krysten-ritter-apt-23-dont-trust-the-b-in-Smjo9iKgrt3DW"&gt;via GIPHY&lt;/a&gt;&lt;/p&gt;&lt;/p&gt;
&lt;h6 id="overall-the-top-tokens-were-mainly-noise"&gt;Overall, the top tokens were mainly noise&lt;/h6&gt;
&lt;p&gt;When I aggregated the top tokens for both real and fake news, sorting by count (i.e. the most common tokens identified as real and fake for all models), I saw mainly noise. Here are the top tokens sorted by the number of occurrences for identifying real news:&lt;/p&gt;
&lt;table border="1" &gt;
  &lt;thead&gt;
    &lt;tr style="text-align: right;"&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;Aggregate Rank&lt;/th&gt;
      &lt;th&gt;Count&lt;/th&gt;
      &lt;th&gt;Label&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;said&lt;/th&gt;
      &lt;td&gt;9.8&lt;/td&gt;
      &lt;td&gt;5&lt;/td&gt;
      &lt;td&gt;REAL&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;cruz&lt;/th&gt;
      &lt;td&gt;3.5&lt;/td&gt;
      &lt;td&gt;4&lt;/td&gt;
      &lt;td&gt;REAL&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;tuesday&lt;/th&gt;
      &lt;td&gt;8.33333&lt;/td&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;REAL&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;conservative&lt;/th&gt;
      &lt;td&gt;4.66667&lt;/td&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;REAL&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;gop&lt;/th&gt;
      &lt;td&gt;3.33333&lt;/td&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;REAL&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;islamic&lt;/th&gt;
      &lt;td&gt;6.33333&lt;/td&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;REAL&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;says&lt;/th&gt;
      &lt;td&gt;8.33333&lt;/td&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;REAL&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;president&lt;/th&gt;
      &lt;td&gt;5.5&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;REAL&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;trump&lt;/th&gt;
      &lt;td&gt;9.5&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;REAL&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;state&lt;/th&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;REAL&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;And the top tokens for identifying fake news:&lt;/p&gt;
&lt;table border="1"&gt;
  &lt;thead&gt;
    &lt;tr style="text-align: right;"&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;Aggregate Rank&lt;/th&gt;
      &lt;th&gt;Count&lt;/th&gt;
      &lt;th&gt;Label&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;2016&lt;/th&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;FAKE&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;share&lt;/th&gt;
      &lt;td&gt;5&lt;/td&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;FAKE&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;print&lt;/th&gt;
      &lt;td&gt;7.33333&lt;/td&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;FAKE&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;october&lt;/th&gt;
      &lt;td&gt;2.66667&lt;/td&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;FAKE&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;november&lt;/th&gt;
      &lt;td&gt;5.66667&lt;/td&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;FAKE&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;hillary&lt;/th&gt;
      &lt;td&gt;2.33333&lt;/td&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;FAKE&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;article&lt;/th&gt;
      &lt;td&gt;4.33333&lt;/td&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;FAKE&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;0000&lt;/th&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;FAKE&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;election&lt;/th&gt;
      &lt;td&gt;7.5&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;FAKE&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;000035&lt;/th&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;FAKE&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;To see the code used to generate these rankings, please take a look at &lt;a href="https://github.com/kjam/random_hackery/blob/master/Comparing%20Fake%20News%20Classifiers.ipynb"&gt;the Jupyter Notebook&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="takeaways"&gt;Takeaways&lt;/h4&gt;
&lt;p&gt;As I conjectured from the start, fake news is a much harder problem than simply throwing some simple NLP vectors and solving with linear or Bayesian model. Although I found it interesting that the linear classifiers noticed real news used quoting verbs more often, this was far from a deep insight that can help us in building a real vs. fake news filter which might improve democracy.&lt;/p&gt;
&lt;p&gt;I did have fun spending a short time building on a few ideas and found it useful that the linear models performed better in terms of token noise for real news. If I had taken time to clean the dataset of these tokens, I'm curious how the comparison between the models would change.&lt;/p&gt;
&lt;p&gt;In the end, the dataset is likely not a great candidate for building a robust fake versus real news model. It seems to have a lot of token noise (dates, share and print links and a limited variety of topics). It is also fairly small and therefore any models would likely suffer from having a smaller token set and have trouble generalizing.&lt;/p&gt;
&lt;p&gt;I'm always curious to hear other trends or ideas you have in approaching these topics. Feel free to comment below or &lt;a href="https://twitter.com/kjam"&gt;reach out via Twitter (@kjam)&lt;/a&gt;.&lt;/p&gt;
&lt;h6 id="footnotes"&gt;Footnotes&lt;/h6&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;Perhaps the fact that Clinton did not appear alongside it might mean a longer n-gram could identify references to popular alt-right and conservative monikers like "Lying Hillary" versus "Hillary Clinton").&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;Some fun ones in there included '11truther', '0h4at2yetra17uxetni02ls2jeg0mty45jrcu7mrzsrpcbq464i', 'nostrums', 'wordpress' and 'woot'. (I'm sure there are many more finds awaiting more study ...)&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="research"></category></entry><entry><title>Data Unit Testing: EuroPython Tutorial</title><link href="https://blog.kjamistan.com/data-unit-testing-europython-tutorial.html" rel="alternate"></link><published>2017-07-14T00:00:00+02:00</published><updated>2017-07-14T00:00:00+02:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2017-07-14:/data-unit-testing-europython-tutorial.html</id><summary type="html">&lt;p&gt;I gave a long and opinionated tutorial at &lt;a href="https://ep2017.europython.eu/p3/schedule/ep2017/"&gt;EuroPython 2017&lt;/a&gt; about how we &lt;a href="https://ep2017.europython.eu/conference/talks/data-unit-testing-with-python"&gt;should do unit testing and validation within a data science scope&lt;/a&gt;. The GitHub repository for the course (which is part of my &lt;a href="https://blog.kjamistan.com/practical-data-cleaning-with-python-resources.html"&gt;O'Reilly Live Online training&lt;/a&gt;) is &lt;a href="https://github.com/kjam/data-cleaning-101"&gt;https://github.com/kjam/data-cleaning-101&lt;/a&gt;. I will continue editing and …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I gave a long and opinionated tutorial at &lt;a href="https://ep2017.europython.eu/p3/schedule/ep2017/"&gt;EuroPython 2017&lt;/a&gt; about how we &lt;a href="https://ep2017.europython.eu/conference/talks/data-unit-testing-with-python"&gt;should do unit testing and validation within a data science scope&lt;/a&gt;. The GitHub repository for the course (which is part of my &lt;a href="https://blog.kjamistan.com/practical-data-cleaning-with-python-resources.html"&gt;O'Reilly Live Online training&lt;/a&gt;) is &lt;a href="https://github.com/kjam/data-cleaning-101"&gt;https://github.com/kjam/data-cleaning-101&lt;/a&gt;. I will continue editing and updating the repository with more examples, so feel free to fork or star it to get updates.&lt;/p&gt;
&lt;p&gt;The slides for the talk are also available here:&lt;/p&gt;
&lt;iframe src="https://docs.google.com/presentation/d/18RyP6X1eRvdvK720UtX3TxFbke6kfVLLXI0zqeiEBtU/embed?start=false&amp;loop=false&amp;delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"&gt;&lt;/iframe&gt;

&lt;p&gt;And for those who attended, please &lt;a href="https://bit.ly/data-unit-testing-feedback"&gt;give me feedback&lt;/a&gt;!&lt;/p&gt;</content><category term="trainings"></category></entry><entry><title>if Ethics is not None</title><link href="https://blog.kjamistan.com/if-ethics-is-not-none.html" rel="alternate"></link><published>2017-07-14T00:00:00+02:00</published><updated>2017-07-14T00:00:00+02:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2017-07-14:/if-ethics-is-not-none.html</id><summary type="html">&lt;p&gt;This past Wednesday, I had the pleasure of giving a keynote at &lt;a href="https://ep2017.europython.eu/en/"&gt;EuroPython 2017&lt;/a&gt;. I covered a historical view of ethics in computing. The slides are shared here, but it was also recorded so I will post a video when it is available. (Updated: video added!)&lt;/p&gt;
&lt;p&gt;In addition, a series …&lt;/p&gt;</summary><content type="html">&lt;p&gt;This past Wednesday, I had the pleasure of giving a keynote at &lt;a href="https://ep2017.europython.eu/en/"&gt;EuroPython 2017&lt;/a&gt;. I covered a historical view of ethics in computing. The slides are shared here, but it was also recorded so I will post a video when it is available. (Updated: video added!)&lt;/p&gt;
&lt;p&gt;In addition, a series of blog posts and interviews I conducted during my research will be here in August, so stay tuned for more historical computing memories!&lt;/p&gt;
&lt;p&gt;Slides:&lt;/p&gt;
&lt;iframe src="https://docs.google.com/presentation/d/1EFHTp8okoIOvn3j0ga8YzNBIF80zGNKsROwttS2_j00/embed?start=false&amp;loop=false&amp;delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"&gt;&lt;/iframe&gt;

&lt;p&gt;Video:&lt;/p&gt;
&lt;iframe width="1280" height="720" src="https://www.youtube.com/embed/FtRbAePXUoI" frameborder="0" gesture="media" allow="encrypted-media" allowfullscreen&gt;&lt;/iframe&gt;</content><category term="conferences"></category></entry><entry><title>Practical Data Cleaning with Python Resources</title><link href="https://blog.kjamistan.com/practical-data-cleaning-with-python-resources.html" rel="alternate"></link><published>2017-05-03T00:00:00+02:00</published><updated>2017-05-03T00:00:00+02:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2017-05-03:/practical-data-cleaning-with-python-resources.html</id><summary type="html">&lt;h2 id="practical-data-cleaning-resources"&gt;Practical Data Cleaning Resources&lt;/h2&gt;
&lt;h4 id="oreilly-live-online-training"&gt;(O'Reilly Live Online Training)&lt;/h4&gt;
&lt;p&gt;This week I will be giving my first O'Reilly Live Online Training via the Safari platform. I'm pretty excited to share some of my favorite data cleaning libraries and tips for validating and testing your data workflows.&lt;/p&gt;
&lt;p&gt;This post hopes to be …&lt;/p&gt;</summary><content type="html">&lt;h2 id="practical-data-cleaning-resources"&gt;Practical Data Cleaning Resources&lt;/h2&gt;
&lt;h4 id="oreilly-live-online-training"&gt;(O'Reilly Live Online Training)&lt;/h4&gt;
&lt;p&gt;This week I will be giving my first O'Reilly Live Online Training via the Safari platform. I'm pretty excited to share some of my favorite data cleaning libraries and tips for validating and testing your data workflows.&lt;/p&gt;
&lt;p&gt;This post hopes to be a resource to those attending the class, but also anyone interested in the subject of practical data cleaning with Python. If you have tips or ideas on extra content or links to add, feel free to comment or reach out via Twitter or email.&lt;/p&gt;
&lt;p&gt;Hope you enjoy!&lt;/p&gt;
&lt;h3 id="libraries-repositories"&gt;Libraries / Repositories&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Course Repository: https://github.com/kjam/data-cleaning-101&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="deduplication"&gt;Deduplication&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Dedupe: https://github.com/dedupeio/dedupe&lt;/li&gt;
&lt;li&gt;CSV Dedupe: https://github.com/dedupeio/csvdedupe&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="string-matching"&gt;String Matching&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Fuzzy Wuzzy: https://github.com/seatgeek/fuzzywuzzy&lt;/li&gt;
&lt;li&gt;TextaCy: https://github.com/chartbeat-labs/textacy&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="managing-nulls"&gt;Managing Nulls&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Pandas functions: http://pandas.pydata.org/pandas-docs/stable/missing_data.html&lt;/li&gt;
&lt;li&gt;Dora: https://github.com/NathanEpstein/Dora&lt;/li&gt;
&lt;li&gt;Badfish: https://github.com/harshnisar/badfish&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="normalization-preprocessing"&gt;Normalization &amp;amp; Preprocessing&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Scikit-learn preprocessing: http://scikit-learn.org/stable/modules/preprocessing.html&lt;/li&gt;
&lt;li&gt;Pandas stats: http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="specific-data-cleaning-topics"&gt;Specific data cleaning topics&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Privacy? https://github.com/datascopeanalytics/scrubadub&lt;/li&gt;
&lt;li&gt;Measurements? http://pint.readthedocs.io/&lt;/li&gt;
&lt;li&gt;Versioning ML Data? https://github.com/NathanEpstein/Dora&lt;/li&gt;
&lt;li&gt;Dates? http://arrow.readthedocs.io/en/latest/ or https://github.com/kennethreitz/maya&lt;/li&gt;
&lt;li&gt;AutoClean? https://github.com/rhiever/datacleaner&lt;/li&gt;
&lt;li&gt;DIY Parser? https://github.com/datamade/parserator&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="simple-pipelines-graphs-task-processing"&gt;Simple pipelines / graphs, task processing&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Dask: https://github.com/dask/dask&lt;/li&gt;
&lt;li&gt;Distributed: https://github.com/dask/distributed&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="schema-validation"&gt;Schema Validation&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Voluptuous: https://github.com/alecthomas/voluptuous&lt;/li&gt;
&lt;li&gt;Validr: https://github.com/guyskk/validr&lt;/li&gt;
&lt;li&gt;With Serialization: https://marshmallow.readthedocs.io/en/latest/&lt;/li&gt;
&lt;li&gt;For JVM / Apache: https://avro.apache.org/&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="dataframe-validation"&gt;Dataframe Validation&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Engarde: https://github.com/TomAugspurger/engarde&lt;/li&gt;
&lt;li&gt;Validada: https://github.com/jnmclarty/validada&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="constraint-detection"&gt;Constraint Detection&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;TDDA: Test-Driven Data Analysis: https://github.com/tdda/tdda&lt;/li&gt;
&lt;li&gt;SciPy: https://docs.scipy.org/doc/scipy-0.19.0/reference/stats.html#statistical-functions&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="property-based-testing"&gt;Property-based Testing&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Hypothesis: https://hypothesis.readthedocs.io/&lt;/li&gt;
&lt;li&gt;Haskell's Quickcheck: https://hackage.haskell.org/package/QuickCheck&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="more-validation-and-testing"&gt;More Validation and Testing&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Model Cross Validation: http://scikit-learn.org/stable/modules/cross_validation.html&lt;/li&gt;
&lt;li&gt;Testing ML features: https://github.com/machinalis/featureforge&lt;/li&gt;
&lt;li&gt;Built-in Stats: https://docs.python.org/3/library/statistics.html&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="unit-testing-basics"&gt;Unit Testing Basics&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;PyTest: https://docs.pytest.org/en/latest/&lt;/li&gt;
&lt;li&gt;Mocking: https://docs.python.org/3/library/unittest.mock-examples.html&lt;/li&gt;
&lt;li&gt;Faking Data with Faker: https://faker.readthedocs.io/en/master/&lt;/li&gt;
&lt;li&gt;Faker CSVs: https://github.com/pereorga/csvfaker&lt;/li&gt;
&lt;li&gt;Watch &lt;a href="https://www.youtube.com/watch?v=FxSsnHeWQBY"&gt;Ned Batchelder’s testing talk&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Continuous Integration: &lt;a href="https://travis-ci.org/"&gt;TravisCI&lt;/a&gt;, &lt;a href="https://jenkins.io/"&gt;Jenkins&lt;/a&gt;, &lt;a href="https://www.jetbrains.com/teamcity/"&gt;TeamCity&lt;/a&gt; and many more&lt;/li&gt;
&lt;li&gt;Better Code Reviews: http://www.bettercode.reviews/&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="testing-pipelines"&gt;Testing Pipelines&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/"&gt;Data Quality Checks with Spark DataFrames&lt;/a&gt;  &lt;/li&gt;
&lt;li&gt;Drunken Data Quality (Spark DF): https://github.com/FRosner/drunken-data-quality&lt;/li&gt;
&lt;li&gt;Apache Beam: https://beam.apache.org/documentation/pipelines/test-your-pipeline/&lt;/li&gt;
&lt;li&gt;Tip: Check your framework first!&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="open-datasets-to-try-out-your-skills"&gt;Open Datasets (to try out your skills!)&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.kaggle.com/datasets"&gt;Kaggle Datasets&lt;/a&gt;: beyond just competition data, Kaggle also has shared datasets curated by users.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/caesar0301/awesome-public-datasets"&gt;Awesome Datasets GitHub List&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public"&gt;Quora: Where can I find large public datasets?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://scikit-learn.org/stable/datasets/index.html"&gt;Scikit-learn datasets&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.dataquest.io/blog/free-datasets-for-projects/"&gt;Dataquest.io: 17 places to find open datasets for projects&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.nltk.org/data.html"&gt;NLTK Data&lt;/a&gt;: NLP data such as books, scripts, articles and poems&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="research"&gt;Research&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://sirrice.github.io/files/papers/cleaning-hilda16.pdf"&gt;Towards Reliable Interactive Data Cleaning: A User Survey and Recommendations, S Krishnan, D Haas, M. J. Franklin, 2016&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.cs.toronto.edu/~mvolkovs/icde14_data_cleaning.pdf"&gt;Continuous Data Cleaning, M. Volkovs, F. Chiang, J. Szlichta, and R. J. Miller. ICDE, 2014&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://activeclean.github.io/"&gt;ActiveClean: Krishnan, Franklin, Goldberg, Wang, Wu, 2016&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cs.uwaterloo.ca/~x4chu/SIGMOD2015_2.pdf"&gt;Katara: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing, X Chu, Morcos, Ilyas et al. 2015&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://svn.aksw.org/papers/2014/WWW_Databugger/public.pdf"&gt;Test-driven Evaluation of Linked Data Quality, D. Kontokostas, P. Westphal, S. Auer, S. Hellmann, J. Lehmann, R. Cornelissen, and A. Zaveri., 2015&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/1703.05921.pdf"&gt;Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery, T Schlegl, P Seeböck, S M. Waldstein, U Schmidt-Erfurth, and G Langs, 2017&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That's all for now! Check back as I plan to update and evolve this list with more libraries and examples.&lt;/p&gt;</content><category term="trainings"></category></entry><entry><title>PyData Amsterdam Keynote on Ethical Machine Learning</title><link href="https://blog.kjamistan.com/pydata-amsterdam-keynote-on-ethical-machine-learning.html" rel="alternate"></link><published>2017-04-07T00:00:00+02:00</published><updated>2017-04-07T00:00:00+02:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2017-04-07:/pydata-amsterdam-keynote-on-ethical-machine-learning.html</id><summary type="html">&lt;p&gt;I was kindly asked by the PyData Amsterdam organizers to keynote the conference. As a passionate fan of ethical machine learning and the great research being done by data scientists and academics around the world -- I am very enthused to present the topic to the conference.&lt;/p&gt;
&lt;p&gt;My slides are currently …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I was kindly asked by the PyData Amsterdam organizers to keynote the conference. As a passionate fan of ethical machine learning and the great research being done by data scientists and academics around the world -- I am very enthused to present the topic to the conference.&lt;/p&gt;
&lt;p&gt;My slides are currently available as &lt;a href="https://github.com/kjam/random_hackery/tree/master/talks"&gt;a jupyter notebook via GitHub&lt;/a&gt; and I will be posting them in an easy way to key through them soon. I will be adding the video as well as several extra posts regarding the research and findings here.&lt;/p&gt;
&lt;p&gt;I would especially like to thank &lt;a href="https://github.com/mattilyra"&gt;Matti Lyra&lt;/a&gt; for his help and suggestions in crafting this talk. I would also like to thank &lt;a href="http://francoiseprovencher.weebly.com/blog"&gt;Françoise Provencher&lt;/a&gt; for pointing me to some of the great resources.&lt;/p&gt;
&lt;h2 id="talk-and-slide-references"&gt;Talk and Slide References&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.propublica.org/article/minority-neighborhoods-higher-car-insurance-premiums-white-areas-same-risk"&gt;Minority Areas Pay Higher Car Insurance than White Areas with the Same Risk&lt;/a&gt; by ProPublica&lt;/li&gt;
&lt;li&gt;&lt;a href="https://deardesignstudent.com/ethics-cant-be-a-side-hustle-b9e78c090aee"&gt;Ethics Can't be a Side Hustle&lt;/a&gt; by Mike Monteiro&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/1605.06083"&gt;Stereotyping and Bias in the Flickr30 Dataset&lt;/a&gt; by Emiel van Miltenburg\n- &lt;a href="https://www.cnet.com/news/why-facebook-is-giving-out-free-wi-fi-for-check-ins/"&gt;Why Facebook is giving out free Wi-Fi for check-ins&lt;/a&gt; by CNet&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.formisimo.com/blog/do-not-untick-this-box-if-you-do-not-want-to-not-receive-updates/"&gt;Do not untick this box if you do not want to receive updates&lt;/a&gt; by Formismo&lt;/li&gt;
&lt;li&gt;&lt;a href="https://media.ccc.de/v/32c3-7482-say_hi_to_your_new_boss_how_algorithms_might_soon_control_our_lives"&gt;Say hi to your new boss: How algorithms might soon control our lives&lt;/a&gt; and &lt;a href="https://github.com/adewes/32c3"&gt;GitHub Repo&lt;/a&gt; by Andreas Dewes&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.princeton.edu/~aylinc/papers/caliskan-islam_semantics.pdf"&gt;Semantics derived automatically from language corpora necessarily contain human biases&lt;/a&gt; by Aylin Caliskan-Islam, Joanna J. Bryson, and Arvind Narayanan (Related 33c3 talk by Aylin Caliskan-Islam: &lt;a href="https://www.youtube.com/watch?v=j7FwpZB1hWc"&gt;Story of discrimination and unfairness&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing"&gt;Machine Bias&lt;/a&gt; by ProPublica&lt;/li&gt;
&lt;li&gt;&lt;a href="https://medium.com/@AbeGong/ethics-for-powerful-algorithms-1-of-3-a060054efd84"&gt;Ethics for powerful algorithms&lt;/a&gt; by Abe Gong&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/1412.3756"&gt;Certifying and removing disparate impact&lt;/a&gt; by Michael Feldman, Sorelle A. Friedler, John Moeller, Carlos Scheidegger, Suresh Venkatasubramanian&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/1510.02377"&gt;FairTest: Discovering Unwarranted Associations in Data-Driven Applications&lt;/a&gt; and &lt;a href="https://github.com/columbia/fairtest"&gt;Github Repository&lt;/a&gt; by Florian Tramèr, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, Jean-Pierre Hubaux, Mathias Humbert, Ari Juels, Huang Lin&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=MqoRzNhrTnQ"&gt;When Recommendation Systems Go Bad&lt;/a&gt; by Evan Estola&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/1610.02413"&gt;Equality of Opportunity in Supervised Learning&lt;/a&gt; with &lt;a href="https://research.google.com/bigpicture/attacking-discrimination-in-ml/"&gt;interactive data visualization with generated loan data&lt;/a&gt; by Moritz Hardt, Eric Price, Nathan Srebro (interactive by Google BigPicture)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.oreilly.com/ideas/ideas-on-interpreting-machine-learning"&gt;Ideas on Interpreting Machine Learning&lt;/a&gt; by Patrick Hall, Wen Phan and SriSatish Ambati&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.uio.no/studier/emner/sv/oekonomi/ECON4135/h09/undervisningsmateriale/FinancialModelersManifesto.pdf"&gt;Financial Modeler's Manifesto&lt;/a&gt; by Emanuel Derman and Paul Wilmott&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="recommended-reading-related-work"&gt;Recommended Reading &amp;amp; Related Work&lt;/h2&gt;
&lt;p&gt;In addition to the papers I was able to reference in the slides, I have appended here some recommended reading on the topic of Ethics in Machine Learning. Expect this list to expand over time :)&lt;/p&gt;
&lt;h3 id="conferences"&gt;Conferences&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://www.fatml.org/"&gt;Fairness, Accountability, and Transparency in Machine Learning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://ethicsinnlp.org/"&gt;Ethics in NLP&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="blogs-and-publications"&gt;Blogs and Publications&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://unbias.wp.horizon.ac.uk/"&gt;UnBias: Emancipating Users Against Algorithmic Biases for a Trusted Digital Economy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://joanna-bryson.blogspot.nl/"&gt;Joanna J Bryson's Blog (and entire CV)&lt;/a&gt;. You can also [follow her on Twitter].(https://twitter.com/j2bryson)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://nlpers.blogspot.nl/"&gt;Hal Daumé III's NLP Blog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://algorithmicfairness.wordpress.com/"&gt;Algorithmic Fairness by Suresh Venkat&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://randomwalker.info/"&gt;Arvind Narayanan's work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;I also wrote a post about &lt;a href="https://blog.kjamistan.com/embedded-isms-in-vector-based-natural-language-processing.html"&gt;embedded racism and sexism in word vectors&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="books"&gt;Books&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://weaponsofmathdestructionbook.com/"&gt;Weapons of Math Destruction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.emanuelderman.com/books/models-behaving-badly"&gt;Models.Behaving.Badly&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="articles"&gt;Articles&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://hbr.org/2016/12/a-guide-to-solving-social-problems-with-machine-learning"&gt;A Guide to Solving Social Problems with Machine Learning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.oreilly.com/ideas/the-ethics-of-artificial-intelligence"&gt;The Ethics of Artificial Intelligence&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.ncjrs.gov/pdffiles1/nij/240696.pdf"&gt;Predicting Recidivism Risk: New Tool in Philadelphia Shows Great Promise&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="podcasts"&gt;Podcasts&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://machine-ethics.net/"&gt;The Machine Ethics Podcast&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://ethicalmachines.com/"&gt;Ethical Machines&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="news-on-ethical-machine-learning-and-models-gone-bad"&gt;News on Ethical Machine Learning and Models Gone Bad&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.theguardian.com/us-news/2016/dec/18/michigan-unemployment-agency-fraud-accusations"&gt;Michigan unemployment agency made 20,000 false fraud accusations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.theverge.com/2016/10/11/13243890/facebook-twitter-instagram-police-surveillance-geofeedia-api"&gt;Facebook, Twitter, and Instagram surveillance tool was used to arrest Baltimore protesters&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.autoblog.com/2011/05/31/women-voice-command-systems/"&gt;Many Cars Tone Deaf To Women's Voices&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.washingtonpost.com/news/wonk/wp/2016/03/10/uber-seems-to-offer-better-service-in-areas-with-more-white-people-that-raises-some-tough-questions/"&gt;Uber seems to offer better service in areas with more white people. That raises some tough questions.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.theguardian.com/technology/2016/sep/08/artificial-intelligence-beauty-contest-doesnt-like-black-people"&gt;A beauty contest was judged by AI and the robots didn't like dark skin&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.bloomberg.com/graphics/2016-amazon-same-day/"&gt;Amazon Doesn’t Consider the Race of Its Customers. Should It?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.forbes.com/sites/ciocentral/2016/12/21/on-the-ethical-use-of-data-vs-the-internet-of-things/2/#1d18691f1d18"&gt;On The Ethical Use Of Data Vs. The Internet Of Things&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hbr.org/2016/12/research-how-subtle-class-cues-can-backfire-on-your-resume"&gt;How Subtle Class Cues can Backfire on Your Resumé&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.fastcompany.com/3067285/the-future-of-work/can-artificial-intelligence-wipe-unconscious-bias-from-your-workday"&gt;Can Artificial Intelligence Wipe Unconscious Bias From Your Workday?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.oreilly.com/ideas/on-computational-ethics"&gt;On Computational Ethics&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.weforum.org/agenda/2017/02/ai-learned-to-betray-others-heres-why-thats-okay"&gt;AI learned to betray others. Here's why that's okay&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.nytimes.com/2012/08/09/opinion/after-knight-capital-new-code-for-trades.html"&gt;Errant code? Not just a bug.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://psmag.com/artificial-intelligence-will-be-as-biased-and-prejudiced-as-its-human-creators-38fe415f86dd"&gt;Artificial Intelligence Will Be as Biased and Prejudiced as Its Human Creators&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://motherboard.vice.com/en_us/article/ai-can-learn-values-from-reading"&gt;If We Don’t Want AI to Be Evil, We Should Teach It to Read&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.seattletimes.com/business/microsoft/how-linkedins-search-engine-may-reflect-a-bias/"&gt;How LinkedIn’s search engine may reflect a gender bias&lt;/a&gt; by Matt Day, Seattle Times (and follow up by Samanta Cooney: &lt;a href="http://motto.time.com/4484530/linkedin-gender-bias-search/"&gt;LinkedIn Tweaks Search Algorithm After Report Suggests Gender Bias&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;</content><category term="conferences"></category></entry><entry><title>Ten Tips for First-Time Conference Speakers</title><link href="https://blog.kjamistan.com/ten-tips-for-first-time-conference-speakers.html" rel="alternate"></link><published>2017-02-11T00:00:00+01:00</published><updated>2017-02-11T00:00:00+01:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2017-02-11:/ten-tips-for-first-time-conference-speakers.html</id><summary type="html">&lt;p&gt;The saddest moment for me at conferences is when I'm in the middle of an interesting conversation with a bright person and I ask her when her talk is and she says, "Who me?"&lt;/p&gt;
&lt;p&gt;The number of folks I speak with every year at conferences who have amazing stories to …&lt;/p&gt;</summary><content type="html">&lt;p&gt;The saddest moment for me at conferences is when I'm in the middle of an interesting conversation with a bright person and I ask her when her talk is and she says, "Who me?"&lt;/p&gt;
&lt;p&gt;The number of folks I speak with every year at conferences who have amazing stories to share and who are working on great datasets and tools is astounding. I often feel overwhelmed by being the most average person in the room.&lt;/p&gt;
&lt;p&gt;That said, I think one way we can help increase diversity of ideas and culture in our community is to encourage and support first-time speakers. And I strongly believe a more diverse community benefits us all by creating more opportunities, increased inventiveness and fresh perspectives on the important topics and problems we face.&lt;/p&gt;
&lt;h2 id="why-speak-at-conferences-what-good-is-it"&gt;Why speak at conferences? What good is it?&lt;/h2&gt;
&lt;p&gt;Besides being good practice for management roles or other roles where public speaking is important, conferences give you an opportunity to share your work and knowledge and engage with others who you might not have met organically. I find the conversations I enjoy after giving a talk inspire new ideas and research for me and often teach me just as much as I learned in preparation for the talk.&lt;/p&gt;
&lt;h2 id="but-i-dont-like-public-speaking"&gt;But I don't like public speaking...&lt;/h2&gt;
&lt;p&gt;Honestly, that's fine. If you tried it once and you hated it, okay. You could always submit a panel, perhaps? Or give a tutorial? (I know, I'm trying too hard). However, if you haven't tried public speaking outside of the time you were in a play in grade school, &lt;em&gt;please&lt;/em&gt; give it a second chance.&lt;/p&gt;
&lt;h2 id="fine-where-are-my-tips"&gt;Fine, where are my tips?&lt;/h2&gt;
&lt;p&gt;Me RN&lt;/p&gt;
&lt;p&gt;&lt;img alt="excite!!" src="http://i.amz.mshcdn.com/ZONonowm38Eyu2i_CgFPnMISon0=/fit-in/1200x9600/http%3A%2F%2Fmashable.com%2Fwp-content%2Fuploads%2F2013%2F07%2Fexcited-baby.gif"&gt;&lt;/p&gt;
&lt;p&gt;Image: &lt;a href="http://maxafax.tumblr.com/"&gt;maxafax&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Here goes!&lt;/p&gt;
&lt;h3 id="1-talk-about-something-you-love"&gt;1. Talk about something you love.&lt;/h3&gt;
&lt;p&gt;The best talks are ones where the presenter is passionate and interested in the topic. You will likely put in hours practicing and rehashing ideas for your talk. You might need to research or test your hypothesis or implementation. Let it be a joy for both you and your audience.&lt;/p&gt;
&lt;h3 id="2-avoid-writers-block"&gt;2. Avoid writer's block.&lt;/h3&gt;
&lt;p&gt;Try techniques writers use! Get a notebook and write down every topic that comes to mind. Write down everything you know about every topic. Get on the phone and talk about it with your boss, coworker, friend, mother. Write down any and all ideas that come from those conversations. Read books or listen to talks or podcasts on the topic and write more notes on them. Reread your notes and code and/or data and repeat the above process until you have way too many words and not enough time to fit them.&lt;/p&gt;
&lt;h3 id="3-dont-be-afraid-to-engage-mentors-and-experts"&gt;3. Don't be afraid to engage mentors and experts.&lt;/h3&gt;
&lt;p&gt;Is there someone whose talk, library, career or accomplishments helped inspire your idea? Even if they might seem too busy or famous -- it’s likely they will be flattered and interested in speaking with you. Reach out and see what happens -- you could be pleasantly surprised!&lt;/p&gt;
&lt;h3 id="4-have-stage-fright-co-present"&gt;4. Have stage fright? Co-present!&lt;/h3&gt;
&lt;p&gt;If you are someone who is &lt;em&gt;truly&lt;/em&gt; terrified of speaking in front of groups, the best practice is to co-present. If for some reason you freeze, you have a partner to take over! And with practice, I honestly believe you can overcome your fear. You can also propose a panel and help moderate so the limelight is not focused on you and you have the opportunity to introduce and interview experts you collect for the topic.&lt;/p&gt;
&lt;h3 id="5-dont-worry-about-knowing-everything-be-prepared-to-learn"&gt;5. Don’t worry about knowing everything; be prepared to learn.&lt;/h3&gt;
&lt;p&gt;You won’t know everything about your topic. Be willing to learn and ask lots of questions. Be willing to be humbled by the knowledge of your attendees. Be willing to thank people for sharing knowledge with you. Be willing to admit you don’t know an answer, but be willing to help find it. (Side note: Don’t be afraid to get technical and dig deep!!)&lt;/p&gt;
&lt;h3 id="6-practice"&gt;6. Practice.&lt;/h3&gt;
&lt;p&gt;Practice your timing and your slide presentation. Practice in front of (every|any)one. Practice in front of your cat. Practice in front of your boss. Practice in front of a local meetup group. Practice in your sleep. Practice on Snapchat. Practice in front of a mirror. Basically, practice until you are saying similar enough things every time, you stop reading your notes and the timing and talk progression are second-hand knowledge.&lt;/p&gt;
&lt;h3 id="7-on-the-day-of-the-talk-get-rest-eat-breakfast-dont-look-at-your-slides"&gt;7. On the day of the talk, get rest, eat breakfast, don’t look at your slides.&lt;/h3&gt;
&lt;p&gt;By now you've practiced so much you could do it in your sleep. Give your mind a break. Make yourself a nice cup of tea or a latte. Get a good night’s rest. Meditate or watch a fun movie or do some (non-coding) reading or writing. You’ll be fine! In fact, you’ll be great! Time to just relax and enjoy your upcoming speech with ease.&lt;/p&gt;
&lt;h3 id="8-take-a-deep-breath-smile-stare-at-one-person-walk-around"&gt;8. Take a deep breath. Smile. Stare at one person. Walk around.&lt;/h3&gt;
&lt;p&gt;As you're giving your talk, remember to breathe! I like to take a deep breath and smile as I get on stage. Even if you don’t feel like smiling, it helps! If you get nervous about the crowd size, find a few friendly faces (or one or two you don’t know) and focus on those. Walk around the stage while you talk to ease your nerves and engage your audience.&lt;/p&gt;
&lt;h3 id="9-dont-take-yourself-or-your-talk-too-seriously"&gt;9. Don’t take yourself or your talk too seriously.&lt;/h3&gt;
&lt;p&gt;You are not a brain surgeon. If your talk completely flops, no one is going to die. If your slides freeze up, the world will continue turning. If you mispronounce someone’s name or you forget to mention a library, no one is going to put you in time-out. It’s OK to mess up and it doesn’t mean you’re not a smart cookie.&lt;/p&gt;
&lt;h3 id="10-ask-for-listen-to-and-learn-from-feedback"&gt;10. Ask for, listen to, and learn from feedback.&lt;/h3&gt;
&lt;p&gt;Feedback, both in the form of any written reviews as well as people on Twitter or folks who come up to speak to you later, is great! There will always be haters; try not to focus on reviews that say nothing constructive. Ask for feedback from mentors and colleagues who were there. Take both positive and negative feedback to heart and use it to make your &lt;em&gt;next&lt;/em&gt; talk even better.&lt;/p&gt;
&lt;p&gt;If you've made it this far: 👯 🎉 🙌 I hope to see you at an upcoming conference! In case you need ideas for where to present, I help organize &lt;a href="http://pydata.org/berlin2017/"&gt;the PyData Berlin Conference&lt;/a&gt;, which is guaranteed to be absolutely fabulous and is happening July 1-2, 2017. The PyData Berlin committee will also be organizing some local workshops to encourage first-time speakers and some mentorship opportunities -- so feel free to reach out for more information (forms and links for these will also be added to the website soon).&lt;/p&gt;
&lt;p&gt;Now that you are inspired, I recommend getting started on your proposal. For some further advice on writing a great proposal, I can recommend:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://us.pycon.org/2017/speaking/talks/"&gt;PyConUS Advice for Talk Proposals&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.noelrappin.com/railsrx/2014/3/17/what-i-learned-from-reading-429-conference-proposals"&gt;Noel Rappin: What I Learned from Reading 429 Conference Proposals&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.sarahmei.com/blog/2014/04/07/what-your-conference-proposal-is-missing/"&gt;Sarah Mei: What your Conference Proposal is Missing&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Look forward to seeing you speak up! 👏&lt;/p&gt;</content><category term="conferences"></category></entry><entry><title>The Practice of Programming: 18 Years Later</title><link href="https://blog.kjamistan.com/the-practice-of-programming-18-years-later.html" rel="alternate"></link><published>2017-01-20T00:00:00+01:00</published><updated>2017-01-20T00:00:00+01:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2017-01-20:/the-practice-of-programming-18-years-later.html</id><summary type="html">&lt;p&gt;Over the new year holiday time I had a chance to get away from it all, and snuck up to Finland to sit in a lodge on the Gulf of Finland, sip coffee, take saunas and read. I brought along a few books, the only programming one being &lt;a href="http://www.cs.princeton.edu/~bwk/tpop.webpage/"&gt;Brian W …&lt;/a&gt;&lt;/p&gt;</summary><content type="html">&lt;p&gt;Over the new year holiday time I had a chance to get away from it all, and snuck up to Finland to sit in a lodge on the Gulf of Finland, sip coffee, take saunas and read. I brought along a few books, the only programming one being &lt;a href="http://www.cs.princeton.edu/~bwk/tpop.webpage/"&gt;Brian W. Kernighan and Rob Pike's "The Practice of Programming."&lt;/a&gt;&lt;/p&gt;
&lt;blockquote class="twitter-tweet" data-lang="en"&gt;&lt;p lang="en" dir="ltr"&gt;Cabin: woke up like this. 😂 😍 &lt;a href="https://t.co/spr130gFzR"&gt;pic.twitter.com/spr130gFzR&lt;/a&gt;&lt;/p&gt;&amp;mdash; katharine jarmul (@kjam) &lt;a href="https://twitter.com/kjam/status/816206196591984640"&gt;January 3, 2017&lt;/a&gt;&lt;/blockquote&gt;

&lt;script async src="//platform.twitter.com/widgets.js" charset="utf-8"&gt;&lt;/script&gt;

&lt;p&gt;I received the book as a loan from a long-time mentor, who helped me first learn how to write production-ready code. I remember reading it in 2008 and having difficulty understanding all the concepts. As I moved from city to city, I always thought I should probably mail it back, or perhaps read it again &lt;em&gt;first&lt;/em&gt;, then mail it back...&lt;/p&gt;
&lt;h2 id="practice-of-programming-the-book"&gt;Practice of Programming: The Book&lt;/h2&gt;
&lt;p&gt;The book is 18 years old. It covers C programming. It handles issues like signed versus unsigned integers, piping data between mismatched byte systems and a few other topics that do not affect my programming, nor most of the folks I know. Why reread it?&lt;/p&gt;
&lt;p&gt;Brian W. Kernighan and Rob Pike should need no introduction, but here is one in case you are like me and getting older and dependent on Google. Kernighan is a contributor to the C programming language and co-author of the famous book, &lt;a href="https://en.wikipedia.org/wiki/The_C_Programming_Language"&gt;"The C Programming Language"&lt;/a&gt;. He worked at &lt;a href="http://www.theverge.com/2012/3/21/2887206/jon-gertner-idea-factory-bell-labs-great-american-age-innovation-book-review"&gt;Bell Labs&lt;/a&gt; with Rob Pike, famous in his own right for developing numerous parts of the Unix system we all know and love today; and the whole &lt;a href="https://github.com/golang/go/graphs/contributors"&gt;Go language thing...&lt;/a&gt; #nbd.\n\nWhat gems still held my attention, 18 years after they were published and nearly 9 years after I first was handed the book? Many more than you might think, here are a few:&lt;/p&gt;
&lt;h4 id="debugging"&gt;Debugging&lt;/h4&gt;
&lt;p&gt;Chapter 5 is devoted solely to debugging; and has many informative sections including tips on finding patterns, rubber ducking (but with &lt;a href="https://discourse.codinghorror.com/t/rubber-duck-problem-solving/67/32"&gt;a teddy bear instead&lt;/a&gt;), analyzing data to help find programming bugs, and how to solve "non-reproducible" errors. The section that is truly timeless is &lt;em&gt;5.7 Other People's Bugs&lt;/em&gt;, which valiantly takes on how to find, manage and report other programmer's errors.&lt;/p&gt;
&lt;p&gt;Including this tidbit:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If you think that you have found a bug in someone else's program, the first step is to make absolutely sure it is a genuine bug, so you don't waste the author's time and lose your own credibility.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;From someone who has written and helped fix many bugs, this resonated. Especially when it seems the standard today is to simply report a GitHub issue and let the author(s) and contributors figure it out. If most of us spent an extra day debugging the issue, we might even fix it ourselves (we have the source code) or at least present a well-proven test case for the author(s) to help alleviate the burden on open-source maintainers.&lt;/p&gt;
&lt;p&gt;In that vein, Kernighan and Pike write:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Finally, put yourself in the shoes of the person who receives your report. You want to provide the owner with as good a test case as you can manage. It's not very helpful if the bug can be demonstrated only with large inputs, or an elaborate environment, or multiple supporting files. Strip the test down to a minimal and self-contained case. Include other information that could possibly be relevant, like the version of the program itself, and of the compiler, operating system and hardware.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I feel like a checklist of these points should be required before submitting bug reports. A kind of &lt;a href="https://www.joelonsoftware.com/2000/08/09/the-joel-test-12-steps-to-better-code/"&gt;Joel Test&lt;/a&gt; for error reporting.\n\nOn the topic of errors, the authors also reference Donald Knuth's &lt;a href="http://onlinelibrary.wiley.com/doi/10.1002/spe.4380190702/abstract"&gt;the Errors of TeX&lt;/a&gt;, which deserves it's own separate treatment (or post).&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt;\n\n#### Testing\n\nChapter 6 is devoted to testing. As &lt;a href="https://www.safaribooksonline.com/library/view/building-data-pipelines/9781491970270/video289850.html"&gt;a fan of testing (even for your data!)&lt;/a&gt;, this chapter stood out; not just for it's methodical evaluation of how, when and why to write tests, but also it's use of data validation (!!) and test automation (!!!). The fact that good developers are still having to explain why they need these types of tests included in their test suite (or to managers or higher ups that these tests are even necessary), is a sad and telling reflection of our priorities and (non)adherence to lessons learned long ago.\n\nI especially liked this passage:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It is important to test your own code: don't assume that some testing organization or user will find things for you. But it's easy to delude yourself about how carefully you are testing, so try to ignore the code and think of the hard cases, not the easy ones. To quote Don Knuth describing how he creates tests for the TEX formatter, "I get into the meanest, nastiest frame of mind that I can manage, and I write the nastiest [testing] code I can think of; then I turn around and embed that in even nastier constructions that are almost obscene."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I literally spit my coffee out when reading this bit, imaging the coders of the world finding their worst selves and attacking their product with vigor and malice. But it &lt;em&gt;IS&lt;/em&gt; great advice. How many times have I written the obvious test instead of devoting a day or a few hours figuring out how to break my own code? &lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;h4 id="portability"&gt;Portability&lt;/h4&gt;
&lt;p&gt;The final chapter that struck me as still very much applicable today was Chapter 8 on Portability. This was a surprise, as I assumed the portability issues in 1999 didn't reflect any I might have seen as a developer. Grrllll, was I wrong...&lt;/p&gt;
&lt;p&gt;I can't even begin to explain my joy and amusement at turning the page and reading this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;8.8 Internationalization&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;If one lives in the United States, it's easy to forget that English is not the only language, ASCII not the only character set, $ not the only currency symbol, dates can be written with the day first, times can be based on a 24-hour clock, and so on.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The amount of data errors, report misunderstandings and general grief I have seen in my career due to these misconceptions (sometimes my own, of course) are too many for me to recount. Additionally, the fact we still debate the &lt;em&gt;need&lt;/em&gt; for internationalization of smaller tools or even our own websites is again interesting to note (when given an 18-year-old book outlining internationalization as a requirement).&lt;/p&gt;
&lt;p&gt;Beyond internationalization, Kernighan and Pike touch upon portability for different environments, and elaborate on the pitfalls of massive if/else or switch statements in compilers or setup configuration files. Their warning against modifying source for one particular install was succinct and useful:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;When you modify a program to adapt to a new environment, don't begin by making a copy of the entire program. Instead, adapt the existing source. You will probably need to make changes to the main body of the code, and if you edit a copy, before long you will have divergent versions. As much as possible, there should only be a single source for a program; if you find you need to change something to port to a particular environment, find a way to make the change work everywhere.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Finally, something I think we have caught up to (although should still remember)! Version control, generalization (when useful) and open-source libraries eating the world. Hooray us!&lt;/p&gt;
&lt;h4 id="other-fun-to-me-notes"&gt;Other fun (to me) notes&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;An entire section on self-generating code and ideas for better code written by machines.&lt;/li&gt;
&lt;li&gt;Seeing &lt;code&gt;print("%s", str)&lt;/code&gt; and doing a double-take to make sure I was not reading Python.&lt;/li&gt;
&lt;li&gt;A paragraph outlining (very politely) how ridiculous it is that we still need to support carriage returns (&lt;code&gt;\r&lt;/code&gt;) despite the fact that computers have no carriages.&lt;/li&gt;
&lt;li&gt;Learning that "big endian" is a reference to Jonathan Swift's Gulliver's Travels.&lt;/li&gt;
&lt;li&gt;Code to roll your own RegEx parser in C.&lt;/li&gt;
&lt;li&gt;Telnetting from machine to machine to copy files and using checksum (&lt;code&gt;sum&lt;/code&gt;) to test if the copy was properly performed.&lt;/li&gt;
&lt;li&gt;A &lt;em&gt;still&lt;/em&gt; semi-functional TCL and Perl script to scrape the web. See footnote for the code.&lt;sup id="fnref:3"&gt;&lt;a class="footnote-ref" href="#fn:3"&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/li&gt;
&lt;li&gt;Checking your email with grep&lt;blockquote&gt;
&lt;p&gt;Where did I save that mail from Bob?
&lt;code&gt;% grep '^From:.* bob@' mail/*&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="in-conclusion"&gt;In Conclusion&lt;/h4&gt;
&lt;p&gt;Granted some of the content in this book was merely fun review for me and several themes are problems of a different era, I found it remarkably relevant given its age. It seems that often we talk about books even a year old as outdated, but this made me reconsider how it's sometimes easy to treat every new thing as just that, NEW. Most often it's the same programming paradigms the folks at Bell Labs were working on since the '80s.&lt;/p&gt;
&lt;p&gt;Moral of the story: Never too old to (re)read a good book.&lt;/p&gt;
&lt;p&gt;Oh, and, &lt;a href="https://twitter.com/ryanjoneil"&gt;Ryan&lt;/a&gt;... I'm sending your book back! Thanks for the loan! 😇&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;Debating doing a series on some of these older but still relevant texts. If this post is interesting to you, please let me know!&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;This is a good point to remind you how much a tool like &lt;a href="https://hypothesis.readthedocs.io/en/latest/"&gt;Hypothesis&lt;/a&gt; can help you find those nasty corners of your code that you may or may not be able to reach.&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;&lt;a href="https://gist.github.com/kjam/e31d4e50d9d5b50ca9337e3b677d20fa"&gt;Check out the unmodified 18-year old code as a Gist&lt;/a&gt;. Exact usage from book is to run as so: geturl.tcl $1 | unhtml.pl | fmt.awk. I couldn't get piping to work with my current setup, but the scripts still worked using tclsh and perl as a series of commands (granted most sites reject or don't respond to HTTP/1.0 requests without headers anymore... 😏)&amp;#160;&lt;a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="programming"></category></entry><entry><title>New O'Reilly Video Training: Data Pipelines with Python</title><link href="https://blog.kjamistan.com/new-oreilly-video-training-data-pipelines-with-python.html" rel="alternate"></link><published>2016-12-13T00:00:00+01:00</published><updated>2016-12-13T00:00:00+01:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2016-12-13:/new-oreilly-video-training-data-pipelines-with-python.html</id><summary type="html">&lt;p&gt;I'm really excited to announce a new &lt;a href="http://shop.oreilly.com/product/0636920055334.do"&gt;Python video course with O'Reilly on data pipelines&lt;/a&gt;. If you are interested in learning some of the popular options available for workflow automation and management in Python, take a look!&lt;/p&gt;
&lt;p&gt;In the course, I cover:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Using &lt;a href="http://www.celeryproject.org/"&gt;Celery&lt;/a&gt; for simple automation&lt;/li&gt;
&lt;li&gt;Setting up &lt;a href="https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html"&gt;Hadoop …&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;</summary><content type="html">&lt;p&gt;I'm really excited to announce a new &lt;a href="http://shop.oreilly.com/product/0636920055334.do"&gt;Python video course with O'Reilly on data pipelines&lt;/a&gt;. If you are interested in learning some of the popular options available for workflow automation and management in Python, take a look!&lt;/p&gt;
&lt;p&gt;In the course, I cover:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Using &lt;a href="http://www.celeryproject.org/"&gt;Celery&lt;/a&gt; for simple automation&lt;/li&gt;
&lt;li&gt;Setting up &lt;a href="https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html"&gt;Hadoop for file storage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Comparing tools like &lt;a href="https://airflow.incubator.apache.org/"&gt;Airflow&lt;/a&gt; and &lt;a href="http://luigi.readthedocs.io/en/stable/"&gt;Luigi&lt;/a&gt; for your pipeline needs&lt;/li&gt;
&lt;li&gt;How to parallelize data processing with &lt;a href="http://dask.pydata.org/en/latest/"&gt;Dask&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;A brief look at other popular tools like &lt;a href="http://spark.apache.org/"&gt;Apache Spark&lt;/a&gt; and &lt;a href="https://channels.readthedocs.io/en/stable/"&gt;Django Channels&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;More general and broad concepts like testing, DAGs, producers, consumers and how to be a not-awful systems caretaker.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There is also a &lt;a href="https://github.com/kjam/data-pipelines-course"&gt;public repository available&lt;/a&gt; which covers the code and tools used.&lt;/p&gt;
&lt;p&gt;I appreciate any and all feedback from students who are enrolled or have taken the course, so please reach out! :)&lt;/p&gt;</content><category term="trainings"></category></entry><entry><title>DAGs &amp; Dask: How and When to Accelerate your Data Analysis</title><link href="https://blog.kjamistan.com/dags-dask-how-and-when-to-accelerate-your-data-analysis.html" rel="alternate"></link><published>2016-10-29T00:00:00+02:00</published><updated>2016-10-29T00:00:00+02:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2016-10-29:/dags-dask-how-and-when-to-accelerate-your-data-analysis.html</id><summary type="html">&lt;p&gt;I gave a talk about &lt;a href="https://en.wikipedia.org/wiki/Directed_acyclic_graph"&gt;Directed Acyclic Graphs (DAGs)&lt;/a&gt; and &lt;a href="https://github.com/dask"&gt;Dask&lt;/a&gt; at &lt;a href="https://cz.pycon.org/2016/"&gt;PyConCZ 2016&lt;/a&gt;. It was super fun and I had a great time at the conference. If you want to read my slides below, here they are! There will be videos available later, so I'll post the link / video …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I gave a talk about &lt;a href="https://en.wikipedia.org/wiki/Directed_acyclic_graph"&gt;Directed Acyclic Graphs (DAGs)&lt;/a&gt; and &lt;a href="https://github.com/dask"&gt;Dask&lt;/a&gt; at &lt;a href="https://cz.pycon.org/2016/"&gt;PyConCZ 2016&lt;/a&gt;. It was super fun and I had a great time at the conference. If you want to read my slides below, here they are! There will be videos available later, so I'll post the link / video here when I see it.&lt;/p&gt;
&lt;p&gt;The notebooks I used are available on GitHub: &lt;a href="https://github.com/kjam/data-wrangling-pycon/blob/master/books/other-notebooks/"&gt;Fun with Dask Notebooks&lt;/a&gt;. If you have any questions, reach out &lt;a href="https://twitter.com/kjam"&gt;on Twitter&lt;/a&gt;.&lt;/p&gt;
&lt;iframe src="https://docs.google.com/presentation/d/1a4hsRoTWVRNTuQNcb_bY66bKikpwQJordjBx-_qs-FY/embed?start=false&amp;loop=false&amp;delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"&gt;&lt;/iframe&gt;</content><category term="conferences"></category></entry><entry><title>Introduction to Data Wrangling @ PyConCZ</title><link href="https://blog.kjamistan.com/introduction-to-data-wrangling-pyconcz.html" rel="alternate"></link><published>2016-10-29T00:00:00+02:00</published><updated>2016-10-29T00:00:00+02:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2016-10-29:/introduction-to-data-wrangling-pyconcz.html</id><summary type="html">&lt;p&gt;&lt;a href="https://cz.pycon.org/2016/"&gt;PyConCZ 2016&lt;/a&gt; was such a fun conference! First off, it was the first time I got to see &lt;a href="https://twitter.com/JackieKazil"&gt;Jackie Kazil&lt;/a&gt; since we started writing our &lt;a href="http://shop.oreilly.com/product/0636920032861.do"&gt;O'Reilly book Data Wrangling with Python&lt;/a&gt; together, HOORAYYYY!&lt;/p&gt;
&lt;blockquote class="twitter-tweet" data-lang="en"&gt;&lt;p lang="en" dir="ltr"&gt;OMG PYTHONISTAS! &lt;a href="https://twitter.com/JackieKazil"&gt;@JackieKazil&lt;/a&gt; &amp;amp; I are together for the first time since we started the &lt;a href="https://twitter.com/OReillyMedia"&gt;@OReillyMedia&lt;/a&gt; Data Wrangling …&lt;/p&gt;&lt;/blockquote&gt;</summary><content type="html">&lt;p&gt;&lt;a href="https://cz.pycon.org/2016/"&gt;PyConCZ 2016&lt;/a&gt; was such a fun conference! First off, it was the first time I got to see &lt;a href="https://twitter.com/JackieKazil"&gt;Jackie Kazil&lt;/a&gt; since we started writing our &lt;a href="http://shop.oreilly.com/product/0636920032861.do"&gt;O'Reilly book Data Wrangling with Python&lt;/a&gt; together, HOORAYYYY!&lt;/p&gt;
&lt;blockquote class="twitter-tweet" data-lang="en"&gt;&lt;p lang="en" dir="ltr"&gt;OMG PYTHONISTAS! &lt;a href="https://twitter.com/JackieKazil"&gt;@JackieKazil&lt;/a&gt; &amp;amp; I are together for the first time since we started the &lt;a href="https://twitter.com/OReillyMedia"&gt;@OReillyMedia&lt;/a&gt; Data Wrangling with Python book! 🙌 💜 🐍 &lt;a href="https://t.co/1LG3iCspQ3"&gt;pic.twitter.com/1LG3iCspQ3&lt;/a&gt;&lt;/p&gt;&amp;mdash; katharine jarmul (@kjam) &lt;a href="https://twitter.com/kjam/status/792353586328047616"&gt;October 29, 2016&lt;/a&gt;&lt;/blockquote&gt;

&lt;script async src="//platform.twitter.com/widgets.js" charset="utf-8"&gt;&lt;/script&gt;

&lt;p&gt;Secondly, it was super awesome and well organized so THANK YOU to the organizers!! 🙌 I gave two talks, one about &lt;a href="http://kjamistan.com/dags-dask-how-and-when-to-accelerate-your-data-analysis/"&gt;Dask and parallelized Data Analaysis&lt;/a&gt;, and a second one on Introduction to Data Wrangling with Python.&lt;/p&gt;
&lt;p&gt;The notebook I used is available on GitHub: &lt;a href="https://github.com/kjam/data-wrangling-pycon/blob/master/books/other-notebooks/2016%20Election%20FEC%20Data.ipynb"&gt;Data Analysis with Pandas on 2016 US Election Data&lt;/a&gt;. If you have any questions, reach out &lt;a href="https://twitter.com/kjam"&gt;on Twitter&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Enough typing, here are the slides! It was also recorded, so I will post the video of the talk as soon as I see it!&lt;/p&gt;
&lt;iframe src="https://docs.google.com/presentation/d/1-x2b7-P2BCLg0joLxruz4zXk6Nevhk3AKCAc2CCZOPY/embed?start=false&amp;loop=false&amp;delayms=3000" frameborder="0" width="960" height="749" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"&gt;&lt;/iframe&gt;</content><category term="conferences"></category></entry><entry><title>Chatbot Scraper: Europarl Scraper: 24 Languages of Politics, at your fingertips</title><link href="https://blog.kjamistan.com/chatbot-scraper-europarl-scraper-24-languages-of-politics-at-your-fingertips.html" rel="alternate"></link><published>2016-10-20T00:00:00+02:00</published><updated>2016-10-20T00:00:00+02:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2016-10-20:/chatbot-scraper-europarl-scraper-24-languages-of-politics-at-your-fingertips.html</id><summary type="html">&lt;p&gt;I participated in a two-day &lt;a href="http://www.meetup.com/PyData-Berlin/events/232774832/?eventId=232774832"&gt;PyDataBerlin Hackathon event&lt;/a&gt; in early-October and decided to build a scraper for European Parliament. This was after I found the &lt;a href="http://www.statmt.org/europarl/"&gt;Europarl parallel corpus&lt;/a&gt; a bit underwhelming as it is messy and not tagged for party, speakers or topic (this is understandable, as it is primarily …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I participated in a two-day &lt;a href="http://www.meetup.com/PyData-Berlin/events/232774832/?eventId=232774832"&gt;PyDataBerlin Hackathon event&lt;/a&gt; in early-October and decided to build a scraper for European Parliament. This was after I found the &lt;a href="http://www.statmt.org/europarl/"&gt;Europarl parallel corpus&lt;/a&gt; a bit underwhelming as it is messy and not tagged for party, speakers or topic (this is understandable, as it is primarily used as a multilingual training corpus for machine-learning translation models).&lt;/p&gt;
&lt;p&gt;At the hackathon, many folks were working on really interesting projects to analyze bias, framing and different word usage depending on party. Since I know a bit of web scraping, I built &lt;a href="https://github.com/kjam/europarl_scraper"&gt;a scraper for the current European Parliament site&lt;/a&gt;. The data from the scraper is also available via &lt;a href="http://s3.eu-central-1.amazonaws.com/europarlspeeches/"&gt;a public bucket on S3&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;All of the folks involved in the hackathon shared their findings at last night's &lt;a href="http://www.meetup.com/PyData-Berlin/events/234668866/"&gt;PyData Berlin meetup&lt;/a&gt;. It was really interesting! Felix Biessmann, David Batista and Jirka Lewandowski all found correlations between word choices and party. I encourage you to check out their slides!&lt;/p&gt;
&lt;p&gt;I hope we can have another PyData Berlin hackathon soon, and my data can be useful for further research in political language bias. Although I spent a lot of times in my slides making jokes (as I don't have much analysis to present and talking about web scraping is a bit boring), I do believe strongly that democracy is hard and the more folks we have who are "good at data" helping analyze and keep watch and collaborate with those who understand politics, the better.&lt;/p&gt;
&lt;p&gt;Here are my slides, feel free to reach out if you have questions about the data or if you do anything interesting with it! 🙌&lt;/p&gt;
&lt;iframe src="https://docs.google.com/presentation/d/1MZHdgWFovx71z71JkDV35Lk65HAUXj-yInZp6CCgV3c/embed?start=false&amp;loop=false&amp;delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"&gt;&lt;/iframe&gt;</content><category term="hacking"></category></entry><entry><title>Chatbot Scraper: Using (today's) IRC logs as your NLP datasets</title><link href="https://blog.kjamistan.com/chatbot-scraper-using-todays-irc-logs-as-your-nlp-datasets.html" rel="alternate"></link><published>2016-09-29T00:00:00+02:00</published><updated>2016-09-29T00:00:00+02:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2016-09-29:/chatbot-scraper-using-todays-irc-logs-as-your-nlp-datasets.html</id><summary type="html">&lt;p&gt;I dunno about you, but I often find myself bored with NLP (natural language processing) datasets. Too often they are older, based around something that is not particularly interesting to me or something I've analyzed or used before.&lt;/p&gt;
&lt;p&gt;For me, &lt;a href="https://wikipedia.org/wiki/Internet_Relay_Chat"&gt;IRC&lt;/a&gt; has often been a source of community, fun, sometimes …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I dunno about you, but I often find myself bored with NLP (natural language processing) datasets. Too often they are older, based around something that is not particularly interesting to me or something I've analyzed or used before.&lt;/p&gt;
&lt;p&gt;For me, &lt;a href="https://wikipedia.org/wiki/Internet_Relay_Chat"&gt;IRC&lt;/a&gt; has often been a source of community, fun, sometimes trolliness (is that a word yet?) and clearly an interesting source of news / assistance with regards to my work.&lt;/p&gt;
&lt;p&gt;Given the fact that &lt;a href="https://freenode.net/"&gt;freenode&lt;/a&gt; has many publicly logged channels, I decided to see if I could scrape &lt;a href="https://botbot.me"&gt;botbot.me&lt;/a&gt; to get more data for NLP fun.&lt;/p&gt;
&lt;p&gt;After about a day of tinkering and testing, I present &lt;a href="https://github.com/kjam/chatbot_scraper"&gt;chatbot_scraper&lt;/a&gt;. It currently &lt;a href="https://botbot.me/"&gt;only scrapes the public lists for botbot.me&lt;/a&gt;, but if you use a major open-source framework / platform, you'll likely find at least one channel of interest. For me, I'm perusing the docker logs looking for interesting new topics. For you, who knows?! (Although feel free to send interesting things you find!) To get started, take a look at the &lt;code&gt;README.md&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Here is an example run:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;python&lt;span class="w"&gt; &lt;/span&gt;botbot_scraper.py&lt;span class="w"&gt; &lt;/span&gt;--network_name&lt;span class="w"&gt; &lt;/span&gt;freenode&lt;span class="w"&gt; &lt;/span&gt;--chan_name&lt;span class="w"&gt; &lt;/span&gt;docker&lt;span class="w"&gt; &lt;/span&gt;--start_date&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;2016&lt;/span&gt;-08-30&lt;span class="w"&gt; &lt;/span&gt;--end_date&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;2016&lt;/span&gt;-09-05
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;For more info, try the help command:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;python&lt;span class="w"&gt; &lt;/span&gt;botbot_scraper.py&lt;span class="w"&gt; &lt;/span&gt;-h
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I am hoping to expand it for more public chat logs and possibly even slack logging (although I'm unsure what ToS Slack has, probably too constrictive tbh..). That said, let me know if you have suggestions or issues on the &lt;a href="https://github.com/kjam/chatbot_scraper/issues"&gt;issues page&lt;/a&gt; or simply fork and send a pull request!&lt;/p&gt;
&lt;p&gt;Cheers and happy bot-ing!&lt;/p&gt;</content><category term="hacking"></category></entry><entry><title>Automating your Data Cleanup with Python</title><link href="https://blog.kjamistan.com/automating-your-data-cleanup-with-python.html" rel="alternate"></link><published>2016-09-17T00:00:00+02:00</published><updated>2016-09-17T00:00:00+02:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2016-09-17:/automating-your-data-cleanup-with-python.html</id><summary type="html">&lt;p&gt;I gave a talk at &lt;a href="http://2016.pyconuk.org/"&gt;PyCon UK 2016&lt;/a&gt; on automating your data cleanup with Python. I want to again thank the organizers for having me and thank the folks who attended. If you have any questions or are interested in talking about data cleaning problems, feel free to reach out …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I gave a talk at &lt;a href="http://2016.pyconuk.org/"&gt;PyCon UK 2016&lt;/a&gt; on automating your data cleanup with Python. I want to again thank the organizers for having me and thank the folks who attended. If you have any questions or are interested in talking about data cleaning problems, feel free to reach out: katharine at kjamistan or &lt;a href="http://twitter.com/kjam"&gt;on social media&lt;/a&gt;. Here are my slides:&lt;/p&gt;
&lt;iframe src="https://docs.google.com/presentation/d/1HCHh-V8QnZ2vGbDA-95mIAuVNQP9OXCVqknOYMg1wUg/embed?start=false&amp;loop=false&amp;delayms=5000" frameborder="0" width="960" height="749" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"&gt;&lt;/iframe&gt;

&lt;p&gt;And here is the video! :)&lt;/p&gt;
&lt;iframe width="560" height="315" src="https://www.youtube.com/embed/gp-ngPV_ZX8" frameborder="0" allowfullscreen&gt;&lt;/iframe&gt;</content><category term="conferences"></category></entry><entry><title>Embedded *isms in Vector-Based Natural Language Processing</title><link href="https://blog.kjamistan.com/embedded-isms-in-vector-based-natural-language-processing.html" rel="alternate"></link><published>2016-09-16T00:00:00+02:00</published><updated>2016-09-16T00:00:00+02:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2016-09-16:/embedded-isms-in-vector-based-natural-language-processing.html</id><summary type="html">&lt;p&gt;You may have read recently about &lt;a href="http://www.nytimes.com/2016/06/26/opinion/sunday/artificial-intelligences-white-guy-problem.html?_r=0"&gt;machine learning's&lt;/a&gt; &lt;a href="https://www.oreilly.com/learning/how-we-amplify-privilege-with-supervised-machine-learning"&gt;bias problem&lt;/a&gt; particularly in word &lt;a href="https://arxiv.org/abs/1606.06121"&gt;embeddings&lt;/a&gt; and &lt;a href="https://www.technologyreview.com/s/602025/how-vector-space-mathematics-reveals-the-hidden-sexism-in-language/"&gt;vectors&lt;/a&gt;. It's a massive problem. If you are using word embeddings to generate associative words, phrases or to do comparisons, you should be aware of the biases you are introducing into your work. In preparation …&lt;/p&gt;</summary><content type="html">&lt;p&gt;You may have read recently about &lt;a href="http://www.nytimes.com/2016/06/26/opinion/sunday/artificial-intelligences-white-guy-problem.html?_r=0"&gt;machine learning's&lt;/a&gt; &lt;a href="https://www.oreilly.com/learning/how-we-amplify-privilege-with-supervised-machine-learning"&gt;bias problem&lt;/a&gt; particularly in word &lt;a href="https://arxiv.org/abs/1606.06121"&gt;embeddings&lt;/a&gt; and &lt;a href="https://www.technologyreview.com/s/602025/how-vector-space-mathematics-reveals-the-hidden-sexism-in-language/"&gt;vectors&lt;/a&gt;. It's a massive problem. If you are using word embeddings to generate associative words, phrases or to do comparisons, you should be aware of the biases you are introducing into your work. In preparation for &lt;a href="__GHOST_URL__/i-hate-you-nlp/"&gt;my EuroPython talk on machine learning with sentiment analysis&lt;/a&gt;, I came across some disturbing nearest neighbor vectors when using Google's news vectors[^1] in emotionally charged speech; this provoked me to further investigate the bounds of *isms[^2] in word embeddings.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;I must warn you that parts of this post are disgusting, disturbing and awful.&lt;/strong&gt; If you are having a rough day, feel free to save for another time. If you are already sick of seeing hateful language, this is likely not a post to read at present. That said, I feel my duty as a former journalist to look at it, expose it, and hope to spark better conversations around how we handle both implicit and explicit bias and prejudice in our models.&lt;/p&gt;
&lt;p&gt;In my research, not dissimilar to &lt;a href="https://arxiv.org/abs/1606.06121"&gt;Bolukbasi, Chang, Zou, Saligrama and Kalai's findings&lt;/a&gt;, I found word embeddings rife with examples of sexism. Take the following example, &lt;code&gt;model.most_similar(['lady'], topn=20)&lt;/code&gt; produces several expected words, 'woman', 'gentleman', even 'gal' alongside some gems like 'beauty queen', 'FLOTUS' and 'vivacious blonde'. Whereas, &lt;code&gt;model.most_similar(['gentleman'], topn=20)&lt;/code&gt; produces several expected words, 'man', 'gentlemen', 'gent' as well as some flattering terms like 'statesman', 'sportsman' and 'stunningly handsome'.[^3]&lt;/p&gt;
&lt;p&gt;To dive a bit deeper into how these biases play out, let's do some standard analogies. We all know &lt;a href="https://www.google.co.uk/search?q=king+queen+word2vec"&gt;the King-Queen comparison&lt;/a&gt;, how might that apply to other professions?&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;In&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;most_similar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;positive&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;doctor&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;woman&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;negative&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;man&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;gynecologist&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.7093892097473145&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;nurse&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.647728681564331&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;So, Doctor - Man + Woman = Gynecologist or Nurse. Great! What else?&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;In&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;most_similar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;positive&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;professor&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;woman&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;negative&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;man&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;associate_professor&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;0.7771055698394775&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;assistant_professor&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.7558495402336121&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;So, Professor - Man + Woman = Associate / Assistant Professor. Now, for &lt;a href="__GHOST_URL__/obligatory-women-in-tech-post/"&gt;something near and dear to me...&lt;/a&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;In&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;most_similar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;positive&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;computer_programmer&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;woman&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;negative&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;man&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;homemaker&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.5627118945121765&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;housewife&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.5105047225952148&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;graphic_designer&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.505180299282074&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;So, Computer Programmer - Man + Woman = housewife. Or graphic designer. Because of course women only do design work (never great male designers or amazing female DBAs). Now pay attention that some of these vectors have varying degrees of similarity (noted in the second element of the tuple); the higher the number, the closer the vectors. That said, these are &lt;em&gt;real&lt;/em&gt; responses from word2vec.[^4]&lt;/p&gt;
&lt;p&gt;I hadn't seen much written about word2vec's racist and xenophobic tendencies, but after playing around with sexism, I assumed I would find some. Again, &lt;strong&gt;fair warning that hateful language lies ahead!&lt;/strong&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt; &lt;span class="n"&gt;In&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;most_similar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;positive&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;immigrant&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;topn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
 &lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;immigrants&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.7985076904296875&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Immigrant&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.6984704732894897&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;migrant&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.6784891486167908&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;illegal_immigrant&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.6712934970855713&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;So it only took our model to the fourth most similar vector to assume our immigrant is illegal. Scanning the rest of the word list, I found some &lt;a href="https://en.wikipedia.org/wiki/Binghamton_shootings"&gt;references to violence&lt;/a&gt; tied to immigrants, but no positive associative words.&lt;/p&gt;
&lt;p&gt;A few searches into African-American and man, I found that 'Negroes' existed not far from 'african_american' and 'black' + 'man'. Taking a look at the other nearest neighbors,&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;  &lt;span class="n"&gt;In&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;most_similar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;positive&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Negroes&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;topn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;negroes&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.7197504639625549&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;blacks&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.6292858123779297&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Negro&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.5892727375030518&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Blacks&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.5798656344413757&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;negro&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.5609244108200073&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;slaves&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.5548534393310547&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;niggers&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.553610622882843&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Yep. Word2Vec just dropped &lt;a href="https://www.youtube.com/watch?v=nwTejVem4zc"&gt;the N-Word&lt;/a&gt; in the middle of my search. It's clear that &lt;a href="https://www.theguardian.com/technology/2016/mar/30/microsoft-racist-sexist-chatbot-twitter-drugs"&gt;Microsoft isn't the only one with potential racist bot abuse&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;There were many more offensive phrases I found, many of which I didn't save or write down as I could really only stomach 5 minutes at a time of research until I needed a mental and spiritual break. Here are a summary of some I remembered and was able to find again:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;mexicans =&amp;gt; illegals, beaners&lt;/li&gt;
&lt;li&gt;asians =&amp;gt; gooks, wimps&lt;/li&gt;
&lt;li&gt;jews =&amp;gt; kikes&lt;/li&gt;
&lt;li&gt;asian + woman =&amp;gt; teenage girl, sucking dick&lt;/li&gt;
&lt;li&gt;gay + man =&amp;gt; "horribly, horribly deranged"&lt;/li&gt;
&lt;li&gt;transsexual + man =&amp;gt; convicted rapist[^5]&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I'm certain these are not the only &lt;em&gt;-isms that lie in the vectors. Although these offensive vectors are often not the top similar result, we can see that hidden inside these word embeddings are offensive, demeaning, repulsive mirrors on the &lt;/em&gt;-isms in our society. Journalists are not always unbiased, and the news itself often contains quotes, references and other pointers to things we might rather not see or confront. Therefore, using the news to train our language models is shown here to expose our model to the *-ism-rich underbelly of our society.&lt;/p&gt;
&lt;p&gt;We, as data scientists and computer programmers, should recognize these statistical certainties in our data. I will note that doing similar searches in &lt;a href="https://github.com/idio/wiki2vec"&gt;the Wikipedia vectors&lt;/a&gt; produced far less offensive and hateful speech. I would be curious as to other vector models trained on different texts can help us produce ethical models for our use or if we can prove findings around &lt;strong&gt;unlearning&lt;/strong&gt; bias[^6].&lt;/p&gt;
&lt;p&gt;Confronting &lt;a href="http://boingboing.net/2015/12/02/racist-algorithms-how-big-dat.html"&gt;racism&lt;/a&gt;, sexism, heteronormativeism and likely many other *isms in our models is not something we can avoid or ignore: it's already here and at work. Taking a raw look at it and determining how we then treat our broken models is a step we will all be forced to take either now or later.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note: if you find other &lt;/em&gt;isms or are working on anything related to challenging bias in machine learning, I would love to hear from you! Feel free to reach out in the comments, email katharine at kjamistan or &lt;a href="http://twitter.com/kjam"&gt;on social media&lt;/a&gt;.*&lt;/p&gt;
&lt;p&gt;[^1] To download the model used in this post and read about how the model was developed, go to &lt;a href="https://code.google.com/archive/p/word2vec/"&gt;Google's original word2vec release&lt;/a&gt;. tldr; it was trained on 300 billion words via english-language news articles (on Google News datasets) and contains 300-dimensional vectors for 3 million words.&lt;/p&gt;
&lt;p&gt;[^2] For the purpose of this post, &lt;em&gt;isms will be used to represent a variety of oppressive societal constructs such as racism, sexism and heterosexism. I am certain there are likely more hidden &lt;/em&gt;isms in word embeddings, as well as more examples of these *isms in both the news vectors as well as other embedding models and other languages.&lt;/p&gt;
&lt;p&gt;[^3] Mind you: I was surprised at that one! Indeed, it shows the inherent cultural bias of judging all genders by our looks -- another *ism in our social language.&lt;/p&gt;
&lt;p&gt;[^4] To see the entire code yourself, check out &lt;a href="https://github.com/kjam/random_hackery/blob/master/*isms%20and%20word%20embeddings.ipynb"&gt;my github&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;[^5] I found 'transsexual' via searching for 'transgender'.&lt;/p&gt;
&lt;p&gt;[^6] &lt;a href="https://arxiv.org/abs/1606.06121"&gt;Bolukbasi, Chang, Zou, Saligrama and Kalai's research&lt;/a&gt; was also able to show bias can be expressed as a directional vector(&lt;strong&gt;!!!&lt;/strong&gt;). We could possibly use machine learning to unlearn the aforementioned biases.&lt;/p&gt;</content><category term="research"></category></entry><entry><title>Obligatory Women In Tech Post</title><link href="https://blog.kjamistan.com/obligatory-women-in-tech-post.html" rel="alternate"></link><published>2016-09-16T00:00:00+02:00</published><updated>2016-09-16T00:00:00+02:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2016-09-16:/obligatory-women-in-tech-post.html</id><content type="html">&lt;p&gt;&lt;strong&gt;Question:&lt;/strong&gt; How does it feel to be a woman in tech?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Answer:&lt;/strong&gt;&lt;/p&gt;
&lt;iframe src="https://giphy.com/embed/13Xs7FQmAsqsHS" width="480" height="256" frameBorder="0" class="giphy-embed" allowFullScreen&gt;&lt;/iframe&gt;

&lt;p&gt;&lt;a href="https://giphy.com/gifs/hair-blow-dries-13Xs7FQmAsqsHS"&gt;via GIPHY&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;see also:&lt;/em&gt; &lt;a href="http://www.laweekly.com/arts/geek-chicks-pyladies-a-gang-of-female-computer-programmers-2373431"&gt;OG PyLadies Interview&lt;/a&gt;&lt;/p&gt;</content><category term="life"></category></entry><entry><title>I Hate You, NLP ;)</title><link href="https://blog.kjamistan.com/i-hate-you-nlp.html" rel="alternate"></link><published>2016-07-21T00:00:00+02:00</published><updated>2016-07-21T00:00:00+02:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2016-07-21:/i-hate-you-nlp.html</id><summary type="html">&lt;p&gt;"I had a great time talking about Sentiment Analysis and Natural Language processing at &lt;a href="https://ep2016.europython.eu/"&gt;EuroPython 2016&lt;/a&gt;. Here are my slides for your review, feel free to reach out &lt;a href="https://twitter.com/kjam"&gt;on Twitter&lt;/a&gt; or email if you'd like to chat further about NLP, machine learning and sentiment. I look forward to starting more …&lt;/p&gt;</summary><content type="html">&lt;p&gt;"I had a great time talking about Sentiment Analysis and Natural Language processing at &lt;a href="https://ep2016.europython.eu/"&gt;EuroPython 2016&lt;/a&gt;. Here are my slides for your review, feel free to reach out &lt;a href="https://twitter.com/kjam"&gt;on Twitter&lt;/a&gt; or email if you'd like to chat further about NLP, machine learning and sentiment. I look forward to starting more conversations about how we are handling NLP in open source and sentiment analysis.&lt;/p&gt;
&lt;iframe src="https://docs.google.com/presentation/d/1c9TbcDpxpyjKopY-LOeL49oqYAPrOcQyB6XQjewEPrg/embed?start=false&amp;loop=false&amp;delayms=5000" frameborder="0" width="960" height="749" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"&gt;&lt;/iframe&gt;

&lt;p&gt;And here's &lt;a href="https://www.youtube.com/watch?v=vitEXiOuiEk"&gt;the video&lt;/a&gt;!&lt;/p&gt;
&lt;iframe width="560" height="315" src="https://www.youtube.com/embed/vitEXiOuiEk" frameborder="0" allowfullscreen&gt;&lt;/iframe&gt;</content><category term="conferences"></category></entry><entry><title>Python Flight Search</title><link href="https://blog.kjamistan.com/python-flight-search.html" rel="alternate"></link><published>2016-03-29T00:00:00+02:00</published><updated>2016-03-29T00:00:00+02:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2016-03-29:/python-flight-search.html</id><summary type="html">&lt;p&gt;Like many people, I enjoy travel. With family and friends all across the United States and a home base in Berlin, it's fairly easy to find a reason to travel -- either globally or within the EU. That said, what I find more difficult is to determine what's the best way …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Like many people, I enjoy travel. With family and friends all across the United States and a home base in Berlin, it's fairly easy to find a reason to travel -- either globally or within the EU. That said, what I find more difficult is to determine what's the best way to get from one place to another. I have used &lt;em&gt;many&lt;/em&gt; flight trackers before and generally was happy with the results, but I always wondered if there was more to the flight matrix...&lt;/p&gt;
&lt;p&gt;As I was planning a potential visit to Cuba, many of the "normal" sites were lacking available trips. Since I'm based in Berlin, it's also easy (and cheap -- thanks budget air!) to fly out of Frankfurt, Paris, Amsterdam or London. This usually means setting up countless alert variations on numerous sites.&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;Being a person who has &lt;a href="https://www.youtube.com/watch?v=p1iX0uxM1w8"&gt;written some scrapers in her time&lt;/a&gt;, I thought I'd at least write one to compare a few of the popular flight search sites. I was curious to know what different options the sites gave and compare if the same flights were listed with different prices.&lt;/p&gt;
&lt;h4 id="diving-into-github"&gt;Diving into GitHub&lt;/h4&gt;
&lt;p&gt;It's always good to see what's out there when you're building something new -- just in case what you're building already exists (or mainly exists). Upon some searching I came across several flight trackers / scrapers written in Python.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/Skyscanner/skyscanner-python-sdk"&gt;FlightScanner's Python SDK&lt;/a&gt; looked great. I applied to get an API Key, and so far haven't heard back.&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;I found &lt;a href="https://github.com/mayanez/flight_scraper"&gt;a github flight scraper from @mayanez&lt;/a&gt;, but after installation, I realized it no longer worked. This is a big problem for scrapers, since they usually need constant maintenance to function properly. Every time an API changes, it could render your project obsolete.&lt;/p&gt;
&lt;p&gt;I located the Google API to unearth the Google Flight Search (purchased from Matrix) called QPX Express. I registered and created a client on my Google Cloud Developer Console (hint: you must search for it to show up), and perused &lt;a href="https://developers.google.com/qpx-express/v1/trips/search#request"&gt;the search documentation&lt;/a&gt;. It's worth noting this API charges money &lt;a href="https://developers.google.com/qpx-express/v1/pricing"&gt;after the first 50 requests per day&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I was interested in comparing the Google Flight Search with some of the popular ones here in Europe. Momondo was sadly out with no API and a strict "no automation" policy in their Terms of Service. With some luck, I found SkyPicker (another great site for low fare searches) does have &lt;a href="http://docs.skypickerpublicapi.apiary.io/#"&gt;an API with some documentation&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I also found &lt;a href="http://airfinder.de"&gt;airfinder.de&lt;/a&gt;, a popular aggregator here in Germany, has a simple search and no restrictions on automation. I was able to write a scraper to parse responses on their site.&lt;/p&gt;
&lt;p&gt;I've amassed the code I wrote &lt;a href="https://github.com/kjam/python_flight_search"&gt;in a repository on Github&lt;/a&gt;. To note, there is &lt;em&gt;a lot&lt;/em&gt; more information available on these API requests, so you could easily extend it to add filtering for your favorite airlines / airports or your least favorites. I've included a script I used to pull the results into a Panda's DataFrame for easy comparison and analysis.&lt;/p&gt;
&lt;h4 id="what-i-found"&gt;What I found&lt;/h4&gt;
&lt;p&gt;The first thing I noticed was that, although there were some duplicates, there was definitive variance. (Aha! See??? I'm not crazy!) Some of the sites really offtered quite a few mixed carrier flights (with usually cheaper but longer routes), while others focused on direct flights. The duplicates I saw were always listed with the same times and prices (Conspiracy theory thwarted... 😢).&lt;/p&gt;
&lt;p&gt;I found a pretty large variance depending on the search input. For the most part, I was searching for flights out of Berlin, attempting to go long distances (America, Asia, Carribean). Your mileage may vary (HAAAA..😂😂😂).&lt;/p&gt;
&lt;p&gt;I also wondered how travel time compared to price (the eternal time vs. money question). I assumed this comparison would be a linear negative correlation, with price decreasing as travel duration increased. I was wrong.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Flight Duration versus Price" src="https://blog.kjamistan.com/images/2016/03/duration_vs_price.png"&gt;&lt;/p&gt;
&lt;p&gt;In addition, I looked at mean prices across time of day buckets. I like to take morning flights so I can just get them out of the way… for this particular flight search (Berlin to San Francisco), that preference is costly:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;departure_tod (mean price)
early am        2495.080635
morning         2459.062500
afternoon       2392.573200
evening         1663.772432
late evening    1544.032000
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;There's plenty of other questions to ask and answer with this dataset, so feel free to play around with your own searches or let me know if you have anything in particular you'd like me to explore.&lt;sup id="fnref:3"&gt;&lt;a class="footnote-ref" href="#fn:3"&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;For now, I have a solid way to compare across a few aggregators and some new airline price search tools going forward.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;&lt;a href="http://gif.co/oWqd.gif"&gt;My Feelings about this.&lt;/a&gt;&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;To be fair, they do have a note that they get thousands of requests and cannot fulfill all of them. If you have a business need for their API, I'm fairly certain you could get an API Key much faster.&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;I'm hoping to write some price comparison over time blog posts from this data, so let me know if you have any specific questions.&amp;#160;&lt;a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="hacking"></category></entry><entry><title>Data Wrangling with Python Course</title><link href="https://blog.kjamistan.com/data-wrangling-with-python-course.html" rel="alternate"></link><published>2016-02-29T00:00:00+01:00</published><updated>2016-02-29T00:00:00+01:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2016-02-29:/data-wrangling-with-python-course.html</id><summary type="html">&lt;p&gt;I'll be in New York on July 13th and 14th, teaching how to "big data" with Python. We'll cover Pandas, Hadoop, PySpark and more on automation, acquisition and managing your data.&lt;/p&gt;
&lt;h3 id="next-course-new-york-city-july-13-14"&gt;Next Course: New York City, July 13-14&lt;/h3&gt;
&lt;p&gt;&lt;a href="https://www.eventbrite.co.uk/e/learn-big-data-wrangling-with-python-tickets-24220425946"&gt;Tickets are available on Eventbrite&lt;/a&gt; with a special Early Bird and Student …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I'll be in New York on July 13th and 14th, teaching how to "big data" with Python. We'll cover Pandas, Hadoop, PySpark and more on automation, acquisition and managing your data.&lt;/p&gt;
&lt;h3 id="next-course-new-york-city-july-13-14"&gt;Next Course: New York City, July 13-14&lt;/h3&gt;
&lt;p&gt;&lt;a href="https://www.eventbrite.co.uk/e/learn-big-data-wrangling-with-python-tickets-24220425946"&gt;Tickets are available on Eventbrite&lt;/a&gt; with a special Early Bird and Student discount. If you don't want to use Eventbrite and would like to pay via invoice instead, please make a note in the form comments.&lt;/p&gt;
&lt;p&gt;If you want to attend, &lt;em&gt;please fill out the form below&lt;/em&gt; and let me know more about what you're hoping to learn. I like to modify the course once I know more about the students, so that you can have a tailored experience and I can make sure it's engaging and interesting.&lt;/p&gt;</content><category term="trainings"></category></entry><entry><title>Data Wrangling with Python</title><link href="https://blog.kjamistan.com/data-wrangling-with-python.html" rel="alternate"></link><published>2015-11-01T00:00:00+01:00</published><updated>2015-11-01T00:00:00+01:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2015-11-01:/data-wrangling-with-python.html</id><summary type="html">&lt;p&gt;Just a quick note that my book: Data Wrangling with Python is available for &lt;a href="http://www.amazon.com/Data-Wrangling-Python-Jacqueline-Kazil/dp/1491948817/ref=sr_1_1?s=books&amp;amp;ie=UTF8&amp;amp;qid=1445422551&amp;amp;sr=1-1&amp;amp;keywords=katharine+jarmul"&gt;prepurchase on Amazon&lt;/a&gt; as well as in &lt;a href="http://shop.oreilly.com/product/0636920032861.do"&gt;early release on O'Reilly's web site&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Data Wrangling with Python" src="http://ecx.images-amazon.com/images/I/51qWQ75%2BCXL._SX379_BO1\n,204\n,203\n,200_.jpg"&gt;&lt;/p&gt;
&lt;p&gt;Pick up a copy for less than full amount now. I'll be posting some examples of problems we work through in the book …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Just a quick note that my book: Data Wrangling with Python is available for &lt;a href="http://www.amazon.com/Data-Wrangling-Python-Jacqueline-Kazil/dp/1491948817/ref=sr_1_1?s=books&amp;amp;ie=UTF8&amp;amp;qid=1445422551&amp;amp;sr=1-1&amp;amp;keywords=katharine+jarmul"&gt;prepurchase on Amazon&lt;/a&gt; as well as in &lt;a href="http://shop.oreilly.com/product/0636920032861.do"&gt;early release on O'Reilly's web site&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Data Wrangling with Python" src="http://ecx.images-amazon.com/images/I/51qWQ75%2BCXL._SX379_BO1\n,204\n,203\n,200_.jpg"&gt;&lt;/p&gt;
&lt;p&gt;Pick up a copy for less than full amount now. I'll be posting some examples of problems we work through in the book in the coming weeks, as well as some classes in Europe to learn in person, so stay tuned!&lt;/p&gt;
&lt;p&gt;Also be on the lookout for my &lt;a href="http://kjamistan.com/upcoming-courses"&gt;upcoming courses&lt;/a&gt; to learn applied Data Wrangling via intensive weekend-long trainings.&lt;/p&gt;</content><category term="books"></category></entry><entry><title>Europython 2015</title><link href="https://blog.kjamistan.com/europython-2015.html" rel="alternate"></link><published>2015-07-23T00:00:00+02:00</published><updated>2015-07-23T00:00:00+02:00</updated><author><name>katharine</name></author><id>tag:blog.kjamistan.com,2015-07-23:/europython-2015.html</id><content type="html">&lt;h3 id="introduction-to-data-analysis-tutorial"&gt;Introduction to Data Analysis Tutorial&lt;/h3&gt;
&lt;p&gt;Want to learn how to analyze data using Python? If you're at #europycon you should drop by my course! If not, watch the video online later today (will post link!)&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.haikudeck.com/p/b8T4gEIWvi/introduction-to-data-analysis---europython-2015"&gt;Slides&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kjam/data-wrangling-pycon"&gt;Repo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ipynb.kjamistan.com:8888"&gt;Notebooks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://bit.ly/data-class-feedback"&gt;Feedback&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</content><category term="conferences"></category></entry></feed>