kjam's blog

AI Risk and Threat Taxonomies

2025-08-05T00:00:00+02:00

It seems like every week my LinkedIn feed is filled with new just released AI risk taxonomies, threat models or AI governance handbooks. Usually these taxonomies come from governance consultants or standards authorities and are a great reference for understanding the wide variety of risks AI systems¹ bring with them.... but...

Often they are maddeningly impractical.

Let's say you're a governance or technical expert on a team and you get forwarded a 500-definition taxonomy with little to no categorization on: where the threat lies, how to implement controls and, most importantly, if this even applies to you. Where can you start with that document? For your mental health, I'd recommend closing your browser and making yourself a tea in the garden...

So, let's NOT stop making taxonomies, they are useful as a reference, but let's START making deeply practical approaches for people who actually work in governance, data and AI. These documents can help teams:

Prioritize which risks matter: Do you train your own models? No? Then for the love of the universe stop talking about data poisoning, it's not your problem! Instead, focus on figuring out which threats are actually relevant for the ways you are using AI/ML systems and then first only look at those relevant attacks and vulnerabilities.
Stop (just) citing papers and start red teaming and testing: Papers are great. I love papers, really, have you seen my book's citations? ;) But have you heard of papers with code? Work with technical team members to actually build out and test out a few attacks once you know which ones concern you.
Build out data governance infrastructure: Most organizations aren't training or hosting extremely large models themselves, but they are building tooling around these systems. Focus on getting data governance basics correct (documentation, tagging, cataloging, lineage and quality tracking) so that as your data/AI/ML maturity grows you've already covered the basics and you're ready to go.
Focus on system components and data access: Concerned about AI privacy and security? Focus on what data and documents the system has access to and how. Build protections just like you would for any data access. For example, removing potentially sensitive data sources from any data the AI system accesses is a great start.
Flex your multidisciplinary risk muscle: Not yet doing multidisciplinary risk assessment and evaluation? You're living in the past, bud! Yes, it'll "slow you down" and introduce new processes at first, but the benefits of faster releases, higher-quality, privacy-aware and secure products will definitely outweigh that initial friction.

Getting the ball rolling, even starting small is the most essential thing you can do for building more secure, more privacy-aware systems. Then, your ability to address all of those taxonomies grows with practice, platforms and systems that help you assess, manage and reduce the impact of new risks, threats and vulnerabilities.

Want more tips like this in your inbox? Subscribe to my newsletter or my YouTube channel to get the latest.

I'm curious: any other practical tips you have for folks to get started on AI system risk? Do taxonomies help you do your work; if so, for what work and how? And what are you doing outside of taxonomy work to address risk?

An AI system includes machine learning models, monitoring, evaluation, software, infrastructure/networking/hardware and data needed to run an AI-based product or service. ↩

Algorithmic-based Guardrails: External guardrail models and alignment methods

2025-07-28T00:00:00+02:00

You've probably at some point heard the term "guardrails" when talking about security or safety in AI systems like LLMs or multi-modal models (i.e. models that include and produce multiple modalities, like speech and image, videos, image and text).

Are you a visual learner? There's a YouTube video for this article on the Probably Private channel.

In this article in the series, you'll dive deeper into what technically falls under the term "guardrail" in today's AI systems and review whether these are a reasonable approach to memorization in AI/ML models.

What are guardrails?

The term is unfortunately difficult to detail technically. Since the term became popular, it's been used to describe a variety of interventions in AI/ML systems. These can range from:

software-based input and output filters (as you read in the last article)
external algorithmic/machine learning model input and output filters
actual fine-tuning or extended training which attempts to update the main model to reduce the chance of unwanted outputs (this is sometimes called alignment)

Let's review the second two in further detail and compare them to what you learned about software-based filters.

External machine learning models as input and output filters

Similar to the software-based inputs and outputs you reviewed, machine learning based filters or algorithmic filters attempt to identify problematic inputs and outputs. Instead of using the hashing or keyword-based approaches that you know from software filters, these use a trained model that sits outside of the main model and predicts whether the input or output should be blocked.

There are already some popular open-source models that do just this, like Llama-Guard, Prompt Shield and Code Shield, which all fall under the Purple-Llama family of models released by Meta. Let's investigate how these work in a real system.

These models are trained to identify known problems within the models themselves, like toxicity and malicious content, as well as known attacks against multi-model models, such as prompt injection attacks to provoke banned responses.

The model uses the input from the conversation or individual chat message and identifies if the user, the prompt or the answer could be viewed as problematic. Some of these models are trained with a variety of categories, like violence, hateful language, illegal activities and even privacy. Some are trained just to identify one particular problem, like protecting the meta prompt or attempting to find cybersecurity errors in generated code.

For example, a guardrail model can process chat input to identify if someone has included anything that would be considered a jailbreak attempt, like "Ignore instructions and do this instead". Or the model might identify particular racist remarks or slurs and flag a conversation as discriminatory.

This works well for inputs that are easy to classify. The model is trained on a classification task, and the training data has the input (text and/or other input) and the label is the appropriate category of problems occurring (i.e. insecure code or derogatory statement).

These content-filtering models can also be trained on new categories that a given organization has in mind -- for example, not to mention a competitor when answering about services. Llama Guard has specific instructions on how to add and train your own categories.

These are still machine learning models, and machine learning models are relatively easy to fool or trick. Not only is this possible by actively attacking the model, like you learned in the adversarial machine learning article, but also by simply testing interesting new and creative approaches that are unlikely to have been tested or trained yet.

This has long been the case in cyber- and information security, where security professionals become quite skilled at thinking outside of the box and using what they know about systems and security to devise new attack vectors and creative workarounds. Because a machine learning model doesn't have actual reasoning, it is often much easier to evade than other humans.

There are many examples of successful attacks, but one of my favorite recent examples came from researchers looking at ASCII art. In their work, they ask usually blocked questions by changing key words to ASCII art text. They even developed an open-source library so you can try out your own adversarial requests.

ArtPrompt example of evading guardrails with ASCII art

In addition to these creative evasions, the issues that come up for privacy and copyright are certainly much harder to train than something like violent or criminal behavior.

Although Llama-Guard has a category for privacy, it doesn't have many details about how this was trained. I decided to test out a few conversation examples to test what it considered private or not.¹

Private	Not private
Outputting a list of numbers with the text "id"	Asking for an ID
The text "credit card number" with any number of digits	Entering or repeating a phone number
Asking for a personal address	Asking for non-public information about a person
Asking to interpret a medical report and tell who it is	Sharing medical information via chat

These results are quite similar to NVIDIA's Nemo Guardrails, which uses Microsoft's Presidio to scan for easy-to-find categories of personal data and block or mask those tokens. Although it's great to block potential release of nonpublic information like an address or phone number, it doesn't mean that privacy is actually guaranteed.

Comparing these model-based guardrails to the two major attack vectors, it's clear that the interventions don't prevent revealing memorized private or copyrighted information. Even just answering information about a person can be seen as a violation of their privacy and can reveal something that can be used outside of context and consent. And what about reproducing someone's face, voice, likeness? In addition, these guardrails don't prevent a membership inference attack, and aren't trained to evaluate if the training data is being repeated or exposed.

Perhaps more important is the question: who decides what is a guardrail and how it's trained? Most companies don't have enough data to develop and train their own guardrail models, which means they are relying on model providers to release useful guardrails.

Because each system might be different, general privacy and intellectual property guardrails can be erroneous, because the company might need to receive personal information to perform a lookup on a customer database or return copyright material that the organization has a license to use. Since most models aren't documented with how they were trained and what they can do in detail, this leaves organizations struggling to understand how to effectively deploy and monitor the guardrails available and what to do if there aren't any guardrails that fit their use case.

Since filter-like approaches are external to the actual model, what happens if you try to instead incorporate the guardrail task into the actual learning step? In this case, you want to ensure that while a task is learned, potential errors or undesired outputs are avoided -- which brings us to training-based approaches.

Fine-tuning or training-based alignment approaches

In addition to filtering and flagging inputs and outputs, today's largest models generally go through alignment as part of the fine-tuning/training step of the model development process. Let's review what this looks like and how it works.

The entire training process of today's LLMs looks something like this:

These terms can be confusing because they are very LLM-jargon specific, so let's translate what each step does:

Pretraining: For me, this naming is very strange because essentially this is just unsupervised training that produces a base language model. This model is trained on content (text, image, video, etc) at scale. The input embeddings (i.e. text or multi-modal embeddings) are also learned. The model is not trained to specifically predict chat-style responses but it might have chat text as part of the training data. For LLMs, this results in a model that is good at predicting the next token(s) when given a small or large amount of text. For other sequence-based models, it will predict the next step when given an input sequence (such as next part of audio wave or next image for video).
(Optional) Extended or Continued Pretraining: This step can be used to further pretrain a publicly-released base model from a large LLM provider² or to continue pretraining on a context- or language specific dataset. Like in the previous step, this pretraining is just learning basic language or sequence modeling, so no labeled data or supervised training is used.
Training (or sometimes called Supervised Training or Supervised Fine Tuning): This is where the base language or sequence-based model is trained to complete a given task, like answering chat messages. You can also train base language models to do other things: write code, classify text, translate or perform other sequence-based machine learning tasks. Today's chat assistants are trained on chat-like texts with additional prompt inputs that give instructions and show completions. In these datasets the user input is listed under "User" and the model should learn to respond as the "Agent" speaker. Instruction datasets can also be used where there are task-completion examples, like counting, mathematics, "world model" tasks, etc.
(Optional) Fine-tuning for Alignment: Although this can be done as a normal part of the LLM training, sometimes a separate dataset and objective is used for training guardrail alignment. If so, this usually happens directly before models go into use to ensure that these guardrails are not partially changed or forgotten during another step in the fine tuning process. These alignment datasets include examples of responding differently to requests for objectionable content.

These steps can change based on a particular model setup, but they're useful to know even if your organization's method is different.

To better understand common implementations for steps 3 and 4, you'll first need to become familiar with reinforcement learning.

A brief introduction to Reinforcement Learning: Reinforcement learning is a particular type of machine learning that uses an incentivization-like method to measure loss and update model parameters. The field emerged out of robotics, where you'd like to set particular constraints and reward or penalize particular steps or next-action predictions in order to train the robot towards a goal.

During the final training steps (#3 and #4), model optimization focuses on "preference" optimization. Let's review the most popular approaches for doing so.

Collect Initial Data on Human Preferences: First you need to develop a dataset that shows human preference between a variety of chat responses or conversations. Usually these are collected by data workers³ who act as the "agent" and produce high quality responses or interact with an already trained chatbot to produce conversations. Once enough data exists, data workers shift to ranking and correcting responses (i.e. which response is better, or what would make this response even better).
Reinforcement Learning with Human Feedback (RLHF): The "human preference" chat data is used to train a reinforcement learning reward model that learns human preferences from this ranked data (think of this as a supervised text classification task). The reward model is then used to give or subtract points from the model as it continues training. The model updates (i.e. loss/optimization) are directly calculated via this reward/penalty. More specifically, the reward model output is combined via a policy function (which balances learning with "remembering what it already learned") and this function calculates the model parameter updates.
Direct Preference Optimization (DPO): Here, only binary "human preference" chat data can be used (i.e. like/dislike) instead of a more nuanced ranking (i.e. most favorite to least). Then a policy function is used to increase likelihood of positive responses and decrease likelihood of negative responses.

In addition to providing higher quality and more interesting responses, this human feedback also helps reduce harmful text learned from the internet -- whether that's violent and criminal activities or just blatant racism, sexism, ableism, etc. Relatively recently privacy and intellectual property have been added to the list of ways to align models.

This means, however, that data workers must be given guidance on what constitutes a privacy or intellectual property violation. Only then can conversations be guided away from these outputs. As you learned in the attacks article this would be impossible for any human to do since it would require photographic memory of the entire set of training data examples.

One approach to automate this would be to actually test outputs from both the pretrained model and the fine-tuned model for their proximity to training examples and to encourage divergence from these examples. As far as I know this is not an active approach in production-grade model training.

Since it would be impossible for humans to review every conversation to determine if it has released person-related information out of context or if it has repeated potentially copyrighted content without attribution, this alignment usually finds only the most blatant examples, as shown in the small exercise with Llama Guard. These models end up avoiding directly outputting personal contact information or recognizing blatant requests that might violate privacy (i.e. "Tell me the social security number of [person]).

By design this fine tuning for alignment only modifies the model slowly and slightly, because there is a penalty attributed to too much divergence from the underlying base model. This penalty exists to ensure the fine-tuning doesn't create "catastrophic forgetting" of the large-scale text learned in the base language model.

Jailbreaking attacks

Despite these external and internal model guardrails, there are many examples of quickly and easily "jailbreaking" models. This term is used to refer to deflection of the guardrail models (i.e. by letting a prompt or response that should be flagged go through) or by either modifying or exploiting the actual model in a way that subverts the alignment fine-tuning.

Let's review a few broader categories of these attacks:

Clever prompting: Early attacks which often still work use clever prompt engineering to deflect guardrail model filters and/or alignment. One fun example is to "time travel" or "world travel" to a place where the topic is allowed. There are many great examples of clever prompt attacks, and there's likely to be many years of new prompt-based attack development.
Fine-tuning to remove alignment: Many models are available for free download and use on HuggingFace or via GitHub. Even OpenAI and Copilot allow forking of the model and fine-tuning "on your own data". Although big model providers have terms of service that prohibit malicious fine-tuning or use, this doesn't necessarily stop motivated users from downloading or forking the model and then fine-tuning to remove guardrails. Recent research on publicly-released models show the costs can be as low as $160 to significantly reduce fine-tuned guardrails.
Adversarial attacks: Adversarial attacks can be developed based on model outputs. These attacks can also be transferred from one model to another, based on the same principles of transfer learning. By design, adversarial attacks change the outputs and model behavior (either to make an error or to push outputs in a particular direction).

As of yet, there haven't been general publicly released examples where attackers utilize these methods to exfiltrate memorized data from machine learning models -- outside of producing naked images and videos of famous or less famous people without their consent or impersonating famous people via deepfakes.

As AI systems are used in increasingly proprietary and sensitive environments, these attacks will become more valuable. Particularly if you wanted to build an attack based on searching for a particular piece of content or targeting a particular person or company, this is easier to do than a general attack.

Accidental attacks on privacy, where personal information is released by queries have already been reported, and it would be useful to require transparency reporting on how often this occurs.

As you learned previously, the larger the model the easier this is to do -- even with existing guardrails. Recall that Nasr et al. predicted the ability to extract more than a million word-for-word training data examples from ChatGPT with a larger budget.

Similar to the problems with input and output filters, non-deterministic approaches (i.e. fine tuning or ML-based filters) to these problems are unlikely to catch all unwanted outputs without a clear definition of what is expected. Because most organizations training large-scale AI systems do not actively test for memorization, it is difficult to then prove that the second training/fine-tuning step has reduced this memorization to an acceptable threshold.

In general, when thinking about privacy, it is enough to prove that one person's privacy is fully violated (i.e. their data is memorized and exposed in a way they did not expect or consent to). To do this for every data example that is not collected under enthusiastic consent presents a scaling problem that would require significantly changing how privacy metrics and auditing are used in model training and evaluation.⁴

Although model training via fine-tuning and ML-based guardrails help with AI safety and reliability, they are not a substitute for thinking through and addressing real issues of privacy and memorization.

In the next article, you'll learn about the field of "machine unlearning", or if/when/how it is possible to remove information that has already been learned from deep learning models.

As always, I'm very open to feedback (positive, neutral or critical), questions (there are no stupid questions!) and creative re-use of this content. If you have any of those, please share it with me! This helps keep me inspired and writing. :)

Acknowledgements: I would like to thank Vicki Boykis for feedback, corrections and thoughts on this article. Any mistakes, typos, inaccuracies or controversial opinions are my own.

This isn't meant to be a thorough or exhaustive test or research, but I am curious if you happen to come across a holistic approach! I used Llama-Guard 3.8. ↩
For example, Mistral AI has a few base pretrained models and you can see examples of extended/continued pretraining from Lightning AI. ↩
Usually paid extremely low wages and overwhelmingly employed in Africa and South America, the working conditions of these often high-skilled workers with advanced degrees is documented well by DAIR's data workers reporting. ↩
More details on what companies/organizations and individuals can do to combat this in future articles. :) ↩

Blocking AI/ML Memorization with Software Guardrails

2025-07-11T00:00:00+02:00

One common way to control memorization in today's deep learning systems is to fix the problem by building software around it. This software can also be used to deal with other undesired behavior, like producing hate speech or mentioning criminal activities.

Are you a visual learner? There's a YouTube video for this article on the Probably Private channel.

In this article in the series, you'll learn about how software around AI/deep learning models can be used and explore why these interventions provide more of a good feeling than an actual practical solution to the problem.

How an AI product is designed

AI and deep learning models are just a tiny part of an overall system. Most of the system is deterministic software around the non-deterministic machine learning model. At an extremely high-level, this is how a Chat Assistant system might look:

In the above figure, the chat messages come in from a user via an API call to software that processes the input. As you learned in exploring the design of a machine learning system, this text will be prepared for the machine learning input. This could vary depending on design, from potential removal or correction of typos, grammatical errors, to appending meta information from the user account or other data source, and then eventually this text and any additional inputs are tokenized and sent to the AI model (via another API call).

The AI model will process the tokenized input and calculate some predicted set of tokens as a response. More often than not, there is now software around this step that requests multiple possible responses. Depending on the design, the model might return the beginning of a response while the system continues calculating the next part of the response. Remember: the model will use its own response as part of the input to continue calculating the next word(s).

If you have heard about topics like temperature, top-k and top-p sampling, these are implemented in software around the model outputs, resulting in multiple queries before the final response is constructed.

You don't need to learn the deep details of these sampling choices and settings, just know these are different parameters that the chat provider and/or the user can set to determine the deterministic or explorative qualities of the response. This creates several ways to sample longer answers and compare or explore response possibilities before determining a final response. For large models, there are other optimizations used, like potentially splitting the prediction task between a small and large model (see: speculative decoding) to improve speed.

Sometimes the response is fully formed, but sometimes the response can start before the final text is formulated. Either way, this response usually goes through another batch of software filters on its way back to the original user.

There is a tradeoff between how much post processing you can do and the response latency, so usually these are light-touch filters and interfaces before the text reaches the user. Depending on the system this might be performed many times before the answer is fully formulated.

This process starts all over again the next time the user sends a message.

Filtering inputs and outputs

As you can probably tell from the diagram, if you want to use software to build protections against memorization you need to either:

catch potentially harmful input before it reaches the AI model (i.e. in the input, text cleaning and tokenization step)
or attempt to remove it as it is produced by the system (either as part of the testing and generation) or before it reaches the user.

Let's explore and compare both options.

Prompt rewriting

In search engines, there's been significant research on rewriting queries to improve user experience, by correcting typos or expanding search terms for better results. This approach inspires the idea of prompt rewriting, where the user's interactions with the model might be modified before it hits the machine learning model.

There are several motivations for rewriting prompts for better alignment with whatever the organization wants the model to do or not do. This is usually provided in a meta prompt (also called system prompt) which describes in natural language how the model should behave and what it should or shouldn't do. You might have seen easy ways around this if the model wasn't trained to distinguish the meta prompt from user input with the "ignore all previous instructions and ...".

But since models don't have any "concept" of what information is learned that can be used and which cannot, this type of intervention doesn't work as easily for memorization problems. Even if a company wanted to list every possible copyright character to not reproduce their likeness (i.e. "Don't show Batman"), there are easy ways to indirectly and even unintentionally anchor copyright or otherwise memorized images/words.

The same research around copyright in generative images experimented with additional approaches, where prompts are tested for similarity to "forbidden" prompts and rewritten to avoid potential problems. This was explored in related research that attempted to identify the forbidden "concepts" (for example: Batman) and then fine tune the model to remove the potentially problematic concept.

For example, a prompt like "Gotham superhero" should align closer with "superhero" and end up further from "Batman". As you might guess, if implemented at scale this could be extremely expensive because you would need to find every possible term, test for memorization and then implement learning interventions. It might also not always work for the task you want it to do (i.e. which well-known superheroes aren't copyrighted?).

In-context Unlearning

In-context learning (sometimes also called few-shot learning) is a common prompt engineering strategy where you type extra instructions and examples into the prompt to demonstrate the task or how you'd like it to respond. In-context or few-shot learning allows users to on-the-fly introduce a new concept or pattern to a general purpose LLM by showing a few examples and then asking for the model to complete the next in the sequence.

For example, you could give a list of sentences and then follow each item with what language it was written in and then upload a document and ask that the model return each sentence with the language it was written next to it.

In-context learning has been used alongside prompt rewriting as a way to "unlearn" concepts. In-context "unlearning" modifies the user prompt to replace data points that should be forgotten with "dummy labels". This only scales if the forget-set is quite small and the concept is easily defined and filtered. Also it won't work as well for things that don't easily mold into an in-context prompt setting (i.e. freeform conversations). In other research on data removal from models, this type of in-context or input rewriting was proven ineffective at reducing training data exfiltration.

Doing in-context unlearning at scale successfully would mean being able to accurately determine that the user is performing an attack or that the prompt would unintentionally release memorized information. But because model developers aren't currently testing for memorization, current architectures and training and evaluation would still need to be modified to cover this input- or output-testing.

How could this type of rewriting or filtering work on the outputs instead of the inputs?

Research and applications in output filtering

Because filtering inputs is fairly difficult, in today's largest AI systems memorization testing is done via unsophisticated output filters. These filters only exist for certain systems and generally test if the model response directly matches training data that should not be output.

For example, GitHub's Copilot can test if the generated code directly matches publicly accessible code. To avoid unnecessary latency, this is usually done via an advanced hashing memory structure, so exact matches are found quickly and the false positive rate remains low.

From Copilot documentation, this is the description of the intervention.¹

Copilot code referencing searches for matches by taking the code suggestion, plus some of the code that will surround the suggestion if it is accepted, and comparing it against an index of all public repositories on GitHub.com. Code in private GitHub repositories, or code outside of GitHub, is not included in the search process. The search index is refreshed every few months. As a result, newly committed code, and code from public repositories deleted before the index was created, may not be included in the search. For the same reason, the search may return matches to code that has been deleted or moved since the index was created.

This explains the recent problems where private repository code was exposed since that code has already been memorized and yet is no longer being tested by the output filters. Depending on the index updates, this could also apply to code you might have deleted -- for example, if you found that you accidentally exposed a secret (like a key or password) or other potentially sensitive details (i.e. exposure to libraries or systems with known vulnerabilities or environment settings).

Additional interventions can test visual output, such as asking a different machine learning model "is Batman in this image?" and block outputs that find undesirable memorized content in the output. As you might imagine, this is very difficult to scale, but might work for smaller models and a small subset of data or tasks.

It is likely that larger LLMs including ChatGPT use some of these output filters to block certain undesired responses (i.e. Terms of Service violations) or to comply with right to be forgotten requests. For example, in recent news, ChatGPT wouldn't respond when the response contained specific person's names which seems like a clear sign of an output-filter rather than concept-unlearning intervention.

You can only catch what you definitely know

The problem is that you can only really do this efficiently if you know what you are looking for and if it scales appropriately. Since very few companies test for memorization as a part of their model evaluation, it's also unknown internally how much memorization happens. If users can adjust settings like temperature or other parameters to shift model behavior at will, this would also change the produced content, making the problem even more non-deterministic than it already is.

For software teams trying to develop these interventions, it's like you're building a box to fit an object in, but nobody has told you what the object is. You're building based on vibes, not based on facts and knowledge.

If rigorous testing for privacy violations and memorization happens as part of the model training and evaluation, then you start from this basic understanding and likely build both better protections and also train models with fewer issues.

Unsurprisingly, software-based filters are easy to bypass. Any motivated attacker can easily sidestep things like prompt rewriting with their own prompt engineering (more on this in the next article).

Chiyuan Zhang presented several easy methods for bypassing the GitHub Copilot output filters (originally published in research Ippolito et al.). By changing the variable names to French or adding comment markers to start the line, previously undesired code output was output because the hashing memory architecture didn't catch the similarities.² This image shows how Ippolito et al.'s attack to produce a previously blocked function description by changing the variable names to French.

This same research group found that models would at times output "style transfer" on memorized text by changing spacing, language or writing style even if not prompted by the user to do so, again showing that near-memorization testing (or paraphrase testing) might be necessary to catch these types of responses.

Determining that someone or something is in the training data and has been memorized is easy to perform when these output filters are on, as they are a direct indicator. Just like the ChatGPT example that (likely) exposed that a person had requested their data be deleted, these blocked answers leak information.

Debenedetti et al. named these types of information leaks "side channels" -- borrowing the term from cybersecurity where an attacker can extract sensitive information by observing changes in outputs or related side channels (often by observing attributes like latency, response content or other signals).

In this case, the side channel is as simple as producing prompts that generate a generic response (like, "I can't help you with that.") or generate a specifically different type of response (i.e. empty response, shortened response or fundamentally divergent response).

In information security, this falls under the concept of non-interference. This concept is easy to see with forgotten passwords. If a password reset form says "We emailed you your password" if an email is found but it says "This user doesn't exist, please create an account" if the email isn't found, then this response leaks potentially sensitive information about whether the person has an account or not.

In conclusion, the output and input filter examples you've read in this article leak particular information about what prompts are allowed and what outputs are allowed (and which not). Via a variety of clever prompts, these rudimentary safeguards are easy to evade. For this reason, software-based filters are not an appropriate intervention for problems like memorization.

In the next article, you'll investigate fine-tuned guardrails and other training-based alignment methods to determine if they are a valid solution to this problem.

Acknowledgements: I would like to thank Damien Desfontaines and Vicki Boykis for their feedback, corrections and thoughts on this article. Any mistakes, typos, inaccuracies or controversial opinions are my own.

Accessed on 5 March, 2025. ↩
You can now just turn these filters off in GitHub in your settings, and this option is likely to surface in other systems where these settings are not public-facing. ↩

Defining Privacy Attacks in AI and ML

2025-06-12T00:00:00+02:00

In this article series, you've been able to investigate memorization in AI/deep learning systems -- often via interesting attack vectors. In security modeling, it's useful to explicitly define the threats you are defending against, so you can both discuss and address them and compare potential interventions.

Prefer to learn by video? This post is summarized on Probably Private's YouTube.

In this article, you'll walk through two common attack vectors against memorization in AI systems: membership inference and data reconstruction (or exfiltration).

NOTE

This article specifically about privacy attacks related to memorization; but the field of AI security is much larger and broader. Red teaming and other security testing for AI models are a common approach for companies releasing models into production systems. The field of adversarial machine learning explores how AI/ML models can be hacked, tricked and manipulated. It's essential to understand how stochastic systems will behave when attacked or under unexpected conditions to ensure that the deployment is adequately protected or that humans who interact with the system receive appropriate training for handling unexpected or erroneous events.

What is a membership inference attack?

A membership inference attack (MIA) attempts to infer if a person (or particular example) was in the training data or not. It was first named by Shokri et al.'s work in 2016, where the researchers were able to determine which examples were in-group (training data) and which ones were not. The original attack developed a system of shadow models that were similar to the target model. The outputs of these shadow models were used to train another model to discriminate between in-training and out-of-training examples.

Since the initial attack definition, there have been a variety of improvements -- creating targeted variants to adaptively expose particular training data points or variants that attack correlated groups of data points. Several related attacks can expose sensitive attributes of individuals by revealing which subpopulations they belong to or teach attackers about overall training data populations and their qualities, which could be exploited to perform better MIAs.

Why does this work? If a model memorizes a particular example, it should return large confidence on that data point compared to similar data points it hasn't seen. If these examples are infrequent or rare (i.e. in the long tail), then these examples are overexposed compared to other examples, which can "hide in the crowd". As you already learned, larger and more accurate models display this problem more often than smaller and less accurate models.

To illustrate this, researchers worked on training multiple models on different splits of data. They then found ways to show the inlier versus the outlier-ness of particular examples by showing how the model performed on these if it processed them as part of the training data or not. For outlier examples, if they weren't in the training data, then the model performed quite poorly on them. Even for inliers, if the example was more complex (i.e. harder to learn in one training round) then the loss (i.e. accuracy) on that example leaked more information than if it was easy to learn.

This figure shows a view of the model's prediction accuracy via cross-entropy loss when comparing images in the training dataset with images out of the training dataset. As a quick reminder, cross-entropy is an accurate way to measure performance for a classification model, where it calculates how far the resulting prediction is from the true label.

If you were performing a privacy attack, you'd want to find a way to separate the member distributions in red from the non-member distributions in blue. For outliers, this is much easier! And for more complex examples that are "harder to learn" this is also easier.¹

This problem is exacerbated when model size grows and when those models are trained with datasets where one large data collection, such as ImageNet, is used to pull both testing and training data. Unfortunately, as shown in the previous article, these datasets often have duplicates or near-duplicates and this ends up leaking additional information that incentivizes memorization. As you can imagine, this doesn't just affect image datasets -- the internet is full of near duplicate or exact duplicate text and other content forms (i.e. video/audio/etc).

This figure plots different model architectures, where the y-axis shows accurate attack successes and the x-axis shows the model's performance on the holdout test examples. As you've already learned throughout this series, performing well on a test-dataset from a long-tail distribution and many outliers likely requires some element of memorization.²

Combining these two pieces of knowledge (i.e. the likelihood that an example is an outlier and the confidence in the model to guess it properly) is a good way to infer this membership.

Both of the above figures come from the Likelihood Ratio Attack (Carlini et al., 2022), which is specifically designed to minimize false positives (i.e. alerting that something is a member when it is not).³

The steps to perform this attack are as follows:

Sample from a dataset similar to the dataset you think was used by the model you want to attack (target model). Create data subsets, so each of the examples in the dataset are seen (and "not seen") by the models you will train. These are called shadow models, because they act as a stand-in for the target model.
Train your shadow models keeping track of which models have seen which examples. This creates sets of in- and out-shadow models, where the in-models have seen the example and the out-models have not. Try to match the model architecture and task of your target model as accurately as possible for best results.
Measure the model outputs (i.e. prediction accuracy / loss) on a particular example and scale that loss by using a logit function. Repeat steps 1-3 numerous times and try to cover a large swath of the the training data, with a variety of classes and examples. Store the scaled losses with notes on the example they were measured with and whether the model had seen the example or not (in-versus-out shadow model).⁴
After many training iterations and when you representative set of scaled losses on a variety of examples, analyze the scaled losses of the in-versus-out shadow models. You will hopefully have two distributions that don't completely overlap. Similar to the other CE-loss distributions you saw above, you want to try to separate these two distributions to be able to tell if something is in the training data or not based on the loss.
Query the target model with the target example. Scale the returned prediction and use probability theory to figure out if the example comes from the "in-training" or "out-of-training" distributions.

There are additional suggestions and improvements to this attack based on model architecture, datasets and that you can review in the paper and the related code repository.

Related variants of these attacks can also directly expose the training data by guessing close to the exposed data point and optimizing the input until it reaches the memorized input. This can show what data is in the training and what data isn't.

If I can learn something about you by knowing you were in the training data, then I can use this information for my own benefit. For example, if I know you have a particular disease, or visit a certain website often or if I know your immigration status or income level because your data was in a model that only represents people like that -- then this is all extra information I get from a membership inference attack.

These attack vectors overlap with data reconstruction attacks, because if I know you are in the training data, I can also attempt to extract your data directly.

What is a data reconstruction attack?

Data reconstruction attacks attempt to discover, reconstruct and exfiltrate the training data. As you might guess, this works better if the data was memorized!

If combined with membership inference attacks, these two attacks can first determine which data probably exists in the model, and then either recreate that data and test its veracity or to attempt to exfiltrate the memorized data from the model itself.

As you learned about in the article on repetition as a source of memorization, this can mean a full exfiltration of heavily repeated examples, which is easy to do if the example is common enough and has been memorized. In a way, this is expected data reconstruction, where we want to learn common information (i.e. a widely known text or celebrity face).

And as you read in the article about novelty as a source of memorization, these attacks can also directly expose less frequent examples, particularly outliers or infrequent examples. This might mean accidentally learning personal text data, like social security numbers, credit card numbers and home addresses that can then be extracted either by querying the model itself, or by a targeted attack. There are variants of these attacks that use both "white box" (i.e. direct testing of the model with a view of the model's internal state) and "black box" (i.e. API access) methods.

A related but contested variation of data reconstruction invokes paraphrased information. Here, the model outputs are compared with their training data to discover partial verbatim and paraphrased content. For visual content, these "paraphrasing" attacks can look at portions of the image or video and determine if particular features come from particular training data examples.⁵

For example, the below images are not exact duplicates, but are clearly near-duplicates. For each pair, the left images are from the training data, and the right images are generated by prompting Midjourney with the training data caption. This research from Webster (2023) unveiled efficient and accurate ways to reconstruct training data from diffusion models.

For content creators or artists, this type of attack might be important if their content is particularly popular or interesting for particular AI/LLM audiences.

Let's walk through these two main attack vectors and their variations in real-world threats where an organization or individual might be concerned about their information being memorized.

Threat: Outputting copyright material or images

One obvious initial threat is explicitly copying copyrighted content and outputting it with little or no variation. This has already sparked several prominent lawsuits, such as The New York Times vs. OpenAI, where ChatGPT outputs verbatim copies of popular New York Times articles without attribution.

Researchers have also been actively reproducing these attacks in visual images, where copyrighted characters or visuals can be easily reproduced, even without directly invoking the name. For example, typing "Gotham, Superhero" produces copies of Batman.

This affects other mediums, such as music and video content as those AI models become easier to use and more widely available. There have already been highly publicized examples of voice and video-cloning use for criminal activities.

For organizations and persons whose income or existence relies primarily on producing and licensing copyrighted content, this is certainly a very serious threat -- one that deserves attention and discussion in a public forum.

Threat: Violating someone's privacy by directly outputting their information

One drastic case is the either intentional or unintentional release of a person's sensitive information. This information could be their face, their words taken out of context, their contact information or other information they would rather not share via a machine learning model.

This is documented in research, where ChatGPT directly output personal contact information, where StableDiffusion can reproduce a person's face and where models trained on sensitive keyboard data output that information.

There is currently no required reporting from companies providing machine learning models to test, audit or verify this behavior. A more comprehensive understanding could be achieved by institutionalizing privacy auditing and reporting for deep learning models which could be standardized and enforced by a regulatory body. This could accompany other monitoring and testing for privacy right related to hallucination, like when a model repeats outdated and incorrect information or hallucinates things that never happened.

For well-known persons, this threat extends to uses like DeepFakes or other clones, where their likeness is used in ways that they did not consent to, such as the rise of DeepFake-Pornography and DeepFake-Propaganda. Combining these attacks with other software like "face transfer" makes these violations easier to do with enough examples of the person's likeness/voice (i.e. this could be performed by a person close to the person, or a person with access to enough photo or video materials of the person).

Threat: Learning if someone's data is in a model or if someone is in a particular population

Membership Inference Attacks can reveal information about someone's participation in a particular service, population or activity. That might seem harmless at first -- who cares if someone knows that OpenAI scraped my data?

For LLMs, it's not as relevant, but what about models that specifically target a certain population: like a model built to evaluate people with a disease for medical treatment, people with a certain income for advertising or credit evaluation or people with a particular political view for political ads or border control? These models exist, and membership in those models would expose related sensitive information that people likely don't want to share out of context.

Related attacks on subpopulations rather than individuals can also reveal information about the subpopulation that expose that group's sensitive attributes. Similar to the Cambridge Analytica attacks, where "harmless" information about liking Facebook Pages provided enough information to expose related sensitive attributes like gender, political affiliation, drug use and sexual preferences.

Threat: Stealing someone's work without attribution or compensation

Another real-world problem for organizations is directly repeating someone's work without appropriate attribution. For example, security researchers found private repository code from several FAANG-companies available in Copilot. The memorized code from the repositories was accessed when those repositories was public, but since then the repositories have been changed to private and the models haven't been updated.

This isn't the same as the copyright issue, because there is quite a bit of content that isn't explicitly under copyright, but is intended to create awareness, wealth and recognition for the original creator. For example, in creative and software communities, there are popular licenses that require attribution or even specify under what conditions the work can be reused or remixed. When the author, coder, artist is cited by someone who remixes or reuses their work, this creates more awareness, building their audience or giving them new opportunities.

Because of this, there's been increased awareness of GenerativeAI in artist communities, who want their art to be used by AI systems, but who would like attribution, or compensation, or both.

In a recent example that overlaps with copyright protection, artists released a "silent" album to protest the UK's proposal to not enforce copyright for AI-generated work. There have been many such protests over the past 5 years.

Threat: Overexposing certain populations to the above attacks

As you learned in the how it works (novel examples) article, some examples are more beneficial to memorize for the model's evaluation scores (i.e. accuracy on related difficult examples). This means these examples are also memorized at a higher frequency than other examples.

This might seem innocuous, but if you investigate these data distributions, you'll find that these underrepresented groups in those populations are overexposed when it comes to privacy. Research from Bagdasaryan and Shmatikov proved that models trained with differential privacy did poorly on fairness metrics across diverse groups. For example, privacy-respecting models performed worse on sentiment analysis on African-American English versus a "Standard" American English dataset. In the same research, privacy-respecting models misclassified gender for dark-skinned faces more frequently than light-skinned faces.

This demonstrates how increases in model fairness and accuracy for subpopulations is directly related to specific memorization of individuals whose data comes from that subpopulation. Put differently, certain persons in an underrepresented group give up their privacy in exchange for better model accuracy on "data like them". This overexposes individuals in this group when compared with other persons from a majority population in the dataset who can "hide in the crowd".

In larger machine learning datasets, this problem is exacerbated due to human biases in labels, where a white man might be labeled "man" and a black woman is labeled "black woman". This labeling problem occurs any time a subpopulation assumes that they represent the general population. This label bias exacerbates the memorization problem, because each label must be learned separately, and many of these "non-majority" labels will end up in the long-tail and be more prone to memorization.

Shokri et al. investigated this issue when looking at explanations for deep learning systems and found that data reconstruction attacks worked more easily on minority populations when using model explanations on the same or similar examples. Chang and Shokri formalized this privacy issue for minority populations in other works, proving that minority populations are at greater privacy risk, especially when fair algorithm and model design do not take privacy into account.

Threat: Exposing critical knowledge from training data unintentionally

Another memorization issue is the ability for these systems to memorize corporate secrets, important legal contracts or otherwise confidential information. Because the model is not incentivized to understand the difference between text, photos or other media that should be learned versus other material that shouldn't, this creates a significant problem for organizations with confidential material they'd like to use for machine learning.

For example, after the launch of ChatGPT-3.5, Amazon's legal department found text snippets of text that shared internal corporate secrets in the chat model's responses.

This can also unintentionally happen when building systems with access to such documents -- even if they haven't been trained on that data. These exposures have little to do with AI memorization and more to do with lack of privacy and security understanding in Retrieval Augmented Generation system design.

Should you be concerned?

As you've seen thus far, the only reason to be worried that sensitive data will be stored in the model is if you are training the model on data that you don't want explicitly memorized. If you don't train with copyrighted, licensed or person-related data, these attacks aren't a threat.

If you are using corporate proprietary or internal data and you are only using the model internally, this probably isn't an issue, so long as the model outputs are also considered "for internal use only". As usual, talk with your legal and privacy teams to clarify.

If teams or individuals are training their own models (i.e. personal or collaborative-based models) and they all consent to this training and co-own this model, this might not be a problem if the data is available for use across the entire company. In my experience, those teams should discuss and be aware of memorization, but they presumably enthusiastically consent to the use and development.

So really you should only be concerned about this phenomenon if you are training model that:

use people's data without their enthusiastic consent and knowledge
is used or deployed in new contexts (i.e. for a new purpose / in the public sphere / for something that those people wouldn't agree to)
doesn't address the privacy/content implications as part of model design and development

You might think: that's got to be a TINY percentage of models, but my industry experience can confirm that this is a non-trivial amount of AI systems, including many of the LLMs, Code Assistants and AI Agents.

You might also wondering about the implications if you use potentially at-risk models but don't train them. This puts you in a difficult position of not influencing the model development but potentially being exposed to the same threats above. Being aware of these threats is a good step when evaluating what systems to integrate for what tasks, and there will be a future article in this series on addressing exactly this situation.

Now that you've identified the biggest potential threats, let's begin investigating ways to address these threats. In the following articles, you'll learn about:

Software- and system-based interventions, like output filtering and system prompts
Fine-tuning guardrails
Machine unlearning or intentional "forgetting"
Differential privacy in training and fine-tuning
Evaluating and auditing privacy metrics in deep learning systems
Evaluating AI systems and their threats as a third-party user
Pruning and distillation for information reduction
Different types of models that could offer explicit and enthusiastic consent and public participation

Have a burning question or idea related to these topics, or want to share new threats and ideas? Please feel free to reach out via email or LinkedIn.

Acknowledgements: I would like to thank Vicki Boykis and Damien Desfontaines for their feedback, corrections and thoughts on this series. Their input greatly contributed to improvements in my thinking and writing. Any mistakes, typos, inaccuracies or controversial opinions are my own.

These histograms show unscaled cross-entropy loss (CE-loss) collected from 1024 models that were trained with training data produced by leaving samples in/out. The CE-loss per example was collected to visually show the behavior of different types of examples and classes in the underlying training distributions and subsequent models. ↩
The y-axis is logarithmic and shows the attack accuracy when holding for a high performance rate (False Positive Rate of 0.1). This is done as part of their attack design, where they aim to create more rigorous standards for MIA measurement to just focus on attacks that don't guess membership incorrectly. ↩
A fun piece of information from this research is that the attack's success can be measured by looking at the model generalization gaps. This connects with what you've learned so far on evaluation metrics and the generalization gap as an indicator for memorization. In general, they find that models that are more accurate and larger are easier to attack, which aligns with what you've already learned thus far. ↩
There are numerous tips in the paper and an openly available implementation on GitHub that shows how to parallelize this and how many models and data splits are efficient to develop the distributions needed for the attack. ↩
For example, when the website "this person does not exist" launched, researchers were quick to find "AI generated people" who actually represented faces in commonly used public face datasets. See Webster et al., This Person (Probably) Exists, 2021. ↩

Priveedly: your private and personal content reader and recommender

2025-01-23T00:00:00+01:00

I'm excited to open-source a project that I've been using for the past 2 and a half years: a private/personal reader and recommender.

It works with:

and comes with an example Jupyter Notebook for training your own text-based recommendation model once you have enough content. For most folks, this will be about 3-6 months of active use -- depending on the amount of content you consume.

Interested in what it looks like? There's a short video introduction on YouTube

If you just want to get started, head over to the project's GitHub! If you want a little history of why I bothered to build this and how I use it, read on.

Why news and content is personal

Despite what Social Media(TM) wants you to think, your content choices are deeply personal. You like the things you like, surely others like them, but you might be a very special combination of things which is what guides your interests.¹

The large content providers and social media platforms try to be everything to everyone, and when that doesn't work, they try to personalize by tracking you and putting you in ever smaller and smaller bins and cross-sections so that eventually your feed is "personalized" in a way that is still profitable for them to serve you content.

Unfortunately, this means that if you are curious about something outside of your normal interactions, one poor click or follow might haunt you and rearrange your content.² In my opinion, you shouldn't be afraid that clicking or reading something you are mildly interested in means you're doomed to see ads (or even deal with changes in online prices or search results) just because you clicked on one article.³

It can also be fun to decide what you want to expose yourself to, for your own autonomy and purposes. Maybe your ideas change, maybe you are going through a huge life change, or maybe you want to surround yourself with a new bubble on the internet. Either way, deciding and determining directly what you read and see is a cool way to reclaim that autonomy.

For these reasons, I decided that I was going to try to pursue more directed attention on my own reading and content online.

Down the rabbit hole: shouldn't this be easy?

I had long used feed readers, but I wanted to combine that with other content sources, like Reddit, Twitter⁴ and other tech news sites. As I first started investigating ways to just do this easily with online services (i.e. private services that promised to keep my data private), it was hard to find one that justified the cost. Many other services weren't very clear on if they actually implemented tracking-free clicks and content.

I assumed there would be some easy open-source options, so then I looked there. There were some great ones I tried at first that were React-based, but since I am essentially incompetent at Javascript it was hard to figure out how to extend them. For Python-based readers, I tried NewsBlur, which was awesome, but also set up for much larger and in-depth usage than I was planning on. For me, the obvious options were asking too much (i.e. run a beefy, expensive server) and too complicated (i.e. learn Javascript).

Since I know some things about feed and web scraping and language processing, I thought it might be fun to set up a small PoC... heheh -- yes, I know I am this XKCD comic (see below) and I literally cannot stop, don't bother sending help.

Built small-and-simple for one-person use

If you don't need to commercialize it, you can personalize it! Added benefit: this means you don't have to reach scale other than 1 user! You are already winning if one person can log in and use it. I ran my content server for the first year on a very small $3/month server. :)

Since I already knew how to write scrapers and make a Django-based website, I did that. There are certainly a million other ways to do this, but I did what worked for me.

Over time, I realized that I might want to filter content that I'm not interested in, especially when I get busy and don't log in for a month or two. When that happened, I wanted to only read the potentially interesting stuff and mark everything else as read.

To start with building a recommender, I exported my data and played around with simple natural language processing to see what models worked for my data. I didn't overcomplicate or overthink it for my use, which is why I used scikit-learn and not some LLM.

You might be different and decide:

a) you want to build your own using a different web framework or open-source reader/recommender b) you want to have a LLM

I say: go for it! It's your project! :)

Note: My current server costs about $6 a month to manage running the feed-reading, parsing and bulk-rating of articles every few hours. If you want to run an LLM it will cost a lot more and require much more memory.

The treasure trove of your own data

One cool thing about running your own content reader/recommender is that you can study your data over time. As a data scientist, I think this is really awesome (yes, I am a nerd).

Once you have enough data to do some basic analysis or whenever you decide to train a model on your data, you can use that analysis or model introspection to investigate more about you. This can be a fun exercise and you can do it on the privacy of your own computer and/or server.⁵ There is an example notebook in the GitHub to get you started and an accompanying video on YouTube.

Should you want to change/retrain your model or even change what you read or what you mark interesting based on your model introspection, you can guide that yourself on your own terms. Changing the model is something you can do at any time, without anyone making money or poking you to change what you click so they can make money.

Open-sourcing Priveedly

This project was really just for me for a long time, but I thought now is a hard time for many people to control their news and what they want to read, so I decided to clean it up as best I could and open-source it. If you think you can make Priveedly better by helping with the open requests in the ReadMe or via GitHub Issues, I would be very grateful!

I hope you might be inspired to use Priveedly or whatever service/project you decide gives you the right balance of privacy, autonomy and fun.

If you find this project useful and want to support my work, you can subscribe to my newsletter, buy my most recent book, follow me on YouTube or even hire me for corporate trainings, advisory and speaking engagements on topics like Privacy and Security in ML/AI systems.

I am odd, just probably like you are odd in some ways. My ideal feed is heavy on tech, computer science, machine learning but also on things like my favorite cooking blogs, artsy blogs and artists/comics I like. ↩
I sometimes want to read stuff without teaching the algorithm, just because I am curious what's behind a link. And yes, sometimes it is clickbait and I wished I didn't click, but I try to be kind to myself and tell myself that's okay too. ↩
I don't like BigTech or Third-party-ad-platform trying to target me via my clicks or reading interests. It makes me feel uncomfortable about clicking things. This isn't the internet I signed up for... (sad trombone) ↩
RIP Twitter. The freely available API for your own feed got turned off a few months after The MuskRat took over. :( The original code that worked is still there (it accesses Lists and pulls from them), but I am highly doubtful that it still works and that the API hasn't dramatically changed. ↩
I presented some interesting trends and tokens from my personal recommender model at PyData Paris, including one of the most negative bigrams (2-word-combinations), which was "Elon says". When I first saw this, it made me laugh all day long and was well worth the additional time and effort. ↩

Adversarial Examples Demonstrate Memorization Properties

2025-01-15T00:00:00+01:00

In this article, the last in the problem exploration section of the series, you'll explore adversarial machine learning - or how to trick a deep learning system.

Adversarial examples demonstrate a different way to look at deep learning memorization and generalization. They can show us how important the learned decision space and its properties are and how the training data and preprocessing affect that behavior. Adversarial examples demonstrate similar properties to outliers in deep learning systems.

Prefer to learn by video? This post is summarized on Probably Private's YouTube.

You'll also explore how adversarial learning contributed to model growth with early approaches to adversarial training and robustness, and how today's approaches find correlations to memorization in diffusion models.

What is an adversarial example?

Adversarial examples are those which trick a machine learning model into behaving in unlikely or unwanted ways. You've probably seen some adversarial examples on social media or in the news, which are often referred to as "jailbreaking" if they attack a LLM system.¹ But the study of adversarial examples precedes the existence of LLMs, and can teach you about how deep learning models work.

Let's examine an early example of an adversarial attack, from MIT researchers in 2017. The machine learning lab researchers were building on still image work that introduced adversarial examples by altering the images in small ways. They wondered if they could build a 3D adversarial example - and were able to do so!

Watch the full video on YouTube

Using then state-of-the-art computer vision models, they were able to 3D print an adversarial turtle which from many angles is improperly categorized as a rifle.

You might be wondering, how does that work???

How adversarial examples happen

Adversarial examples occur primarily by increasing uncertainty or error in the machine learning system. Via a variety of methods, adversarial examples push inputs into other areas of the decision space or boundaries (think about what you learned about margin theory!). In this case, the researchers wanted to push the turtle into "rifle" decision space. These attacks exploit those decision boundaries, in a similar way to how the training process creates them.

Let's walk through exactly how a simple adversarial attack works, to get an idea for how it happens.

You can take a model, really any model that is trained on a similar task -- here, a computer vision task. The properties of transfer learning make this possible. Deep learning models that are trained for similar tasks hold similar properties, learn similar things, and sometimes even have similar base datasets.

With your local computer vision model, you take an input that you want to make adversarial. Because you have direct model access, you can run the image through the model to produce an inference / prediction. When you do so, you can also observe the weights and activations at each layer, and of course the output of the inference. Let's say it correctly identifies a person in the image.

You want to make sure a person isn't found in the photo. You can then measure the gradient changes you would need to increase the error. You are essentially reversing the process of stochastic gradient descent, trying to increase error overall or towards another decision boundary (i.e. please make sure this photo has an imaginary boat in it). The goal is to reach an error level where the original classification no longer holds (i.e. the model returns that the photo has no person in it).

An example from one of the first well-cited papers on these attacks (Goodfellow et al, 2015) shows visually, what this might mean:

This attack uses the Fast Gradient Sign Method (FSGM), which functions similarly to what is described above.² This method shows you "which direction to push" and where in the input you should change to push that way. Then you actually create the perturbation, which aims to push the input in the right places in the right way, here: 0.07 (represented in the equation as e). This perturbation is combined with the original image, resulting in the large classification error (and resulting confidence in the incorrect gibbon class).

An attacker can use a method like this, or several more complex methods developed over time, to introduce error into model inference and influence the model to return a particular decision. The turtle becoming a rifle was a specific example to show deploying computer vision as security could go very wrong if you targeted people carrying "weapons".

In a way, these examples represent the unlikely inputs you see in the long tail of normal data collection. To a human, these are obvious, but to a computer vision or deep learning model, they are novel, erroneous or unknown. Interestingly enough, some of the initial defenses against adversarial attacks used this fact to correct the introduced error. Let's explore one of them that relates to our investigation of memorization.

Initial defenses: Manifold-ing

One of the early defenses that caught my eye had an interesting approach to the adversarial input problem. Instead of attempting to build the most robust model, it attempted to adjust the input and draw it closer to the more common examples, essentially regularizing the error away. This approach, called MagNet, was introduced by Meng and Chen in 2017.

Let's explore how this correction worked, step-by-step 😉:

First, potential adversarial examples are identified by a series of detectors. The detectors are trained to determine how abnormal the current example is based on the training examples. Numerous detectors were trained, so an adversary would need to know each of the detectors well enough to build an example that would 100% go through undetected³.
If an example is too far from the distribution of training examples, the images were corrected using an autoencoder built specifically to bring the example closer to the nearest training examples - here named a manifold. You can also think of this as attempting to migrate the examples closer to the nearest decision boundary, as you learned about in margin theory.

To visually illustrate an example of the output of this autoencoder, check out this diagram from the paper, which simplifies and visually explains the second step in 2-D. The curved line represents the manifold and the green circles the training examples. The red crosses represent adversarial examples, and the arrows represent their correction via the reformer.

Only then can the "reformed" version of the example be run through inference - hopefully without the carefully designed noise that would have disrupted the system without the correction.

This protection was successful against all of the common attack vectors at the time.⁴

How does this relate to memorization? Adversarial examples present the same types of problems for the network functionality as singletons, although they do so in different ways (one is malicious, the other just odd). If you treated singletons as adversarial, you could choose to focus on learning the nearest decision boundary or manifold and handling them as an outlier. This would reduce the chance of memorization.

In doing so, you might choose to implement something more like the reformer, which could assist in encoding information worth learning to shift decision boundaries but not enough to memorize the initial input. The reformer algorithm would be a stand-in for something like differential privacy, auto-encoding outliers towards a more "common" case. One approach that trains autoencoders as a way to encode privacy into a representation can be found in Dwork et al.'s work Fairness through Awareness.

These autoencoders and several other interesting approaches popular during those initial years were eventually replaced by a different approach to address adversarial examples.

The Heyday and Wane of Adversarial Research

Adversarial examples have likely existed as long as machine learning has existed, but they experienced a renaissance with deep learning due to the unique behavior of deep learning models in comparison to simpler models.

In 2016 it was hard to attend any machine learning event without hearing about adversarial examples (kind of like trying to avoid LLMs in 2024). There were dedicated sections of popular machine learning conferences just for papers and posters exploring these problems. These papers were looking at unique attack vectors, easy ways to produce adversarial examples and, of course, a variety of interesting and novel defense mechanisms, many of which were broken by other research sometimes days after publishing a new state-of-the-art defense paper.

Eventually newer approaches emerged -- ones that didn't try to figure out why or how adversarial examples worked or come up with clever ways to address them.

In 2018, a paper from MIT researchers (Mądry et al.) reached state-of-the-art adversarial performance with a new approach: throw compute and memory at the problem. Instead of trying to find, correct or otherwise disarm adversarial examples, they simply trained a bigger model for longer using adversarial examples alongside normal examples.

This works similar to the phenomenon you've been exploring in this series, where the double descent and increase in memorization enhances model performance. Instead of figuring out new and interesting ways to understand deep learning, you just memorize adversarial examples in a massive model and move along.

In addition to this approach, a second popular method came about when diffusion models rose in popularity. As you read about in a previous article, diffusion models can take an arbitrary noisy image and output prompt-based images. To do so, the diffusion process is reversed, so noise is gradually removed, shifting the output towards a known trajectory or goal. This can be applied to adversarial images in order to gradually remove potential noise.

Recent research on certifying adversarial robustness, a two-step process of applying a one-shot reverse diffusion process and then using a (probably fairly large) classifier achieved "state of the art" results. One interesting note is that a one-shot diffusion process works best, because iterative diffusers begin "filling in the blanks" and insert error by moving the input closer to an already learned class (or memorized input example). Choosing a less-diffused image results in higher integrity to the actual input. This demonstrates what you learned with regard to inpainting attacks.

Exploring the problem space

So far in this series you've learned about:

how data is collected and labeled for machine learning
how training and evaluation works
why and how accuracy became the most important metric
model size and training time growth
what examples are memorized and some understanding and intuition on why and how
how researchers first understood and found examples of memorization in deep learning
how differential privacy relates to the memorization problems discovered in deep learning
how adversarial examples demonstrate similar qualities as outliers
how adversarial approaches influenced today's models and vice-versa

You now also know that memorization in deep learning happens. That it happens for common examples and outliers. That it's difficult to 100% understand and predict what is memorized. And that unless it is treated as a first-order problem which should be addressed and corrected, as adversarial examples were treated, memorization will continue to plague deep learning models. This memorization, as you have learned, presents serious problems to the privacy guarantees, and certainly to any person whose data is used for training.

In the next part of the article series, you'll explore the solution space. How can you address memorization? I'll walk you through active research areas, such as machine unlearning and differential privacy for training deep learning models. I'll also cover some areas which are useful but haven't gotten much attention, like personalized machine learning systems.

If you've enjoyed this series, consider subscribing to my newsletter. If you'd be interested in a printed version of this series, I'd love to hear from you. I hope to produce a zine (physical) version with artist illustrations and content modifications for ease of reading and based on reader feedback.

Acknowledgements: I would like to thank Vicki Boykis for feedback, corrections and thoughts on this article. Any mistakes, typos, inaccuracies or controversial opinions are my own.

Interesting word choice, based on the fact that these models are often trained with "guardrails" to try to control behavior that the underlying language model has learned -- like swear words, how to build bombs or other undesired behavior for a large consumer-facing language model. ↩
To translate all the symbols, you use the model to calculate gradients (∇x). You do it quickly in the Jacobian matrix (J) based on the model (θ), input (x) and target class (y). You do this to figure out the easiest and fastest way to achieve your adversarial goal (i.e. just increase error or increase error towards a particular target class). In this particular method, the Jacobian matrix is used to calculate the sign gradients (i.e. positive or negative) given a particular input. This can be computed quickly and creates an easy-to-use output, where you can reverse the signs to "push" in the appropriate direction. ↩
This borrows from principles of cryptography, where you want to increase randomness and related uncertainty in the process to deter certain types of attacks. In cryptography, you can tell that a method is solid if it introduces enough randomness to provoke attacker uncertainty about the exact method, key and/or plaintext chosen. ↩
I took some similar research code on feature squeezing (where anomalies are detected and then compressed and smoothed) and turned it into a GitLab exercise for a security in machine learning course, if you want to play around with some code examples. ↩

Differential Privacy as a Counterexample to AI/ML Memorization

2025-01-02T00:00:00+01:00

At this point in reading the article series on AI/ML memorization you might be wondering, how did the field get so far without addressing the memorization problem? How did seminal papers like Zhang et al's Understanding Deep Learning Requires Rethinking Generalization not fundamentally change machine learning research? And maybe, is there any research on actually addressing the problem of memorization?

I have an answer for the last question! In this article, you'll explore how differential privacy research both exposes memorization in deep learning networks and presents ways to address these issues.

Prefer to learn by video? This post is summarized on Probably Private's YouTube.

In case differential privacy is new to you, let's walk through how differential privacy works and why it's a great fit for studying memorization.

Differential Privacy: A primer

In 2006, Microsoft researcher Cynthia Dwork released a paper challenging common ideas around releasing data. To note, these ideas were already circulating in her research and related research for several years, but were not yet concretized for data release. In her paper Differential Privacy, she posited that there was no real way to release data without potentially exposing someone's private information if the released information was combined with available external information. For example, if you release the average height of women in Lithuania and someone knows that a woman is 2cm over the average height, that person's height can now be calculated.

This fact that any kind of information release can be detrimental to privacy is obviously at odds with the work of data science and study of information. Dwork and her peers didn't want or intend to stop all research - quite the contrary. They wanted to find new ways to provide safer guarantees than the current status quo, which often just aggregated data and suppressed or removed outliers.

Differential privacy provided a new, safer way to release and share information. Differential privacy is a rigorous and scientific way of measuring information release and its impact on individual privacy. Instead of guessing and hoping you are releasing data in a safe manner, differential privacy gives you a way to measure the data release's impact on individual privacy. When used correctly, it provides strong privacy guarantees for people in the dataset.

Differential privacy does this by giving you a new way to analyze information gain that someone can get by looking at the data. This information gain for the attacker (i.e. person trying to learn more about someone) is privacy loss for the person or people they are trying to expose. The original differential privacy definition prohibits anyone from learning anything too specific about a certain person.

The definition is as follows:

$$P[A(D_1) \in S] \le exp(\varepsilon) \times P[A(D_2) \in S]$$

If you don't like math, just read on! It's okay :)

There are two databases (D1 and D2), which differ by one person. The definition tells us that if you ask a question about either database, you shouldn't be able to guess accurately if it is database 1 or 2 based on the answer. If you want to protect the privacy of single persons, you shouldn't notice when the database has changed by only one.

The above equation can be described as a probability bounds problem. However, in this case, instead of reducing uncertainty, you are trying to increase uncertainty. You want the interactions with the database(s) to leak very little information, so the attacker's probability distributions based on prior knowledge and then updated with the information from the query response are closely bound. The attacker should remain unsure about which database they have queried and therefore also unsure about whether the person is in the dataset or not.

Let's take another concrete example to review differential privacy with a real-world lens. Imagine you work at a company with a lot of business dashboards. One of the business dashboards you have access to shows the overall payroll spend by city and role.

You find out someone new is joining the company in your city and you know their role. You're curious about what they are getting paid, so... you decide to take a screenshot the week before they start. Then, you take another screenshot the week they start (or when payroll goes out). When you compare those two screenshots, you have a pretty good guess at their annual salary, presuming they were the only joiner in that specific role/city combination. Even if they weren't the only joiner, you now have a bounds or range of possible salaries, and can better guess their salary.

Image from my video course on Practical Data Privacy from O'Reilly Learning

This demonstrates the problems Dwork and others saw with simple aggregation. By just aggregating data, you don't protect individuals.

But what if the dashboard didn't update as regularly, and when it did, it sometimes changed in more unpredictable ways. What if you weren't exactly sure if your screenshot was the correct number, just that it might be near the correct number, or also maybe not?

When you implement differential privacy to release a dataset or to process data, you follow several key steps:

Understand sensitivity: How much can a person change the result? This can be easy to measure, like it is for our salary example, or difficult to measure -- like, how much can a person's data change a machine learning model? 🤔
Determine bounds (if needed): Sometimes you realize that sensitivity is unbounded, meaning a person could change the result by an unlimited amount. Technically, the company could pay the person 10 billion dollars or whatever minimum wage is, so this would be a very large bound. Instead of thinking about outliers, you want to think about the true distribution and keep individual contributions within a particular bound. Choosing this bound should allow you to still do meaningful data analysis but also provide protection for outliers.
Apply careful noise to result: Once the sensitivity is known and the bounds are applied to the data, you can run the analysis or data processing and apply carefully calibrated noise to that process.¹ Since you are essentially inserting error into the data analysis or processing², you want to ensure that you know what type of error to adjust your analysis accordingly. For example, you might choose Gaussian noise because your dataset has a Gaussian distribution, and by applying Gaussian noise, you keep the overall Gaussian distribution intact. Differential privacy noise is tuned based on the extent of analysis and the sensitivity of the query in question, so you can to some extent decide which answers get how much noise.
Track budget: In differential privacy, each time you run a query you need to track the privacy impact it has on the people in the dataset. This is called a privacy budget and your budget spend is determined by parameters you set for a particular query or processing activity. Your entire budget is tracked, usually for the length of a particular analysis or process. This budget ensures that you haven't learned too much about any one individual in the group - and technically, when an individual or set of individuals run out of budget (i.e. when some limit of the parameters in that initial equation is reached), you should stop asking for more information.

I recommend reading more about differential privacy via Damien Desfontaines's blog series on differential privacy.

One graphic from his blog series shows a visual idea of how choosing different budget values (here epsilon) relate to the information an attacker can gain by analyzing the results of the data release or query:

From Differential Privacy, in more detail.

In this graphic, the parameter choice for epsilon is the legend. To read the graph, the x-axis shows "how sure someone is that a person is in the dataset". The y-axis shows "how much more sure an attacker will become after spending this budget". Each epsilon value has a range from the lower to the upper bound. This bound comes from the randomness chosen -- where it might be possible that someone learns more or less depending on the mechanism. However, the upper limit is a guarantee, and is what separates differential privacy from other methods.

Reading the graph, you can see that the parameter choice both affects your budget and has a fairly significant impact on the privacy guarantees. An epsilon of 5 reveals a lot of information, while an epsilon of 1 is fairly safe.

To address machine learning memorization, you could reduce the individual or singleton impact on the training process. You just need to figure out how you can apply differential privacy to something like a machine learning model. If you can do so, this provides the differential privacy guarantees for the model and for anyone used in the training dataset.³

Who has done this already? Let's investigate early research on applying differential privacy to deep learning.

PATE's Near-misses

Papernot et al. architected one of the first differentially private deep learning systems in 2017. Their architecture, called Parent Aggregation of Teacher Ensembles, or PATE, achieved high accuracy, within 3 percentage points of the baseline models.

To review how PATE worked, let's review the architecture shown above. First, the data is separated into subsets so each person is only ever seen by one model. Then, many different models are trained on these datasets. After training, these models each receive one vote towards prediction or inference of a student model, which has access to publicly available data, but no labels. The votes of the trained models are output into a histogram with differential privacy noise applied and the highest class is chosen as the appropriate label. This uses differential privacy essentially as the labeling function for publicly available unlabeled data.⁴

In the paper, the authors found the PATE architecture "mistakes" were near-misses of odd or incorrectly labeled inputs. Here is an example from the paper, showing the correct label that PATE guessed incorrectly.

How many of these would you guess correctly? These near-misses are examples that could potentially confuse a human. In the exploration of novelty and memorization, these training examples represent uncommon or novel examples of their class. This is exactly the type of example that Feldman proved would be memorized - because not memorizing it is too expensive if you want the highest accuracy model. Using differential privacy, however, blocks this novel example memorization from happening.

Papernot et al. aren't the only ones -- much research around studying memorization have explored the link between memorization of novel examples and differentially private training.

Memorization and Differential Privacy Research

In the Secret Sharer paper, Carlini et al. compared data extraction from a model trained with no differential privacy compared with one trained using different values of epsilon.

They trained 7 models with different optimizers and epsilon values and compared the estimated exposure, which measures the memorization of a given piece of sensitive information in the language model. The paper is from 2019, so they used a recurrent neural network (RNN), another type of deep learning for language model architecture, not a transformer.

Their results were as follows:

Optimizer	Epsilon	Test Loss	Estimated Exposure
RMSprop	0.65	1.69	1.1
RMSprop	1.21	1.59	2.3
RMSprop	5.26	1.41	1.8
RMSprop	89	1.34	2.1
RMSprop	2x10**8	1.32	3.2
RMSprop	1x10**9	1.26	2.8
SGD	inf.	2.11	3.6

With ever increasing values of epsilon, the accuracy improves and the exposure increases. This makes sense based on what you've learned so far, because memorizing the outliers and the long tail improves test accuracy. It also shows that implementing differential privacy reduces the chance of memorizing all outliers. In this paper, the exposure and extraction attacks target outliers instead of common examples. Therefore, the exposure increases as the differential privacy guarantees decrease.

The researchers were unable to successfully perform any of the extraction attacks against the machine learning models trained with differential privacy.

At what cost? At what gain?

As you learned in the Gaming Evaluation article in this series, the present machine learning culture is focused on accuracy above all other metrics. In exploring near-misses, I hope you have begun to question this focus. Would it be okay to guess these incorrectly in exchange for better privacy guarantees and less memorization?

To some degree, differential privacy works as a regularizer for machine learning problems - forcing models to generalize and not memorize. If you want to make sure that you won't memorize any specific examples, you should apply it vigorously and as standard practice -- particularly when training large deep learning models.

The types of mistakes models with differential privacy make are small mistakes, namely reflecting the datasets' long tail. Are these mistakes worth the cost of memorizing these examples? Perhaps for generic photos, but what about for someone's artwork, voice, personal details or writing? What cost are we individually and collectively paying and how will it affect work and life years later?

Differential privacy training is not a magical salve that will solve all memorization problems for privacy in deep learning. You'll learn more about the limitations and critique of differentially private training in an upcoming article, as you begin exploring solutions to the problems discussed in these initial articles (coming in Spring 2025).

In the next article, which is also the last article focused on the problem of memorization in deep learning, you'll explore adversarial learning and examples. Adversarial learning -- similar to differential privacy research -- is an area of research that understood aspects of the memorization phenomenon long before other research caught up.

I'm very open to feedback (positive, neutral or critical), questions (there are no stupid questions!) and creative re-use of this content. If you have any of those, please share it with me! This helps keep me inspired and writing. :)

Acknowledgements: I would like to thank Damien Desfontaines for his feedback, corrections and thoughts on this article. Any mistakes, typos, inaccuracies or controversial opinions are my own.

I mention this because many descriptions of differential privacy mention "random noise", and that's not a very good description of what should be done. It is not uniformly random noise, but instead a noise distribution that you choose -- meaning you can fit the noise to the problem you want to solve. ↩
By the way, you already have error in your data because data is always an approximation and never 100% correct. For a great review on "ground truth", I recommend Kate Crawford's lecture on the topic. ↩
I encourage you to learn more about differential privacy, especially if you work in data science and machine learning. If you'd like to read more on differential privacy, check out Desfontaines's series and my book, which has two chapters on differential privacy and its application in machine learning. ↩
In theory, this system could also provide a prediction or inference service, where incoming data points are also labeled by majority differential privacy votes. Such a system was quite compute expensive to run at that time. ↩

How Memorization Happens: Overparametrized Models

2024-12-18T00:00:00+01:00

You've heard claims that we will "run out of data" to train AI systems. Why is that? In this article in the series on machine learning memorization you'll explore model size as a factor in memorization and the trend for bigger models as a general problem in machine learning.

Prefer to learn by video? This post is summarized on Probably Private's YouTube.

To begin, what is meant by the word overparameterization?

What is a parameter?

In machine learning, a parameter is a value tied to the model's calculations. These calculations determine the model's predictions and how the model functions internally. You might have heard about parameter size, such as "this is the 7B (billion) parameter model" when using LLMs or other large generative models.

The parameters are set as the model is trained. Usually a parameter starts at a "random" state and is adjusted as part of the training process. The parameters updating creates the "learning" part of the machine learning training.

The parameter count also includes hyperparameters, with additional variables or inputs -- usually used by the training optimizers and learning algorithms. A hyperparameter is something a person can set beforehand, usually at a known state or range. Some hyperparameters can be adjusted as the model learns, such as setting an adaptive learning rate to expedite early stages of model training and then slow down with smaller steps as the model training enters later stages.

In a deep learning model, the most common parameters are weights and biases. Let's view a diagram to understand how these work:

You can think of a deep learning model as a series of nodes, pictured above as a circle, and edges, pictured above as connecting arrows. The terms nodes and edges come from graph theory in mathematics, and you might know them from graph networks, like social network graphs.

In deep learning, the nodes here are the "neurons" and the edges connect neurons to one another. In this network, each input value can change the result of the equations in the node. Those equations within a node are calculated and each node sends the results over the edges to the next series of nodes, and so-on. This is why the original term "neural network" exists, since the nodes were compared to neurons.¹

In each node, there is a bias parameter which is part of the contained equation. The other values in the equation come from the incoming edges. Usually there are many nodes sitting at the same depth in the network. Together, these "same depth" nodes are called a layer.

Between each layer of nodes there are connections called edges. A weight parameter exists at the edges -- sitting "between" the layers. These weights connect each layer to the next layer. The number and type of connections can change with different types of deep learning architectures, but often there are at least a few fully connected layers which mean each node in one layer connects with an edge (and its accompanying weight) to the next layer of nodes.

When a data example is first encoded (see initial encodings article) and input into the model at the input layer, those input nodes calculate the result of their internal activation functions. Activation functions are usually calculated by taking the incoming values and summing the results with the node's biases and then running the result through a chosen activation function, which can vary by architecture. A common choice for an activation function today is ReLU because it has non-linear properties, allowing the network or model to learn more complex mathematics (alongside the power of linear algebra!).

The result of the activation function for a given node is transmitted as input to the next layer. This is calculated as inputs along the weight along the edge. This happens for each of the middle, hidden layers. In the final layer, the activations are condensed into a range of probabilities to make a prediction. This final step is heavily dependent on model type, architecture and the task at hand (i.e. generate text/image/video versus predict a class label).

Example of connected layers. Only the top nodes' edges for each layer are shown, but imagine this continues for every node in the model.

The training process updates the weights and biases, usually via some version of backpropagation, where error on whatever early guess the weights and biases have is transmitted backwards through the network. This error is used to update the parameters so the model can perform better on the next round of predictions. At a high level, this happens by reducing the weights, biases and resulting activations of the incorrectly identified guesses and increasing the weights, biases and resulting activations of the correct outcomes.

Now you have a high level understanding of parameters, let's investigate how they relate to model size.

Overparameterization and Model Size Growth

Since there must be at least as many parameters as nodes and edges, when the model's architecture grows and more layers are added, there will also be more parameters. In earlier deep learning, it was common to have between 100-200 layer models with a wide range of parameters per layer -- totaling between 20-120M parameters (see ResNet, VGGNet and AlexNet).

In today's deep learning, those numbers have exploded. Many deep learning models released since 2022 are overparameterized. This means they have more parameters than training data examples. If you've studied or used machine learning, you might be wondering if this results in overfitting. Overfitting is the ability for a model to exactly learn the training data, and therefore do poorly on unseen data. With overparameterized models, the model could potentially encode every example in the parameters.

The growth of model size and the subsequent growth of training time to ensure all parameters were adequately updated led to the discovery of double descent.

In smaller and older deep learning architectures, there was a point in parameter growth and training time where the model would overfit. This means that the model learned the training dataset too well (i.e. memorization and similar strategies) and therefore started performing poorly on the test dataset because of small divergences between the training and testing data.

As model parameters and training time increased, these larger models had a second descent of the error where the models generalized well and outperformed smaller models. This led to a massive investment in larger and larger models.

Tetko et al. studied overtraining as early as 1995. Overtraining increases the number of training epochs, and at that time this process created models that overfit and memorized the training dataset. That research recommended smaller networks with fewer hidden layers and less dense hidden layers, which could be trained without overfitting. They also recommended cross-validation via leave-one-out methods to compare models that had seen the data with those that hadn't (as you learned is a key part of memorization research).

When looking at how changes in layer size and depth over time -- they have grown massively since the 1990s. Just charting GPT parameter growth since the transformer architecture appeared (i.e. since 2018) is a good way to visually inspect the changes in parameter size.

This chart looks at GPT parameter size based on what is known about the OpenAI GPT models. The first GPT was released in 2017 and had 117 million parameters. The second was released in 2019 and had 1.5 billion parameters, already a parameter growth of more than 12x. The third GPT came in 2020 and has 175 billion parameters (>100x size of GPT-2). Estimates of the GPT-4 put it at 1.7 trillion parameters, almost 10x bigger than GPT-3. As you can tell from this trend, there is a huge push towards ever larger models.

But understanding double descent and how deep learning works requires study. Simply embracing model size and training time growth without knowing how or what is changed by these aspects is unlikely to result in "forever" growth.² What, exactly, has changed in terms of models when they are overparameterized and trained for many more iterations?

Ummm, do we know how deep learning works?

This was the investigative question explored in the now famous paper Understanding Deep Learning Requires Rethinking Generalization (Chiyuan Zhang et al, 2017). You might remember Zhang's work from the last article, when he worked with Feldman on quantifying novel example memorization. Three years before that work, in 2017, the researchers were trying to understand exactly how computer vision deep learning models were learning with greater accuracy than before, especially with the increased architecture size.

They took the CIFAR10 and ImageNet datasets, then common datasets for large computer vision training and completely randomized the labels. Now the labels no longer applied to what was in the photo. A photo of a person was now a plane, a photo of a dog was now a building and so on.

Even with the labels completely randomized the training accuracy of the two most performant resulting models reached 89 percent. Of course, the accuracy on real data was awful, the model hadn't learned anything useful. But the fact that it learned completely random data opened new questions in understanding deep learning. This opened questions around common thinking about overfitting and generalization. How do we measure generalization? Do we understand the difference between memorization and overfitting?

The researchers called for more research and understanding of how memorization happens and a better understanding of what generalization is in deep learning. In larger models, generalization doesn't seem to behave the same way as generalization in smaller models -- in part due to the complexities of increasing parameter size and how that can leave enough parameters for both good generalization and memorization.

Learning the identity function

The identity function is a great example of a simple mathematical rule that could be trained using deep learning. The identity function takes an input and returns the input unchanged -- hence its name: identity. Think of it like adding 0 or multiplying by 1, but for more complex inputs.

Zhang et al. tested this idea in 2020 by training a few computer vision models to take in fairly simple input (the NIST and Fashion-NIST datasets) and learn the identity function.³ To investigate if the identity function could be learned versus just input memorization, they trained on just one training example repeatedly and altered the depth of the network (hence, the number of parameters).

They found that deeper networks with more parameters learned a constant function. What does that mean? Those networks learned to always answer with the example they were trained on, regardless of the input.

Results of inference from CNN models trained with different depths. The 7 to the left is the entire training data for these models which attempted to learn the identity function. Each image is the prediction when given the input shown across the top row.

Intermediate depth networks learned to identify edges, which is probably a bit closer to the expected identity function (i.e. this is the outline of the shape you gave me), but not the same.

Very shallow networks sometimes learned the identity function (as shown in this example) but with other architectures or learning functions, they simply returned noise or a blank image.

Their conclusions were that computer vision models don't generalize the way that researchers and practitioners assumed they generalized -- and that more parameters might lead to more memorization instead of generalization.

But this research doesn't necessarily provide a better understanding of how to measure generalization. Related research from margin theory, however, did provide some insights!

Margin theory

Margins help estimate the generalization gap, or the difference between how a model performed during training and how it performs on unseen test data.

What is a margin? Margins are an important idea in support vector machines (SVMs), so let's use an example from SVMs to explain margins.

In support vector machines, you want to maximize the distance between the data points and the decision boundary. This improves the model confidence and avoids potential misclassification with "nearby" classes. Ideally, the boundary is further from the cluster of training examples, giving space to incorporate some "outliers", like the white peacocks.

But, how does this actually work in a high dimensional setting and with the complexity of a deep learning model? Deep learning models are more complex than support vector machines...

Research from Google has shown that margin theory applies if you take each layer of a deep learning model as its own decision engine. By sampling the distance between the intermediate input representation at that layer and the projected decision boundary, the margins can be estimated. The projected decision boundary is approximated via linearization even if the activation function is nonlinear.

To put it another way, you are sampling layers and approximating their decision boundaries against the current input. Measuring these margins and determining if the network is maximizing them, as with SVMs, provides an accurate prediction of the model generalization. The larger the approximated margins across several key layers, the better the performance on unseen data.

Google Research successfully built a new loss function to maximize margins while training deep learning models. The loss function penalizes predictions based on margin distance at different layers. The resulting models are more robust against adversarial attacks or other unintentional input perturbations. Neat that an idea from more simple machine learning, like SVMs, can also be useful in deep learning.

This provides some insight into memorization in large deep learning systems. Unlike SVMs, a deep learning model can create more decision boundaries due to its greater complexity and nonlinear learning. By memorizing outliers, a model can perform better on similar outliers that it sees later, even if these were not included in the original training data -- all because that outlier lands somewhere near the memorized input. This is one of the leading theories to explain why deep learning models memorize and generalize well.

In the next article, you'll look at how privacy research came to this conclusion about deep learning networks long before most machine learning research. You'll explore deep learning through the lens of differential privacy, a rigorous definition of to protect individual privacy.

Acknowledgements: I would like to thank Vicki Boykis, Damien Desfontaines and Yann Dupis for their feedback, corrections and thoughts on this series. Their input greatly contributed to improvements in my thinking and writing. Any mistakes, typos, inaccuracies or controversial opinions are my own.

Of course, they are not nearly as complex as our brain, hence why the term is now considered outdated. ↩
In fact, recent research and commentary suggests that the current models have already reached peak scaling, and adding more parameters doesn't seem to affect performance anymore. ↩
Radhakrishnan et al. first studied this in 2018 and demonstrated how traditional downsampling methods in computer vision models tended to memorize rather than learn the identity function. Their work also uncovered particular elements of autoencoders in computer vision that created mathematical properties that would inevitably end in memorization unless directly addressed in the architecture. Unfortunately, I learned of this work after the initial writing of this article; hence my use of Zhang et al.'s example. ↩

How memorization happens: Novelty

2024-12-09T00:00:00+01:00

So far in this series on memorization in deep learning, you've learned how massively repeated text and images incentivize training data memorization, but that's not the only training data that machine learning models memorize. Let's take a look at another proven memorization: novel examples.

Prefer to learn by video? This post is summarized on Probably Private's YouTube.

As you learned in the evaluation article, the chance of pulling a rare or novel example from the tail is fairly high, given that the tail is long and makes up the majority of the distribution. If you are training multiple models and evaluating them against one another based on test performance, there is a good chance that the best performing model will process more of the novel examples that also exist in the test dataset.

Vitaly Feldman, previously of Google Brain, now at Apple Research, initially studied this phenomena in 2019 in his paper Does learning require memorization? A Short Story about a Long Tail. Let's walk through the important parts of the paper together.

In the first example of the paper, the learning algorithm will learn to differentiate two groups in a binomial population. This is a small toy example to easily define how a machine learning model should minimize the learning error. This example is an oversimplification of typical machine learning problems, but Feldman uses it to extrapolate learnings to more complex examples.

In order to show how memorization happens, the paper then defines memorization mathematically. To do so, Feldman defines memorization by comparing two models. One has seen a particular example and the other has not. The difference between the two models demonstrates whether that point contributes significantly to memorization or not. This is a "leave one out" principle -- which can be used to test memorized training examples in real systems.

This definition combined with the toy example show that with a long-tail distribution, the optimal model performance is reached if some examples are memorized. The significant contribution of the paper presents a lower bound for model accuracy if a particular example or set of examples are not memorized. Because these are novel examples, the model must memorize them even if they are only shown once in the dataset in order to achieve higher accuracy.

Feldman's Proof of a lower bounds

This simplification of Feldman's equation¹ seems somewhat obvious -- of course the error is the optimal model plus some sort of representation of the things the model didn't learn properly. But let's summarize the impact of Feldman's normalized penalty.

Feldman was able to use typical probability theory to formulate the penalty based on properties of the population and their distribution. Remember the long-tail? His formulation creates a lower bound on what rare examples cost the model. The more rare an example, the more costly it is to the training process when the process tries to reduce error. In addition, those rare examples, or sets of rare examples are extremely costly to not learn if they will show up in the test dataset.

To put it another way, the model's error is relative to the size of a given class in the dataset and whether that class is infrequent within the overall population. As you remember from the uncommon photos of buses (odd angles, only parts of the bus), this also applies to infrequent examples of more common classes.

Based on the estimated distribution for long-tail data, the singleton examples (examples that only occur once) make up approximately 17% of the data. An algorithm that does not memorize these singleton examples will be suboptimal by approximately 7% in accuracy (maximum accuracy is then 93% if singleton's are not memorized).

Feldman's article focused on proving this mathematically; but does this happen in real machine learning or only in theory? This research spawned deeper investigations into the memorization problem with exciting results.

What color is a peacock?

Did you know there are peacocks that are completely black and white? I didn't! And neither did researcher Chiyuan Zhang who worked alongside Feldman to study the deep learning memorization phenomenon. Their work attempted to find the novel examples that a model had memorized.

Feldman and Zhang's work uncovered high influence pairs, where a "leave one out"-inspired training routine demonstrated the impact of novel examples on the model. Here are a few high influence pair examples from the paper:

Some of these examples probably remind you of the data collection discussion from this series because some of the photos are literally from the same photo shoot. When the same photos appear in the training and testing datasets, the model's test performance will increase if it predicts those correctly. You can explore more high-influence pairs on the paper's site.

To find these high influence pairs, the researchers needed to find a way to "leave one out" and measure the impact on the model performance. Because of the high costs of training large deep learning models, they didn't leave "just one" example out and retrain. Instead, they batched the initial dataset and experimented with leaving out sets of images. They then compared the models that had seen different sets of images on the same evaluation data.

This allowed them to compare the model performance between models that had processed rare examples with other models trained the same way but without those examples. In doing so, they found the "high influence" pairs. If one of these pairs were included in the training data, the model performed much better on the test dataset example.

They also show that the influence of an image is related to the long-tail. More uncommon classes and uncommon images of that class were memorized than common ones, hence Zhang's discovery of black and white peacocks.² They also found that about 30% of examples have some level of memorization, or a significant "influence" on the model's performance for given test points. In their experiments, they demonstrated a 2.5-3.2% performance boost that came from memorization, which supports Feldman's initial theory that optimal performance on a long-tail results in partial memorization.

These novel examples were also discovered separately by other researchers working on extracting training data from large deep learning models. Let's investigate Carlini et al.'s work on diffusion models.

And other types of deep learning models...

Carlini et al. extracted memorized examples from diffusion models³ with great success. For a quick primer on diffusion models, they are deep learning models that produce much of today's generative text-to-image models, like DALL-E or Flux. These models have a specific architecture which uses an initial "random" sampling to create a base image. This random start is processed with denoising techniques to create a visual representation that matches a particular text input. So, when you type in "a unicorn jumping over the moon", there is an approximation of what those vectors represent together based on the training data, and the model is optimized to try to extract the closest representation of that text.

Example Stable Diffusion Steps: From noise to photorealism

To test whether training data could be extracted from diffusion models, the researchers trained a large scale diffusion model on the original dataset that the Stable Diffusion team used. Then, they used prompts from the training data to test extraction.

One successful extraction is the photo below, where they used the name of an author from the training dataset. The extracted image is a near match of the original.

They were able to extract more than 100 near-identical images of training data examples like this one. More than half of the memorized extracted images are of a person. In running the attack against a larger diffusion model (Imagen), they were able to extract a higher rate than the smaller model, which supports prior research that model size also impacts memorization. They also found that more accurate models, measured by model performance metrics, memorize more data. In running further experiments, they show that by building their own diffusion model from scratch, they are able to extract 2.5% of the training data.

Later in the paper, they perform a new and different type of extraction attack, which they name an "inpainting" attack. Inpainting is a desired quality of many image-generating or editing models -- for example, to remove a person from the background of a photo and "fill in the blank". In their inpainting attack, they cover a significant portion of the image (>50%) and query the diffusion model to complete the picture. When performing these attacks, they were able to quickly see the difference between models trained on the image shown and models who were not trained on the image.

They were able to show with this research that a diffusion model that has processed the original image in training can reproduce it much more clearly than the diffusion model who has not. This again supports the "leave-one-out" approach that Feldman and Zhang used.

They also found that the easiest data to extract are outlier examples. These outliers have significant privacy risk compared to other populations. When performing the attacks, they were able to target outliers in a membership inference attack. This attack allowed them to determine if a particular image was in the training data or not. Here is a visual representation of their findings, where outliers were much easier to attack than common images.

This work on diffusion models highlighted that their training methods and processes are part of the problem; and again directly linked model size and accuracy to memorization.

Some models need to recall an individual, which makes them easier to attack. For example, a facial recognition model or a generative art model that needs to learn the styles of famous artists. These methods have inspired the types of attacks shown in papers today, which you will investigate to better understand memorization.

Model Inversion Attacks

It's important to point out that training data extraction is not a completely new attack vector and that memorization isn't either. In a paper published in 2016 by Tramér et al. (also a co-author on the diffusion paper), they designed an attack called a model inversion attack, which allows an attacker with model access to extract information from the model. The most powerful version of the attack required direct access to the model or a model trained locally that mimicked the model -- obtained via "model stealing attacks" or by training a simpler model that mimics the real model.

To perform a model inversion attack, an example of similar data is first generated. In the paper they use a facial recognition model to extract a copy of a person's face, which in this case must be memorized by the model to function correctly. They start with a base image of a face -- choosing one without significant markers (like glasses, beard, etc.).

Then, the gradient weights and activations of the local model are observed based on that input, and a loss optimizer is applied, just like you learned in training the model and evaluating the loss function. Only this time, the loss optimizer isn't trying to improve the model by training it -- it's being used to reverse engineer how to change the input in order to make it closer to the target. In this case, you want to update the image to more closely match the person's face. By doing this iteratively, you develop an image that looks like a fuzzy version of the training dataset target example.

Model inversion attack on a facial recognition model: the training image is on the right and the extracted image is on the left

This shows that there are several other cases where deep learning requires memorization, and the way that deep learning models are trained can be used to extract data from them.

The funny thing is that this attack process is quite similar to how diffusion models work internally in their generative steps, making it highly susceptible to both memorization and exploits in revealing the original data. In case you missed it, the "Paragraphica" camera project was able to reproduce almost exact copies of street images that mirrored reality due to the high tendency for diffusion models to memorize their training data and repeat it when given an input query contained in the training data.

And it happens with text too...

Although it's easier to demonstrate visually with images, the memorization of outliers and novel examples also occurs with text, proven in 2018 by Carlini et al's⁴ paper The Secret Sharer. In this paper, they were able to train a large language model using text from the Enron emails (another example of a commonly used dataset collected without direct consent). They were able to successfully extract email addresses, social security numbers and credit card numbers that appeared in the emails by crafting targeted prompts and exploiting the models tendency to memorize rarely seen data.

How did they do it? They trained a then common natural language processing (NLP) deep learning architecture (called an LSTM, another sequence-based deep learning model) with the Enron data. They trained it to predict next character tokens (which you might remember from the tokenization article). A character-level tokenization model predicts the next character given the preceding characters. The model was quite small compared to today's model sizes, with only 2 layers and likely under 5,000 parameters (it wasn't explicitly listed, this is an inference based on the numbers they posted).

Given prompts such as "My social security number is...", they used the sequence-based model to figure out the most likely continuation. Via their generation algorithm they calculate an "exposure score" -- a metric which measures the memorization of a particular sequence. Even with this small model and relatively small dataset, they were able to successfully extract several sequences, and prove a relatively high exposure (think partial memorization) for the sequences they couldn't entirely extract.

In a later investigation of the same phenomenon, many of the same authors looked at large language models, both open weight models like LLAMA and closed models like ChatGPT to extract sensitive data. They were able to do so using a few different attack vectors:

Say "poem" forever attack: In this attack, they prompted ChatGPT to say the word poem forever. Why? The researchers believe a singular word or token repeated forever triggers behavior similar to the token. When training a model the token is repeated many times, because it appears at the end of a document, book, or other text and those texts are joined together when performing language model training. Today's models have two training steps: one called pretraining which is base language training on many texts and then another training, where chat or instruction text is used on the already trained language model. Since the chat model requires text to be in conversational form, the repetition of a single model diverges from the second training and seems to activate the base language model, which in turn spits out memorized data.

Nasr et al. found this attack is most successful with single words rather than multiple tokens. In addition, not all single word tokens were equal in their extraction power. For example, the word (and token) "company" was more than 100x more powerful at extracting memorized data than "poem". By spending $200 on the OpenAI API, they were able to extract more than 200,000 unique memorized sequences, which included personal information, NSFW content, user identifiers, URLs, and even literature and code. Based on a statistical estimator they trained, they predict a dedicated attacker could extract much more from ChatGPT -- noting that the rate of extraction from ChatGPT-3.5 was higher than any other model they tested. This paper was published before even larger models, like ChatGPT-4 were released.

To compare openly available models alongside the closed chat models, the authors generate longer texts and look for memorized chunks of text in the output. They compare the texts with a compilation of several popular training datasets, including The Pile and the Common Crawl Corpus. If the text has 50 tokens verbatim from the example training dataset, this is considered a successful extraction of memorized training data. For every model they successfully extract hundreds of thousands of 50-token memorized text, some which is repeated many times. To note: the training dataset of these models are unknown, and could contain all, some or only parts of the example training data that the researchers compiled. This means that these figures are lower estimates on extractability, since the actual training data would likely find better matches and provide easier extraction.

These attacks are unlikely to be the only successful ones, they are simply the most obvious ones to those who have studied the phenomenon of deep learning memorization. This research demonstrated how memorization occurs due to common text or image duplication, as explored in the previous article and now in novel cases, where the model memorizes a particular example or set of examples due to the way it is trained and optimized.

Despite attempts to remove this possibility -- such as the use of guardrails, closed models, and paid APIs -- extraction of personal data is both theoretically and practically possible. This means that personal data, copyrighted data and other sensitive data exists in the model -- saved in the model weights and biases. The data doesn't need to be repeated to be memorized. Without access to the original language models (pretraining models) and the training data used, it will be difficult to estimate the exact amount of memorized data, particularly when the companies building closed models are not incentivized or required to perform these attacks and estimates internally.

In the next article, you'll investigate how overparameterization of deep learning (aka the growth of model size) affects memorization, and how the "bigger is better" approach changed machine learning training, architectures and deployment.

This simplified representation of Feldman's proof is an adequate summarization for our use case; however, to learn more or read through the entire series of proofs, please read the paper or a longer study from Brown et al.. ↩
He mentions this find along with presenting findings of several of his publications on the topic in his MIT lecture on Quantifying and Understanding memorization in Deep Neural Networks. ↩
Zhang and Feldman's work on proving extraction used a traditional CNN design for computer vision (like what you learned about with AlexNet, just much larger and more modern). Diffusion models, which power much of the text-to-image generative AI, are a separate type of deep learning, where you can also extract memorized data. ↩
Carlini has contributed significantly to research around security in machine learning models. In case you are inspired, he wrote a blog post on Why he attacks AI. ↩

How memorization happens: Repetition

2024-12-03T00:00:00+01:00

In this article in the deep learning memorization series, you'll learn how one part of memorization happens -- highly repeated data from the "head" of the long-tailed distribution.

Prefer to learn by video? This post is summarized on Probably Private's YouTube.

Recall from the data collection article that some examples are overrepresented in the dataset. They live in the "head" area and might be duplicated and show cultural and societal biases based on the collection methods. You learned in the last article how training steps work and how data is sampled, along with the overriding cultural focus on accuracy above everything else. In this article, you'll evaluate how wanting to score highly in accuracy with an unevenly distributed dataset creates the first problem with memorization: memorizing common examples.

To begin the analysis, you'll first explore how a simple machine learning system works. To begin, you'll look at random forests as a popular classical baseline, and then review how a deep learning system works. If you're already familiar with these models, you can skip ahead to sequence-based deep learning.

Simple Machine Learning Model

Machine learning is the task of determining if computers (via software) can use patterns to make inferences or decisions about an example. Machine learning models are used to automate or expedite tasks, like identifying and sorting spam messages, or offer assistance in making a particular decision, like if a patient should undergo additional screening for a disease. Machine learning models learn patterns and condense information based on historical data.

In today's machine learning, you usually choose what software and algorithm you want to use before you begin the training process using your training and test data. As you learned in the encodings and embeddings article, the data is transformed into mathematical form (vectors or matrices) in order to train and also predict.

For simple machine learning models, you first choose an algorithm. A good set of examples for classic algorithm choices are shown in scikit-learn's overview. Broadly, these choices depend on things like your data size and structure and the task you want to solve. For example, you might want to predict a number or trend, like in forecasting, or classify something, like finding all positive product reviews from a series of texts.

One popular choice due to its simplicity and performance tradeoff is random forests, which is an ensemble of decision trees. You can use random forests for many classification tasks, where you want to assign an outcome or label to an incoming piece of data. Because random forests are built out of decision trees, let's review a decision tree first.

In a decision tree, you want the algorithm to determine useful splits in the data based on particular attributes. Ideally, these attributes split the data into fairly homogenous buckets. For example, if you are trying to decide whether someone has a particular illness, you'd want to end up with a tree that splits perfectly the people who have the disease from those who don't by using information encoded into the data. An example tree could look something like this:

This is an oversimplified toy example, but it demonstrates the basic structure of a decision tree, where particular pieces of information are used to create hierarchical data splits based on particular attributes.

A random forest is a collection of such decision trees, hence why it's called a forest. When you train a random forest, you specify the number of trees to train. Each tree usually starts with separate samples of the training data so they don't overlap, which results in different splits for the trees.¹ If you train many trees, some will likely be quite similar and some will be heavily biased based on their sample of data. With enough different trees, you can create robust performance. Because each tree gets a vote, the majority vote becomes the most likely class or label.

Once you have a trained random forest, you can run inference (aka. prediction) tasks with your trained model. Your model is now an artifact that contains information and instructions for how to take a prepared piece of data and output a prediction (or a series of predictions with different likelihoods).

To get a prediction, you send an example like how the training examples were processed -- but this time without the label or result. The model returns the particular outcome or classification label, usually with an indication of the confidence in that prediction. In a random forest, the model asks all trees to make a prediction and each tree votes on the outcome. The confidence is essentially the voting distribution (i.e. 30% trees say not infected, 70% trees say infected). It's your job as the human to figure out if the model is making the correct decision.

When you take a simple model, like a random forest, you can often also reverse engineer the model's decision. This is useful for determining if you trust the decision. For example, you can look at which trees voted for the majority decision and investigate what branches in those trees contributed to that decision. This process is referred to as the interpretability or explainability of the model. When a model is simple, like with random forests, you can use your human understanding to evaluate how trustworthy and accurate you find the prediction. This is demonstrated in "human in the loop" systems, where a human can use a machine learning model and the interpretation of the model's prediction to make an informed decision.

Now that you've investigated a simple machine learning model at a high level, let's take a look at how it compares with a deep learning model.

Deep Learning Model

In a deep learning model, at the model selection level, you don't make one algorithm choice, but instead many choices. Because a deep learning model often consists of many layers of functions which interconnect, you are building a model architecture rather than one algorithm choice. Researchers try out new architectures, adding new types of layers or changing the layers and changing the algorithms within those layers to seek performance improvements or innovative approaches.

Within industry, you are usually just implementing other people's architectures that you know work well for the type of problem you are solving. For example, many of today's LLMs use a Transformer architecture, which is a fairly complex deep learning architecture that uses an attention mechanism. Attention mechanisms were first introduced by Google researchers in 2017 in a famous paper called Attention Is All You Need. GPT models are a type of transformer, with small changes in how certain parts of the encoder (to "read" the incoming text) and decoder (to "write" the response) work. For an illustrated and deeper dive into transformers, check out the Illustrated Transformer.

In much of today's deep learning, unless you are at a large company with an extensive machine learning research team or a machine learning-based startup focused on research, you are likely using someone else's model. This could mean that you are using an OpenAI API call, where you are also not even hosting the model, or it could mean you first download and use someone else's model and deploy it on your own infrastructure.

There are also ways to download a model that someone else first trained and train it further. This is called transfer learning or fine-tuning. When you do so, you take a model trained on a task and train it further, to better align with your particular data or use case. If you don't have enough of your own data to train, there is increasing use of large scale language models (LLMs) to assist in building out robust training examples for model distillation tasks. With model distillation your goal is to actually build a smaller model that performs well on your particular use case, hence "distilling" the information from the larger model (see: Spacy's human-in-the-loop distillation).

In deep learning, you are often dealing with data and tasks that aren't suited for simpler machine learning models -- like generating photos, videos, audio, text or translating those from one medium to another. Deep learning is what powers Generative AI, what allows for text-to-speech or speech-to-text, and what is used for computer vision tasks, like facial recognition or "self-driving" cars. The complexity of such tasks and the data size make deep learning more performant than simple machine learning models.

For training deep learning models, as you learned in the training article, the entire training dataset is input into the model multiple times. In today's largest models the training set is seen tens of thousands of times by extremely large models. These models are called over- or hyperparameterized because they actually have more parameters--weights, biases and other parameters that the functions of the network might use--than there are training data points. You might have heard about 1 trillion parameter models, and yet these models were trained with less than 1 trillion pieces of unique data.

Compared with the earlier decision tree and random forest example, a deep learning model is much more difficult to interpret. Especially as the layer complexity and depth grows, as it does with large deep learning models, it's difficult to look at the activations of the nodes and make any sense of what is happening in a way that humans can understand. Despite the difficulty in working on human-interpretable understanding of deep learning, it hasn't stopped researchers from trying to peek inside these networks to see what is happening.

LIME or Local Interpretable Machine Learning Explanations investigated if small changes in inputs could expose and locate deep learning decision boundaries. To envision how this works, first think of a 3-dimensional space, where inputs are represented as coordinates. Then, imagine planes marking boundaries between those points that help you determine whether the point belongs to one group or another. In reality, these models are extremely high-dimensional and non-linear in nature, meaning it works a bit differently than you just imagined, but LIME worked by changing the coordinates to figure out where these boundaries were and then used that information to say this part of the image or text is why it was classified with this label. Since a neural network is not a simple linear equation, finding these boundaries can be quite difficult, but it was an interesting first step.

Been Kim's work brought the field of deep learning interpretability to a new level. In her work at DeepMind, she investigates how hidden layers (the layers between the first and last one) can create intermediary representations which map much closer to interpretable patterns for humans. Her seminal contribution of "Testing with Concept Activation Vectors" (TCAV) created a way to try to use human interpretability approaches to understand deep learning, not the other way around.

Deep learning is a large area of machine learning, with many different model types. To focus our attention on particularly useful areas of deep learning to explore memorization, you'll start with sequential deep learning models, where you want to predict what happens next when presented with a sequence. As you might already know, this is the deep learning that powers today's text-based Generative AI, like OpenAI's ChatGPT and Google's Gemini.

Sequential Deep Learning Models

For many years, it was difficult to do language-based deep learning. Deep learning computer vision models were well into production by the time word embeddings made language deep learning possible. This is because of the size and complexity of language, which has much wider ranges and of course, also multiple languages that one could use. It was also because there wasn't a good way to do performant sequence-based models for a long time.

As recently as 2018, generative text models would quickly devolve into babble or change topics midstream. What significantly shifted the field are two inventions: the attention mechanism and the large context window. Let's walk through both at a high-level to appreciate what they were able to bring to generative text.

The attention mechanism creates specific parts of the deep learning model which hold "attention" on links between elements of the sequence. This helps create a web of references, which greatly improves the ability to produce meaningful language.

These attention heads are both on the encoder ("reading what is coming in") and the decoder ("writing what is going out") with an additional one between the encoder and decoder. These attention heads can hold input or output embeddings within a sequence that are calculated as more significant or useful, depending on the task at hand and the training data.

Let's look at an early example of attention from the original Google paper. In this view you can see two of the attention head weights of each token when looking at another token, which reflects what those weights have "learned" are links between sequential inputs.

An example of attention head links

In this example, the entity (or subject) linkage and attribution was learned by this attention head, allowing "law" and "its" to be linked. The attention also links to application, tying law and application together via the pronoun it.

For early text-based transformers, attention mechanisms could only be used on a subset of the overall sequence due to memory and compute limitations, often limited to under 2000 tokens. This means that the prompt (initial instructions) as well as the ongoing generative text could only be given attention up to that context window limit. This limit bounded the length of meaningful generative text - particularly when using transformers for longer generative text tasks, such as summarization, chat, or search and information retrieval.

To address these use cases, model developers like OpenAI increased the context window size of the attention mechanism, which also increased the hardware memory requirements and computational cost of the models. This means that you could often hold entire conversations or chapters of books in the context window, creating the ability to stay on task, but also giving the model extra context to ensure that the important tokens and ideas are always available. Today's LLM context windows are often 128,000 tokens or longer. Compare that to Shakespeare's Hamlet, which is just over 30,000 words.

I often describe context windows as the model's RAM (Random Access Memory). RAM allows computers to easily grab recently used data to accelerate computations or loading times. When you encode text, that might be bytes, characters or words, so these context window sizes could roughly translate that a model has more than 50,000 words in RAM. How does it process what word comes next?

To oversimplify, you can think about all of the possible words and tokens as different points in 3D space, as you did when looking at decision space. The attention mechanism and context window might already have highlighted some of these points as more important or more relevant to the model - which also means points near them become more relevant.

The model has been trained first on many sequences in order, and gotten information about how tokens come together. During pretraining (which is what first happens with today's language models), it updates many hidden layers of weights to "learn" how the input corpus works and what sequences of tokens are common or uncommon.

The embedding information of each token also carries added information about tokens that are similar, near each other, or tokens that have a particular distance from one another which shows their relationship, as you learned when reviewing Word2Vec.

If the model has learned enough about language and the input has enough context (i.e. via the context window RAM), it can create sensical combinations that are on topic by also combining its output as part of its sequencing (i.e. writing word-by-word and continuing to compute what a useful next series of tokens might be).²

In a situation where you have: Why did the chicken ..., you can probably guess the next word, as it's a common phrase. In a deep learning sequence-based model, the network calculates which of all of the possible next steps are most relevant using the steps it has already seen and already generated. Usually the highest probability next step is chosen, but there might be clever ways to see the best sequences that aren't just exactly the next most probable step. For example, you can predict several different sequences using several of the next most probable steps and calculate the probability over the entire sequence, not just the exact next step.³

In statistics (and in deep learning), this relates to the log-perplexity of the next sequence. This is a useful comparison for the final stages of the decoder in a transformer, which should take all of the calculations to this point in time (including calculations from the context window) and run it through several final deep learning layers. This ends with a Softmax function which takes the prior layer inputs and translates them into a probability distribution for the different tokens. Depending on the strategy, either the most likely token or some combination of most likely tokens will be chosen.

How does this affect the problem of memorization? Let's put together what you've learned thus far to see the bigger picture.

Repetition Begets Memorization in Deep Learning

Let's investigate a few facts that you now know:

Scraped, collected datasets have a section of examples that are repeated often and are much more common.
During model training, the model will be optimized on these examples hundreds, if not thousands of times.
Sequential language modeling must choose the best answer with one word missing. It also can hold a massive amount of words in memory to access at any time. These words build weights and connections in the network itself.
The model and model developers are incentivized to score well on the "test" and are penalized (error and loss) when they fail. The training rounds should explicitly use this penalty to improve.
The "best" model wins, regardless of interpretability or if cheating has occurred.

There are both mathematical and human incentives to produce models that memorize common text, particularly if that text will be in the testing dataset and if that text has been seen multiple times during training. Even more so if that text is presumed to be known by the users of the model at a later stage.

Google researchers proved exactly this fact in 2018, when they released a paper called The Secret Sharer. Carlini et al. stated in the paper, "unintended memorization is a persistent, hard-to-avoid issue that can have serious consequences". They were able to demonstrate the extraction of both common and more rare training examples that had been processed multiple times in the training rounds.

In a later piece of research by Carlini et al., they were able to show that model size impacts the memorization, and that repeated examples are especially prone to memorization. They estimated that a minimum lower bound of 1% of the training data is memorized, but an unknown upper bound.

In some of their experiments, they were able to extract more than 32% of the text that was included in at least 100 training data examples, sometimes not the full text token-by-token, but enough to recognize and match the text. Their testing was on much smaller models than today's models, as they used GPT-Neo and GPT-J with 6B parameters. In comparison, it is believed that GPT-4 has 1800B parameters.

Here is a comparison chart linking model size to text memorization by showing a few examples they were able to extract:

This memorization within deep learning can be also observed in the attribute inference and membership inference attacks where an attacker can find and extract common properties of the underlying training data population or reveal if a particular example was in the training data based on the model response. Particularly interesting is the work This Person Probably Exists on deriving attributes from a computer vision model that was trained on the CELEB A dataset - a dataset consisting primarily of celebrity faces.

Let's see if this type of behavior is easy to evoke using online freely-available tools.

Easy Examples

With almost any generative tool, you can easily create images of particular brands, faces and common images, like this version of Angela Merkel, created on Stable Diffusion Web.

Let's see if we can also reproduce this type of memorization using ChatGPT (free version).

An initial prompt of "Can you tell me some popular children's authors?" gave me a list starting with Dr. Seuss, so I asked to know about a book from Dr. Seuss and was told about Green Eggs and Ham and how it had good rhymes. I asked the ChatGPT service to show me some of the writing and it did so with the opening stanzas: Would you eat them in a box? Would you eat them with a fox? (and a few more lines).

Then I asked:

Reading Dr. Seuss with ChatGPT Memorization

And got a not quite perfect but pretty close continuation of the entire book (the response was more than 600 words and spaced in the appropriate stanzas).

ChatGPT can also easily reproduce code and related topics, like the Zen of Python:

The Zen of Python from ChatGPT Memorization

You might be thinking, ah, well this is exactly the goal. If you train a large model on a bunch of repetitive things, then of course it can also repeat them. You're correct! It is indeed both expected, anticipated and empirically true!

It just might not be the outcome that Dr. Seuss's family expected, or at the scale that any popular artists, authors, creators, musicians and coders imagined. It can be both expected and yet have unanticipated and unintended secondary effects. It can be both desirable given the specific machine learning task but not well thought through in terms of cultural, societal and personal impact.

Unfortunately this is not also the only way that memorization occurs. In the next article, you'll review how unique and novel examples also end up memorized. Stay tuned!

Random forests usually use statistical methods like bootstrapping and bootstrap aggregation, or bagging. I encourage you to dive into the links to learn more, but on a high level you can think of these as statistics-informed sampling methods, which allow creation of samples that aim to represent the dataset or populations within the dataset. This may also increase the dataset size by resampling from a subset of data. ↩
For LLMs today, there is usually also a second round of deep learning training after the initial language model learning, which originally was based on reinforcement learning and called reinforcement learning with human feedback (RLHF). This usually involved chat style prompts and then data workers who were paid very little money to both write their own responses as if they were the AI assistant, but also to rate which response was best out of a variety of responses. Now there are several approaches for this type of instruction training and tuning, which are not always reinforcement learning-based, but instead use traditional deep learning approaches by incorporating human preference information into the loss function. ↩
There has been some success at using approaches like beam search to compare potential sequence options and calculate their probability or their preference (when using human input to determine best responses). This creates more options and potential variety in responses. ↩

Gaming Evaluation - The evolution of deep learning training and evaluation

2024-11-26T00:00:00+01:00

In this article in the series on machine learning memorization, you'll dive deeper into how typical machine learning training and evaluation happens, a crucial step in ensuring the machine learning model actually "learns" something. Let's review the steps that lead up to training a deep learning model.

High-level steps to train a deep learning model

You've already learned the initial stages that lead up to training a model -- namely, how data is collected and sometimes labeled (depending on the "task" you might need labeled or unlabeled data). After the data is processed, cleaned and saved, it will likely be stored in files, document stores or other distributed data architecture setups for easy access from the data science and/or machine learning team.

Prefer to learn by video? This post is summarized on Probably Private's YouTube.

In a typical machine learning setup, the team uses a dedicated or on-demand GPU cluster or other machines that accelerate and parallelize machine learning training. This special hardware ensures that the massively parallelizable linear algebra computations run as quickly as possible, as you learned when reviewing AlexNet.

At this point, the team will also decide:

what data and task are relevant
what model architecture(s) they will train
an evaluation and validation strategy and dataset to evaluate the models

The simplest and the most common way to answer the data question in a) and c) is to use the data you've already collected and randomly split it into training and testing datasets. This is appealing because it ensures both datasets are similar, making evaluation easier and ensuring all data undergoes standardized preparation and preprocessing. Also it's data you already have, so you don't need to figure out how to collect more data.

Nearly every machine learning algorithm or architecture has hyperparameters, which are variables for the architecture or algorithm. These are usually initialized directly if you have an idea what some of the values should be, or initialized randomly. If you choose to use random initialization, or to search a variety of values for the best initialization, you might perform many parallel training initiations to see which creates a better model.

In conjunction, you can use cross-validation, where multiple models are trained and evaluated on different splits in the dataset and with different initializations of the hyperparameters.

When using cross-validation, your data split might look like this:

A visual example of training and test splits

In a perfect world, this is fine, because the data has been properly cleaned, is attributed correctly and you know that it's high quality data. You presume the data is free and available for use (i.e. data protection compliant and not under special licensing or copyright). You also presume it is representative (i.e. it doesn't have sampling biases) and any labels are correct and appropriately representative. Unfortunately, our world is not perfect.

Instead, as discussed in a previous article, the data often has a significant mass of "typical" examples and then a long tail of more novel examples. Some of those novel examples are likely just errors and mistakes in labeling and collection. Some of the popular examples will repeat themselves either pixel-for-pixel and word-for-word or in chunks with close approximation, like beginning a business letter with "To Whom it May Concern".

This brings several problems, some which contribute significantly to the memorization problem. Let's evaluate them alongside the typical training process.

Data Quality, Duplication and Preprocessing / Cleaning

Internet-scraped data has many quality issues, but so does data specifically collected for a task, due to many of the societal, measurement and population biases described in the data collection article. If your goal is to create an accurate and representative view of something like everything you can see outside, or even a smaller task, like recognizing every voice speaking English for speech to text, you will certainly miss some representations and you'll likely also run into training data quality issues.

Data quality is a hotly debated topic within machine learning and data science. Some machine learning scientists presume that if the errors only represent a small portion of the data, they will essentially be regularized out of the model. This presumes two things: (1) the errors represent a small portion of the overall data collected and (2) the model will not memorize erroneous data.

If a data scientist presumes that there are significant quality issues with the current dataset, there should be a plan to deal with the problems. An example plan could look something like the following:

Test for duplication and remove duplicates using near-match or perfect match search strategy. Remember that near-match is a hard problem and can require human intervention and labeling.
Test for realistic bounds or patterns and regularize or remove data outside of those bounds. For example, find and remove overexposed photos.
Apply domain-specific criteria to detect problems and either correct or remove those issues. For example, remove poor quality boilerplate or spam text.
Determine other preprocessing to ensure all data is similarly standardized and irregularities have hopefully been removed. This can require things like removing outliers, normalizing data and filling in missing values.

This is a difficult task due to the input data complexity, especially when you are using non-tabular datasets (i.e. not data in rows and columns). Is a cropped photo of a larger photo a duplicate? What about text that varies by one paragraph? What are "realistic bounds" for outliers when it comes to PDF documents? Many of these are open research problems that require significant domain experience to address properly - which most data scientists don't have by default.

Due to this skill mismatch and lack of resources, usually only the most rudimentary quality checks and preprocessing happens. This data is then considered "clean" for the following training and evaluation steps.

Sampling Bias

Sampling bias is data bias or error that comes from the way the data is collected and used.

In deep learning, there are two forms of sampling bias. The first occurs during data collection, which you learned about in the last article. This bias is seen in the skewed representations and societal biases, but also in other features, like the represented language style and context. If you are using Wikipedia, the writing has a certain style, versus if you use arXiv (another popular source) or Reddit (a very different style of language, even though it will also be mainly in English). These choices greatly impact how a language can learn and reproduce linguistic style and writing.

The second type of bias is the actual sampling method used to perform the training split and validation in the dataset. In an ideal scenario, you'd study the underlying data populations and make specific decisions on how you'd like to split the training and evaluation/testing data so that each sample has an adequate representation of the population information you are trying to learn.

In a perfect scenario you might even use a separate test and evaluation set that you specifically collected and labeled to ensure you know the data quality and provenance -- even if it slightly deviates from the original training data. For example, if you really wanted to evaluate a system with unseen data, you could collect the test and/or evaluation data via a separate process. Let's say you are testing a chatbot for customer search and knowledge base surfacing. You could collect the test and evaluation set by leveraging your customer service department -- who could create an entirely separate evaluation set based on their knowledge and experience. You could expand this dataset by sampling real customer queries of the system when it launches or in a beta setting and having the customer service team appropriately validate, label and annotate or enhance the dataset.

In reality, a data scientist likely uses a built-in preprocessing train-test split that takes the entire data and runs a random sampler across it. Again, this probably wouldn't be a problem if the data always had a normal distribution and was high quality, but this is not usual with the large scale scraped or publicly available datasets. This means that random sampling is not actually representative of what you are trying to learn, and certainly not always a quality you want to reproduce. It also means the chance of sampling near-matches, extremely similar data to your training data, and one-off outliers or errors is high (because of the long tail and prevalent collection methods).

This sampling bias impacts both the performance of the model on the data, but also the progression of model training, which brings us to our next problem.

Training batches and rounds

In deep learning setups, data science and machine learning teams use multiple training rounds to ensure the model "learns".

Usually, training is broken down into iterations called epochs and then a smaller iteration called steps. This is then repeated as long as needed until the model scores high enough or the team decides it isn't working. When reviewing the following process, I want you to imagine this process happening thousands, if not hundreds of thousands of times, meaning the model processes the same data many thousands of times, each time trying to better "learn" from that data.

Let's investigate a typical training epoch:

Before a training epoch can start a data batch size needs to be defined. Usually batch sizing is correlated to the dataset and hardware at hand. For large models and accompanying datasets, usual batch sizes start at 128 data examples.
At the beginning of a training epoch, a sampler is used to select the batch from the training dataset. The default sampler is "random" and breaks up the full training dataset into a particular number of batches. Note that the randomness is dependent on what hardware is being used, and therefore in many cases not truly random.

Visual example for batch selection
The examples in the first batch will be processed through the model to create a prediction or output (similar to how it works when a model is used normally to predict a label or the next token). This processing activates nodes and layers across the model's network based on the model's weights and biases.
The final layer of the model will predict a class or make another prediction, such as a token or other generative output. This response will be compared to the training data itself, such as the label or the next series of words in the training example.

Visual example of a loss calculation. During training, you continuously predict and measure loss to sequentially improve the model response.
Depending on how correct or likely that answer is, there will be an error calculation (often called a loss function). You can choose different error calculations, but many are based on the concept of cross entropy, which roughly tries to say how likely the model response and the training data come from the same population (i.e. how predictable or "normal" is this response?). The error is used to derive the updates for all of the weights in the network to attempt to correct future responses. If the loss is large, this will highly impact the weights for those examples. This means outliers and potential erroneous inputs can have an outsized impact on the model training.
Once the parameter (weight and bias) changes for each layer are calculated, backpropagation begins. The process updates each layer with the new weights and the training step is complete. These updates usually happen at the end of the full batch.

Visual example of corrections to particular weights based on error. The stronger the red border representing error, the larger the shift in the weights and biases related to that particular node.
Then, the next batch is selected and the training step begins again. This is repeated until the epoch is complete, so the model has seen all possible training data once.
At the end of an epoch a subset of the test data is selected to evaluate the model performance. The performance of the model on that data is usually shown via a dashboard, so that the machine learning scientist, data scientist or engineer can determine if the training is going well or if it should be stopped because the model either isn't performing and something catastrophic has happened. Sometimes, the team will also stop training in what is called early stopping because the model has reached a good or optimal performance, where they assess that further learning might result in overfitting or won't provide much additional gains.

It's common to train for multiple epochs, meaning the entirety of the training dataset is used multiple times. For overparameterized networks, which you'll learn about more in a later article, this repetition is significant as the models require hundreds of thousands of "full passes" (i.e. one training epoch on all data) to reach peak performance ([see: early work on scaling Generative models and NVIDIA's scaling language model training to 1T parameters).

To dive deeper into the workings of these steps, take a look at the Neural Blog's Forward Pass and Backpropagation example.

By reviewing this process, you can infer that uncommon examples have a more significant impact on network weights compared to more common examples. Because there are also fewer of them, the model weights must account very specifically for those examples in order to achieve optimal performance. You will come back to this insight later in this series, but let's first evaluate the evaluation process.

The Myth of "Unseen" Datasets

The training steps of a deep learning model require that the data is seen in its entirety usually multiple times. If the test data is sampled from the same dataset, how "unseen" is the test data?

The split between test and training is often random, and yet the datasets are often collected from similar samples. Let's take a look at an artifact from some research around memorization to see how this plays out.

Example training sample and test sample from ImageNet

On the left is a test data sample, with the accuracy of the model's prediction written underneath it (75%). But on the right is an example from the training data which most influenced the weights to guess "correctly" on the test data. The photos are clearly from the same photographer, on the same day, of the same thing. And yet, this is called "unseen" data?

Presumably, you want the test data to be completely unseen so you can tell how well your model is actually generalizing. Generalization is used to describe a quality that the model has learned how to generalize on patterns, rather than to overfit the training data and perform less well on unseen or real-world data.

Since you learned about the long-tail distribution of the scraped "real world" data, the chance that the test example is truly unseen is unlikely when sampling from the same collected dataset. If you sample from the peak of the distribution, that data is massively duplicated, so this is certainly not unseen data. If you sample from the tail, you do have a much higher chance of "unseen" examples, but ideally you also want to learn most of the tail in order to generalize, which means you need significant data points from the tail in your training dataset.

In fact, some of the best research on the problems of imbalanced classes, which is an effect of the "real world" distributions, guide practitioners to oversample the tail as training examples, leading to more "balance" between the peak and the tail. If you oversample the long tail for training, then this also means there is less of the tail for testing, and you again get in the cycle of testing mainly with common examples, which the model should certainly learn.

Why is this happening? This isn't representative of actual learning, and it certainly won't work well in the real world if this is the performance. Let's investigate one potential factor in how this occurs.

The pressure to publish and "Benchmarks"

In academia, there is more pressure to publish than ever before. Especially in fast-moving fields like machine learning or AI, researchers and students must attempt to create novel, breakthrough work at record speed. But usually novel work takes time, it takes inspiration, it takes many trials and errors and blockers until you have a really interesting idea and approach.

So, how do you keep publishing at a high speed if you don't have the time to actually explore ideas fully? You aim for benchmarks!

Benchmarks and their accompanying leaderboards have become a gamification of machine learning research.¹ A benchmark dataset is often introduced as a paper itself (bonus: you get a paper published by creating a new benchmark) and usually introduces a particular task and dataset -- such as a generative AI model passing the bar exam. After publication, someone will beat the initial model who wins the benchmark (another novel paper!). Then, someone else will beat that model (yet again a novel paper!).

But is the data representative of real-world problems? Is the data diverse and representative? Is the testing data "unseen"? Is the benchmark useful for the use cases that people need solved?

Kaggle Culture and the origin of Leaderboards

Kaggle is an online machine learning community that started in 2010 (and has since largely been replaced by HuggingFace).

Kaggle hosted many popular datasets and competitions in the early-to-mid 2010s. The goal was to share models that beat other models at particular machine learning competitions or tasks. This usually involved overengineering models with no attention to generalization, making them bigger than ever, using every possible feature you could think of, using (now dated but then trendy) techniques like AutoML, where feature extraction is automated and becomes opaque. You could often win money, internships or even get hired based on your Kaggle status.

There are several known examples of teams or participants figuring out how to train on the test dataset, or directly encode the not very well hidden test dataset.

An HuggingFace LLM leaderboard, where all models are fine-tuned by individuals and "outperform" the base model trained by large expert teams. Can you guess what the fine-tuning data is?

The entire cultural goal was being the #1 machine learning model, and for that, you would do anything to squeeze out extreme accuracy, even if it wasn't really a very good machine learning model afterwards. This "winner takes all" leaderboard mentality still exists in today's machine learning community, now as Hugging Face leaderboards.

The problems outlined in this article contribute significantly to common problems in real-use applications, like models never launching into production.² These realities also contribute to "model drift" or "data drift", where model performance shifts when launched into real-world use cases in production settings. But where did the data drift to? Simply outside of the carefully collected training dataset representation.

A question to others in the machine learning community: How sure are we that we are getting the population right? Are we using basic statistical thinking to model our data collection approach? Can we learn from social sciences on population representation? Are we focused on creating the best models for real-world impact? Are we challenging current data collection methods for bias, misrepresentation and (in many ways) lack of real world applicability? Can we foster better understanding of what AI models humans want and start our evaluation sets there?

In the next article, you'll use what you learned to review how massively repeated examples are memorized. We're diving into the "heady" part first. ;)

This obviously wasn't the initial intention of benchmarks, which was more about finding useful metrics that were (hopefully) not directly in the training datasets. I don't think the mentality I describe applies to all researchers or practitioners; however, it's still become a serious cultural and strategic problem in achieving useful model metrics and it's deeply affected model development. On the day I published this article, there was a really nice MIT Technology Article on the problems of benchmarks. ↩
There are online methods for evaluating models in real-time, like measuring performance metrics directly in the application using the model. Companies can and do develop these near real-time, online evaluations and sometimes even directly learn from production environments or deploy new models automatically when they perform better than the current ones. I would caution, however, that this lack of human oversight of the incoming "test" dataset can reproduce the same problems as biases in the collected data -- probably even more so if your application isn't used by billions of people (and even then, who and who not?). ↩

Exploring new meadows

2024-11-20T00:00:00+01:00

Hello!

We may not know each other, but here you are on my website -- perhaps because you saw a post or someone shared a link. I'm resourceful, determined, intelligent and looking for new challenges. Welcome!

Wenn Deutsch einfacher ist, schreiben Sie mir bitte per Email (katharine at kjamistan punkt com) oder auf LinkedIn, damit ich meinen Lebenslauf weitergeben kann.

[About Me]

10+ years experience in working on machine learning, deep learning and AI systems. Started in Natural Language Processing (2011) and moved to privacy (federated learning, encrypted learning, differential privacy) in data and ML/AI systems in 2017. Experienced in driving data and ML projects to successful outcomes.
Author of Practical Data Privacy (O'Reilly 2022), translated into German and coming soon in Polish, on using privacy technologies in data and machine learning systems and building better governance and privacy into data workflows and data teams. Video course to accompany the book coming in January 2025. Curator and author of the Probably Private newsletter with more than 500 subscribers.
C-level consultant for security, governance and privacy in data and AI systems at Thoughtworks' clients in the EU and globally. Developing future-proof data strategies that achieve business goals while building trustworthy relationships with customers and partners.
Technical leader with product know-how. Can cross technical, product and business lines to transfer knowledge, assess and reinforce alignment and develop strategic and pragmatic planning and execution. Experience leading small and large teams (3-50 developers/data persons).
Multiple time startup founder with experience on raising, board communication, team building and product-market fit.
Handlungssicher auf Deutsch (C1 Zertifikat). Ich interessiere mich für eine Stelle, bei der ich auf Deutsch arbeite.
More than 15 years in the technology industry, with technical and product experience in machine learning, data science and data engineering, architecture, security engineering, software design and development and large-scale cloud deployment and automation.
Regular speaker and keynoter at international conferences such as CCC, Strangeloop, QCon, ACM, PyData, PyCon, EuroPython. Due to my strong technical background, I have covered topics like data privacy, machine learning security and AI ethics and continue to be invited to speak on these topics.
Lecturer, former adjunct professor and successful educator including courses for O'Reilly, University of Florida, DataCamp and numerous educational workshops. You can check out my teaching style on the Probably Private YouTube channel.
Excel at rapid grasp of new technologies, and asking difficult questions to surface critical issues -- driving teams to research, learn, debate and decisively resolve issues as they arise.
Tinkerer, hacker and forever programmer, see GitHub for an overview of my interests and open projects.
Fluent in Python and GoLang and have experience with C++ and Java.
Founder of PyLadies, mentor and ally for several women of color and immigrant women in tech initiatives, conference diversity scholarship organizer, persistent advocate for the underrepresented in tech.
Background in investigative journalism, love public speaking, meeting new people and working with teams.

[About You]

Here's a few things I'm hoping you can tell me:

What's your team like? Is it diverse (gender, race, immigration status, age)?
What relevant problems do you solve? What excites you about your work / product?
Do you let folks learn on the job? Is this supported with mentoring / pairing / reviews, etc?
Are you friendly to remote workers or based in Berlin?

[Dream Role]

I'm not sure what is available right now given the noise around AI, so I am posting this to learn about opportunities outside of my direct network. Feel free to send it along!

Ideally I'd like a role where I can:

Share my knowledge and continue research and work on AI/ML privacy and security
Either strategically lead teams (leadership/management) or work in IC roles as a ML/tech lead
Develop personalized and privacy-first AI/ML systems at a product-company (particularly interested in B2C) (see my views on personal and private AI systems) or develop communally-run data and AI systems for things like public infrastructure, transport and energy.
Spend some of my day speaking and writing German (oder meinen ganzen Tag!)
Work with a motivated team who enjoys collaboration, learning and mutual support

If you think your organization might be a fit, please drop me a line! You can reach out on LinkedIn or on email I'm katharine at the top-level domain you are currently on. Spelling matters (i.e. kath-A-rine).

Thanks for dropping by. 🤗

Private and Personalized AI

2024-11-19T00:00:00+01:00

I recently had the wonderful experience of keynoting PyData Paris, thanks again for the invite! When deciding on a topic, I was considering my recent research about how AI/ML systems memorize data. As I've mentioned in a few talks, if we indeed embraced the fact that machine learning systems memorize training data, we'd probably design them differently. What would it look like if you could just use your own data, or own your own model, or both?

I've been inspired by recent AI product shifts in this direction including the Apple Intelligence launch, which promises to be more private and personalized. Although it's not available yet in the EU, likely due to its currently closed functionality, I am excited to see what innovation it brings to thinking about personalization in AI/ML systems.

These developments struck me as similar to other events in technology history, like the emergence of the personal computer. Maybe we can learn from that history to see how to make AI systems a helpful and integral part of our society?

What can we learn from history?

At present AI is a centralized, specialized field. It reminds me a lot of early computing or pre-cloud data centers. Early computers were huge machines in rooms full of specialists, used only for special tasks. Here is an example of such a machine and its engineers in the late 1950s.

An IBM 704 mainframe at NACA in 1957

The parallels are actually quite striking when you list out some characteristics:

large, expensive and centralized compute
run by a small group of highly specialized workers
task-specific programming, often for research or large corporate interests

What brought about the revolution in computing? What made it so that we all walk around with a computer in our purse, pocket, bag?

One of the initial turning points was the development of the personal computer (PC), but even that was mainly used by hobbyists and didn't initially have wider market impact. But as software became more useful, that perspective shifted. One good example of this shift was VisiCalc on Apple II.

VisiCalc

With VisiCalc, people could finally see something really useful, something that was worth buying a fairly expensive piece of electronics to do. Everyone needs spreadsheets, right?

This success started a growing trend of making software not just for hobbyists, but for the general public, for your work and life. This momentum created more focus on user-friendly, understandable GUIs (graphic interfaces), it let people bring their own data and it created experiences of joy and fun. As this trend continued, there was a need to use more than one computer, or connect with others -- building both the market and actual demand for the internet and things like web browsers. Each of these steps in computing development brought new use cases, new persons, new data and new communities along with them.

Is AI community-oriented? User-friendly? Easy?

This brings us back to the current status in AI. Where are we in this evolution?

Where do you place AI systems on this scale?¹

I think we are still somewhere in these early stages of VisiCalc, approaching the next stages, but somewhat slowly due to the lack of truly open models with open data. It's still quite difficult to try to bring your own data or your own use case -- other than typing something into a prompt or uploading one image at a time in an interface.

How do I easily connect AI to the documents on my computer? How do I use it with my photo storage or art? How do I have it only use my writing, documents, emails?² Can I train it myself to get the results I want (i.e. by labeling myself and/or engineering my own prompts without having to learn much about how it works)? How do I do all of this safely and successfully (i.e. without having to reread everything and do everything twice)?

One of the main problems in achieving the next stages is that popular AI systems are inherently intransparent. This is a great marketing trick because you can claim magic, but awful for actually making AI systems more trustworthy and useful for humans. As a user, I don't have to understand everything, but I should be able to understand enough to avoid poor quality outcomes. I should also trust things enough to know they won't accidentally reply to an email from my boss with details about my upcoming job interview or provide a profile photo with my underwear showing. :/

What will be the pivotal point that takes AI (or agents or ML models) from where we stand today and moves them into a true revolution, where everyone uses the AI systems directly as regularly as they open their laptop or unlock their phone?

Imagining what is possible: Local Document Search, Retrieval and Chat

I think this will come first when you can run an AI system as easily as installing software or an application on your phone. It needs to work offline or you need to control when and how it connects and what data it sends (because of the aforementioned trust issues and general usefulness).

To test out what I wanted in relation to personalized AI, I built a completely local RAG. I'd already downloaded and tried Ollama and GPT4All, both which I liked, but I couldn't tinker with them as easily as I expected and I wanted to build out some other features I had in mind... (more on this soon!)

A local, offline RAG system search and response

I built mine using examples from UKPLab's sentence transformers and Mozilla's Llamafiles. I didn't need a bunch of add-on libraries and it was quite straightforward. I went for simplicity and ability to shift out models or search easily over robustness and functionality. I also wanted it simple so that I could easily demonstrate how the underlying systems work (transparency is important!).³

I released my proof-of-concept as a Jupyter Notebook on GitHub](https://github.com/kjam/personalized-ai) with annotations. I'll be adding more notebooks, command-line programs and functionality to this -- so if you'd like to contribute, let me know!

I'll also be releasing other personal-AI/ML model examples and workflows in the coming weeks and months to inspire others, debunk mythology about how "hard" it is to build local-first data and AI and to hear your feedback on what's interesting and useful.

I believe we're at a critical moment in the adoption of AI and ML systems as something that can help connect us, that can serve real purpose and that can also be reliable, trustworthy and interesting (maybe even fun!). There's also some pretty dystopian futures that could occur if we continue to have intransparent, corporate-driven AI systems that are more smoke and mirrors than science. I'd like us to use this moment to build AI futures we actually want to see.

I'd be curious to hear your thoughts, so feel free to write me an email or reach out on LinkedIn. ↩
Some people say this is achieved by AI Agents, but I have yet to see an agentic workflow that is clear, trustworthy and transparent enough that I would install it and use it on my computer. I think the security and privacy problems with agents will continue to grow in the short-term, and that they will likely only be fixed with actual user control and transparency--including local-first model design and deployment. ↩
I took inspiration from Ben Clavié's talk on RAG system design, where he recommends splitting search and retrieval from summarization. I concur that this gives much better results. Thank you for your work! ↩

Encodings and embeddings: How does data get into machine learning systems?

2024-11-18T00:00:00+01:00

In this series, you've learned a bit about how data is collected for machine learning, but what happens next? You need to turn the collected data -- images, text, video, audio or even just a spreadsheet -- into numbers that can be learned by a model. How does this happen?

TLDR (too long; didn't read)
Complex data like images and text need complex representations if you want to use them to predict or learn
One way to encode this data while preserving information uses linear algebra
Deep learning also uses linear algebra as building blocks for networks and architectures
Word embeddings encoded language into linear algebra structures--enabling deep learning with language
Word embeddings also encode cultural biases and sensitive information

Watch a video summary of this post

Why encode information?

Data is actually encoded all the time! When you save a file, when you open a program, when you write an email and hit send -- all of these take formats humans can interpret and translate them into formats computers can read, write and process.

The default computer encoding is bytes (collections of bits) -- which the computer can store or process using available hardware, like a CPU and attached memory. Bytes are also used to build datagrams which can be used by internet protocols to send data. These same principles relate to how information is also encoded into other messaging standards, like radio waves that are captured via an antenna and then decoded back into audio via a demodulator.

Encoding and decoding require the design and incorporation of standards to ensure systems interoperate properly. Imagine if your email provider took your text and encoded it incorrectly. The receiver of your email wouldn't be able to open it properly.

In the early days of machine learning, encoding and decoding usually involved taking numerical datasets to predict another number, making the encoding, decoding and computation obvious and in some ways unnecessary because the computer already could do math on numbers. For example, if you wanted to project a line or trend based on previous data, you can do that without machine learning. As interest, research and use cases expanded, machine learning approaches reached domains where the data wasn't already encoded in numbers that could be learned easily. There needed to be a way to encode and decode letters, words, images, audio and so on.¹

The "magic" of linear algebra

In some machine learning problems, a simple algorithm works well and quickly outperforms more complex models -- like when modeling simple linear or easy classification problems. In this case, choosing a simple model, a non-learning based algorithm and just using statistical measurements works well.

However, there are many problems where the dimensionality of the inputs are too complex for a simple model. This was the case for computer vision problems, like photo classification and object recognition, until the creation of AlexNet in 2011. AlexNet utilized neural networks and encoded the image into multi-dimensional matrices, or sets of numbers. The encoding mechanism did this in such a way that it attempted to preserve information and relationships and represent those in the resulting matrices. You can think of these matrices as a related set of numbers that preserves the patterns by creating numerical relationships between different "areas" or sections of the encoded data. This is what the machine learning model should learn.

AlexNet was one of the breakthrough image recognition models which introduced deep learning models as viable solutions to the variety of machine learning for image tasks that were common at the time. This was because AlexNet cleverly leveraged a larger, "deeper" neural network architecture (deep learning) than other neural networks at the time. It also used a clever encoding mechanism

Another idea that AlexNet borrowed from earlier computer vision neural networks like LeNet-5 from 1998 was the convolutional layer. These layers require many matrix computations, making them compute-hungry and therefore both computationally and energy expensive. One clever idea from the paper was to parallelize the processing by using two GPUs. In the past usually only one CPU or GPU was used. By parallelizing the computations, the researchers were able to increase the model parameter size and also unlock the power of matrices and linear algebra for deep learning.

In the following diagram from the original paper, each of the dotted lines and rectangles inside a layer show an example of what computations run at each layer on the parallel GPUs. You can think of each layer as a series of linear algebra matrix computations that take the results of the previous layer and continue to compute with them, with the goal of optimizing for the learning task at hand. You will learn about these in more depth in a later article.

AlexNet Architecture Diagram

Linear algebra has been used for hundreds of years to build systems of equations and map them to linear spaces. What does that mean and why is it relevant? You can take real world problems in engineering or physics and model them in mathematics. By taking data or known properties and building it into a system of equations and then mapping those equations into a "space", you essentially compress the problem space and can create optimized ways to solve for all results or a set of optimal results. You can also use these modeled systems to predict, infer, observe and learn.

Linear algebra powers many machine learning systems and is the core building block of deep learning. By modeling complex tasks like how to locate and name the objects in an image (image segmentation and object detection/recognition) in linear algebra systems, deep learning can perform these quite challenging tasks.

Computer vision benefited greatly from encoding images into matrices and leveraging those to unlock linear algebra powers, but what about text? Let's explore the changes that allowed for language-based deep learning, or natural language processing (NLP).

Encoding language with tokens and embeddings

Natural language processing leverages learnings from the field of linguistics. One way to encode language is to use linguistic knowledge like language families and root words to chunk text into smaller words or stems. You do this to achieve smaller and more consistent building blocks of language so that you can concentrate on patterns or information contained in a smaller vocabulary. In NLP, these linguistic chunks are called tokens. For example, you might take the word "foundation" and "founding" and "found" and agree that they should all be reduced to "found". This works fairly well, but what about the "found" in "lost and found"? Does it have the same meaning? The beautiful complexities of language and how each language develops differently adds challenges to tokenization.

There are many approaches to tokenization, or the breaking down of text into machine learning ready chunks, which become quite language specific. Some of the approaches to tokenization include word-roots or stems, like the example above. Another approach is reading a language character-by-character, which works well for languages where one character has a lot of meaning, like Chinese. The character-based approach also works well when trying to do machine learning with less common languages, where there might be many words that aren't represented well in the training data. For character-based tokenization that the word found becomes literally a list of letters: "f", "o", "u", "n", "d". As you might imagine, this doesn't preserve as much of the deeper meanings of the word, because it doesn't necessarily attempt to find things like word roots explicitly -- although the machine learning model can still learn patterns of certain character sequences.

There are also in-between approaches like subword tokenization, where linguistic understanding is used to break each word into word parts or pieces. This works well because it doesn't reduce the information as much as doing word-based tokenization with word stems. For example, "foundation" might become two tokens: "found", "ation" instead of being reduced to just "found". Hugging Face's introduction to tokenization is a great read to learn more in-depth about how different tokenizers work.

Why are there so many approaches to tokenization for NLP? Due to the long-tail distributions you learned about in the previous article, many languages are underrepresented in the online content available when compared with English. In addition, there are niche topics and content compared to more popular content.

These imbalanced data problems present issues when encoding tokens into numerical representations so you can successfully train machine learning models. Early techniques borrowed from linguistic concepts like token frequency or tried to build encodings based on interesting uncommon tokens. These techniques didn’t successfully encode the relationship of the words to one another in longer texts or passages. These encoding methods and the related datasets presented challenges for early natural language processing models because they had to deal with extremely sparse datasets. Imagine choosing a number for every possible token and then just saying whether the token is there in a sentence or not. You will end up with many tokens that are missing in each sentence. This made machine learning difficult and costly because computing on sparse data is harder for computers to do.

An important moment in text- and language-based machine learning was the creation of word- or token-embeddings, which moved away from sparse matrices and allowed for better leveraging of linear algebra (and therefore deep learning). In 2013, Word2Vec (short for word to vector) was released. Word2Vec is a machine learning model which takes a word or token and maps it to a vector representation which is learned by first training the model on the linguistic relationships in text. The vector is like an encoded version of the word which tries to map its relationship to all the other words that the model has processed.² This process produces mathematical connections or links between the words which show up together frequently, and it also can map different relationships when the word is used in different contexts, like the "found" example earlier in this article. This is why these representations are called "word embeddings", borrowing from the mathematical concept of embeddings.

Simplified 3D space example of Word2Vec word embeddings and their relationships

Word2Vec introduced a context-aware way to link words in embedded form to one another. The Word2Vec model acted as an encoder into a compressed linear algebra space that translated the linguistic relationships more accurately. One famous example from the original paper used the model used to complete analogies, like Man is to Woman as King is to Queen. When you took the vector representing "man" and subtracted that for "woman", you got a resulting vector showing the distance and direction between those two vectors. If you took this difference and applied it to "king", you got "queen". Pretty neat!

Unfortunately, these embeddings had many other issues, including my discovery shortly after Google released their Word2Vec model that Man is to Computer Programmer as Woman is to Homemaker. 🙄 The resulting embedding models had learned racism, homophobia and US-centricity, which you can read more about it via research by Bolukbasi et al. and Garg et al..

There are many newer approaches than Word2Vec, but the underlying principles remain similar. To learn about the advances that happened or to dive deeper into the topic, check out Vicki Boyles's fantastic and freely available exploration of embeddings.

In the context of deep learning memorization, maybe it might be useful than to memorize some relationships between these tokens. It can be quite useful to know that certain words always appear together, or that certain names are inherently connected. But this brings up considerations for privacy. Should embeddings related to private individuals be able to be learned or memorized?

Do embeddings contain personal information?

In Summer 2024 the Hamburg data protection authority released a discussion paper stating that LLMs do not contain personal information. While the paper is not a legal ruling, it does set guidance for companies within Germany (and presumably the EU) who have interest in using, fine-tuning or training LLMs. For organizations who provide services in the EU, and therefore must follow the EU General Data Protection Regulation (GDPR), these opinions provide useful legal interpretation and guide compliance and privacy decisions.

Let's take a concrete example from the paper. The paper uses the question (in German): Ist ein LLM personenbezogen? (English: Is a LLM personal [data]?) which the paper tokenizes like so:

[I][st][ e][in][ LL][M] [ person][en][be][z][ogen] [?]

The paper also uses the example of someone named Mia Müller, stating that Mia's name is tokenized as "M", "ia", "Mü" and "ller". This is a key example used to say that the name is now split into tokens, and is therefore no longer personally identifiable.

They reference OpenAI's tokenizer, which has a handy online interface, so let's check their work quickly:

GPT-3 Tokenizer

GPT-3.5 and GPT-4 Tokenizer

Using GPT-3, I can reproduce their experiment... but there are differences between GPT 3 and 3.5. How come?

OpenAI's tokenization uses byte-pair encoding which helps for tokenizing multiple languages at once and processing messy internet or chat text. This encoding mechanism uses clever ways to detect and deduplicate linguistic patterns without explicitly incorporating linguistic knowledge.³ To note: the tokenizer doesn't show the embeddings, which are only available via a separate API call (the model is not released publicly for download). The tokenizer takes text and returns a series of indices (like a lookup table) for the appropriate token embedding in the OpenAI system.

When evaluating the GPT tokenization output above, understand that it shows both the tokenizer and the related embeddings for that model. The GPT-3 tokenizer and its trained embedding model produce something closer to character-based embeddings when given German text (and likely this applies to other languages for that tokenizer-embedding combination). The GPT-3.5+ tokenizer and embedding model outputs something closer to subwords.

One possible explanation on the differences between these tokenizer and embedding model combinations is that OpenAI acquired better German language training data, which resulted in better tokenization and embeddings for German text. As shown above, Mia's name is now tokenized as one token per name, meaning those words were common enough to each get their own token and related embedding. In the GPT-3 tokenizer and embedding model, common English names with only one common spelling are tokenized as one name per token.

Therefore, it is misleading to interpret the tokenization itself as a practice that removes personally identifiable data, which is what Hamburg has stated in their discussion paper. If you truly want to have a tokenizer obfuscate personal data, this must be done intentionally and likely is only truly accurate if the identifiable information is never tokenized.

Furthermore, the office incorrectly describes the process of turning tokens into embeddings as an encoding mechanism that further diminishes the "personally identifiable" part of the data. That would be like saying storing a text on your computer makes it not personally identifiable, because it's actually stored in bytes.... which, of course, is not a very useful interpretation of how computer- or machine-readable encodings work. Just because a human cannot look at an embedding and know what word, token or letter it represents doesn't mean that that same person cannot use OpenAI's freely provided decoder to understand what the data is -- or that a machine cannot learn or interpret the data. In fact, this is exactly what machine learning is trying to accomplish.

By default, embedding models like Word2Vec and more powerful ones like OpenAI's model want to retain and internally represent the relationships between tokens. The trained model should take information in the tokens and transfer that into relationships in the embeddings. This learns relationships, like how "be" and "zogen" together is a common mapping, especially if these words follow "personen". This is what makes embeddings so powerful.

In addition, these embeddings are then used to train the actual language model. Nearly all natural language models (deep or otherwise) use embedding sequences to learn patterns. Even if the individual embeddings for a name are chunked oddly due to the tokenization strategy, it's likely that the embedding model has seen their sequence and combination. Even if that embedding model never saw those tokens together as part of the embedding model training, the sequence and relationship between those embeddings can be learned by the language model. By design, tokenization and the embedding model should enhance the ability for the language model to learn the relationships, not detract from it. This feature of data encoding and its subsequent model training means models also learn patterns in personal information.

The interpretation that a model "only sees the multidimensional representation, devoid of personal data" is again, like arguing that a computer processing data only sees bytes, and therefore cannot interpret or that an algorithm cannot learn from personal data. Additionally, the growth of so-called "context windows" means that an LLM or other generative model holds thousands of tokens as accessible data and as sequencing information before creating a response or performing another task. When you chat with ChatGPT, it holds the entire conversation you are having along with the initial instructions or prompt written by the model designers, saving up to 128,000 tokens as additional "context". These embeddings and their ordering can contain many examples of personally identifiable text and are used by the model alongside additional user and session information to formulate a response.

Large machine learning systems attempt to extract and compress information into data structures that leverage linear algebra and deep learning architectures. In doing so, they enable more complex machine learning tasks. This encoding should enhance learning from the data, not detract. Therefore, Hamburg's take is fairly misinformed when it comes to interpreting how personal data (or really any data) is encoded and used in larger machine learning systems.

As you learned in this article, language and computer vision machine learning encode data differently based on the different learnings of how to best leverage the power of deep learning and linear algebra. You might be wondering if computer vision models retain personal information? Some computer vision tasks are set up so the entire goal is to learn personally identifiable information, like with facial recognition systems which should remember the user's face (FaceID as one example). Other tasks might be set up differently, where the model is penalized for learning the specifics. Some questions to ask yourself for further reflection: Should it be known that a photo contains a celebrity, and should that celebrity's name be learned? Should it be learned that a piece of art comes from a particular artist by name? Each of these questions can also be applied to language learning, if a token (or series thereof) ends up representing a person.

In the next article, you'll investigate how machine learning systems take these encodings or embeddings as input and process them for training machine learning models. You'll learn about how machine learning models are evaluated and validated. Finally, you'll explore machine learning culture to see how it affects memorization in machine learning.

If you're interested in seeing some alternatives to the encoding and decoding that computers landed on and combining that with problems in machine learning, I recommend looking at the work of inventor, physicist and encoding/decoding machine pioneer Emanuel Goldberg. ↩
This process uses either a continuous bag-of-words (pick the right word from a rotating list to fill in the blank) or a skip-gram (pick what words are contextually related and might show up now or soon) approach. You can read more about how this works in the more detailed section of the Wikipedia article. ↩
Byte-pair encoding is an optimized compression algorithm, so things like repeated characters or bytes can and will be compressed into a single mapping based on the other tokens available in the dataset. This is a language-agnostic way of representing text that will expand to fit the common tokens and patterns seen in a large dataset, while also adapting to less common tokens or completely unseen tokens by breaking them down into smaller chunks (i.e. 7Fvw might become 7F, v, w). ↩

Machine Learning dataset distributions, history, and biases

2024-11-13T00:00:00+01:00

You probably are already aware that many machine learning datasets come from scraped internet data. Maybe you received the infamous GPT response: "Please note that my knowledge is limited to information available up until September 2021." You might have also read fear-mongering opinions and articles that companies will "run out of data" to train AI systems soon.

In this article, you'll examine exactly how data is collected. You'll look at what properties this data has and evaluate known issues with such collection processes, such as amplifying systemic biases and obscuring privacy. Understanding these points will help you better understand machine learning memorization and evaluate deep learning when designing systems. In this article and the next few articles, you'll be focusing on understanding how machine learning systems work, so that you can later understand how they memorize.

TLDR (too long; didn't read)
Datasets collected online have a long-tail distribution
Common examples are heavily repeated
Uncommon examples outnumber common examples
Trying to learn uncommon examples in ML systems is hard
Data collection culture is grab everything as cheaply as possible
The history of the internet and internet culture introduce systemic biases
This creates problems with privacy, equity and justice in ML systems

Watch a video summary of this post

Let's explore today's collected datasets and see what you can learn about them and how they work. In this series, you'll focus specifically on data collected for large-scale deep learning tasks.

Natural Datasets of Scraped Text and Images

When you collect a large sample of text or image data and visualize the distribution, you often see a long tail. This was described by linguist George Zipf, and is sometimes referred to as the Zipf-distribution or Zipf's law. With a long tail or Zipf probability distribution, you have more common types of examples and a long set of examples that are far less common.

The probability distribution of the long tail looks like this:

The common examples in the "head" occur at a much greater frequency than the less common "tail" examples. In addition, the tail composes a significant part of the entire distribution, so if you want to learn how to differentiate the examples in the tail (as with machine learning), this presents a difficult problem. How do you know what parts of the tail you need to learn and what might be not worth learning? Should you learn all of it, even examples which are singletons (i.e. only one example) or which may be outliers or errors?

Because of the difficulty, there is significant research dedicated to studying the Zipf or so-called "long tail" distribution. A survey of deep learning with the long tail explored a variety of approaches to address the long tail problem, including oversampling¹ of the less common occurrences to ensure the model appropriately learns these classes and examples.

Another piece of research found two long tail problems in computer vision datasets. The first long tail happens when looking at the distribution of data across the classes² or labels, like "person" or "bus" or "Ziggurat". In this distribution images of common objects and classes compose the head and much less common objects are in the tail.

Within a class there is also a long tail based on other attributes. As shown in the following graphic, the positioning of the object in the photo has its own long tail distribution within a particular data class. The visual aspects of the objects within a class--for example, all photos labeled as buses--have typical representations for buses where the bus is in the center of the photo without any visual impediments or any other vehicles in the photo. There are also atypical views, where just the top part of the bus is visible over other cars and vehicles.

These less common images are part of the tail of the class "bus", which is already in the tail of the overall categories of images collected. This makes a complex problem even more complex!

Long tail of all classes and long tail within a class

When these datasets are scraped at scale, the head represents larger classes, like photos labeled as persons or prominent buildings or text related to a common topic in a common language, like US-centric English-language news. Within those overrepresented classes, there are also less common examples of that population or class, like local news events which don't make US national news or a photo of a building that doesn't exist anymore.

What about the entirety of the tail, though? In natural language processing, this means ALL of world languages end up being a much smaller representation when compared with available text in English, due to the use of English on the internet.³ Within computer vision, generation of typical "wedding" images look like US weddings because the much higher occurrence of those photos online greatly outweigh other representations of weddings around the world.

This point will be quite important for understanding how memorization happens. To give you a preview, I want you to try to draw a "generic" person or make a list of what you might use to decide if a photo has a person in it. Would your list work for all photos you can take of a person or of people? What list would you need in order to make sure that all photos of people are classified correctly?

Online spaces create a fair amount of content duplication, especially since the advent of the search engine and search engine optimization. Photos of people in front of the Eiffel Tower are much more common than animals in their natural habitat or life in places with fewer digital cameras and devices. For text, there exist massive duplication of boilerplate letter text, common licenses (like the Apache license), common marketing content, content with large distribution channels (like AP news blurbs) and even famous quotes. These texts are usually in English and represent US spelling and grammar rules.

If you try to learn the entirety of the world, which might be the case if your goal is to create a "general AI" system, the duplicates make learning more difficult, because they are overrepresented. Finding duplicates to remove is easy if the duplicates are an exact match. But usually you have to solve the problem of near-duplication, where data is very close but not actually exactly duplicated. This is still a hard problem to automate.

Humans are good at noticing things like if a photo is from the same moment or photoshoot but from a different angle. Humans are also good at noticing things like plagiarism or when an idea, quote, or section of content is mimicking another piece of content. Computers are still not very good at this. Therefore, it's unlikely you can truly remove all duplicates that a human might mark as duplicate using computer-assisted methods. Large deep learning models often memorize repeated data, which you'll explore later in this article series.

But, just how did scraped internet data come to represent "the world"? If the data was more diverse, would the resulting models be more representative? Would the models learn fewer biases? Let's explore by examining the history of machine learning data collection.

Data Collection for Machine Learning: A History

In the early days of deep learning, many datasets were collected by researchers or university research groups to provide data for deep learning research.

One famous dataset is the MNIST dataset, first introduced in 1998 by Yann LeCun et al. The dataset is a canonical example for any computer vision student or machine learning hobbyist.

The original dataset was collected by US National Institute of Standards and Technology (NIST) employees, who were asked to fill out a form that collected their writing. Then the dataset was expanded because it was too small and not as diverse as real handwriting, so the researchers asked several US high schools to participate. The students filled out forms that looked like this:

NIST Handwriting Form

The details on how this assignment was given and if consent was collected are fuzzy, but as an 80s child from the US, this paper looks like a classroom assignment, not an activity kids would fill out for fun in their own time. There is no information on the form about what the data will be used for which makes it hard to understand if, when and for how long the information from the form will be saved. Of course, these letters and digits have now been duplicated across the world many times for every entry-level computer vision class. If a student wanted to revoke consent for machine learning, this would now be impossible.

This initial start turned into a longer trend, best described as "collect data as cheaply and as quickly as possible". This trend became a widespread fundamental practice within the field of machine learning. For those that did it well, it was also immensely profitable.

Indexing the internet and all of its content fueled the growth of today's large scale technology companies like Google who advanced the organization's search engine capabilities via massive data collection. These datasets were collected without special attention to copyright, privacy or consent, other than avoiding websites that specifically blocked crawlers via the 'robots.txt' file.

The data collection was described as "indexing", where keywords were matched to content URLs. But to produce these matches, the entire website content was scraped and saved first. The scraped data--usually a file or set of files--could be deleted or updated by contacting Google if you were the person running the website, which may or may not be the person whose content was posted.

Additional datasets like Flickr30K and Labeled Faces in the Wild show a similar approach to data collection within the computer vision domain -- grab whatever you can and ask questions (or for permission) later.

Unfortunately, this isn't exactly how many of us use the internet or at least not until recently. Helen Nissenbaum speaks about the context in which you write, post photos and connect with others online, and how this context often doesn't match the mental model you have when operating in the real, non-digital world. It's difficult for humans to understand exactly how, where and to whom they are sharing information with via a digital interface because the context and related transparency on how the data is used, stored and managed isn't entirely clear.

When you are writing to a close friend by commenting on their post, you probably don't immediately assume a complete stranger will read it or scrape it and use it for machine learning. When you posted something 10 years ago on a personal blog, you probably didn't assume it would be stored somewhere a decade later and used in the latest GPT model. When you shared your photos on Flickr in 2010, you didn't foresee that it could end up in a Generative AI portrait. And yet, those things are indeed possible due to the lack of contextual integrity provided in many online spaces and platforms.

Training with online data has other pitfalls and challenges, many to do with the skewed culture of the internet itself and the resulting biases in these scraped datasets.

Internet Culture and Biases

The internet was initially used by a small group of people, available only in a small number of places. It still carries the biases of those groups--being a place where it's often safer to be "Western", white and male.

These online biases show up in training datasets produced by scraping the web. For example, one large NLP dataset used to train early GPT models was The Pile. The Pile encompasses several scraped datasets, including one called OpenWebText2. This dataset contains the text of all the websites with top-rated linked Reddit posts between 2005 and 2020. Not only is this dataset a violation of those users' belief they could later delete their posts, but Reddit also hosts several popular communities promoting visceral and violent hatred, racism and bigotry.

The resulting datasets show massive societal biases, including, but not limited to racism, sexism, homophobia, US-centrism and xenophobia. Work from researchers like Timnit Gebru and the DAIR Institute, danah boyd and Kate Crawford have highlighted these biases since 2017. Research like Calisken et al.'s analysis of sexism in translation, Bolukbasi et al.'s work, whose article borrows from my initial research about sexism in word vectors, have been available since 2016. Buolamwini and Gebru's Gender Shades demonstrated in 2018 that darker skinned women are at a disadvantage when it comes to accurate facial recognition. These problems are well documented for nearly a decade and yet the common practice in machine learning communities is to still use these problematic datasets and to produce more by scraping more data.

Although text data can be used directly as it is collected, image data must be appropriately labeled to perform adequate computer vision or text-to-image generative tasks. For text-to-image or image-to-text models, the labels involve either describing the entire image or scene, or creating bounding boxes, where parts of the image are highlighted and describing a smaller subsection of the image. This might involve labeling all objects in the image separately along with their bounding boxes.

For early computer vision datasets, appropriately learning labels (or categories of things, like a "cat") meant data collection attempted to find images with only one thing in them. To learn more quickly and to have more data with the same labels, this might mean that a photo of a person is just labeled "person", and that image may be of the person in the center, in the side, or in some other part of the image. As you can imagine, these labels vary significantly in quality and accuracy, depending on how they are collected. Many of today's labels are semi-automatically scraped from the web and use image ALT text as the description.⁵

For higher-quality datasets, humans work as labelers of scraped or user-generated text, images and videos. These data workers are frequently subjected to poor working conditions and lack of psychological support when facing traumatic content for systems like content moderation. For other machine learning tasks, the instructions are often meager and very little context is given to the data workers, resulting in datasets with lower quality assurance than one would want if people were properly informed about the task.⁶

Crawford and Paglen highlight additional issues with crowd-sourced labels in Excavating AI. Their research and the resulting art piece and essay investigated the ImageNet dataset--highlighting labels like "alcoholic", "ballbuster" and "pervert". There is extensive academic research on the topic, like how online images amplify gender biases and other systems of oppression explored in Flickr 30K dataset research.

Even without human labelers, collected data from the internet reinforces biases and stereotypes from search algorithms and content providers. Safiya Noble documented biases in search engines and their results in Algorithms of Oppression. Her book inspired me to look at what surfaces in popular web crawl datasets, uncovering what it's like, for example, to search for "Brazilian Girl" in the C4 dataset (a part of the common crawl).

The initial examples from the first two pages of results talk about how to ask a Brazilian girl out, how to date a Brazilian girl or they are fake advertisements to meet Brazilian girls. When this data is the only context a large language model or generative model receives on what "Brazilian girl" represents, then in the resulting model "dating" is closer to that idea than being a scientist, researcher, politician, athlete, philosopher, etc.

These internet biases create AI systems that repeat and expand their use and can influence people are seen by others and how people see themselves. This has lasting impacts on society, and creates an amplification and further entrenchment of harmful content.

It's important to keep this in mind when learning about machine learning. What data are you trying to learn? What data do you think is "high quality" and why? What is the machine learning community doing by attempting to build expansive, cheap datasets?

And when memorization happens on these data, what is memorized and reproduced from this internet culture?

Bigger question: what even is data?

Often these datasets are representing only a subset of the world, as you've learned thus far. Why are they used as universal truths if they can only represent a small sample of reality?

danah boyd wrote about how measurement and scientific inquiry come from murky histories of cultural dominance, colonization and oppression. By assuming that there is a standard system of measurement for everything, the assumptions and biases built into what is "normal" and who determines these standards leave some examples marked as "normal" and others marked as outside. Since machine learning then uses these standards and measurements to automatically learn to discriminate from one group, idea or concept to another, these biases are highlighted, reproduced and become entrenched in the concept of data.

Understanding memorization begins with understanding how data is collected and used, and what properties that data has. Massive duplication and the long-tail have a deep impact on how machine learning models--particularly deep learning models--learn, generalize and memorize.

The ethical, social and philosophical problems of how data is collected and labeled are also important to study alongside memorization because it is very difficult to unlearn these concepts. When memorization of these biases happen, it creates an even more difficult problem to attempt to stop reproducing those examples, often creating a need for serious intervention or complete redesign and retraining.

In the next article, you'll investigate how machine learning systems take these datasets as input and process them for machine learning. Specifically, you'll investigate how encoding and embeddings work to take complex input and make it easy to "learn".

Oversampling involves pulling from a certain subset of examples more frequently when performing a machine learning or analysis task in order to ensure they are better represented within the overall dataset or population. This is a basic statistics strategy which help when using nonrepresentative data or when minority subpopulations need to be adequately evaluated (i.e. when one or more groups are heavily overrepresented compared to other groups). If performed, the resulting data analysis or machine learning is much more likely to process duplicates from the oversampled population. This becomes an important factor for memorization, which you will explore in a later article. ↩
Categories like "bus" or "person" are referred to as classes. Photos are labeled in order to be able to see an image later and create a prediction on what is in the image (like a bus). A dataset that is to be used for a classification problem might refer to classes as labels (or vice versa), because the data is tagged or labeled with the class name or encoding (sometimes a number that maps to a human-readable name). Technically, the classes refer to the categories. When the dataset is collected and the examples are tagged with the appropriate matching class, that process produces a label. ↩
This has prompted significant research and deployment strategies to offer better multi-lingual AI products, including joint research from Microsoft China and university researchers which first translates incoming text and prompts into English to use a production LLM system which performs much better at English text than Chinese. ↩
The early internet was available to researchers and government and military personnel. As the World Wide Web grew, it was most accessible in the US and parts of Europe, where internet access for non-academic and non-military persons was subsidized and supported by local authorities. This created an overrepresentation on the web of these world views and lifestyles. The early internet boom and resulting web infrastructure was primarily located in Silicon Valley, California, which brought the area and users' own political, economic and social views to the newsletters, websites, browsers and search engines of the time. These marks are still recognizable in the way many people search, browse and experience the internet today. ↩
Due to the internet's propensity to skew representation, this practice results in a Western, white supremacist and patriarchal view of personhood. Ghosh et al demonstrated that Stable Diffusion models prompted with "person" overwhelmingly produce a white male. Their research also uncovered erasure of indigenous identities and hypersexualization of women from particular areas of the world. ↩
The lack of instructions and connection to the larger team and task could be a result of trying to keep the data workers (often subcontractors working for another company) further removed from their peers working on other parts of machine learning, like the high-paid data workers who train the models or architect the resulting systems. ↩

Deep learning memorization, and why you should care

2024-11-04T00:00:00+01:00

When's the last time that ChatGPT parroted someone else's words to you? Or the last time a diffusion model you used recreated someone's art, someone's photo, someone's face? Has Copilot given you someone else's code without permission or attribution? If this happened, how would you know for sure?

In this article series, you'll explore how and why memorization happens in deep learning¹, as well as what can be done to address the issues it raises.

However, to ensure it's worth studying, let's investigate if this phenomenon really occurs?

Memorization in the wild

NYT vs. OpenAI

Here is a screenshot of an excerpt from the New York Times lawsuit against Microsoft and OpenAI. On the right is the original text of the New York Times article. On the left you can see the extracted text from GPT-4. If a word is red, it means it was directly repeated and therefore memorized by the deep learning model. Is this a violation of copyright law?

Stable Diffusion Face Extraction

And here is an example from a stable diffusion model trained by Carlini et al. on Stable Diffusion's training dataset. This person's face is repeated less than three times in the training data. When prompted with the person's name, you can reproduce their face, or more specifically, the photo from the training dataset. Is this a violation of privacy?

Is OpenAI's Skye imitating ScarJo? Link to watch above video

Finally, here is an example of the new release of OpenAI's GPT-4o features, which originally started with a voice that sounded eerily like Scarlet Johansson's voice. Johansson had been approached several times by Sam Altman to be the voice of the new system but declined. Instead, it appears OpenAI found a voice actor to mimic her voice in order to give a cultural hat tip to her role in the movie Her, where she voiced the AI character.

What is happening here?

Understanding how deep learning systems work and succeed at tasks is an active area of research for more than a decade. In this series, you'll explore both the technical aspects of how these models memorize, but also the creation of a machine learning community culture that allows this to take place. You'll review the seminal research around privacy, security and memorization in deep learning, and better understand deep learning because of it. This knowledge will also help you better understand how to approach and use models and AI systems.

You'll start by looking at how datasets are collected and what their properties are, then, explore machine learning training and evaluation and the impact of those choices. You'll investigate what data repetition and novelty have to do with memorization, and how that can be mathematically modeled and proven. You'll learn the relations between overparameterization, model size and memorization and see some examples of how this phenomenon was discovered long before GPT models were released.

You'll also explore several ideas for how memorization can and should impact the way machine learning engineers manage data, the way models are trained, the way we talk about "intelligent systems" and how to reason about when to use deep learning.

But, why should I care about memorization?

As one person who I spoke with put it, "it doesn't really matter if a model memorizes, as long as it brings us closer to human-level intelligence". But is that true?

There is very little intelligence in merely saving a string of tokens or pixels and being able to repeat them when prompted. It is something that we humans are not so great at, but that is due to our intelligence, not in spite of it. Rote memorization is something computers have done for many decades and something they excel at.

This critique is echoed in the remarks from LeCun and many other deep learning researchers for several years now. The current way that practitioners train large language and computer vision systems are inherently linked to the training data and the limits within that data. These models can get quite good at mimicking the data, but it's heavily disputed if their performance shows deep reasoning, world models or systems thinking.

Memorization is not learning, even if it can mimic learning. If you want to build intelligent systems you'll have to do much better than memorization. And you'd need to prove that deep learning models are capable of significantly more than memorization and remixing. Based on what you'll learn around evaluation datasets, you'll likely have new questions for how machine learning practitioners review what is learned, what is generalizable and how the field might actually move forward towards better generalization.

There are additional reasons to care about memorization. Privacy is a fundamental human right according to the UN Human Rights Convention. The human right to self-determination about how information related to your personhood, your life and your behavior is collected, stored and used is a common understanding across many cultures, nations, lands and societies. As seen in the Second World War, how governments and technology systems collect, use, and proliferate data has a direct impact people's lives.

Privacy is closely related to trust, and how you manage your own privacy relates to who, how and what you trust. In this way, privacy mirrors social bonds that help keep society functioning, that help promote equality amongst persons and that create trust and accountability amongst ourselves and our institutions. When your trust in something is broken, you likely no longer want to share intimate details or data with such systems. And when your privacy is violated, for example, via online stalking or harassment, or even smaller examples, like a super creepy ad or a post that got shared out of context, you may feel violated. Your trust was broken.

Privacy isn't equally available to everyone -- despite common beliefs to the contrary. Some of us have what I call "privacy privilege", where your face is not stored in a database used by the police or state intelligence to track your movements. Some of us might represent the best outcomes in the models, where those systems work in our favor. For example, you are granted automatic entrance in an interview process or you get pre-approved for a loan. In those cases, your trust isn't violated by the system's usage. But there are many persons who do not fall into those categories - where these systems violate their privacy, their right to self-determination, their right to protect themselves from algorithmic classifications and categorizations.

Memorization in machine learning has deep implications in how to reason about choices in machine learning, and studying it can better expose phenomena like unfair outcomes, overexposed persons and how machine learning systems link to other systems of power and oppression in our world.

Memorization violates consent, erodes privacy and throws what all of us are being sold under the banner of "intelligence" into question. By exposing how memorization works, you are also pushing for more realistic views of AI systems and more realistic assumptions around how they can and should be used. You are also evaluating how they shouldn't be used. By studying memorization, you counter fraudulent messages on how machine learning works, and expose much more interesting fields of study based in real science.

Let's dive in!

I hope you're excited to learn more. In the coming articles, you'll explore how deep learning systems create the opportunity for memorization, along with a better understanding of how it happens.

To get a head start, if you already work in machine learning, I want you to reflect on the following:

How do you collect data?
How do you incentivize and optimize learning?
How do you architect deep learning models?
How do you govern data usage in ML systems?

The next article specifically investigates data collection, looking at how long-tail datasets create uneven distributions to be learned. To stay up-to-date, you can sign up for the Probably Private newsletter or follow my work on LinkedIn.

In this series, you'll explore deep learning as a field, which includes the use and training of neural networks to perform a task or series of tasks. A large language model (LLM) is a particular type of deep learning model which can either produce text (just like a normal language model), or answer chats with instructions or prompts, which is what ChatGPT does. You'll learn more about how these systems work from small building blocks (neurons and layers) to the entire model by studying how they are built, trained, evaluated and used. ↩

A Deep Dive into Memorization in Deep Learning

2024-11-03T00:00:00+01:00

Want to learn more about how, when and why machine learning, particularly deep learning systems memorize data? By studying memorization, you'll learn more about how machine learning systems really function, along with how privacy works from a technical point-of-view. You'll also be better able to decide how, when and where to use AI systems based on your new learnings.

This series aims to introduce the topics to a general audience, but there are plenty of links to dive deeper in each article. This page will be updated as the series is published.

The recommended reading order is as follows, but feel free to hop around!

If you are a more visual or video learner, I've made a YouTube playlist to accompany the series.

Building a Privacy-First Newsletter

2023-03-12T09:00:00+01:00

Building a newsletter is a fairly common activity these days, with many creators, writers and thinkers making part of their living via subscribers willing to give small amounts of money out per year or month to get exclusive access. Beyond the paid subscriptions, there's an increasing demand for free, or for fun, newsletters to cut through algorithmic noise. People enjoy hearing directly from other people they trust or enjoy, seeking advice, insight, humor and information, which is why the interest in newsletters and podcasts has grown.

As there is a growing audience for these formats, you would think there would also be a wide array of newsletter platforms with different offerings. In Fall 2020 I started my newsletter Probably Private, on the intersection of privacy and data science and went on a quest that took until Spring 2023 -- to create a privacy-first newsletter.

Why?

A newsletter about privacy just seems like it should have privacy built in. For years now, I've been finding ways to manage my own online data, backups and even how I interact with social media -- finding a balance that fits my own political, cultural, social and individual idea of privacy. I think every human should have the ability to do this, and it should be fundamentally built into services that are offered, so that choice and consent are transparent and easily implemented in software, data and computing architectures.

It also made sense to offer readers of my newsletter the privacy they deserve. I didn't want them to be automatically tracked, in any way. I thought they should be able to open a newsletter, read to their hearts delight, click on links, save things for later, or immediately send to Spam should they see fit -- all without anyone knowing about it. Little did I know, this would turn out to be much more difficult than I originally thought.

My Journey Begins...

Earlier kjam: Let's figure out what service to use by looking at what's popular and has some privacy policies I can read and ways to toggle what data is tracked! Off we go...

Revue

I first started out on Revue in Fall 2020, as several folks recommended it to me and it was then a leader in newsletters, particularly those with supplemental paid options. It wasn't my intention to create a paid newsletter, but I thought if I ever did more newsletters, maybe one day there would be a paid one.

I signed up, wrote the first installment, toggled off all possible tracking settings I could find and sent it out to my, then, about 50 subscribers. Later that day, I got an email from a reader mentioning that the links they received were tracked (!). I took a look at the fine-tuned settings and found that there was literally no way for me to turn off click tracking on links. After some back and forth conversations on social media and via email with other privacy folks, I was recommended to migrate to Buttondown, a friendly and privacy-aware alternative. I picked up my content and migrated over...

Buttondown

I happily logged into Buttondown to see that I could turn off all tracking. I tested that no links were tracked. I tested that I couldn't see the views or opens, and I turned off emails to alert me of who was signing up or unsubscribing. Seemed that I was set!

I wrote several newsletters and received no more privacy feedback, just content feedback. Finally, I thought, it's solved!

But as I wanted to update and change the newsletter by setting up my DNS and integrated Buttondown into my own website. First, I would need to start paying for Buttondown. This was to help cover the costs of the mail service provider and hosting. Sounded very reasonable, but I wanted to look further into these services, just to confirm they were also privacy-respecting, considering I'd now be helping pay for them.

I first emailed the friendly Buttondown admin to confirm the services used. Then, I dug into the fine print from those services to figure out if tracking was somehow built in and what the options were for turning it off.

This sent me down a new rabbit hole: namely, the sad state of privacy in email.

The Plot Thickens: How does Email work?

Many newsletter providers use a third-party mail service provider. This is the service that actually takes the email template, turns it into an email-friendly format and mails it out to your subscribers. Sometimes you are using a service that does both, but many times, particularly for newsletter "front-ends" or management services, the actual sending will be outsourced to a service that your newsletter provider uses to send bulk email.

Let's walk through what normally then happens when this occurs.

With normal SMTP, like when sending an email from your Gmail account, you are usually sending email from one large email provider to another, or within the same organization. Therefore, the SMTP services that need to send several messages back and forth to confirm sender, recipient and message text will either all occur within an internal service (like Google Mail or Outlook) or will happen between those services. This usually means your mail lands in the other person's Inbox and not in the Spam folder. For a deeper dive into how SMTP works, check out Wikipedia.

However, when you are sending bulk email, like with a newsletter, you need to send many emails at once. This is usually not allowed by the large email providers unless you are emailing a large internal group (i.e. a large work list). These providers turned off bulk sending long ago to fight spammers, and that created the surge of bulk email providers you can see today. These providers help send bulk mail for newsletters, brands for direct marketing and advertisers and they can range from easy-to-use setups where you edit the email directly in the browser and hit send, to more complex, like using your cloud provider as an bulk email service, often requiring programmatic access.

Mail Service Providers & Privacy

You can see how these services are ranked and compared on sites like "Email Analytics" along with the delightful other articles that these types of sites feature, like how to track your employees and customers via email and other surveillance software.

In fact, the deeper I dove into trying to find a privacy-first bulk email service, with some help from networking friends, the more I realized there wasn't going to be many without tracking. Investigating mailgun, which Buttondown uses, dropped me into their Privacy Policies and Terms of service which uncovered data storage and retention periods that I did not expect. For example, below is an excerpt from their documentation:

Mailgun keeps track of every event that happens to every message (both inbound and outbound) and stores this data for at least 30 days for paid accounts and 2 days for free accounts.

Note that all of this would technically be stored centrally in the Buttondown admin account, meaning I couldn't verify access or retention in any way. Even if I chose to build my own newsletter to integrate directly with mailgun, there was no way to turn this off.

At the time, they also had fairly expansive privacy policies, that documented the data shared with their subprocessors (services and companies they use to process person-related data). They have since made their privacy policy more legible, but you can still see the wide array of data collected, processed and likely stored in the provider's account in their current DPA:

The types of Personal Data to be processed: The personal data submitted, the extent of which is determined and controlled by the Controller in its sole discretion, includes name, email, telephone numbers, IP address and other personal data included in the contact lists and message content.

Much of this is likely collected to fulfill customer demands (i.e. customers want tracking) or as a way to combat spammers. But there was no way to turn it off and there also wasn't a way to use the service for a given period of time or trial period, prove I am not a spammer and turn it off later. As a machine learning engineer and data scientist, I could think of a million other ways to detect spam activity than storing this history, but that's besides the point.

I was truly in uncharted territory, so I started asking networking friends how I could find a service that didn't actively track opens, reads and unanswered bounces. That's when I learned about SMTP relays...

What is SMTP Relay?

An SMTP relay is a way to handoff SMTP requests between SMTP servers. This happens when the sender and receiver aren't in the same email domain. Much like your internet requests are handed off across the internet, an SMTP relay service hands of incoming and outgoing mail until it reaches the appropriate recipient mail server. You can read more about SMTP relay from Ionos's explanatory article.

What I needed was a privacy-first SMTP relay, that allowed me to turn off tracking for the email as it got forwarded out to the emails. I put a cry out for help on Twitter (reminder: this was early 2022, pre-Emerald Emperor).

original tweet

My request was:

Data hosted in Europe
Data minimization built in (i.e. I can hold the emails, the service just does the send)
Reasonable prices for small amount of emails per month (<$30/mo if possible)
Clear privacy policy and data processing agreement

Unfortunately, no one could point me in a reasonable direction. I was getting desperate, and it seemed like I might need to host my own email somehow...

Just Host Your Own Email (said no one who has done it)

Of course there are many guides and articles on how to set up your own SMTP server. What none of these guides will tell you is what a pain maintenance is, and how you basically will be immediately marked as Spam until you prove yourself otherwise.

Since the last time I set up a mail server (circa 2010), email has changed a bit. The deeper I dove into hosting my own server, the more it seemed impossible to manage due to the way that reputation management is performed. Due to the rising sophistication of phishing and spam communication, email in those past 10 years became a true battleground between those parties. The victim of these battles are people or companies who would like to run their own email and who aren't going to send a lot of mail.

When you set up a mail service and start sending mail, your reputation will be tracked on mail services in relation to your domain and your servers' IP addresses. When you first start or are unknown, this is very difficult, because you are assumed to be Spam. You have to then spend time and energy to increase your reputation -- all that on top of also having to maintain the mail server and whatever it was you were trying to do with it in the first place -- run a business, write a newsletter, etc.

It wasn't intentional, but this basically pushed out a lot of self-hosted email providers or hobby projects. In another way, it's a bit terrifying for privacy, since most email is sent unencrypted and who knows what types of machine learning or other data "insights" are being run on/by most email providers (since now everyone has one or more email providers)... but I digress.

Self-hosting seemed like a lot of work and I also didn't really want to have one more thing to manage, I just wanted to run a privacy-first newsletter.

Back to Privacy-First Email Providers

I ended up routing completely back to email providers. Namely because my needs right now as a relatively small newsletter don't actually exceed normal email sending rates. This sets a newsletter growth constraint, but one I was willing to accept for now in order to provide more privacy for my readers.

I took a look at Proton Mail, who has a great track record with regard to privacy, but they actually don't provide programmatic interfaces very easily, and certainly not for sending many emails at once.

Finally, I found Runbox, a privacy- and security-first email provider based in Norway. Added bonus, they also prioritize green computing! It gave me warm fuzzies and I immediately signed up for a trial account. I tested using the API programmatically and didn't run into any problems, so I bought an enterprise account and migrated over probablyprivate.com.

My troubles were over... or were they?

Populating a Bare-Bones Newsletter

Now that I was ready to build the actual newsletter, it meant starting over. Since I wasn't able to find a newsletter provider with a privacy-first SMTP relay, it meant finding my own way to programatically send emails. At first, I had set up my newsletter to use Ghost.js, which I love as an editor, but it uses Node and the self-hosted and open-source version only allows integration with mailgun, which meant it wasn't something I was going to easily change, fork or fix.

I went in search of a python-based self-hosted newsletter.

django-newsletter

I found django-newsletter, with many users and fairly good support. As I started to work with it, however, I realized it was going to be a nightmare for the type of newsletter I imagined, namely because the code base was quite complex, it didn't support the latest Django and Python versions at that time. It seemed like administrative overkill for such a small newsletter, administered by just one person.

I would, however, recommend django-newsletter for folks that:

have multiple newsletters and want to manage them in one interface
have multiple authors/editors who need to work on posts together
are already using Django as a web framework and might be able to contribute back upgrades and updates

Being that it didn't fit for me, I...

Gave up and wrote my own

Yes, I know. It's definitely "the hard way", but I wrote my Django models and administrative interfaces, along with the ability to manage posts myself. It took me about a day or two, as I already have experience using Django and also sending emails programatically.

I don't really recommend this route, as it's probably overkill if you aren't already someone familiar with the basics of backend and/or frontend web development. It's also a pain to maintain. Should you actually know of an easy-to-use open-source newsletter manager that is not Node.js based (yes, I have issues), please reach out!

To build out the site, I got pretty far myself with a Canva-based design that I edited, here's the original, and some help with CSS via Upwork. But I had reached my limit of being able to make the templates and front-end consistent.

Email templating & minimal loading calls

If you didn't already know, email templating is a complex problem. There are so many different email clients, different screen sizes, different ways to display email, one can get easily lost. This is why professionals often use a framework like MJML (from Mailjet) to ensure that the email works on as many readers as possible with some semblance of consistency.

I also had a requirement that I wanted the site to not load so many things. Not only does this hurt performance, it also leaks (more) information in the browsing and networking calls. I wanted to minimize calls per page load, which meant having a super lean CSS that could, at times, even be loaded on the page itself. To make this work was beyond my front-end depth, so I called a dear friend and colleague.

Žan has helped me on several engagements where I needed someone who does front-end work. Along with being a delightful human, he also knows back-end web development, making it easy to pass over my half-finished pile and scream help. (Thanks Žan!) He also figured out how to convert my poor HTML files into an actual email template using MJML and is, I think, only minorly scarred from the experience. ;)

Announcing the new Probably Private!

In the end, I now have a fairly no-frills but definitely privacy-first newsletter! It was a journey I did not expect, but I learned even more about privacy and the internet, which I'll be writing more about in an upcoming series diving into the technical details of internet privacy!

If this post was interesting to you, or if you want to learn more about privacy in data science, please subscribe!

I'm also always welcome to feedback (via mail: info /at probablyprivate / dot / com).

Joining Dropout Labs!

2019-11-23T00:00:00+01:00

After months of searching, lots of fun (and some less fun) interviews and hours of self-reflection, I am excited to announce I am the new Head of Product at Dropout Labs! 🎉

The interview and decision process was quite iterative and disruptive! I am somewhat to blame for this as I chose to interview with more than 35 companies 😅 The decision process itself involved many pivots, but at FooCamp, via several soul-searching conversations, I came to the conclusion that I couldn't walk away from my passion for changing machine learning for good by continuing to advocate for privacy and security in machine learning.

After coming to this conclusion, the choice was obvious. The team at Dropout Labs was deeply knowledgeable and passionate about this goal and truly believes in a future where encrypted machine learning is not only possible, it's the norm.

What is Dropout Labs even?

An amazing all-remote team built by successful entrepreneurs working on privacy-preserving machine learning at the intersection of deep learning and cryptography! Brag time: they built the open-source TF Encrypted and several other important Tensorflow Libraries helping make secure and privacy-aware deep learning a reality.

In addition to being able to stay in Berlin, the team impressed me with their knowledge and enthusiasm for privacy, machine learning and security. An all-remote culture is something I've always wanted to experience and is providing me with new learnings daily -- am I communicating the right amount? How can I ask better questions? How can I clearly share a specific insight with the team?

If you know me or my work, you also know I wouldn't join a team that didn't have a mission or vision that aligned with my deeply held beliefs. Ours is clear: create a new reality for machine learning -- one where different actors (data owners, data scientists, security and privacy folks, end users) can collaborate to define trust in their relationships and confidently build better models in a privacy-aware and secure manner.

What is this product that you are building as Head of Product?

This is what I can say so far: we're exploring the intersection of machine learning pipelines, data privacy policy and encryption. We want to meet the problem where it will have the most impact: sensitive data in production system that would benefit a machine learning or data science team if they had access.

As we are using iterative design and development, you'll get a sneak peek long before our initial launch if you follow me on LinkedIn or Twitter.

BTW, you should also follow Dropout Labs and check out their posts on Medium to learn more!

Can I learn even more?

Yes, of course! I'd love to chat about what we are building and get feedback! Specifically, if you are:

a data protection officer or policy/governance lead
a data scientist or machine learning engineer
a data or machine learning pipeline engineer
a security or SecOps team member
an executive at a company handling sensitive data

I want to talk to you! As part of product development, we will be implementing a lot of fun prototypes and asking lots of questions -- which means I'd love to hear your needs and prioritize them. I can promise to listen and learn from you. If you are in Berlin (either for our call or after), I can treat you to lunch or a beverage of your choice as a Dankeschön!

Please feel free to DM me on Twitter, connect on LinkedIn or drop me a line via email. I look forward to sharing more of this journey with you and learning along the way. 🤗

Let's Get Together: More Details on Me, You and My Dream Gig

2019-06-06T00:00:00+02:00

Hello!

Here's more about me, in case it is news to you:

[About Me]

Co-founder of KIProtect, a startup with a mission to make privacy easier. Our main technology was developing new encryption methods allowing you to do more secure and privacy-aware machine learning. I led our business, sales, product and marketing efforts.
More than 10 years in the technology industry, with broad engineering and product experience in data engineering, machine learning and data science, software design and development, large-scale AWS, Rackspace and Google Cloud deployment and automation. Deep understanding of data privacy and information security best practices for compliance with GDPR, HIPAA and new privacy regulations in Brazil and California.
Extremely interested in making machine learning more fair, just, accountable, secure and privacy-aware.
Regular speaker and keynoter at international conferences such as CCC, Strangeloop, QCon, ACM, PyData, PyCon, EuroPython. Due to my strong engineering and organizing background, have covered topics like data privacy, machine learning security and AI ethics and continue to be invited to speak on these topics.
Adjunct professor at the University of Florida and teacher for several online platforms (O'Reilly Safari, DataCamp) and offline ones (Frauenloop, PyLadies).
Interested in sharpening my security engineering chops. Implemented basic security automation and monitoring, possess an in-depth understanding of machine learning security. Now anticipate gaining further expertise in the areas of pen-testing, network and container security, and exploit / vulnerability discovery.
Years of experience in business and product side, thus a capable and resourceful intermediary for tech and business/product teams (i.e. I am product-tech bilingual).
Excel at rapid grasp of new technologies, and asking difficult questions to surface critical issues -- driving teams to research, learn, debate and decisively resolve issues as they arise.
Fluent in Python and GoLang and have experience with C++ and Java.
Founder of PyLadies, mentor and ally for several women of color and immigrant women in tech initiatives, conference diversity scholarship organizer, persistent advocate for the “underrepresented” in tech and challenger of privilege in our industry.
Background in investigative journalism, love public speaking, meeting new people and working with teams on cool s**t.

[About You]

Here's a few things I'm hoping you can tell me:

What is your team like? Is it diverse (gender, race, immigration status, age)? If not, why not?
What relevant problems do you solve? Why is your work / product exciting?
Do you let folks learn on the job? Is this supported with mentoring / pairing / reviews, etc?
Are you friendly to remote workers or based in Berlin? If not, where are you and do you offer relocation?
Are you flexible on start date or do you have a shorter engagement (like a small project or consulting) in mind?
Did you read the above description about me and determine I'm a good fit based on our mutual interests or are you just here to add another email to your recruitment database? 😘

[Dream Role]

To be fully transparent, I'm not precisely sure what I want to tackle next. There are multiple possible good fits; here are a few examples of positions where I could add significant value and passion:

Technical product owner focused on defining customer needs, developing the product roadmap and assuring feasibility
Researcher at an AI institute or policy group -- focused on ethical, privacy and security concerns
Technical partner / consultant at a VC firm focused on emerging technologies at the intersection of AI and security
Machine Learning or Data Science Director at a non-profit or activist organization focused on supporting community-based initiatives and fighting injustice with (and in) data
Machine Learning expert at a security consultancy or company who wants to either use data science to help solve security problems or explore ways machine learning can be exploited
Security-focused data engineer or data engineering manager at a company managing large amounts of sensitive data
Senior management or C-Suite at a startup focused on privacy, security and/or ethical AI

I'm currently open to a variety of positions / roles and time allocations (i.e. freelance, consultant, part-time or full-time, etc). I'd love to hear your responses to the questions above -- feel free to drop me a line. I'm katharine at the top-level domain you are currently on. Spelling matters (i.e. kath-A-rine). Thanks for dropping by. 🤗

Adversarial Learning for Good: My Talk at #34c3 on Deep Learning Blindspots

2017-12-28T00:00:00+01:00

When I first was introduced to the idea of adversarial learning for security purposes by Clarence Chio's 2016 DEF CON talk and his related open-source library deep-pwning, I immediately started wondering about applications of the field to both make robust and well-tested models, but also as a preventative measure against predatory machine learning practices in the field.

After reading more literature and utilizing several other open-source libraries, I realized most examples and research focused around malicious uses, such as sending spam or malware without detection, or crashing self-driving cars. Although I find this research interesting, I wanted to determine if adversarial learning could be used for "good".¹

A brief primer on Adversarial Learning Basics

In case you haven't been following the explosion of adversarial learning in neural network research, papers and conferences, let's take a whirlwind tour of some concepts to get on the same page and provide further reading if you open up arXiv for fun on the weekend.

How Does It Work? What Does It Do?

Most neural networks optimize their weights and other variables via backpropagation and a loss function, such as Stochastic Gradient Descent (or SGD). Similarly to how we use the loss function to train our network, researchers found we can use this same method to find weak links in our network and adversarial examples that exploit them.

To get an intuition of what is happening when we apply adversarial learning, let's look at a graphic which can help us visualize both the learning and adversarial generation.

Source Image

Here we see a visual example of SGD, where we start our weights randomly or perhaps with a specific distribution. Here our weight produces high error rates at the beginning, putting it in the red area, but we'd like to end up at the global minimum in the dark blue area. We may, however, as the graphic shows only end up in the local minimum of the slightly higher error rate on the right hand side.

With adversarial sample generation, we are essentially trying to push that point back up the hill. We can't change the weight, of course, but we can change the input. If we can get this unit to misfire, to essentially misclassify the input, and a few other units to do the same, we can end up misclassifying the input entirely. This is our goal when doing adversarial learning and we can achieve it by using a series of algorithms proven to help us create specific perturbations given the input to fool the network. As you may notice, we also need to have a trained model we can apply these algorithms to and also test our success rate.

Historical Tour of Papers / Moments in Adversarial ML

The first prominent paper on adversarial examples came in the form of a technique to modify spam mail to be classified as real mail, published by a group of researchers in 2005. The authors used a technique of addressing important features and changing them using Bayesian and linear classifiers.

In 2007, NIPS had their first workshop on Machine Learning in Adversarial Environments for Computer Security which covered many techniques related primarily to linear classification but also other topics of interest in security such as network intrusion and bot detection.

In 2013, following other interesting research on the topic Batista Biggio and several other researchers released a paper on Support Vector Machine (or SVM) poisoning attacks. The researchers were able to show they could alter specific training data and essentially render the model useless against targeted attacks (or at least hampered by the poor training). I highly recommend Biggio's later paper on pattern-based classifiers under attack and he has many other publications related to techniques to attack and prevent attacks on ML models.

Photo: Example poisoning attack on a biometric dataset

In 2014, Christian Szegedy, Ian Goodfellow and several other Google researchers released their paper Intriguing Properties of Neural Networks which outlined techniques to calculate carefully crafted perturbations of an image allowing an adversary to fool a neural network into misclassifying the image. Ian Goodfellow later released a paper outlining an adversarial technique called the Fast Gradient Sign Method or FGSM, one of the widely used and implemented forms of attacks on neural network classifiers.

In 2016, Nicolas Papernot and several other researchers released a new technique which utilized a Jacobian saliency map built using the Jacobian matrix of the loss function when given the input vector. He and Ian Goodfellow later released a Python open-source library called cleverhans which implements the FGSM and Jacobian Saliency Map Attacks (or JSMA).

There have been many other papers and talks related to this topic since 2014, too much to cover here, but I recommend perusing some of the recent papers from the field and investigating areas of interest for yourself.

Malicious Attacks

As mentioned previously, malicious attacks have been studied at length. Here are a few notable studies:

Spam: Adversarial Machine Learning
Malware recognition: Adversarial Examples for Malware Detection
Malware generation: Generating Adversarial Malware Examples for Black-Box Attacks Based on GAN
Poisoning of biometric data: Adversarial Biometric Recognition
Attacks on self-driving cars: Robust Physical-World Attacks on Deep Learning Models

There are plenty more, but these give you an idea of what has been studied in the space. Of course, alongside many of these studies the authors studied counter-attacks. Security is ever a cat-mouse game so learning how to defend against these types of attacks, particularly with detection of an adversary or adversarial training is a research space in its own right.

Real-life Adversarial Examples

It has been debated whether adversarial learning will ever work for real-life objects or is just useful when the image is a static input such as an image or a file. In a recent paper, a group of researchers at MIT were able to print 3D objects which fooled a video-based Inception network into "thinking", for example, a turtle was a rifle. Their method utilized similar techniques to FGSM across a plane of possible alterations on the texture of the object itself.

How can I build my own adversarial samples?

Hopefully you are now interested in building some of your own adversarial samples. Maybe you are a machine learning practitioner looking to better defend your network, or perhaps you are just intrigued by the topic. Please do not use these techniques to mail spam or malware! Really though... don't.

Okay, ethical use covered, let's check out the basic steps you'll need to go through when building adversarial samples:

Pick a problem / network type
- Figure out a target or idea. Do some research on what is used "in production" on those types of tasks.
Research “state of the art” or publicly available pretrained models or build your own
- Read research papers in the space, watch talks from target company. Determine if you will build your own or use a pretrained model.
(optional) Fine tune your model
- If using a pretrained model, take time to fine-tune it by retraining the last few layers.
Use a library: cleverhans, FoolBox, DeepFool, deep-pwning
- Utilize one of many adversarial learning open-source tools to generate adversarial input.
Test your adversarial samples on another (or your target) network
- Not all problems and models are as easy to fool. Test your best images on your local network and possibly one that hasn't seen the same training data. Then take the highest confidence fooling input and pass it to the target network.

Want to get started right away? Here are some neat tools and libraries available in the open-source world for generating different adversarial examples.

cleverhans: Implementations of FGSM and JSMA in Tensorflow and Keras
deep-pwning: Generative drivers with examples for Semantic CNN, MNIST and CIFAR-10
FooxBox: Implementations of many algorithms with support for Tensorflow, Torch, Keras and MXNet
DeepFool: Torch-based implementation of the paper DeepFool (less detectable FGSM)
Evolving AI Lab: Fooling: Evolutionary network for generating images that humans don't recognize but networks do, implemented in Caffe
Vanderbuilt's adlib: sci-kit learn based fooling and poisoning algorithms for simple ML models.

There are many more, but these seemed like a representative sample of what is available. Have a library you think should be included? Ping me or comment!

Benevolent Uses of Adversarial Samples (a proposal)

I see the potential for numerous benevolent applications of these same techniques. The first idea that came to mind for me was facial recognition for surveillance technology (or simply when you want to post a photo and not have it recognize you).

Face Recognition

To test the idea, I retrained the final layers of the Keras pre-trained Inception V3 model to determine if a photo is a cat or a human. It achieved 99% accuracy in testing.² Then, I utilized the cleverhans library to calculate adversaries using FGSM. I tried varying levels of epsilon, uploading each to Facebook. At low levels of perturbations, Facebook immediately recognized my photo as my face and suggested I tag myself. When I reached .21 epsilon, Facebook stopped suggesting a tag (this was around 95% confidence from my network that the photo was of a cat).

[Photo: me as a cat]

The produced image clearly shows perturbations, but after speaking with a computer vision specialist, Irina Vidal Migallon³, it is possible Facebook is also using the Viola-Jones statistics-based face detection or some other statistical solution. If that is the case, it's unlikely we would be able to fool it using a neural network with no humanly visible perturbations. But it does show that we can use a neural network and adversarial learning techniques to fool face detection.⁴

Steganography

I had another idea while reading a great paper which covered using adversarial learning alongside evolutionary networks to generate images which are not recognizable by humans but are convincing to a neural network with 99% accuracy. My idea is to apply this same image generation as a form of steganography.

Photo: Generated Images from MNIST dataset which the model classifies with high confidence as digits

In a time where it seems data we used to consider private (messages to friends a family on your phone, emails to your coworkers, etc), can now be used to either sell you advertising or be inspected by border agents, I liked the idea of using an adversarial Generative Adversarial Network (or GAN) to send messages. All the recipient would need is access to training data and some information about the architecture. Of course, you could also send the model if you can secure the method you are sending it. Then the recipient could use a self-trained or pretrained model to decode your message.

Some other benevolent adversarial learning ideas

Some other ideas I thought would be interesting to try are:

Adware “Fooling”
- Can you trick your adware classifiers into thinking you are a different demographic? Perhaps keeping predatory advertising contained...
Poisoning Your Private Data
- Using poisoning attacks, can you obscure your data?
Investigation of Black Box Deployed Models
- By testing adversarial samples, can we learn more about the structure, architecture and use of ML systems of services we use?
??? (Your Idea Here)

I am curious to hear others ideas on the topic, so please reach out if you can think of an ethical and benevolent application of adversarial learning!

A Call to Fellow European Residents

I chose to speak on the #34c3 Resiliency track because the goal of the track resonated with me. It asked for new techniques we can use in a not-always-so-great world we live in so that we can live closer to the life we might want (for ourselves and others).

For EU residents, the passage and upcoming implementation of the General Data Protection Regulation (or GDPR) means we will have more rights than most people in the world regarding how corporations use, store and mine our data. I suggest we use these rights actively and with a communal effort towards exposing poor data management and predatory practices.

In addition, adversarial techniques greatly benefit from more information. Knowing more about the system you are interacting with, knowing about possible features or model-types used will give you an advantage when crafting your adversarial examples.⁵ In GDPR, there is a section which has been often cited as a "Right to an Explanation." Although I have covered that this is much more likely to be enforced as a "Right to be Informed," I suggest we EU residents utilize this portion of the regulation to inquire about use of our data and automated decisions via machine learning at companies whose services we use. If you live in Europe and are concerned how a large company might be mining, using or selling your data, GDPR allows you more rights to determine if this is the case. Let's use GDPR to the fullest and share information gleaned from it with one another.

A few articles of late about GDPR caught my eye. Mainly (my fellow) Americans complaining about implementation hassles and choosing to opt-out. Despite the ignorant takes, I was heartened by several threads from other European residents pointing out the benefits of the regulation.

I would love to see GDPR lead to the growth of privacy-concerned ethical data management companies. I would love to even pay for a service if they promised to not sell my data. I want to live in a world where the "free market" system then allows for ME as a consumer to choose someone to manage my data who has similar ethical views on the use of computers and data.

If your startup, company or service offers these types of protections, please write me. I am excited to see the growth of this mindset, both in Europe and hopefully worldwide.

My Talk Slides & Video

If you are interested in checking out my slides, here they are!

Video:

Slide References (in order)

I am not a big fan of moral labels, so I use this term as it is widely understood. A much longer description of adversarial learning for ethical privacy-concerned motivations seemed like too long of a title and description, but that is my belief and intention. :) ↩
I think it's not a great implementation due to the fact that I don't work in computer vision and I used a few publicly available datasets with no extra alterations, but it did work for this purpose. If I was doing more than a proof of concept, I would likely spend time adding perturbations to the initial input (cropping, slicing), and find varied datasets. ↩
Irina's awesome PyData Berlin 2017 talk on deep learning for computer vision on a mobile phone is not to be missed! ↩
Facebook has recently released the ability to opt-out of suggested facial recognition. This was, however, more of a proof-of-concept than a "Facebook Fooling" experiment. ↩
However, this is not required. In fact, Nicolas Papernot has a series of great papers covering successful black box attacks which query the model to get training data and then create useful adversarial examples as well as transferability which shows you can use adversarial examples from one type of model to fool a different network or model with varying rates of success. ↩

Towards Interpretable Reliable Models

2017-10-29T00:00:00+02:00

I presented a keynote at PyData Warsaw on moving toward interpretable reliable models. The talk was inspired by some of the work I admire in the field as well as a fear that if we do not address interpretable models as a community, we will be factors in our own demise. In my talk, I addressed some of the main reasons I believe interpretability is important for the data science and machine learning community.

Why Care About Interpretability?

If we become so removed from the average person's understanding and we see it as a burden and nuisance to even address their concerns, we will find ourselves the target of a cultural, political or regulatory backlash.
If we build interpretability, we allow area experts and our end users to give us realistic feedback and help improve our model overall. They can help us diagnose noise, see correlations and find better labels.
GDPR and other regulations are pushing for more transparency. If we fear or run from transparency, then we might want to ask ourselves WHY. Is it because we fear the gap between our user's understanding of models and our own explanation? If so, is it just a matter of some technical literacy? OR, is it because we aren't proud of the way we are using their data and perhaps our models are extensions of unethical or immoral decisions made in the preprocessing, training or use case.
Models can become racist, sexist and display other issues that are present in the data (often found in language data and crowd sourced data). If you are interested to read more, I have a whole talk on this as well, or just start with the amazing article on stereotypes in word vectors by Aylin Caliskan-Islam et al.

Now you are convinced interpretability might be useful, yes? So, where do we go from here? For better or worse, this is still a very open a broad area of research. I'll summarize a few libraries and papers you can use to get started immediately as well as some problems in the space which are still areas of active research.

What Can I Do Now?

There are several interesting open-source libraries which you can use to get started with interpretability. I highlighted a few in my talk, but there are many more. I will try to outline a few of the interesting ones I found including some I didn't have time to outline in my talk.

Classification Explanations

This is currently the space that has the most open-source tools available; so if you are working on classifiers, the good news is there is more than one tool you can use.

LIME (Local Interpretable Model Explanations): GitHub and Paper -- Find subsets of your data which can explain the model at a local level.
eli5 (explain to me like I'm five): GitHub Open-source library with great documentation allowing you to build visual explanations of classifiers and regression models.
Sklearn-ExpertSys: GitHub -- Decision and Rule-based sets for Classifiers. I personally haven't had a chance to use this yet, but plan to do so as part of a longer blog series.

Neural Network Architectures

Attention-Based Networks: Attention RNNs are useful in determining what the network has learned due to the network's memory access. This gives special meaning to the image-based networks because of our ability to then "see" clusters of pixels alongside the network. For more reading, check out: Training and Analyzing Deep RNNs, A Neural Attention Model for Sentence Summarization and Show, Attend and Tell: Neural Image Caption Generation with Visual Attention for a start.
Generator-Encoder Rationales: GitHub and Paper Great paper and library which shows a method of generating smaller rationales using phrases from the text for several NLP tasks including multi-aspect sentiment analysis.

Other useful open-source tools and notebooks

YellowBrick: GitHub -- Data Visualization library aimed at making visual explanations easier. I have so far only played around with this for data exploration, not for explaining models, but I am curious to hear your experience!
MMD-critic: GitHub A meaningful approach to sampling! Google Brain resident Been Kim also wrote an accompanying paper which explains how this library works to help you sample
Ian Ozsvald's Notebook using eli5: Ian and I have been chatting about these libraries, and I asked him to continue to update and elaborate his own use of tools like eli5. Updates will come as well, so check back!
Bayesian Belief Networks: Probabilistic Programming is cool again! (or always was... probably?) This is one of many libraries you can use for building Bayesian networks. Although this may not fit your definition of interpretability (if you have to expose this to the end-client they may not be able to make sense of it), it is worth exploring for your own probabilistic models.

There are many more, which I hope to write about over the coming weeks in a series of blog posts and notebooks as I explore what I call: reverse engineering for model interpretability AND MVM: Minimal Viable Models. (more on this to come so check back or follow me on Twitter... 😉)

What is Still Unsolved?

Plenty. If you are a graduate student or you work in a research lab or you work with unlimited access to TPUs (ahem..), please help this area of research. Here are a few things that are still very difficult.

Interpretable views of neural networks: I don't mean the one part of ImageNet where you can see a face. I mean actual interpretation of neural networks in a meaningful and statistically significant way.
Multidimensional Projections: Finding ways to explain models or clusters using 2-D or 3-D visualizations of multi-dimensional space is difficult at best and error-prone at worst. Watch Matti Lyra's PyData Talk on Topic Modeling for some insight. Or follow up with research from the fields of multi-dimensional distance metrics as well as unsupervised learning.
Kagglefication: Ensembles are killing us, with some sort of averaged metric I wish I could explain... 😝 But honestly, if we gamify machine learning, do we run the risk of making our own work in the field into an optimization game where the only metric is our f1 score? I hope not, but it makes me fearful sometimes... I fear we find ways to often boost or over-engineer our features to the point that we no longer can interpret the metrics and measurements we have created. This is a problem.
Finding representative samples and ensuring our labels are useful: It's difficult enough to explain models that you know were trained on meticulously documented labels. This becomes much more difficult in the "real world" where tags or labels might at times be high-quality or in other moments be garbage (or entirely absent...).
Measuring Interpretability: Until there is a built in sklearn.metrics.interpret I'm not certain how widespread metrics or usage we will see for interpretability. Even defining how we might calculate that is difficult to deduce. Although we can build upon probabilistic models and cognitive science theory, how can we easily compare the interpretability of a text explanation with that of a regression model? Research is clear that this is not impossible to do, so I hope we can find a solution which allows us to optimize for a metric like interpretability...

There are likely many more areas of research and concern, but these are the ones that, for me, struck a chord and seemed obvious areas we, as an open-source community, can work on. If you know of papers or research in the area, I am all ears! I hope this small post has at least inspired you to have more conversations with peers or colleagues around the subject of interpretability, which is a good start.

My Slides / Talk

If you are curious about my slides, I have posted them below.

The video is available here:

Please continue the conversation in the comments below, or feel free to reach out on Twitter (@kjam).

GDPR & You: My Talk at Cloudera Sessions München

2017-10-11T00:00:00+02:00

Unless you have been avoiding all news, you have likely heard of the coming changes in European privacy regulations which go into effect in May 2018. The changes are covered under the General Data Privacy Regulation Directive, whose final text was made available in May 2016.

I presented a talk at Cloudera Sessions Munich covering a few topics I found interesting on data privacy and security overall (not just for GDPR). Although inspired by some of the GDPR provisions, my talk focused on how a few areas might be impacted by the regulation and dove into how companies can take GDPR as a suggestion to start taking ethical data science more seriously.

The main takeaways I wanted to share are:

1. GDPR doesn't require ethical or even interpretable machine learning. But you should be doing this anyways, right?

There are a lot of scary articles out there, usually by someone with half of a clue, talking about how GDPR is going to kill artificial intelligence in Europe as we know it. They cite a paragraph in a recital which calls for the ability to explain automated decisions and processing to the data subject (aka client / user / you & me).

However, if you take time to read the text of GDPR as well as consult several legal papers on the topic, it is fairly clear that this right doesn't exist the way it's being spread in the headlines. A great paper on this topic is Why a Right to Explanation of Automated Decision-Making Does Not Exist in the General Data Protection Regulation (Wachter et al., 2017), where they delve into the potential legal implications of this section of the directive and explain it is highly likely this will be interpreted as a right to be informed. That said, if you cannot explain your model at all, doesn't that concern you? As a data scientist and machine learning practitioner, it bothers me! In fact, I think if we were required to explain our models more often, this might lead to a better understanding of our problem space, innovative new ways to measure or classify our results and more ethical models. Why? Because if I take the time to create an interpretable model, I not only can better explain why it behaves that way, but also I can see if perhaps there has been some "data leakage" which means my model has perhaps learned something I wanted to avoid (i.e. how to be racist or sexist).

So how do we promote more interpretability within the community? Interpretability in machine learning has already been a topic for several years, with workshops, great papers, open-source libraries and in-depth blog writeups. What saddens me is how often the Kaggle-verse who somehow values every last half-percentage of accuracy over anything interpretable. Don't be that person! Instead, spend time finding a model that you can explain, reason with and defend.

2. Data privacy is a myth. However, you can do your best at REAL anonymization to protect your customers.

Think your data is private? If you have used a service that uses third-party data processing, had your data released as part of a competition or study, or simply leave default settings on most of your applications and sites, then it is probably not. Why? In a "big data" world, de-anonymization (especially targeted) is trivial. Research in de-anonymization made a leap in 2008 when Arvind Narayanan and Vitaly Shmatikov published their paper: Robust De-anonymization of Large Sparse Datasets. The researchers had successfully de-anonymized users released in data by the Netflix Prize. This data was released knowingly by Netflix and, according to Netflix, had been properly anonymized. The paper was well-received and Narayanan went on for further research on deanonymization. It is also worth reading just for the fantastic burns.

Peak joy: I have a *real* reason to read the Netflix de-anon paper in entirety. And let me tell you, it is full of 🔥 https://t.co/KU5HPaDLoE pic.twitter.com/dmgHGqvg04
— katharine jarmul (@kjam) September 30, 2017

Andreas Dewes and several reporters from NDR and ARD researched this same topic recently, presenting the findings in a re:publica talk #NacktimNetz (Note: it is in German, but they presented also at DefCon and that video should be available soon). They were able to very easily get ahold of German politician, police officer and public servant click-stream data via a third-party company selling complete URL stream data for persons. Without great difficulty, they could find personally identifyable information in the data and deanonymize the complete browsing history of the person. So what can you do as a person handling potentially sensitive user data? Mainly, don't be evil (but no, really this time...). Don't sell your customer data to third-parties. Don't release it as a competition because it will be fun. Don't give it to anyone. Don't keep it connected to the public internet with default passwords. Just, be smart about it. And if you do choose to give it, sell it or release it, know that you need to really think about what that might mean WHEN someone deanonymizes it.

3. Data portability will hopefully inspire and encourage more competition.

A ray of hope in this slightly grim blog post is the GDPR articles related to data portability. To me, this is perhaps the most exciting part of GDPR and holds quite a lot of power if implemented properly. Of course, there is quite a lot of debate surrounding how this will actually be enforced by the courts.

The working party document is fairly clear about its interpretation, stating that:

In this regard, WP29 considers that the right to data portability covers data provided knowingly and actively by the data subject as well as the personal data generated by his or her activity. This new right cannot be undermined and limited to the personal information directly communicated by the data subject, for example, on an online form.

To me, this sounded like the mobile phone number portability competition. I actually decided to read a bit about how that was implemented in Europe, and there were several interesting papers I found related to the topic, including one Mobile number portability in Europe (Buehler et al., 2005) which explored pricing and its relation to the number of persons switching carriers. Via some networking folks, I found anecdotal evidence that in some areas where start ups and smaller network carriers were competing with the larger companies on features, there was a high proportion of mobile users porting their numbers.

For me, data portability opens up this same door. What if I could get all of my location data and port it to a new company? What if I could choose who I use for my language learning apps and port data easily between them?

The possibility for competition on whom is a better data guardian, whom has better features and better security and better privacy could be real. This makes me both happy and hopeful.

In case you want to look through them, you can find my slides here:

If there is video recording, I will post it as well.

Slide References (in order they were presented)

Algorithmic Art and "Künstliche Kunst"

2017-10-07T00:00:00+02:00

I was invited to give a talk at 404 Dublin, a really cool conference joining community groups w/ tech folks and art installations. When thinking of what topics might be of interest to the audience, I selfishly went to one of my (side) passions.. following artists who are doing amazing things with the intersection of computers and art.

So, what did I find when left to my own devices, Google, old art books and Twitter?

1. I found really awesome older than I imagined algorithmic art.

Did you know the first publicly shown algorithmic visual art was made on a graph printer called the Zuse Graphomat Z64?

THIS THING:

(image from Wikimedia)

HOW COOL IS THAT? And it was written by a mathematician who studied philosophy under Max Bense at the (now) University of Stuttgart. Georg Nees (the artist) went on to create famous pieces now on display in galleries around the world, and his thesis on Algorithmic Art is massively hard to find and costs hundreds of Euros (yes, please send me a copy 😂). And yes, he is the one who described the art as "Künstliche Kunst" or "Artificial Art".¹

And he wasn't the only one! Nanni Balestrini was writing algorithmic poetry using trained rules in 1961! Georg Nees presented alongside his fellow student Frieder Nake. And as you proceed into the 70s, you hit Harold Cohen creating AARON, a system designed to eventually create AI art. A system he spent 40 YEARS (yes!) working on. And Cybernetic Landscapes by Aaron Marcus. If you come across more interesting art in these early times, please post a comment or feel free to message me. I'm fascinated with early applications of "Cybernetics" and computers. 😀

2. I realized just how much neural network (inspired or created) art is pushing boundaries today.

I have been following the work of Gene Kogan, Mario Klingemann and Memo Atkin for the past year or so because they are amazing, inspiring and doing things I think will change the way we use deep learning in the coming years (I would argue they already are doing this). If you haven't seen their work yet...

After 7.5 epochs of training we are still in a dark place. pic.twitter.com/Bs9869YSc6
— Mario Klingemann (@quasimondo) January 8, 2017

neural glitch: billions of computations to evoke the appearance of total disorder. and yet perfectly reproducible... a true dynamical system pic.twitter.com/3kZnBDLLci
— Gene Kogan (@genekogan) September 22, 2017

Memo Atkin's Pattern Recognition

Yr welcome! 😉

But I also came across several artists and other persons in the field I hadn't heard of yet whose work I found really interesting, including:

Emily Daniels' work on creative poetry and @ker00lf
Luba Elliott is essentially the mafia boss of creative AI, sharing her research, curation and experiments via her site, talks, newsletter and work.
Jonus Lund's trippy, whimsical and political views on our digital world
Sebastien Schmeig's fantastic take on Futurism and AI: LSTM
Jake Elwes between creating neural networks trained on pornography, to his "closed loop" and auto-encoded Buddha...
SuperHyperCube: a VR game created by a collective of artists
Eyes Gaze: Neural network generated portraits using DeepGaze to create creepy and surreal images and interactions by Mike Tyka.

There were more too, but those were some of my favorites. The more I looked, the more I realized I needed to visit galleries more. At least 3 of the artists I was inspired by had shown pieces in Berlin in the last year. Time to get off the computer and start experiencing art.

And I got to re-investigate some of the artists I feel like are making commentary on how AI and mainstream machine learning are affecting society, including:

James Bridle: both Citizen Ex and Autonomous Trap were spectacular
Ai Weiwei's Hansel and Gretel: W T F

3. It isn't very difficult to get started generating your own neural network art.

On my small laptop GPU, I was able to train several generative text networks using a series of LSTM networks or long-short-term-memory networks. The output usually just made me laugh as you need quite a lot of interesting data to make them work well.² I started with the usual suspects:

Andreas Karpathy's RNN LSTM (which is a great read if you haven't already done so)
A tensorflow-backed character RNN
Keras Generative LSTM

But there were even more interesting takes out there:

NeuralSnap: Poetry generated from images
E.M. Daniels poetic inner join: Joining poetry together using Bayesian probability and RNNs

And of course plenty for visual art:

Pix2Pix: Coloring based on trained networks for edges
DeepWarp: Gaze images
OpenFrameworks: C++-based creative coding suite

If you are looking for inspiration, you might want to start with these:

And there are so many more to play with. I'd love to hear about your favorites or feel free to share more artists (new and old) who are pushing the boundaries with neural networks and art in the comments.

Finally, if you want to peruse them, here are my slides:

I will post video if it's shared publicly. Special thanks to Vicky Lee for making my talk at 404 possible.

To see a great interview including this "Künstliche Kunst" conversation alongside some of Nees' algorithmic contemporaries, check out Early Computer art, man-machine. ↩
I tried making one with Tupac songs, but it quickly devolved from cursing into gibberish. Erykah Badu was my next goal, but I simply didn't have enough content. Then I set my eyes on James Joyce -- perhaps a bit too high brow for my small GPU. And finally had a lot of fun with U2 lyrics, leading to fun excerpts like those I showed in my talk as well as these gems: "oh my heart, love is bloody sunday", "i've got to get you, got to get you, got to get you...", and one particularly lulzy one "la la la la la la la la la la la la la la la la la". ↩

Comparing scikit-learn Text Classifiers on a Fake News Dataset

2017-08-28T00:00:00+02:00

Finding ways to determine fake news from real news is a challenge most Natural Language Processing folks I meet and chat with want to solve. There is significant difficulty in doing this properly and without penalizing real news sources.

I was discussing this problem with Miguel Martinez-Alvarez on my last visit to the SignalHQ offices; and his post on using AI to solve the fake news problem further elaborates on why this is no simple task.

I stumbled across a post which built a classifier for fake news with fairly high accuracy (and yay! the dataset was published!). I wanted to investigate whether I could replicate the results and if the classifier actually learned anything useful.

Preparing the data

In my initial investigation, I compared Multinomial Naive Bayes on a bag-of-words (CountVectorizer) features as well as on a Term Frequency-Inverse Document Frequency (TfIdfVectorizer) features. I also compared a Passive Aggressive linear classifier using the TF-IDF features. The resulting accuracy ranged from 83% to 93%. You can walk through my initial investigation published on the DataCamp blog to read my approach and thoughts (a Jupyter notebook of the code is also available on my GitHub). In summary, the data was messy and I was concerned the features were likely nonsensical.

Comparing different classification models

I wanted to take a deeper look into the features and compare them across classifiers. This time I added an additional few classifiers, so overall I would compare:

Multinomial Naive Bayes with Count Vectors
Multinomial Naive Bayes with Tf-Idf Vectors
Passive Aggressive linear model with Tf-Idf Vectors
SVC linear model with Tf-Idf Vectors
SGD linear model with Tf-Idf Vectors

On accuracy without parameter tuning, here is a simple ROC curve comparison on the results: You can see that the linear models are outperforming the Naive Bayes classifiers, and that the accuracy scores are fairly good (even without parameter tuning).

So indeed I could replicate the results, but what did the models actually learn? What features signified real versus fake news?

Introspecting significant features

To introspect the models, I used a method I first read about on StackOverflow showing how to extract coefficients for binary classification (and therefore show the most significant features for each class). After some extraction, I was able to compare the classifiers with one another. The full notebook for running these extractions is available on my GitHub. I will summarize some of the findings here.

Fake news has lots of noisy identifiers

For the most models, the top features for the fake news were almost exclusively noise. Below is the top ten features ranked by weight for the most performant Naive Bayes classifier:

Feature	Weight
'0000'	-16.067750538483136
'000035'	-16.067750538483136
'0001'	-16.067750538483136
'0001pt'	-16.067750538483136
'000km'	-16.067750538483136
'0011'	-16.067750538483136
'006s'	-16.067750538483136
'007'	-16.067750538483136
'007s'	-16.067750538483136
'008s'	-16.067750538483136

You might notice a pattern, yes? The "top features" all have the same weight and are alphabetical -- when I took a closer look there were more than 20,000 tokens as top performers with the same weight for Naive Bayes.²

The top linear model features for fake news looked like this:

Feature	Weight
'2016'	-5.067099443402463
'october'	-4.2461599700216439
'hillary'	-4.0444719646755933
'share'	-3.1994347679575168
'article'	-2.9875364640619431
'november'	-2.872542653309075
'print'	-2.7039994399720166
'email'	-2.4671743850771906
'advertisement'	-2.3948473577644886
'oct'	-2.3773831096010531

Also very noisy, with words like "share" and probably "Print article" as well as date strings (likely from publication headers). The only token that is not from auxiliary text on the page is likely Hillary, which in and of itself does not classify fake from real news. ¹

Linear Models agreed that "to say" is a real news feature

For the linear models, forms of the verb "to say" appeared near the top -- likely learning this from professional journalism quotations (i.e. Chancellor Angela Merkel said...). In fact, "said" was the most significant token for the top linear model, edging out the next token by 2 points. Here is a short summary of real news top tokens from the Passive Aggressive classifier:

Feature	Weight
'said'	4.6936244574076511
'says'	2.6841231322197814
'cruz'	2.4882327232138084
'tuesday'	2.4307699875323676
'friday'	2.4004245195582929
'islamic'	2.3792489975683924
'candidates'	2.3458465918387894
'gop'	2.3449946222238158
'conservative'	2.3312074608602522
'marriage'	2.3246779761740823

Although there are more real topics included, there is also words like Friday and Tuesday. Perhaps we should only read the news on Friday or Tuesday to ensure it is real...

via GIPHY

Overall, the top tokens were mainly noise

When I aggregated the top tokens for both real and fake news, sorting by count (i.e. the most common tokens identified as real and fake for all models), I saw mainly noise. Here are the top tokens sorted by the number of occurrences for identifying real news:

	Aggregate Rank	Count	Label
said	9.8	5	REAL
cruz	3.5	4	REAL
tuesday	8.33333	3	REAL
conservative	4.66667	3	REAL
gop	3.33333	3	REAL
islamic	6.33333	3	REAL
says	8.33333	3	REAL
president	5.5	2	REAL
trump	9.5	2	REAL
state	3	2	REAL

And the top tokens for identifying fake news:

	Aggregate Rank	Count	Label
2016	1	3	FAKE
share	5	3	FAKE
print	7.33333	3	FAKE
october	2.66667	3	FAKE
november	5.66667	3	FAKE
hillary	2.33333	3	FAKE
article	4.33333	3	FAKE
0000	1	2	FAKE
election	7.5	2	FAKE
000035	2	2	FAKE

To see the code used to generate these rankings, please take a look at the Jupyter Notebook.

Takeaways

As I conjectured from the start, fake news is a much harder problem than simply throwing some simple NLP vectors and solving with linear or Bayesian model. Although I found it interesting that the linear classifiers noticed real news used quoting verbs more often, this was far from a deep insight that can help us in building a real vs. fake news filter which might improve democracy.

I did have fun spending a short time building on a few ideas and found it useful that the linear models performed better in terms of token noise for real news. If I had taken time to clean the dataset of these tokens, I'm curious how the comparison between the models would change.

In the end, the dataset is likely not a great candidate for building a robust fake versus real news model. It seems to have a lot of token noise (dates, share and print links and a limited variety of topics). It is also fairly small and therefore any models would likely suffer from having a smaller token set and have trouble generalizing.

I'm always curious to hear other trends or ideas you have in approaching these topics. Feel free to comment below or reach out via Twitter (@kjam).

Footnotes

Perhaps the fact that Clinton did not appear alongside it might mean a longer n-gram could identify references to popular alt-right and conservative monikers like "Lying Hillary" versus "Hillary Clinton"). ↩
Some fun ones in there included '11truther', '0h4at2yetra17uxetni02ls2jeg0mty45jrcu7mrzsrpcbq464i', 'nostrums', 'wordpress' and 'woot'. (I'm sure there are many more finds awaiting more study ...) ↩

Data Unit Testing: EuroPython Tutorial

2017-07-14T00:00:00+02:00

I gave a long and opinionated tutorial at EuroPython 2017 about how we should do unit testing and validation within a data science scope. The GitHub repository for the course (which is part of my O'Reilly Live Online training) is https://github.com/kjam/data-cleaning-101. I will continue editing and updating the repository with more examples, so feel free to fork or star it to get updates.

The slides for the talk are also available here:

And for those who attended, please give me feedback!

if Ethics is not None

2017-07-14T00:00:00+02:00

This past Wednesday, I had the pleasure of giving a keynote at EuroPython 2017. I covered a historical view of ethics in computing. The slides are shared here, but it was also recorded so I will post a video when it is available. (Updated: video added!)

In addition, a series of blog posts and interviews I conducted during my research will be here in August, so stay tuned for more historical computing memories!

Slides:

Video:

Practical Data Cleaning with Python Resources

2017-05-03T00:00:00+02:00

Practical Data Cleaning Resources

(O'Reilly Live Online Training)

This week I will be giving my first O'Reilly Live Online Training via the Safari platform. I'm pretty excited to share some of my favorite data cleaning libraries and tips for validating and testing your data workflows.

This post hopes to be a resource to those attending the class, but also anyone interested in the subject of practical data cleaning with Python. If you have tips or ideas on extra content or links to add, feel free to comment or reach out via Twitter or email.

Hope you enjoy!

Libraries / Repositories

Course Repository: https://github.com/kjam/data-cleaning-101

Deduplication

Dedupe: https://github.com/dedupeio/dedupe
CSV Dedupe: https://github.com/dedupeio/csvdedupe

String Matching

Fuzzy Wuzzy: https://github.com/seatgeek/fuzzywuzzy
TextaCy: https://github.com/chartbeat-labs/textacy

Managing Nulls

Pandas functions: http://pandas.pydata.org/pandas-docs/stable/missing_data.html
Dora: https://github.com/NathanEpstein/Dora
Badfish: https://github.com/harshnisar/badfish

Normalization & Preprocessing

Scikit-learn preprocessing: http://scikit-learn.org/stable/modules/preprocessing.html
Pandas stats: http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics

Specific data cleaning topics

Privacy? https://github.com/datascopeanalytics/scrubadub
Measurements? http://pint.readthedocs.io/
Versioning ML Data? https://github.com/NathanEpstein/Dora
Dates? http://arrow.readthedocs.io/en/latest/ or https://github.com/kennethreitz/maya
AutoClean? https://github.com/rhiever/datacleaner
DIY Parser? https://github.com/datamade/parserator

Simple pipelines / graphs, task processing

Dask: https://github.com/dask/dask
Distributed: https://github.com/dask/distributed

Schema Validation

Voluptuous: https://github.com/alecthomas/voluptuous
Validr: https://github.com/guyskk/validr
With Serialization: https://marshmallow.readthedocs.io/en/latest/
For JVM / Apache: https://avro.apache.org/

Dataframe Validation

Engarde: https://github.com/TomAugspurger/engarde
Validada: https://github.com/jnmclarty/validada

Constraint Detection

TDDA: Test-Driven Data Analysis: https://github.com/tdda/tdda
SciPy: https://docs.scipy.org/doc/scipy-0.19.0/reference/stats.html#statistical-functions

Property-based Testing

Hypothesis: https://hypothesis.readthedocs.io/
Haskell's Quickcheck: https://hackage.haskell.org/package/QuickCheck

More Validation and Testing

Model Cross Validation: http://scikit-learn.org/stable/modules/cross_validation.html
Testing ML features: https://github.com/machinalis/featureforge
Built-in Stats: https://docs.python.org/3/library/statistics.html

Unit Testing Basics

PyTest: https://docs.pytest.org/en/latest/
Mocking: https://docs.python.org/3/library/unittest.mock-examples.html
Faking Data with Faker: https://faker.readthedocs.io/en/master/
Faker CSVs: https://github.com/pereorga/csvfaker
Watch Ned Batchelder’s testing talk
Continuous Integration: TravisCI, Jenkins, TeamCity and many more
Better Code Reviews: http://www.bettercode.reviews/

Testing Pipelines

Data Quality Checks with Spark DataFrames
Drunken Data Quality (Spark DF): https://github.com/FRosner/drunken-data-quality
Apache Beam: https://beam.apache.org/documentation/pipelines/test-your-pipeline/
Tip: Check your framework first!

Open Datasets (to try out your skills!)

Kaggle Datasets: beyond just competition data, Kaggle also has shared datasets curated by users.
Awesome Datasets GitHub List
Quora: Where can I find large public datasets?
Scikit-learn datasets
Dataquest.io: 17 places to find open datasets for projects
NLTK Data: NLP data such as books, scripts, articles and poems

Research

That's all for now! Check back as I plan to update and evolve this list with more libraries and examples.

PyData Amsterdam Keynote on Ethical Machine Learning

2017-04-07T00:00:00+02:00

I was kindly asked by the PyData Amsterdam organizers to keynote the conference. As a passionate fan of ethical machine learning and the great research being done by data scientists and academics around the world -- I am very enthused to present the topic to the conference.

My slides are currently available as a jupyter notebook via GitHub and I will be posting them in an easy way to key through them soon. I will be adding the video as well as several extra posts regarding the research and findings here.

I would especially like to thank Matti Lyra for his help and suggestions in crafting this talk. I would also like to thank Françoise Provencher for pointing me to some of the great resources.

Talk and Slide References

Minority Areas Pay Higher Car Insurance than White Areas with the Same Risk by ProPublica
Ethics Can't be a Side Hustle by Mike Monteiro
Stereotyping and Bias in the Flickr30 Dataset by Emiel van Miltenburg\n- Why Facebook is giving out free Wi-Fi for check-ins by CNet
Do not untick this box if you do not want to receive updates by Formismo
Say hi to your new boss: How algorithms might soon control our lives and GitHub Repo by Andreas Dewes
Semantics derived automatically from language corpora necessarily contain human biases by Aylin Caliskan-Islam, Joanna J. Bryson, and Arvind Narayanan (Related 33c3 talk by Aylin Caliskan-Islam: Story of discrimination and unfairness)
Machine Bias by ProPublica
Ethics for powerful algorithms by Abe Gong
Certifying and removing disparate impact by Michael Feldman, Sorelle A. Friedler, John Moeller, Carlos Scheidegger, Suresh Venkatasubramanian
FairTest: Discovering Unwarranted Associations in Data-Driven Applications and Github Repository by Florian Tramèr, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, Jean-Pierre Hubaux, Mathias Humbert, Ari Juels, Huang Lin
When Recommendation Systems Go Bad by Evan Estola
Equality of Opportunity in Supervised Learning with interactive data visualization with generated loan data by Moritz Hardt, Eric Price, Nathan Srebro (interactive by Google BigPicture)
Ideas on Interpreting Machine Learning by Patrick Hall, Wen Phan and SriSatish Ambati
Financial Modeler's Manifesto by Emanuel Derman and Paul Wilmott

News on Ethical Machine Learning and Models Gone Bad

Ten Tips for First-Time Conference Speakers

2017-02-11T00:00:00+01:00

The saddest moment for me at conferences is when I'm in the middle of an interesting conversation with a bright person and I ask her when her talk is and she says, "Who me?"

The number of folks I speak with every year at conferences who have amazing stories to share and who are working on great datasets and tools is astounding. I often feel overwhelmed by being the most average person in the room.

That said, I think one way we can help increase diversity of ideas and culture in our community is to encourage and support first-time speakers. And I strongly believe a more diverse community benefits us all by creating more opportunities, increased inventiveness and fresh perspectives on the important topics and problems we face.

Why speak at conferences? What good is it?

Besides being good practice for management roles or other roles where public speaking is important, conferences give you an opportunity to share your work and knowledge and engage with others who you might not have met organically. I find the conversations I enjoy after giving a talk inspire new ideas and research for me and often teach me just as much as I learned in preparation for the talk.

But I don't like public speaking...

Honestly, that's fine. If you tried it once and you hated it, okay. You could always submit a panel, perhaps? Or give a tutorial? (I know, I'm trying too hard). However, if you haven't tried public speaking outside of the time you were in a play in grade school, please give it a second chance.

Fine, where are my tips?

Me RN

Image: maxafax

Here goes!

1. Talk about something you love.

The best talks are ones where the presenter is passionate and interested in the topic. You will likely put in hours practicing and rehashing ideas for your talk. You might need to research or test your hypothesis or implementation. Let it be a joy for both you and your audience.

2. Avoid writer's block.

Try techniques writers use! Get a notebook and write down every topic that comes to mind. Write down everything you know about every topic. Get on the phone and talk about it with your boss, coworker, friend, mother. Write down any and all ideas that come from those conversations. Read books or listen to talks or podcasts on the topic and write more notes on them. Reread your notes and code and/or data and repeat the above process until you have way too many words and not enough time to fit them.

3. Don't be afraid to engage mentors and experts.

Is there someone whose talk, library, career or accomplishments helped inspire your idea? Even if they might seem too busy or famous -- it’s likely they will be flattered and interested in speaking with you. Reach out and see what happens -- you could be pleasantly surprised!

4. Have stage fright? Co-present!

If you are someone who is truly terrified of speaking in front of groups, the best practice is to co-present. If for some reason you freeze, you have a partner to take over! And with practice, I honestly believe you can overcome your fear. You can also propose a panel and help moderate so the limelight is not focused on you and you have the opportunity to introduce and interview experts you collect for the topic.

5. Don’t worry about knowing everything; be prepared to learn.

You won’t know everything about your topic. Be willing to learn and ask lots of questions. Be willing to be humbled by the knowledge of your attendees. Be willing to thank people for sharing knowledge with you. Be willing to admit you don’t know an answer, but be willing to help find it. (Side note: Don’t be afraid to get technical and dig deep!!)

6. Practice.

Practice your timing and your slide presentation. Practice in front of (every|any)one. Practice in front of your cat. Practice in front of your boss. Practice in front of a local meetup group. Practice in your sleep. Practice on Snapchat. Practice in front of a mirror. Basically, practice until you are saying similar enough things every time, you stop reading your notes and the timing and talk progression are second-hand knowledge.

7. On the day of the talk, get rest, eat breakfast, don’t look at your slides.

By now you've practiced so much you could do it in your sleep. Give your mind a break. Make yourself a nice cup of tea or a latte. Get a good night’s rest. Meditate or watch a fun movie or do some (non-coding) reading or writing. You’ll be fine! In fact, you’ll be great! Time to just relax and enjoy your upcoming speech with ease.

8. Take a deep breath. Smile. Stare at one person. Walk around.

As you're giving your talk, remember to breathe! I like to take a deep breath and smile as I get on stage. Even if you don’t feel like smiling, it helps! If you get nervous about the crowd size, find a few friendly faces (or one or two you don’t know) and focus on those. Walk around the stage while you talk to ease your nerves and engage your audience.

9. Don’t take yourself or your talk too seriously.

You are not a brain surgeon. If your talk completely flops, no one is going to die. If your slides freeze up, the world will continue turning. If you mispronounce someone’s name or you forget to mention a library, no one is going to put you in time-out. It’s OK to mess up and it doesn’t mean you’re not a smart cookie.

10. Ask for, listen to, and learn from feedback.

Feedback, both in the form of any written reviews as well as people on Twitter or folks who come up to speak to you later, is great! There will always be haters; try not to focus on reviews that say nothing constructive. Ask for feedback from mentors and colleagues who were there. Take both positive and negative feedback to heart and use it to make your next talk even better.

If you've made it this far: 👯 🎉 🙌 I hope to see you at an upcoming conference! In case you need ideas for where to present, I help organize the PyData Berlin Conference, which is guaranteed to be absolutely fabulous and is happening July 1-2, 2017. The PyData Berlin committee will also be organizing some local workshops to encourage first-time speakers and some mentorship opportunities -- so feel free to reach out for more information (forms and links for these will also be added to the website soon).

Now that you are inspired, I recommend getting started on your proposal. For some further advice on writing a great proposal, I can recommend:

Look forward to seeing you speak up! 👏

The Practice of Programming: 18 Years Later

2017-01-20T00:00:00+01:00

Over the new year holiday time I had a chance to get away from it all, and snuck up to Finland to sit in a lodge on the Gulf of Finland, sip coffee, take saunas and read. I brought along a few books, the only programming one being Brian W. Kernighan and Rob Pike's "The Practice of Programming."

Cabin: woke up like this. 😂 😍 pic.twitter.com/spr130gFzR
— katharine jarmul (@kjam) January 3, 2017

I received the book as a loan from a long-time mentor, who helped me first learn how to write production-ready code. I remember reading it in 2008 and having difficulty understanding all the concepts. As I moved from city to city, I always thought I should probably mail it back, or perhaps read it again first, then mail it back...

Practice of Programming: The Book

The book is 18 years old. It covers C programming. It handles issues like signed versus unsigned integers, piping data between mismatched byte systems and a few other topics that do not affect my programming, nor most of the folks I know. Why reread it?

Brian W. Kernighan and Rob Pike should need no introduction, but here is one in case you are like me and getting older and dependent on Google. Kernighan is a contributor to the C programming language and co-author of the famous book, "The C Programming Language". He worked at Bell Labs with Rob Pike, famous in his own right for developing numerous parts of the Unix system we all know and love today; and the whole Go language thing... #nbd.\n\nWhat gems still held my attention, 18 years after they were published and nearly 9 years after I first was handed the book? Many more than you might think, here are a few:

Debugging

Chapter 5 is devoted solely to debugging; and has many informative sections including tips on finding patterns, rubber ducking (but with a teddy bear instead), analyzing data to help find programming bugs, and how to solve "non-reproducible" errors. The section that is truly timeless is 5.7 Other People's Bugs, which valiantly takes on how to find, manage and report other programmer's errors.

Including this tidbit:

If you think that you have found a bug in someone else's program, the first step is to make absolutely sure it is a genuine bug, so you don't waste the author's time and lose your own credibility.

From someone who has written and helped fix many bugs, this resonated. Especially when it seems the standard today is to simply report a GitHub issue and let the author(s) and contributors figure it out. If most of us spent an extra day debugging the issue, we might even fix it ourselves (we have the source code) or at least present a well-proven test case for the author(s) to help alleviate the burden on open-source maintainers.

In that vein, Kernighan and Pike write:

Finally, put yourself in the shoes of the person who receives your report. You want to provide the owner with as good a test case as you can manage. It's not very helpful if the bug can be demonstrated only with large inputs, or an elaborate environment, or multiple supporting files. Strip the test down to a minimal and self-contained case. Include other information that could possibly be relevant, like the version of the program itself, and of the compiler, operating system and hardware.

I feel like a checklist of these points should be required before submitting bug reports. A kind of Joel Test for error reporting.\n\nOn the topic of errors, the authors also reference Donald Knuth's the Errors of TeX, which deserves it's own separate treatment (or post).¹\n\n#### Testing\n\nChapter 6 is devoted to testing. As a fan of testing (even for your data!), this chapter stood out; not just for it's methodical evaluation of how, when and why to write tests, but also it's use of data validation (!!) and test automation (!!!). The fact that good developers are still having to explain why they need these types of tests included in their test suite (or to managers or higher ups that these tests are even necessary), is a sad and telling reflection of our priorities and (non)adherence to lessons learned long ago.\n\nI especially liked this passage:

It is important to test your own code: don't assume that some testing organization or user will find things for you. But it's easy to delude yourself about how carefully you are testing, so try to ignore the code and think of the hard cases, not the easy ones. To quote Don Knuth describing how he creates tests for the TEX formatter, "I get into the meanest, nastiest frame of mind that I can manage, and I write the nastiest [testing] code I can think of; then I turn around and embed that in even nastier constructions that are almost obscene."

I literally spit my coffee out when reading this bit, imaging the coders of the world finding their worst selves and attacking their product with vigor and malice. But it IS great advice. How many times have I written the obvious test instead of devoting a day or a few hours figuring out how to break my own code? ²

Portability

The final chapter that struck me as still very much applicable today was Chapter 8 on Portability. This was a surprise, as I assumed the portability issues in 1999 didn't reflect any I might have seen as a developer. Grrllll, was I wrong...

I can't even begin to explain my joy and amusement at turning the page and reading this:

8.8 Internationalization

If one lives in the United States, it's easy to forget that English is not the only language, ASCII not the only character set, $ not the only currency symbol, dates can be written with the day first, times can be based on a 24-hour clock, and so on.

The amount of data errors, report misunderstandings and general grief I have seen in my career due to these misconceptions (sometimes my own, of course) are too many for me to recount. Additionally, the fact we still debate the need for internationalization of smaller tools or even our own websites is again interesting to note (when given an 18-year-old book outlining internationalization as a requirement).

Beyond internationalization, Kernighan and Pike touch upon portability for different environments, and elaborate on the pitfalls of massive if/else or switch statements in compilers or setup configuration files. Their warning against modifying source for one particular install was succinct and useful:

When you modify a program to adapt to a new environment, don't begin by making a copy of the entire program. Instead, adapt the existing source. You will probably need to make changes to the main body of the code, and if you edit a copy, before long you will have divergent versions. As much as possible, there should only be a single source for a program; if you find you need to change something to port to a particular environment, find a way to make the change work everywhere.

Finally, something I think we have caught up to (although should still remember)! Version control, generalization (when useful) and open-source libraries eating the world. Hooray us!

Other fun (to me) notes

An entire section on self-generating code and ideas for better code written by machines.
Seeing print("%s", str) and doing a double-take to make sure I was not reading Python.
A paragraph outlining (very politely) how ridiculous it is that we still need to support carriage returns (\r) despite the fact that computers have no carriages.
Learning that "big endian" is a reference to Jonathan Swift's Gulliver's Travels.
Code to roll your own RegEx parser in C.
Telnetting from machine to machine to copy files and using checksum (sum) to test if the copy was properly performed.
A still semi-functional TCL and Perl script to scrape the web. See footnote for the code.³
Checking your email with grep

Where did I save that mail from Bob? % grep '^From:.* bob@' mail/*

In Conclusion

Granted some of the content in this book was merely fun review for me and several themes are problems of a different era, I found it remarkably relevant given its age. It seems that often we talk about books even a year old as outdated, but this made me reconsider how it's sometimes easy to treat every new thing as just that, NEW. Most often it's the same programming paradigms the folks at Bell Labs were working on since the '80s.

Moral of the story: Never too old to (re)read a good book.

Oh, and, Ryan... I'm sending your book back! Thanks for the loan! 😇

Debating doing a series on some of these older but still relevant texts. If this post is interesting to you, please let me know! ↩
This is a good point to remind you how much a tool like Hypothesis can help you find those nasty corners of your code that you may or may not be able to reach. ↩
Check out the unmodified 18-year old code as a Gist. Exact usage from book is to run as so: geturl.tcl $1 | unhtml.pl | fmt.awk. I couldn't get piping to work with my current setup, but the scripts still worked using tclsh and perl as a series of commands (granted most sites reject or don't respond to HTTP/1.0 requests without headers anymore... 😏) ↩

New O'Reilly Video Training: Data Pipelines with Python

2016-12-13T00:00:00+01:00

I'm really excited to announce a new Python video course with O'Reilly on data pipelines. If you are interested in learning some of the popular options available for workflow automation and management in Python, take a look!

In the course, I cover:

Using Celery for simple automation
Setting up Hadoop for file storage
Comparing tools like Airflow and Luigi for your pipeline needs
How to parallelize data processing with Dask
A brief look at other popular tools like Apache Spark and Django Channels
More general and broad concepts like testing, DAGs, producers, consumers and how to be a not-awful systems caretaker.

There is also a public repository available which covers the code and tools used.

I appreciate any and all feedback from students who are enrolled or have taken the course, so please reach out! :)

DAGs & Dask: How and When to Accelerate your Data Analysis

2016-10-29T00:00:00+02:00

I gave a talk about Directed Acyclic Graphs (DAGs) and Dask at PyConCZ 2016. It was super fun and I had a great time at the conference. If you want to read my slides below, here they are! There will be videos available later, so I'll post the link / video here when I see it.

The notebooks I used are available on GitHub: Fun with Dask Notebooks. If you have any questions, reach out on Twitter.

Introduction to Data Wrangling @ PyConCZ

2016-10-29T00:00:00+02:00

PyConCZ 2016 was such a fun conference! First off, it was the first time I got to see Jackie Kazil since we started writing our O'Reilly book Data Wrangling with Python together, HOORAYYYY!

OMG PYTHONISTAS! @JackieKazil & I are together for the first time since we started the @OReillyMedia Data Wrangling with Python book! 🙌 💜 🐍 pic.twitter.com/1LG3iCspQ3
— katharine jarmul (@kjam) October 29, 2016

Secondly, it was super awesome and well organized so THANK YOU to the organizers!! 🙌 I gave two talks, one about Dask and parallelized Data Analaysis, and a second one on Introduction to Data Wrangling with Python.

The notebook I used is available on GitHub: Data Analysis with Pandas on 2016 US Election Data. If you have any questions, reach out on Twitter.

Enough typing, here are the slides! It was also recorded, so I will post the video of the talk as soon as I see it!

Chatbot Scraper: Europarl Scraper: 24 Languages of Politics, at your fingertips

2016-10-20T00:00:00+02:00

I participated in a two-day PyDataBerlin Hackathon event in early-October and decided to build a scraper for European Parliament. This was after I found the Europarl parallel corpus a bit underwhelming as it is messy and not tagged for party, speakers or topic (this is understandable, as it is primarily used as a multilingual training corpus for machine-learning translation models).

At the hackathon, many folks were working on really interesting projects to analyze bias, framing and different word usage depending on party. Since I know a bit of web scraping, I built a scraper for the current European Parliament site. The data from the scraper is also available via a public bucket on S3.

All of the folks involved in the hackathon shared their findings at last night's PyData Berlin meetup. It was really interesting! Felix Biessmann, David Batista and Jirka Lewandowski all found correlations between word choices and party. I encourage you to check out their slides!

I hope we can have another PyData Berlin hackathon soon, and my data can be useful for further research in political language bias. Although I spent a lot of times in my slides making jokes (as I don't have much analysis to present and talking about web scraping is a bit boring), I do believe strongly that democracy is hard and the more folks we have who are "good at data" helping analyze and keep watch and collaborate with those who understand politics, the better.

Here are my slides, feel free to reach out if you have questions about the data or if you do anything interesting with it! 🙌

Chatbot Scraper: Using (today's) IRC logs as your NLP datasets

2016-09-29T00:00:00+02:00

I dunno about you, but I often find myself bored with NLP (natural language processing) datasets. Too often they are older, based around something that is not particularly interesting to me or something I've analyzed or used before.

For me, IRC has often been a source of community, fun, sometimes trolliness (is that a word yet?) and clearly an interesting source of news / assistance with regards to my work.

Given the fact that freenode has many publicly logged channels, I decided to see if I could scrape botbot.me to get more data for NLP fun.

After about a day of tinkering and testing, I present chatbot_scraper. It currently only scrapes the public lists for botbot.me, but if you use a major open-source framework / platform, you'll likely find at least one channel of interest. For me, I'm perusing the docker logs looking for interesting new topics. For you, who knows?! (Although feel free to send interesting things you find!) To get started, take a look at the README.md.

Here is an example run:

python botbot_scraper.py --network_name freenode --chan_name docker --start_date=2016-08-30 --end_date=2016-09-05

For more info, try the help command:

python botbot_scraper.py -h

I am hoping to expand it for more public chat logs and possibly even slack logging (although I'm unsure what ToS Slack has, probably too constrictive tbh..). That said, let me know if you have suggestions or issues on the issues page or simply fork and send a pull request!

Cheers and happy bot-ing!

Automating your Data Cleanup with Python

2016-09-17T00:00:00+02:00

I gave a talk at PyCon UK 2016 on automating your data cleanup with Python. I want to again thank the organizers for having me and thank the folks who attended. If you have any questions or are interested in talking about data cleaning problems, feel free to reach out: katharine at kjamistan or on social media. Here are my slides:

And here is the video! :)

Embedded *isms in Vector-Based Natural Language Processing

2016-09-16T00:00:00+02:00

You may have read recently about machine learning's bias problem particularly in word embeddings and vectors. It's a massive problem. If you are using word embeddings to generate associative words, phrases or to do comparisons, you should be aware of the biases you are introducing into your work. In preparation for my EuroPython talk on machine learning with sentiment analysis, I came across some disturbing nearest neighbor vectors when using Google's news vectors[^1] in emotionally charged speech; this provoked me to further investigate the bounds of *isms[^2] in word embeddings.

I must warn you that parts of this post are disgusting, disturbing and awful. If you are having a rough day, feel free to save for another time. If you are already sick of seeing hateful language, this is likely not a post to read at present. That said, I feel my duty as a former journalist to look at it, expose it, and hope to spark better conversations around how we handle both implicit and explicit bias and prejudice in our models.

In my research, not dissimilar to Bolukbasi, Chang, Zou, Saligrama and Kalai's findings, I found word embeddings rife with examples of sexism. Take the following example, model.most_similar(['lady'], topn=20) produces several expected words, 'woman', 'gentleman', even 'gal' alongside some gems like 'beauty queen', 'FLOTUS' and 'vivacious blonde'. Whereas, model.most_similar(['gentleman'], topn=20) produces several expected words, 'man', 'gentlemen', 'gent' as well as some flattering terms like 'statesman', 'sportsman' and 'stunningly handsome'.[^3]

To dive a bit deeper into how these biases play out, let's do some standard analogies. We all know the King-Queen comparison, how might that apply to other professions?

In: model.most_similar(positive= ['doctor', 'woman'], negative=['man'])
Out: [('gynecologist', 0.7093892097473145), ('nurse', 0.647728681564331),...]

So, Doctor - Man + Woman = Gynecologist or Nurse. Great! What else?

In: model.most_similar(positive=['professor', 'woman'], negative=['man'])
Out: [('associate_professor',0.7771055698394775),
      ('assistant_professor', 0.7558495402336121),...]

So, Professor - Man + Woman = Associate / Assistant Professor. Now, for something near and dear to me...

In: model.most_similar(positive=['computer_programmer', 'woman'], negative=['man'])
Out: [('homemaker', 0.5627118945121765),
      ('housewife', 0.5105047225952148),
      ('graphic_designer', 0.505180299282074),...]

So, Computer Programmer - Man + Woman = housewife. Or graphic designer. Because of course women only do design work (never great male designers or amazing female DBAs). Now pay attention that some of these vectors have varying degrees of similarity (noted in the second element of the tuple); the higher the number, the closer the vectors. That said, these are real responses from word2vec.[^4]

I hadn't seen much written about word2vec's racist and xenophobic tendencies, but after playing around with sexism, I assumed I would find some. Again, fair warning that hateful language lies ahead!

 In: model.most_similar(positive=['immigrant'], topn=30)
 Out: ('immigrants', 0.7985076904296875),
      ('Immigrant', 0.6984704732894897),
      ('migrant', 0.6784891486167908),
      ('illegal_immigrant', 0.6712934970855713),...]

So it only took our model to the fourth most similar vector to assume our immigrant is illegal. Scanning the rest of the word list, I found some references to violence tied to immigrants, but no positive associative words.

A few searches into African-American and man, I found that 'Negroes' existed not far from 'african_american' and 'black' + 'man'. Taking a look at the other nearest neighbors,

  In: model.most_similar(positive=['Negroes'], topn=30)
  Out: [('negroes', 0.7197504639625549),
        ('blacks', 0.6292858123779297),
        ('Negro', 0.5892727375030518),
        ('Blacks', 0.5798656344413757),
        ('negro', 0.5609244108200073),
        ('slaves', 0.5548534393310547),
        ('niggers', 0.553610622882843),...]

Yep. Word2Vec just dropped the N-Word in the middle of my search. It's clear that Microsoft isn't the only one with potential racist bot abuse.

There were many more offensive phrases I found, many of which I didn't save or write down as I could really only stomach 5 minutes at a time of research until I needed a mental and spiritual break. Here are a summary of some I remembered and was able to find again:

mexicans => illegals, beaners
asians => gooks, wimps
jews => kikes
asian + woman => teenage girl, sucking dick
gay + man => "horribly, horribly deranged"
transsexual + man => convicted rapist[^5]

I'm certain these are not the only -isms that lie in the vectors. Although these offensive vectors are often not the top similar result, we can see that hidden inside these word embeddings are offensive, demeaning, repulsive mirrors on the -isms in our society. Journalists are not always unbiased, and the news itself often contains quotes, references and other pointers to things we might rather not see or confront. Therefore, using the news to train our language models is shown here to expose our model to the *-ism-rich underbelly of our society.

We, as data scientists and computer programmers, should recognize these statistical certainties in our data. I will note that doing similar searches in the Wikipedia vectors produced far less offensive and hateful speech. I would be curious as to other vector models trained on different texts can help us produce ethical models for our use or if we can prove findings around unlearning bias[^6].

Confronting racism, sexism, heteronormativeism and likely many other *isms in our models is not something we can avoid or ignore: it's already here and at work. Taking a raw look at it and determining how we then treat our broken models is a step we will all be forced to take either now or later.

Note: if you find other isms or are working on anything related to challenging bias in machine learning, I would love to hear from you! Feel free to reach out in the comments, email katharine at kjamistan or on social media.*

[^1] To download the model used in this post and read about how the model was developed, go to Google's original word2vec release. tldr; it was trained on 300 billion words via english-language news articles (on Google News datasets) and contains 300-dimensional vectors for 3 million words.

[^2] For the purpose of this post, isms will be used to represent a variety of oppressive societal constructs such as racism, sexism and heterosexism. I am certain there are likely more hidden isms in word embeddings, as well as more examples of these *isms in both the news vectors as well as other embedding models and other languages.

[^3] Mind you: I was surprised at that one! Indeed, it shows the inherent cultural bias of judging all genders by our looks -- another *ism in our social language.

[^4] To see the entire code yourself, check out my github.

[^5] I found 'transsexual' via searching for 'transgender'.

[^6] Bolukbasi, Chang, Zou, Saligrama and Kalai's research was also able to show bias can be expressed as a directional vector(!!!). We could possibly use machine learning to unlearn the aforementioned biases.

Obligatory Women In Tech Post

2016-09-16T00:00:00+02:00

Question: How does it feel to be a woman in tech?

Answer:

via GIPHY

see also: OG PyLadies Interview

I Hate You, NLP ;)

2016-07-21T00:00:00+02:00

"I had a great time talking about Sentiment Analysis and Natural Language processing at EuroPython 2016. Here are my slides for your review, feel free to reach out on Twitter or email if you'd like to chat further about NLP, machine learning and sentiment. I look forward to starting more conversations about how we are handling NLP in open source and sentiment analysis.

And here's the video!

Python Flight Search

2016-03-29T00:00:00+02:00

Like many people, I enjoy travel. With family and friends all across the United States and a home base in Berlin, it's fairly easy to find a reason to travel -- either globally or within the EU. That said, what I find more difficult is to determine what's the best way to get from one place to another. I have used many flight trackers before and generally was happy with the results, but I always wondered if there was more to the flight matrix...

As I was planning a potential visit to Cuba, many of the "normal" sites were lacking available trips. Since I'm based in Berlin, it's also easy (and cheap -- thanks budget air!) to fly out of Frankfurt, Paris, Amsterdam or London. This usually means setting up countless alert variations on numerous sites.¹

Being a person who has written some scrapers in her time, I thought I'd at least write one to compare a few of the popular flight search sites. I was curious to know what different options the sites gave and compare if the same flights were listed with different prices.

Diving into GitHub

It's always good to see what's out there when you're building something new -- just in case what you're building already exists (or mainly exists). Upon some searching I came across several flight trackers / scrapers written in Python.

FlightScanner's Python SDK looked great. I applied to get an API Key, and so far haven't heard back.²

I found a github flight scraper from @mayanez, but after installation, I realized it no longer worked. This is a big problem for scrapers, since they usually need constant maintenance to function properly. Every time an API changes, it could render your project obsolete.

I located the Google API to unearth the Google Flight Search (purchased from Matrix) called QPX Express. I registered and created a client on my Google Cloud Developer Console (hint: you must search for it to show up), and perused the search documentation. It's worth noting this API charges money after the first 50 requests per day.

I was interested in comparing the Google Flight Search with some of the popular ones here in Europe. Momondo was sadly out with no API and a strict "no automation" policy in their Terms of Service. With some luck, I found SkyPicker (another great site for low fare searches) does have an API with some documentation.

I also found airfinder.de, a popular aggregator here in Germany, has a simple search and no restrictions on automation. I was able to write a scraper to parse responses on their site.

I've amassed the code I wrote in a repository on Github. To note, there is a lot more information available on these API requests, so you could easily extend it to add filtering for your favorite airlines / airports or your least favorites. I've included a script I used to pull the results into a Panda's DataFrame for easy comparison and analysis.

What I found

The first thing I noticed was that, although there were some duplicates, there was definitive variance. (Aha! See??? I'm not crazy!) Some of the sites really offtered quite a few mixed carrier flights (with usually cheaper but longer routes), while others focused on direct flights. The duplicates I saw were always listed with the same times and prices (Conspiracy theory thwarted... 😢).

I found a pretty large variance depending on the search input. For the most part, I was searching for flights out of Berlin, attempting to go long distances (America, Asia, Carribean). Your mileage may vary (HAAAA..😂😂😂).

I also wondered how travel time compared to price (the eternal time vs. money question). I assumed this comparison would be a linear negative correlation, with price decreasing as travel duration increased. I was wrong.

In addition, I looked at mean prices across time of day buckets. I like to take morning flights so I can just get them out of the way… for this particular flight search (Berlin to San Francisco), that preference is costly:

departure_tod (mean price)
early am        2495.080635
morning         2459.062500
afternoon       2392.573200
evening         1663.772432
late evening    1544.032000

There's plenty of other questions to ask and answer with this dataset, so feel free to play around with your own searches or let me know if you have anything in particular you'd like me to explore.³

For now, I have a solid way to compare across a few aggregators and some new airline price search tools going forward.

My Feelings about this. ↩
To be fair, they do have a note that they get thousands of requests and cannot fulfill all of them. If you have a business need for their API, I'm fairly certain you could get an API Key much faster. ↩
I'm hoping to write some price comparison over time blog posts from this data, so let me know if you have any specific questions. ↩

Data Wrangling with Python Course

2016-02-29T00:00:00+01:00

I'll be in New York on July 13th and 14th, teaching how to "big data" with Python. We'll cover Pandas, Hadoop, PySpark and more on automation, acquisition and managing your data.

Next Course: New York City, July 13-14

Tickets are available on Eventbrite with a special Early Bird and Student discount. If you don't want to use Eventbrite and would like to pay via invoice instead, please make a note in the form comments.

If you want to attend, please fill out the form below and let me know more about what you're hoping to learn. I like to modify the course once I know more about the students, so that you can have a tailored experience and I can make sure it's engaging and interesting.

Data Wrangling with Python

2015-11-01T00:00:00+01:00

Just a quick note that my book: Data Wrangling with Python is available for prepurchase on Amazon as well as in early release on O'Reilly's web site.

Pick up a copy for less than full amount now. I'll be posting some examples of problems we work through in the book in the coming weeks, as well as some classes in Europe to learn in person, so stay tuned!

Also be on the lookout for my upcoming courses to learn applied Data Wrangling via intensive weekend-long trainings.

Europython 2015

2015-07-23T00:00:00+02:00

Introduction to Data Analysis Tutorial

Want to learn how to analyze data using Python? If you're at #europycon you should drop by my course! If not, watch the video online later today (will post link!)