AI Training Data Leak: A Growing Security Nightmare

PureID

Srishti Chaubey

March 13, 2025

AI Training Data Leak Exposes 12,000+ Private API Keys & Passwords

A recent study by Truffle Security uncovered a massive security flaw—over 12,000 real secrets, including API keys and passwords, were embedded in AI training datasets. These secrets, sourced from Common Crawl’s publicly available web data, included authentication tokens for top-tier services like AWS, MailChimp, and WalkScore.

How Did This Happen?

Common Crawl, a nonprofit that archives vast amounts of web data, is widely used for training AI models, including OpenAI’s ChatGPT, Google Gemini, and Meta’s Llama. However, an analysis of 400 terabytes of data from 2.67 billion web pages in 2024 revealed alarming findings:

  • Over 200 different types of secrets were exposed, with AWS, MailChimp, and WalkScore being among the most affected.
  • 1,500+ MailChimp API keys were hardcoded into front-end HTML and JavaScript.
  • A single WalkScore API key was used 57,029 times across 1,871 subdomains.

This issue is a symptom of a widespread problem: developers frequently leave credentials in code during development and forget to remove them before deployment.

The Bigger Threat: AI-Powered Credential Harvesting

Cybercriminals have long used web scraping to extract sensitive information, but AI models amplify the risk. Since AI is trained on vast amounts of publicly available data, it can inadvertently learn, store, and reproduce these secrets. Even when training data is screened, current filtering mechanisms are not foolproof.

Security firm Truffle Security highlighted another concern—AI coding tools don’t distinguish between safe and unsafe credentials. This means example credentials can reinforce poor security practices, making AI-assisted development a potential security liability.

Beyond Credential Leaks: AI Training Risks Grow

This issue is part of a broader set of security challenges tied to AI training data:

  1. Wayback Copilot Attack – Even if organizations secure private repositories, older versions of their data remain accessible through AI tools like Microsoft Copilot due to search engine indexing.
  2. Jailbreak Attacks – Hackers are finding ways to bypass AI security safeguards and extract confidential data from models.
  3. AI Misalignment Risks – If AI is trained on insecure code, it may unknowingly generate unsafe or hazardous recommendations.

How Organizations Can Protect Themselves

Following the discovery, affected vendors revoked compromised keys, but organizations must adopt proactive security measures to prevent future leaks:

  • Use Environment Variables – Never hardcode secrets in source code. Instead, use secure vaults or environment variables.
  • Automate Secret Scanning – Implement tools like TruffleHog, GitGuardian, or AWS Secrets Manager to detect and remove exposed credentials.
  • Adopt Zero-Trust AuthenticationMove away from passwords entirely with passwordless and zero-trust authentication solutions like PureID to mitigate credential-related risks.
  • Enhance AI Training Data Security – AI providers must improve data sanitization techniques to prevent sensitive information from being included in training datasets.

Conclusion

This AI training data breach underscores a critical cybersecurity concern—the mass scraping of data for AI training can inadvertently expose sensitive information. While vendors have taken corrective action, the industry must rethink security practices in an AI-driven world.

As AI grows more advanced, so must our approach to safeguarding digital identities and authentication systems. It’s time for organizations to embrace a passwordless future and strengthen their security posture against evolving threats.

Stay secure. Stay informed.

Connect with Us!

Subscribe to receive new blog post from PureID in your mail box