Compliance

Generative AI and Data Privacy: The Challenge of PII Use in Training Data Sets

06/11/2024

by Bill Tolson

The rapid advancement of generative AI models, such as large language models and image generators, has ushered in a new era of technological capabilities. However, this progress also raises significant concerns regarding data privacy compliance when personally identifiable information (PII) is included to train these powerful AI systems.

Generative AI and data privacy

At the core of this issue is the fundamental tension between the requirement for vast amounts of data needed for the training of generative AI models, the rights of individual data subjects (including the right to have their PII deleted), and the principles of data minimization and purpose limitation enshrined in many data privacy regulations, such as the EU's GDPR, California's CCPA/CPRA, and many more new state data privacy laws.

Generative AI models are trained on massive datasets, often containing billions or trillions of data points, to learn patterns and relationships to generate human-like text, images, or other outputs. While this training data can come from various sources, including publicly available sources, it is not uncommon for organizations to incorporate PII, such as personal names, addresses, or other identifiable information, into their training sets.

Using PII in AI training sets raises several concerns from a data privacy perspective. First, it may violate the principle of purpose limitation — if the personal data was initially collected for a different purpose than training AI models. Additionally, it could conflict with the data minimization principle, which encourages organizations to limit the collection and processing of personal data to what is strictly necessary.

Data privacy law rights

Moreover, many data privacy laws grant individuals the right to access, rectify, or delete their personal data held by organizations. However, when PII is deeply embedded within the parameters of a trained generative AI model, it becomes incredibly challenging, if not impossible, to isolate and remove that specific data point without significantly degrading the model's performance.

This predicament poses a significant compliance challenge for organizations operating under data privacy regulations. Failing to effectively remove requested PII from trained AI models could be considered a violation, potentially leading to substantial fines, legal actions, and reputational damage.

To mitigate these risks, organizations should explore techniques such as synthetic data generation, differential privacy, and federated learning, which can help obfuscate or anonymize individual PII while preserving the overall patterns and distributions in the training data.

A best practice would include maintaining detailed data lineage and documentation of training data sources, preprocessing steps, and model versioning to help identify and manage PII better within AI systems. Additionally, companies should consider changing their consent form requests to include individual PII in generative AI training data sets.

Furthermore, using PII in generative AI training raises broader ethical and legal concerns beyond compliance with data deletion requests. Organizations must carefully evaluate the potential risks, such as algorithmic bias, discrimination, and privacy violations, and implement appropriate safeguards and governance processes.

As generative AI continues to advance and permeate various industries, regulatory bodies and policymakers may need to provide additional guidance or legal frameworks explicitly addressing the challenges posed by AI systems and personal data. Striking the right balance between innovation and data privacy protection will be crucial for fostering public trust and enabling the responsible development of these transformative technologies.

Implications of generative AI models and the new data privacy laws

If a data subject exercises their right to have their personally identifiable information (PII) deleted under various data privacy laws like the GDPR, CCPA/CPRA, or the numerous other state data privacy laws, and that PII was used in training a generative AI model, there could be significant challenges in fully complying with the deletion request.

Key implications and considerations include:

Difficulty in Isolating and Removing specific PII: Generative AI models, especially large language models or image generators, are trained on massive datasets containing vast data points. Identifying and removing the specific PII of an individual data subject from the already trained model can be extremely difficult - if not impossible, due to the distributed nature of the training data within the model's parameters.

Potential Model Degradation: If the PII of a data subject is successfully identified and removed from the training data, retraining the generative AI model without that data could potentially degrade its performance and accuracy and potentially change the model's conclusions and recommendations. The extent of this degradation would depend on the size of the overall training dataset and the significance of the removed data point.

Compliance Challenges: Failing to effectively remove the requested PII from the trained AI model could be considered a violation of data privacy regulations, potentially leading to fines, legal actions, and damage to the organization's reputation.

Synthetic Data and Differential Privacy: One potential solution could be using synthetic data generation techniques or differential privacy methods during the initial training process. These approaches could help obscure or anonymize the individual PII while preserving the overall patterns and distributions in the data, making it easier to comply with deletion requests without significantly impacting the AI model's performance.

Data Lineage and Documentation: Maintaining detailed data lineage and documentation of the training data sources, preprocessing steps, and model versioning could help organizations better identify and manage PII within their AI systems, facilitating compliance with data subject rights.

Ethical and Legal Considerations: Using PII in generative AI training raises ethical and legal concerns beyond compliance with data deletion requests.

One last question:

If a specific data subject requests deletion of their PII and that PII was used for generative AI content creation, would all the associated content generation output from that model also need to be deleted and the model retrained with the new training set?

I have not reviewed the various new AI use laws emerging worldwide, but so far, I have not encountered these related questions or concerns.

Ultimately, addressing data subject deletion requests for generative AI models trained on PII may require a combination of technical solutions, robust data governance practices, and, potentially, new regulatory guidance or legal frameworks specifically addressing the challenges posed by AI systems and using PII.

FEATURED
E-BOOK

Financial Services and Generative AI: Navigating a
New Era of Innovation

How Financial Services Firms are Embracing and Governing Generative AI

Read Now

Share this post!

Author
Recent Posts

Bill Tolson

President at Tolson Communications LLC

Bill Tolson is President of Tolson Communications LLC, an advisory and consulting firm. He has 25-plus years in the archiving, information governance, data privacy, data security, and eDiscovery industries. Bill has held executive leadership positions in a wide range of high technology organizations, from consulting firms and technology startups to multinationals. Companies include Contoural, Hewlett Packard, StorageTek, Iomega, Hitachi Data Systems, Recommind, Actiance and Archive360 where he was the Vice President of Global Compliance and eDiscovery for seven years.

Bill is a frequent speaker at legal and information governance industry events and has authored four eBooks including Email Archiving for Dummies, Cloud Archiving for Dummies, The Bartenders Guide to eDiscovery and the Know IT All's Guide to eDiscovery. Bill has also authored 60 plus industry articles and hundreds of blogs as well as hosting 37 podcasts with industry pundits, subject matter experts, state legislators, and attorneys.

Smarsh Blog

Our internal subject matter experts and our network of external industry experts are featured with insights into the technology and industry trends that affect your electronic communications compliance initiatives. Sign up to benefit from their deep understanding, tips and best practices regarding how your company can manage compliance risk while unlocking the business value of your communications data.

Ready to enable compliant productivity?

Join the 6,500+ customers using Smarsh to drive their business forward.

watch it work

More Resources

Generative AI and Data Privacy: The Challenge of PII Use in Training Data Sets

Generative AI and data privacy

Data privacy law rights

Implications of generative AI models and the new data privacy laws

One last question:

Financial Services and Generative AI: Navigating a
New Era of Innovation

FEATURED CONTENT

AI Compliance in Financial Services: Top Questions Answered

Smarsh Blog

Ready to enable compliant productivity?

AI Governance in Financial Services: What FINRA and SEC Expect

AI Compliance in Financial Services: Top Questions Answered

SEC Crypto Regulation 2025: From Crackdowns to Constructive Frameworks

SOLUTIONS BY ISSUE

SOLUTIONS BY ROLE

SOLUTIONS BY REGULATION

COMPANY

RESOURCES

SUPPORT

LEGAL

Contact Us

Generative AI and Data Privacy: The Challenge of PII Use in Training Data Sets

Subscribe to the Smarsh Blog Digest

FOLLOW US

Generative AI and data privacy

Data privacy law rights

Implications of generative AI models and the new data privacy laws

One last question:

Financial Services and Generative AI: Navigating a New Era of Innovation

FEATURED CONTENT

AI Compliance in Financial Services: Top Questions Answered

Smarsh Blog

Ready to enable compliant productivity?

More Resources

AI Governance in Financial Services: What FINRA and SEC Expect

AI Compliance in Financial Services: Top Questions Answered

SEC Crypto Regulation 2025: From Crackdowns to Constructive Frameworks

Contact Us

Financial Services and Generative AI: Navigating a
New Era of Innovation