Article

Q&A: Why data provenance is critical to AI-powered drug discovery

Q&A: Why data provenance is critical to AI-powered drug discovery
Pharmaceutical research and development is becoming more expensive and taking longer. Innovative AI scale-ups and platforms are addressing this by quickly accumulating vast datasets, but the larger companies that buy them need certainty those datasets are lawful, traceable, and defensible. Here we explain what major players need to know when entering into M&A, partnerships and collaborations with smaller innovators.

Life sciences companies are racing to deliver on an AI promise: faster identification of promising therapies, streamlined development, and novel treatments reaching patients sooner and safer.

Biotechs and pharma companies are integrating the latest generative AI models into their R&D systems: analyzing vast volumes of data, identifying patterns to inform recommendations or predictions (such as whether a protein target or a candidate compound can bind to a given target), and optimizing clinical studies.

This may dramatically increase the efficiency—and reduce the cost—of drug development. This is particularly true in the rare and orphan disease space, where the cost of identifying and testing treatments can be astronomical.

However, realizing these benefits depends on factors less glamorous than the algorithms themselves: the quality and provenance of the data that feeds them.

This trend is driving transactional activity. We have advised on most of the highest-profile partnerships between AI biotechs and pharma companies for the deployment of AI to aid drug discovery.

These agreements raise complex and potentially existential issues around IP ownership, data usage rights, liability allocation, regulatory compliance, and exclusivity.

The shape of these partnership arrangements is beginning to resemble a pure AI licensing model, with pharma companies now prioritizing the ability to take third-party model weights and fine-tune them using proprietary datasets on their own infrastructure.

There are obvious advantages to this approach, although it raises complex questions around the future ownership and usage rights for derived or modified models.

This dynamic holds a lot of promise, but it also carries risk. Fast-growth companies are scrambling to create the best models (with increasing specialization for a specific indication, therapeutic area or even specific protein), and to do so they must ingest huge volumes of high-quality data.

Depending on the model, this may include clinical trial data, real-world evidence, genomic data, electronic health records, and publicly available datasets. If this data is not "clean" (traceable, accountable, and compliant), it can introduce privacy and legal risk.

Another concern is secondary use, where AI models are trained on data collected from trials conducted before the technology existed—meaning the patient could not have explicitly consented to such use.

This raises difficult questions about whether original consent frameworks adequately cover the application of personal health data to train machine learning algorithms, particularly when those models may be commercialized or used in ways that were not contemplated at the time of data collection.

These concerns underscore the importance of data provenance—understanding where data comes from, how it was collected, and under what terms it can and can’t be used.

For companies on both sides of AI-pharma deals, rigorous data provenance practices are becoming essential to managing risk and ensuring long-term value.

What does data provenance mean?

IBM defines data provenance as a historical record that details a data set’s authenticity and integrity: who created it, the history of any modifications and who made those changes. In superhero terms, we might just refer to this as its origin story.

If the data is accurate and collected in compliance with all applicable laws at the time, it tends to meet legal and industry standards, while mitigating future use concerns. For data to be considered “clean” from a privacy standpoint, the data subject must have been provided the proper notice and consents.

These will vary by jurisdiction. In addition, the entire data handling chain must have been processed in accordance with applicable laws. That means appropriate records of processing, legal bases, and transfer mechanisms.

If data is later discovered to be “unclean”—as a result of improper collection or handling—researchers have a menu of unsavory options. The data can be permanently anonymized or de-identified, though this can render it useless for future research purposes.

Such anonymization may also be impossible with certain types of data, such as genetic data. In some cases, the only recourse has been to obtain new consents by locating patients and providing them with updated forms that explicitly describe data uses not contemplated in the original notice and consent documentation.

Intellectual property considerations also arise. Various types of intellectual property rights will protect the content used to train AI models. The processes involved in training an AI model will (absent the appropriate permission or an applicable defense or exception at law) amount to infringement of those rights.

Many pharma and biotech companies will have procured datasets from third-party providers, such as biobanks, research institutions or commercial data vendors.

These licensing arrangements may contain express or implied restrictions on some of the technical steps involved in developing or deploying AI models.

More existentially, the ownership and licensing provisions in those agreements require particularly careful consideration, as they may provide the data licensor with an argument to assert ownership over improvements and derivative works that are made using the licensed data. This would arguably include wider training datasets and even the resultant AI model itself.

These downstream complications underscore why thorough vetting of data provenance is critical during the due diligence phase. Undiscovered defects in data collection or handling (or their related consents and licensing terms) can significantly impair the value of a data asset, expose the acquiring party to regulatory risk, and necessitate costly remediation efforts that could have been identified and addressed previously.

What is “secondary use,” and is it a cause for concern?

Most clinical trial consent forms include a standard disclosure authorizing secondary use, which is permission to use data for different but usually related future purposes, such as broader research applications.

The scope of this disclosure is relevant to assessing potential legal risks and compliance steps around use of data for AI purposes, whether as training data or as an input.

However, while such disclosures are standard, significant variation exists in how specific or broad they may be. We have observed that some early-stage and emerging growth companies sometimes take shortcuts on consent forms, relying on outdated, incorrect, or AI-generated templates.

The challenges are likely greater for older consent forms that would not have anticipated specific AI applications. While some of this data may be subject to public interest or research exemptions, these exceptions do not always provide enough comfort or latitude to use data in novel ways.

Following an acquisition, a buyer may need to locate patients to re-consent for secondary use, which can be time-consuming and uncertain.

Given data provenance concerns—what should buyers ask during due diligence?

Strategic buyers and investors should therefore vet data provenance when considering an acquisition or partnership.

During due diligence, they should probe whether the target has established robust privacy and AI governance processes around data collection and use. They should pay particular attention to data used to train AI models or that forms part of external databases from which the AI model can retrieve data to attach to a given prompt.

Key questions include whether the target has relied on publicly available datasets from bodies such as the FDA, which may carry fewer consent-related risks but still warrant scrutiny for accuracy and permitted uses; and, where the target has in-licensed data from third parties, the terms and scope of those licenses.

More broadly, buyers should also assess whether relevant data was collected in compliance with applicable laws (privacy and intellectual property) and whether AI models were trained on properly permissioned data.

It is important to understand whether the target has tracked risk exposure when models have ingested potentially unusable datasets and considered the impact on model performance if corrupted or noncompliant inputs must be deleted.

Finally, buyers should evaluate whether appropriate cybersecurity measures are in place to protect models containing personal information, including access controls that limit both model access and output distribution to authorized personnel. This should be based on a detailed understanding of how the buyer intends to use the data after the acquisition.

The legal risks around AI are always driven by the use case. There is a big variation depending on the type of computational approaches that a buyer may intend to deploy, type of models, and type of development techniques. A buyer needs a legal team with deep understanding of AI technologies.

What about dealmaking and risk?

Representations and warranties offer one mechanism for discovering specifics and possible shortcomings about data sets.

Purchase agreements should include representations confirming that the seller lawfully sourced or acquired the underlying data, holds all rights necessary to use that data in training AI models, and may transfer the data (and the corresponding model weights and code for the AI models) to the buyer.

Buyers should also seek confirmation that they may continue using the data for both intended and future purposes without obtaining additional consents or licenses.

Buyers can try to negotiate corresponding indemnities to further allocate risk, although these can be difficult to obtain. Indemnification provisions are particularly valuable where the target cannot fully substantiate data provenance.

Buyers may also consider whether escrow arrangements or purchase price adjustments tied to data quality—or contingent on completion of specified remediation steps—are warranted given the circumstances.

Buyers should recognize, however, that representations and warranties insurance may not provide complete protection, or any protection at all, for data-related defects. Coverage gaps are especially likely where representations are qualified by knowledge or materiality thresholds, or where disclosure schedules carve out specific data sets.

We are also seeing broad AI-related exclusions from policies, although the market is slowly moving to AI-specific cover. In practice, buyers should be prepared to undertake data remediation efforts post-closing and factor the potential cost of such efforts into their transaction analysis.

Are there workarounds when it comes to structuring a deal?

Sometimes you simply don’t need the data. Some deals or agreements separate the intellectual property or assets from the underlying training data. While it’s not always the case, it’s worth exploring whether certain datasets are necessary to the overall value of the deal.

This approach may allow a buyer to purchase AI models or algorithms while leaving the underlying data with the seller, thereby reducing exposure to data provenance risks.

For instance, pharma companies are increasingly licensing AI models, but not the underlying data used to train the models. In principle, many AI experts argue that once a model is trained, it does not “store” the data.

This is a technical question at issue in multiple copyright lawsuits globally. EU privacy regulators have also addressed it, saying that the weights for an AI model can indeed be personal data to the extent they are a statistical representation of the underlying training data.

What are the main regulatory frameworks that apply?

Overlapping regulatory frameworks govern data use and protection, many of which predate the AI era and draw from established privacy principles.

There is increasing fragmentation globally, driven by geopolitics and diverse strategic objectives for different governments and policymakers.

The U.S. and the EU are now moving towards targeted AI-specific regulations and guidance, with early signs of harmonization.

Accordingly, in January 2026, the European Medicines Agency (EMA) and the U.S. Food and Drug Administration (FDA) jointly published Guiding Principles of Good AI Practice in Drug Development, which governs AI use from research to manufacturing as well as pharmacovigilance (drug safety monitoring).

Organizations operating in the U.S. must navigate an uncertain domestic landscape, complicating compliance.

At the federal level, legislative and executive priorities remain in flux and sometimes at odds—with one camp promoting AI dominance and another addressing safety concerns.

In the meantime, several states have enacted their own rules. The Trump Administration has signaled its intent to override or delay state-level AI regulations it views as obstructing national policy.

Historically, a significant portion of clinical trial data fell outside the scope of the Health Insurance Portability and Accountability Act (HIPAA), either due to how the trial was conducted or because an applicable exemption applied.

That calculus shifted as researchers increasingly turned to “real world” data sets derived from electronic health records, which fall under HIPAA. While such data can still be used, doing so requires additional processing—whether through de-identification, obtaining appropriate consents, or careful analysis of available exemptions.

Any such use of protected health information (PHI) to train AI models triggers the full suite of HIPAA obligations: the privacy, security, and breach notification rules all apply. Organizations deploying AI that touches PHI must execute a Business Associate Agreement (BAA) with their AI vendors, adhere to the “minimum necessary” standard, and implement robust technical safeguards such as encryption and access controls.

HHS will treat a vendor as a business associate based on the nature of the services it provides, regardless of whether a BAA is in place, and the covered entity bears most of the regulatory risk for failing to formalize the relationship.

In January 2025, the Food and Drug Administration (FDA) issued draft guidance outlining a risk-based framework for integrating AI and machine learning into drug development.

The guidance emphasizes several core principles: sponsors must clearly define the “context of use” for any AI model, conduct a structured risk assessment proportionate to the model's impact on regulatory decisions, and maintain transparency around model design and limitations.

Critically, the FDA underscores that data integrity remains foundational; organizations must be able to demonstrate that the data used to train and validate AI models is fit for purpose and sufficient to support claims of safety and efficacy.

The guidance makes clear that AI should augment, not replace, human expert judgment in drug development. For dealmakers, this framework signals the types of documentation and validation processes that acquirers should expect to see (and diligence carefully) in any life sciences transaction involving AI-enabled research or development.

The EU AI Act represents the first effort to enact cross-sector harmonized AI-specific legislation, although its application to drug discovery is limited in practice. The EU appears to be loosening its most stringent AI rules in response to calls by the 2024 Draghi Report to boost Europe’s global competitivity.

In any case, EU-sourced patient data is subject to GDPR, which remains the region’s most relevant source of compliance requirements. This requires data controllers to have a lawful basis for processing personal data, which is not necessarily straightforward when originally collected for other purposes.

Health data is “special category” data under the GDPR, requiring additional compliance measures, in particular under Article 9. This requires organizations to have secured explicit consent or demonstrate that it is necessary for research and statistics (narrowly construed) or public health.

Arguably the biggest GDPR hurdle is the purpose limitation, under which controllers are not permitted to use personal data for purposes that are incompatible with those for which the data was originally collected (secondary use).

Other key requirements include transparency (which raises complexities for some types of AI system or where the data has been aggregated with third-party data sources), data minimization (limiting the collection of personal information to what is directly relevant and necessary to accomplish a specified purpose), security, and strict cross-border transfer mechanisms.

The position under the UK GDPR is broadly equivalent, with nuances in the application and interpretation by local regulators.

To learn more about how EU and U.S. regulations impact M&A involving AI, please read this article.

To what extent are regulators thinking about data provenance, and is there a risk of disgorgement?

In the U.S., disgorgement and data-destruction remedies are a growing risk for companies that improperly train AI models on healthcare data. While no regulator has yet imposed a major disgorgement penalty on a large-scale AI model, potential acquirers should consider this possibility.

The Federal Trade Commission (FTC) has applied disgorgement-style monetary penalties, substantial civil fines, mandatory deletion of improperly collected data, permanent prohibitions on sharing health information without explicit consumer consent, and injunctive relief.

Targets have included health-focused technology companies such as BetterHelp, GoodRx, Flo Health, and Kochava. The FDA has focused its own enforcement on product safety rather than data misuse, employing tools such as warning letters, injunctions, consent decrees, and product recalls.

It is unclear whether data provenance failures are insurable, though a buyer’s representations and warranties may exclude or limit coverage for regulatory fines. If a target's executives were aware that data was improperly processed, acquirers could face difficulties recovering losses as well as fraud exposure.

In the cyber context, a hack could reveal shoddy data practices and require notification to authorities and data subjects. After an acquisition, this obligation often rests with the data owner, meaning an acquirer may need to contact data subjects with whom it has no relationship.

Regulatory risk can occur even without a data breach. In a worst-case scenario, models trained on unreliable, improperly weighted, or unlawfully obtained data could produce flawed outputs that inform clinical trial design or drug development decisions, potentially contributing to patient harm and product liability exposure.

Any final takeaways?

Data provenance is a growing risk factor for large pharmaceutical companies seeking to acquire or partner with AI driven biotechs.

Smaller firms racing to scale datasets may rely on sources that lack full consent, robust documentation, or clear legal rights, which can leave a pharma company exposed to regulatory, operational, and reputational consequences.

Legal and compliance teams must learn the nuances of different AI technologies, frontier use cases and the factors that affect legal risk.

Depending on the model and use case, it may be appropriate to impose stringent provenance expectations during due diligence, scrutinizing sourcing and consent frameworks.

Strong AI-specific representations and warranties help to flush out information, and there is a growing toolkit of novel deal terms (including indemnities) to allocate data quality risks.

Our cross-practice and global AI Diligence Unit combines expertise in all relevant legal areas to deliver technology-led, efficient diligence on AI risks.

Related capabilities