Introduction

Conformal prediction (CP) is a statistical framework for quantifying uncertainty in machine learning models that provides rigorous confidence guarantees on predictions. Unlike conventional models that output point estimates or uncalibrated probabilities, a conformal predictor produces a prediction set or interval that is guaranteed to contain the true value with a user-specified probability (confidence level) under minimal assumptions. For example, at 90% confidence, a CP for regression will output an interval that covers the true experimental value in at least 90% of cases. This property, known as validity, holds distribution-free – it does not rely on any specific data distribution or model type, only requiring an exchangeability (roughly, i.i.d.) assumption. In other words, CP offers distribution-agnostic uncertainty quantification: no matter if one uses a random forest or a deep neural network, the CP framework can wrap around it and produce well-calibrated prediction regions with guaranteed coverage. Furthermore, CP is model-agnostic and imposes little computational overhead, making it easy to integrate with any machine learning algorithm. These advantages – minimal assumptions, guaranteed marginal coverage, and easy integration – have led to growing interest in CP for high-stakes applications.

In drug discovery, the ability to quantify prediction uncertainty is especially crucial. Errors in this domain carry high costs: a false positive (e.g. predicting a molecule will be active when it is not) can waste substantial resources on synthesis and experimental testing, whereas a false negative might cause a promising drug candidate to be overlooked. CP’s guarantee that a specified fraction of predictions will be correct helps limit such false leads. In fact, using CP has been shown to reduce false positives and improve the hit rate in virtual screening, since only predictions deemed sufficiently confident are taken as reliable. Drug discovery data also pose challenges like data scarcity and noise. Models are often trained on relatively small, biased datasets – for example, limited assays for a novel target – which restricts their applicability domain. Experimental measurements (e.g. bioassay readouts) can be noisy and variable, adding uncertainty to any predicted potency or ADMET property. CP directly addresses these issues by providing explicit measures of uncertainty alongside predictions, rather than overconfident point estimates. A conformal predictor might output a wide confidence interval for a molecule when data is scarce or noisy, signaling to researchers that the prediction is unreliable and more data or experiments are needed.

Another motivation for CP in this field is the increasing emphasis on trustworthy and explainable AI in healthcare-related decisions. Regulatory agencies and industry guidelines often require a clear characterization of a model’s domain of applicability and confidence in its predictions. Traditional approaches to define an applicability domain (AD) in QSAR modeling, such as distance-based or ensemble-based heuristics, lack formal statistical guarantees. CP offers a transparent and formal alternative to applicability domain estimation – effectively, each prediction comes with a guarantee of validity under the stated confidence level. Since its introduction to QSAR modeling by Norinder et al. (2014), conformal prediction has been applied in various drug discovery contexts, including virtual screening campaigns, toxicity prediction, and even clinical trial outcome modeling. By producing individualized confidence bounds for each prediction, CP enables more informed decision-making – for instance, prioritizing compounds with both high predicted activity and narrow confidence intervals for synthesis. In summary, CP’s ability to deliver valid prediction sets with guaranteed coverage makes it a powerful tool to increase the reliability and efficiency of drug discovery pipelines, where decisions must carefully balance risk and reward.

Mathematical Foundations of CP

Exchangeability. The theoretical guarantee of conformal prediction rests on the assumption of exchangeability of data. A sequence of examples $(z_1, z_2, \dots, z_n)$ (where each $z_i = (x_i, y_i)$ includes features and a label) is exchangeable if its joint probability distribution is invariant under permutation – informally, the data points have no inherent order. Exchangeability is a slightly weaker condition than full independence and identical distribution (i.i.d.): it allows for any ordering of samples as long as there is no temporal or contextual bias. In the drug discovery context, this means we assume that the training, calibration, and test compounds are all drawn from the same underlying distribution (e.g. the same chemical space or experimental protocol). If this assumption holds, no data point carries privileged information (such as being an out-of-distribution example), and the symmetry under permutations can be exploited to derive rigorous confidence statements. Why is this important? Under exchangeability, one can prove that CP prediction sets achieve the nominal coverage probability exactly (or conservatively) in finite samples. Essentially, because the new test sample could equally well have appeared as part of the calibration data, its predicted rank among the calibration nonconformity scores is uniformly distributed. This rank invariance implies validity of the CP p-values and prediction sets. If exchangeability is violated – for example, due to covariate shift (the test compounds differ systematically from training, as in an evolving chemical library or a new scaffold series) or temporal drift (data from a later time have different properties than earlier data) – the CP coverage guarantee may no longer hold. In practice, drug discovery datasets often approximate exchangeability (e.g. when drawing a diverse screening subset at random), but there are notable exceptions. Eklund et al. observed that in a five-year collection of in-house screening data, temporal correlations led to weakened validity of conformal predictors (actual coverage fell below nominal). They showed that updating the model and recalibration periodically (a “semi-off-line” CP approach) could partially restore validity. This highlights that careful attention must be paid to data sampling—whenever the i.i.d. assumption is questionable, CP users should check for coverage deviations or use techniques to mitigate dataset shift.

Nonconformity Measures. At the heart of conformal prediction is the concept of a nonconformity measure (also called a strangeness or nonconformity score function). This is a function $A(z)$ that quantifies how “atypical” or nonconforming an example $z = (x,y)$ is relative to a reference set (typically the training or calibration set). Intuitively, $A(z)$ is large if the example does not “fit in” with the others – for instance, if the model’s prediction for $x$ was very different from the true $y$, or if $x$ lies far from the training chemical space. The choice of nonconformity measure is up to the practitioner and serves as a plug-in for domain knowledge. It plays a crucial role: it determines how the model’s errors are judged and thus affects the size (efficiency) of the prediction sets. A simple and common choice in regression tasks (e.g. predicting a compound’s $pIC_{50}$ or log-solubility) is the absolute residual error:

\[\alpha_i \;=\; A(z_i) \;=\; |y_i - \hat{y}_i|\,,\]

where $\hat{y}_i$ is the predicted value for compound $i$ given by a regression model (trained on a proper training set). Intuitively, compounds for which the model makes a large error are considered more nonconforming. In classification tasks (e.g. active vs inactive, or multi-class toxicity outcomes), a popular nonconformity score is based on the model’s estimated class probabilities. For an example $z_i=(x_i,y_i)$ with true class $y_i$, one can define:

\[\alpha_i = 1 - \hat{P}(y_i \mid x_i)\,,\]

i.e. one minus the predicted probability assigned to the true class by the model. This score is near 0 for instances that the model predicts with high confidence (since $\hat{P}(\text{true class})$ is high) and close to 1 for instances the model finds confusing (low predicted probability for the true label). Variants of this for multi-class classification include using the difference between the top predicted probability and that of the true class (a margin-based nonconformity). Many implementations use the margin for binary classification; for example, if a Random Forest outputs the fraction of trees voting “active”, one can take $\alpha = 1 - p_{\text{true}}$ (which reduces to 0 if all trees predict the true class, or 0.5 if the model is completely unsure). More generally, any metric of “prediction uncertainty” or “distance from the training data” can serve as a nonconformity measure. This means CP can flexibly incorporate traditional applicability domain metrics: for example, in chemical space modeling one could define $\alpha_i$ as the distance from molecule $x_i$ to its nearest neighbor in the training set, or an ensemble-based uncertainty (such as the entropy of a predictive distribution). Crucially, the CP validity guarantee holds for any choice of nonconformity function – the trade-off is that a poor choice can lead to very conservative (wide) prediction sets or reduced efficiency. In practice, one tries to choose $A(z)$ that correlates with prediction error: this yields tighter intervals while still capturing most errors. Nonconformity measures leveraging model-specific information (e.g. the margin from an SVM hyperplane, or attention weights in a graph neural network) can specialize the conformal predictor to the problem at hand.

P-Values and Prediction Sets. Given a nonconformity measure, conformal prediction proceeds by computing p-values for new samples to decide which labels to include in the prediction set. Formally, suppose we have a calibration set of $n$ examples (held out and not used in model training) with computed nonconformity scores ${\alpha_1, \alpha_2, \dots, \alpha_n}$. For a new test example with features $x_{n+1}$, and for a candidate label $y$, we first compute the nonconformity score $\alpha_{n+1} = A((x_{n+1}, y))$ by evaluating how strange $(x_{n+1}, y)$ would be relative to the calibration data. In classification, this means we temporarily assign the label $y$ to $x_{n+1}$ and compute the score; in regression, one typically uses the model’s point prediction $\hat{y}{n+1}$ for $x{n+1}$ and considers potential residuals. The conformal p-value for the new example (with label $y$) is then calculated as:

\[p_{y} \;=\; \frac{\#\{ i \in \{1,\dots,n\} : \alpha_i \ge \alpha_{n+1} \} + 1}{\,n+1\,}\,,\]

where we add 1 to the count to account for the new sample itself in the ordering. This p-value essentially measures how “compatible” the new example is with the distribution of nonconformity scores seen in calibration – it is the fraction of calibration instances that are at least as strange as the new one (including the new one in the denominator as a random tie-break). A high p-value (close to 1) means the new $(x_{n+1}, y)$ looks very typical (many calibration points have higher nonconformity), whereas a low p-value means $(x_{n+1}, y)$ is an outlier in terms of nonconformity. To make a prediction, we consider the set of labels for which the null hypothesis “the new example conforms as well as the calibration data” is not rejected at the significance level $\varepsilon$. In practice, one chooses a desired confidence level $1-\varepsilon$ (e.g. 0.80 or 0.95), which corresponds to an allowed error rate $\varepsilon = 0.20$ or 0.05. The prediction set output by a conformal classifier is then:

\[\Gamma_{1-\varepsilon}(x_{n+1}) \;=\; \{\,y : p_{y} > \varepsilon \,\}\,,\]

the set of all labels that achieve p-value above the significance threshold. This set is guaranteed to contain the true label with probability at least $1-\varepsilon$ (over the randomness of the training/calibration data). In a regression setting where $y$ is a real-valued property, the prediction set $\Gamma$ is typically a continuous interval. A convenient formulation in inductive conformal regression is to use the quantile of calibration residuals: for confidence $1-\varepsilon$, let $Q$ be the $(1-\varepsilon)(n+1)$-th order statistic of ${\alpha_1,\dots,\alpha_n}$ (the $(1-\varepsilon)$ quantile of the calibration errors). Then one can simply set:

\[\Gamma_{1-\varepsilon}(x_{n+1}) \;=\; [\,\hat{y}(x_{n+1}) - Q,\;\; \hat{y}(x_{n+1}) + Q\,]\,,\]

i.e. an interval around the model’s point prediction, with half-width equal to that quantile. By construction, at most $\varepsilon$ fraction of calibration points had residuals larger than $Q$, so the new interval will fail to cover $y_{n+1}$ at most $\varepsilon$ fraction of the time (in expectation). This simple recipe is often referred to as split conformal prediction in regression settings.

CP Variants: Transductive, Inductive, and Split. There are a few formulations of conformal prediction, differing in how they utilize the data for calibration:

Full (Transductive) Conformal Prediction: In the original formulation by Vovk et al., conformal prediction is applied in a transductive or online manner, where each test example is handled one at a time in sequence. For each new point, the model is trained on all available training data, and then for each possible label value, the nonconformity score is computed and compared against all training points (often via a leave-one-out procedure). This yields exact p-values for that test point. Then the test point (with its true label, once revealed) can be added to the set, and the next point is predicted. Transductive CP uses all $n$ training samples plus the test sample in question for calibration, and thus it achieves maximum utilization of data and provably exact validity. However, it is extremely computationally intensive: for each test sample and each candidate label, one may need to re-train or at least re-evaluate the model (or nonconformity function) $n+1$ times. In drug discovery problems with large datasets or many test compounds, full CP is often impractical.
Inductive Conformal Prediction (ICP): To improve efficiency, inductive or split conformal prediction uses a hold-out calibration set disjoint from the training set. The model is trained on a proper training set (e.g. 70% of data), and then a separate calibration set (e.g. 20%) is passed through the model to collect nonconformity scores. These calibration scores (together with the exchangeability assumption) serve as a representative distribution for model errors. At prediction time, the model (trained on the proper training set) is applied to a test example once to get the necessary quantities (e.g. predicted label or score), and the calibration scores are used to compute the p-value as described above. In this way, ICP avoids retraining or expensive leave-one-out loops for each test point – the heavy lifting is done only once on the calibration set. The trade-off is that we waste some data for calibration instead of training, potentially reducing model accuracy slightly; but in large datasets this is negligible, and in small data scenarios one can use cross-validation variants. Inductive CP is the most widely used approach in cheminformatics applications because it scales well to modern datasets and integrates neatly with typical train/test splits. Notably, ICP still provides the same validity guarantee: under exchangeability, $\Pr(y_{n+1} \in \Gamma_{1-\varepsilon}(x_{n+1})) \ge 1-\varepsilon$ for any model and any nonconformity function. This guarantee is marginal (averaged over all test points); how to ensure stronger conditional guarantees is a topic of research.
Split Conformal Prediction: In statistics literature, the term “split conformal” often refers specifically to the inductive method for regression introduced by Papadopoulos et al. and further popularized by Lei et al. – essentially the method of using a calibration set and outputting interval predictions by the quantile method described above. In this article, we treat “split conformal” as synonymous with inductive conformal (since both involve a data split into train and calibration sets). Some authors distinguish them in that ICP computes a p-value for each label and can handle classification with prediction sets, whereas split conformal (sometimes called conformalized quantile regression in modern deep learning contexts) focuses on regression intervals. The underlying theory is the same, with slight differences in implementation. For completeness, we note there are also variants like cross-conformal prediction and jackknife+, which attempt to use all data for training while still providing valid intervals, by combining multiple splits or leveraging cross-validation. These are beyond our scope but can be useful for small datasets (common in drug discovery) where one hesitates to sacrifice data for calibration.

In summary, conformal prediction provides a distribution-free, model-agnostic way to obtain prediction sets with finite-sample guarantees. The keys are: (1) an exchangeability assumption ensuring the calibration data represent the test distribution, (2) a nonconformity measure measuring how unusual new points are, and (3) the mathematical machinery to convert nonconformity ranks into valid p-values and confidence sets. When these pieces come together, we obtain the powerful result that, for any requested confidence level (say 95%), the method will output a prediction band that contains the true value at least 95% of the time no matter what the underlying model or data distribution. This level of reliability is especially appealing in drug discovery applications, as we discuss next.

Types of CP and Nonconformity Scores in Drug Discovery

The conformal prediction framework has several extensions and variants that have been tailored to practical considerations in drug discovery. Here we compare some common approaches and discuss how to design nonconformity scores for typical tasks in this domain. Table 1 provides a summary of key CP variants and their characteristics, followed by a discussion of nonconformity choices for regression vs classification, and specialized data types.

CP Variants and Their Trade-offs. A first important distinction is between transductive (full) CP and inductive CP, which we introduced above. Transductive CP uses all available labeled data for each prediction and thus achieves the strongest theoretical guarantee (it is provably valid conditionally on the training data, i.e. each test prediction set has exactly $\le \varepsilon$ chance of error given the training set) at the cost of heavy computation. Inductive CP sacrifices some theoretical elegance (validity is marginal over an ensemble of splits) in exchange for efficiency, making it the default for most real-world applications. In the context of drug discovery, where one might need to predict properties for millions of candidate compounds in a virtual screening library, inductive CP is essentially the only feasible option. Another consideration is the handling of imbalanced data and heterogeneous tasks, which is common in drug discovery (e.g. active compounds are often a tiny fraction of a screening library). A specialized variant called Mondrian Conformal Prediction (MCP) addresses this by conditioning the conformal procedure on predefined categories (named after Mondrian’s partitioning). In Mondrian CP, calibration is done within each category or class. For example, in a binary active/inactive classification, one maintains separate calibration score distributions for actives and inactives. This way, one can guarantee class-conditional validity – e.g., 90% of active compounds will have their true label in the prediction set at 90% confidence, and 90% of inactive compounds will as well (each class treated independently). This is extremely useful in imbalanced datasets: a regular CP might achieve 90% overall accuracy but perhaps only 50% on the minority class, whereas Mondrian CP ensures the error rate is controlled for each class. The cost is that the calibration data is effectively split by class, so one needs enough data in each stratum to calibrate reliably. In practice, Inductive Mondrian CP (combining both strategies) is widely used for classification in drug discovery, as noted by Norinder et al. and others.

Other variants worth noting include cross-validation ensembles of CP (to use all data without a separate calibration set, at the expense of multiple model fits) and online CP for sequential data. In drug discovery, sequential (online) CP could be relevant for active learning or iterative screening settings, where models are updated as new compounds are tested. Researchers have proposed semi-offline CP to handle temporally ordered data, updating the calibration set in batches over time to deal with drift. There are also extensions like Venn predictors that output probability distributions with guaranteed calibration, but these are less common in the literature compared to CP and will not be our focus.

Table 1 below contrasts some key types of conformal prediction as applied to drug discovery problems:

CP Variant	Description	Pros	Cons	Use in Drug Discovery
Transductive CP (Full)	Uses all training data and incorporates each test point on the fly; computes p-value by considering the test point among training points (leave-one-out or retraining for each test).	Exact validity for each test sample; uses maximum data for training.	Computationally prohibitive for large $n$ or many tests; not practical for high-throughput screening.	Rarely used explicitly due to cost; conceptually important as baseline.
Inductive CP (ICP)	Splits data into training set and separate calibration set; train once, then calibrate residuals; predict test points without retraining.	Efficient (single model training); easy to implement; valid marginal coverage.	Uses fewer samples for training (since some are set aside for calibration); validity is marginal, not conditional.	Most common approach (e.g. conformal QSAR models. Standard for large-scale predictions.
Mondrian CP (conditional)	Calibration is stratified by a context (e.g. class label or data cluster). Each stratum has its own nonconformity distribution.	Guarantees coverage within each category (class-conditional validity); handles imbalance better.	Requires sufficient calibration data per category; categories must be known upfront (cannot handle totally novel class).	Frequently used for classification (active/inactive) and scenarios with known subpopulations (e.g. separate models for assay batches).
Cross-Conformal / Jackknife+	Averages or combines multiple CP predictors (e.g. via cross-validation) to avoid a single calibration split.	Uses all data for training (no data “wasted” purely for calibration); often yields tighter intervals than single-split.	More computational cost (train $k$ models for $k$-fold CV); theoretical guarantees become slightly more complex (typically still valid).	Used in research when data is very limited (e.g. small in vivo datasets) to maximize data usage. Not yet common in routine practice.
Online/Semi-offline CP	Continuously update the CP with new data in sequential manner, or periodically recalibrate to account for drift.	Adapts to non-stationary data; maintains validity over time if done properly.	Implementation complexity; risk of violation of exchangeability if not carefully reset; needs recalibration steps.	Applicable in iterative design-make-test cycles and evolving bioassays; Eklund et al. used periodic recalibration for 5-year project data.

Nonconformity Measures for Regression vs. Classification. The choice of nonconformity measure $A(z)$ in a conformal predictor has a large impact on the efficiency (size) of the prediction sets. In drug discovery tasks, one typically leverages the output of the underlying QSAR or ML model to define $A$. We already described common choices: absolute error for regression, and $1 - p(\text{true class})$ for classification. We reiterate these with context:

Regression: For a property like binding affinity or pharmacokinetic endpoints (logD, pIC50, etc.), a natural nonconformity score is the difference between predicted and observed value. Often the absolute error $

y - \hat{y}

$ is used. If the model tends to have larger errors on certain ranges, sometimes scaled residuals or other error metrics can be used. Another strategy is to use quantile regression models to estimate prediction intervals directly, and then apply CP on the quantile regression error – this can yield tighter intervals by incorporating heteroscedasticity (variable noise levels). Simpler yet, one may take the absolute standardized residual, if an estimate of uncertainty per compound is available (for example, from an ensemble variance).

Classification: For a typical binary classification (active vs inactive, toxic vs non-toxic), one can take the model’s confidence in the predicted class as a measure of conformity. If $\hat{p}$ is the predicted probability for the positive class, a reasonable nonconformity is $\min{\hat{p}, 1-\hat{p}}$ (which is 0.5 for a completely uncertain prediction and near 0 for a very confident prediction). This reduces to $1 - \hat{P}(y_{\text{true}}

x)$ as noted earlier. In multi-class settings, a common choice is $1 - \hat{P}(y_{\text{true}}

x)$ as well, which again corresponds to the model’s confidence in the true class. Some works instead use the difference between the top predicted probability and the true class probability (so that if the true class is not the top prediction, the score is large). The fraction of trees voting for the predicted class in a Random Forest is an example used in early CP papers – effectively an uncalibrated confidence. Notably, if one is using Mondrian CP for class conditioning, the nonconformity measure should be chosen in a label-neutral way (since separate calibration is done per class). In practice, however, many implementations simply use the same formula restricted to each class’s calibration data.

Hybrid / Other Measures: Some interesting problem-specific nonconformity scores have been proposed in drug discovery. For example, for graph neural network models that predict molecular properties, one can design $A(z)$ to incorporate model-internal uncertainty signals: an approach might be to use the entropy of the model’s output distribution (for classification) as $\alpha$ (higher entropy = more nonconforming), or use metrics like the attention weights in an attentive GNN as a proxy for confidence. As a hypothetical example, one could define $A(z)$ = 1 - (average attention weight on relevant substructure), so that molecules the GNN pays diffuse attention to are deemed strange. For ligand-based models, the distance in latent space (embedding space) to the nearest training compound is a sensible nonconformity measure – this directly connects to applicability domain: a novel chemotype far from any training point should be considered nonconforming (leading to a large prediction set). Indeed, Cortés-Ciriano et al. note that any metric used for applicability domain determination (like similarity or distance) can be plugged into CP as a nonconformity measure. In toxicity prediction, sometimes consensus models are used; one could define $A$ as the disagreement among models, e.g. variance of predictions from an ensemble (with higher variance indicating more nonconformity).

Because nonconformity scores are so flexible, researchers often experiment to find a measure that yields narrow intervals while still maintaining validity. A poor choice of nonconformity (for instance, a score uncorrelated with true errors) will still give valid coverage in theory, but may produce unwieldy prediction sets (low efficiency) or flag almost all predictions as “unreliable”. For example, if one chose a completely random nonconformity score, the method would technically cover 90% of true values at 90% confidence, but it would do so by often outputting nearly the entire label space as the prediction set! In contrast, a well-chosen $A(z)$ will differentiate easy-to-predict compounds (low $\alpha$) from hard ones (high $\alpha$), allowing the conformal algorithm to confidently output small sets for the former. There is an inherent trade-off between validity and efficiency: CP guarantees validity no matter what $A$ is, but it does not guarantee that the prediction sets will be narrow. The efficiency (average size of intervals or number of labels) depends on how informative the nonconformity measure is about the target. In practice, domain experts use their understanding of the problem to craft $A(z)$ – for instance, incorporating assay noise levels, model calibration information, or known applicability domain limits – to maximize the usefulness of conformal predictions in drug discovery projects.

Limitations and Failure Modes of CP in Drug Discovery

While conformal prediction is a powerful framework, it is not without limitations, especially when applied to real-world drug discovery data. Understanding when and why CP can fail (or produce less useful results) is key to applying it properly.

Violations of Exchangeability: As discussed, the validity of CP rests on the assumption that the calibration (and training) data and the new samples are exchangeable. In practice, this assumption can be broken in numerous ways in drug discovery. Covariate shift is common – for instance, a QSAR model trained on one chemical series may be applied to a different series where the structure–activity relationship is slightly different. Here the new compounds are not drawn from the same distribution as the calibration set. Temporal drift is another culprit: models are often built on historical data, then used prospectively. As chemistry efforts progress, later compounds may explore new regions of chemical space or new experimental protocols might alter the data characteristics. Eklund et al. (2015) provide a vivid example: they applied inductive CP to predict assay outcomes over a 5-year project and found the conformal predictor was no longer exactly valid due to time-dependent changes in the data. Specifically, the early data and later data were not exchangeable, which led to actual coverage falling below the nominal level (more errors than expected in later batches). The CP framework does acknowledge this risk – in fact, tools exist to test for exchangeability assumption violations. In the face of drift, one mitigation is to frequently recalibrate the CP model. For example, using a sliding window calibration set that always uses the most recent data can help maintain validity at the cost of forgetting old data (semi-offline CP). If distribution shift is suspected (e.g., the model is now being used on a different chemical scaffold or a new patient population), CP predictions should be interpreted with caution. It’s possible in such cases that CP outputs unusually large prediction intervals or empty prediction sets – these are warning signs that the new data may be out of the model’s applicability domain. In summary, CP guarantees are strong but conditional on the data being representative. In drug discovery, where regime changes happen (new target classes, new assay technologies, etc.), one must monitor and adapt the conformal predictor to avoid invalid inference.

Choice of Nonconformity Measure: Although any nonconformity measure yields valid coverage in theory, an inappropriate or poorly tuned $A(z)$ can lead to practical failure modes. One issue is inefficiency: if the nonconformity scores do not differentiate between good and bad predictions, the resulting prediction sets will be large for almost all compounds. For example, suppose one used a constant nonconformity score for all calibration examples. Then for a new sample, the p-value calculation will basically yield $p \approx \frac{\text{rank}}{n+1}$ based on random tie-breaking – meaning at 80% confidence, roughly 20% of compounds will be randomly declared unreliable, and intervals might be extremely wide to maintain coverage. Such a CP predictor is valid but useless, as it does not meaningfully prioritize compounds. In more realistic terms, if one chooses a nonconformity measure that is only weakly related to prediction error – say, using molecular weight as $A(z)$ when predicting toxicity – the conformal algorithm might still cover the true labels, but the “tightness” of the prediction sets will be suboptimal. Researchers have noted that conformal prediction inherits the weaknesses of the underlying model if not accounted for. For instance, if a QSAR model systematically overfits a particular region of chemistry, the residual-based nonconformity measure might be very small on calibration (if the model fits calibration data too well) but then explode on truly novel compounds, leading to invalid (under-covering) results for those. This ties back to distribution shift as well – a poorly chosen $A(z)$ might not flag novel compounds as strongly as it should. Another subtle issue is model update frequency: if one continues to use a fixed calibration set nonconformity distribution while the model parameters themselves are updated (say retrained on more data), the p-values might become mis-calibrated. It is generally required to recalibrate whenever the model changes, otherwise the nonconformity scores no longer reflect the model’s current error behavior.

In classification, an interesting failure mode occurs if the conformal prediction sets are always too large or too small. If the model is near-perfect, CP will simply predict a single label for most cases (which is fine). But if the model has blind spots, CP might output prediction sets with multiple labels or even no labels (if all p-values are below $\varepsilon$). The latter case is called a null prediction – it means the algorithm is saying “no label meets the confidence criterion.” This can happen if the new sample has an extremely large nonconformity score, larger than all calibration instances, resulting in $p_{y} = \frac{1}{n+1}$ for all classes $y$. In drug discovery, a null prediction might be interpreted as the compound being outside the model’s domain (none of the model’s predictions can be trusted at the desired level). While this is not a “failure” per se (it is the correct conclusion under the method’s logic), if null predictions occur too frequently, it impedes the usefulness of the model. Mondrian CP can reduce null predictions for minority classes by calibrating their own error rates, but one must still be cautious in very imbalanced cases where even Mondrian calibration data for the rare class is sparse.

Computational Challenges: Inductive conformal prediction is relatively lightweight, but in some scenarios the computation can become non-trivial. One challenge is scaling to ultra-large datasets. If one has millions of compounds in a virtual library to evaluate, computing p-values for each can still be time-consuming (since for each test sample, one must compare its nonconformity to $n$ calibration examples). Fortunately, this comparison can be done in $O(n)$ per test or even faster if the nonconformity scores are sorted once. Indeed, many implementations pre-sort the calibration scores and then just find the rank of the new score via binary search, making it very fast. The bigger issue is often memory: storing nonconformity scores for very large $n$ or caching model predictions. In large-scale drug discovery projects, one might also use distributed computing to handle CP calculations for many compounds in parallel. Full (transductive) CP is basically infeasible beyond small datasets due to retraining costs, but one could approximate it with clever reuse of computations or incremental training (e.g., for leave-one-out, some ML models like linear models or nearest-neighbors can update predictions quickly without full retrain). Another computational consideration is if one wants conformal prediction for structured outputs or complex models – for example, predicting an entire molecular optimization path with confidence. These advanced uses may require custom nonconformity measures and can be expensive to calibrate.

In summary, the main failure modes of CP in drug discovery are tied to dataset shift, mis-specified nonconformity measures, and practical computation limits. If exchangeability is violated, the promised coverage can deteriorate – making the CP overly optimistic (dangerously so, if not detected). If the nonconformity metric is not thoughtfully chosen, the CP may be technically valid but provide little value (giant intervals or too many undecided cases). And while CP is computationally cheap relative to many Bayesian uncertainty quantification methods, applying it naïvely in a discovery pipeline with thousands of models or millions of compounds may require optimization. Users of CP should remain vigilant: always verify the empirical coverage on a hold-out if possible, perform checks for changes in data distribution, and refine the nonconformity measure as needed. When done properly, CP will retain its guarantees; when done carelessly, one might be lulled into a false sense of security by “formal” intervals that quietly failed to meet assumptions.

Future Directions and Conclusion

Conformal prediction in drug discovery is a rapidly evolving area, and several exciting directions are emerging to further enhance its utility:

Integration with Deep Learning Models: Modern drug discovery increasingly relies on deep learning architectures – from graph neural networks (GNNs) that operate on molecular graphs to transformer models for protein or molecule sequences. Integrating CP with these complex models is a natural next step. Encouragingly, CP is model-agnostic, so in principle one can take a deep model and wrap a conformal calibration around it. Recent studies have done exactly this: Zhang et al. combined deep feed-forward neural nets and GNNs with inductive CP for toxicity prediction (Tox21 datasets). They found that the resulting conformal predictors provided well-calibrated confidence intervals and improved minority-class (toxic) detection compared to the raw models. One reason is that deep learning models often produce overconfident probabilities; conformal calibration adjusts for this, yielding more realistic uncertainty bounds. Moreover, the use of Mondrian CP in conjunction with deep models can tackle class imbalance effectively by enforcing per-class error rates. We anticipate greater exploration of nonconformity measures tailored to deep models. For instance, a GNN could use the uncertainty in its node embeddings or the variance in an ensemble of dropouts as part of the nonconformity score. There is also interest in conformal molecular generation: ensuring that when a model generates novel compounds, it can attach confidence that the compound will meet certain property criteria with high probability. Techniques like conformal multi-objective optimization (producing sets of optimized molecules with guaranteed success rates) might appear in the future.

Active Learning and Adaptive Experimentation: Drug discovery is an iterative process – models suggest new compounds, chemists make and test them, and the new data is fed back into models. CP can play a pivotal role in active learning or experiment selection. Because CP provides a principled measure of uncertainty, one can design selection strategies that, for example, prioritize compounds for synthesis which the model is uncertain about (low p-values or very wide prediction intervals) in order to quickly improve the model in those regions. Alternatively, one might focus on compounds that are predicted active with high confidence (small conformal prediction sets) to pursue the most promising leads. Ahlberg et al. (2017) explored strategies to decide which compounds to make next based on conformal predictions of ADME properties. They compared different automated decision rules (like focusing on high-confidence positives vs exploring low-confidence regions) and demonstrated that using CP-informed criteria can save experimental resources while still finding optimal drug candidates. We foresee CP becoming a standard component in design–make–test–analyze (DMTA) cycles, as a decision support tool: for example, flagging which predicted leads are within the model’s confidence domain and which are extrapolations. CP can also enhance Bayesian optimization for drug design by ensuring that the model’s proposed next candidates meet a desired confidence level.

Beyond Marginal Coverage – Conditional and Localized Validity: One active research area in conformal prediction is improving the conditional validity of CP. The standard CP guarantee is marginal, meaning averaged over all samples the error rate is bounded by $\varepsilon$. In drug discovery, one might desire stronger guarantees for certain subsets – e.g., “This model’s activity predictions are 90% accurate for compounds similar to our lead series.” Mondrian CP is one approach to conditional validity based on known categories, but there is ongoing work on more flexible conditioning (like conditioning on a continuous descriptor). Methods like conditional conformal prediction and weighting schemes could allow CP to maintain validity on important subdomains (such as a particular chemistry or a particular range of property values). This might involve training a separate conformal predictor for, say, large molecules vs small molecules, or using covariate-dependent nonconformity functions. While not trivial, any advances here would directly benefit drug discovery by allowing more nuanced risk assessments – for instance, knowing that within a well-explored scaffold series the CP is tight and valid, whereas in a novel series it defaults to broader, more conservative predictions.

User Interpretability and Trust: As CP becomes more integrated into drug discovery pipelines, an interesting “soft” aspect is how medicinal chemists and project teams interact with these predictions. The presentation of conformal prediction sets (e.g., “Compound X will have pIC50 between 6.5 and 7.8 with 90% confidence”) needs to be communicated effectively to non-experts. Visualization tools, such as plotting the predicted interval for each compound alongside actual values, can help build intuition. Trust in AI models can be significantly bolstered by methods like CP: knowing that a model can say “I’m not sure” for certain compounds is comforting in a decision-making context. There is potential to combine CP with explainable AI techniques – for example, providing an explanation for why a prediction interval is wide (perhaps highlighting a molecular substructure that lies outside the training domain).

Automating Nonconformity Selection and Efficiency Improvements: Another future direction is to automate or learn the nonconformity measure itself. Instead of manually specifying $A(z)$, one could imagine a meta-learning approach where we train a small model to predict the error of the main model, and use that as $A$. Ensemble and stacking approaches already hint at this (where an ensemble’s disagreement correlates with error). Research into adaptive conformal prediction aims to make prediction sets smaller while preserving validity, by using information about $x$ (features) in the calibration – essentially trying to achieve conditional validity. This might involve training models that output interval predictions directly (like quantile regression or Bayesian NNs) and then conformalizing them (a technique known as Conformalized Quantile Regression, CQR). Such approaches could yield tighter intervals for easy compounds and only wide intervals for genuinely hard cases.

In conclusion, conformal prediction has emerged as a valuable tool to provide rigorous uncertainty quantification in drug discovery and chemical machine learning. Its ability to deliver valid confidence measures for each prediction addresses a critical need in a field where decisions can have large financial and scientific consequences. By outputting prediction sets with a guaranteed coverage probability, CP methods allow researchers to identify which predictions can be trusted and which require caution. This leads to more efficient use of resources – focusing experiments where the model is less certain, and expediting projects when the model is confident. We have discussed how CP operates, its mathematical underpinnings (exchangeability and nonconformity), and how it can be adapted (inductive, Mondrian, etc.) for practical use. We also examined limitations like dataset shift and the importance of proper calibration. Going forward, the integration of CP with state-of-the-art deep learning and its deployment in active learning loops hold great promise for making drug discovery AI both reliable and informative. As the industry embraces data-driven decisions, methods like conformal prediction will be instrumental in ensuring that those decisions are backed by sound uncertainty estimates – ultimately increasing the trust in AI models to guide the discovery of new therapeutics.

References:

Shafer, G. & Vovk, V. (2008). A Tutorial on Conformal Prediction. Journal of Machine Learning Research, 9, 371–421.
Cortés-Ciriano, I. & Bender, A. (2019). Concepts and Applications of Conformal Prediction in Computational Drug Discovery. Molecular Informatics, 39(8-9), 1900351.
Norinder, U. et al. (2014). Introducing Conformal Prediction in Predictive Modeling. A Transparent and Flexible Alternative to Applicability Domain Determination. J. Chem. Inf. Model., 54(6), 1596–1603.
Eklund, M., Norinder, U., Boyer, S. & Carlsson, L. (2015). The application of conformal prediction to the drug discovery process. Ann. Math. Artif. Intell., 74, 117–132.
Zhang, J. et al. (2021). Deep Learning Based Conformal Prediction of Toxicity. J. Chem. Inf. Model., 61(6), 2641–2653.
Ahlberg, E. et al. (2018). Using Conformal Prediction to Prioritize Compound Synthesis in Drug Discovery. Proc. of Machine Learning Research, 60, 29–48.
Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. & Wasserman, L. (2018). Distribution-Free Predictive Inference for Regression. Journal of the American Statistical Association, 113(523), 1094–1111.
Vovk, V., Gammerman, A. & Shafer, G. (2005). Algorithmic Learning in a Random World. Springer.
Papadopoulos, H. (2008). Inductive Conformal Prediction: Theory and Application to Neural Networks. In Tools in Artificial Intelligence, ed. by D. T. Larose, pp. 315–330.
Linusson, H., Boström, H. & Johansson, U. (2014). Different approaches to address the class imbalance problem in conformal prediction. Proc. of COPA 2014, 11–19.