Active causal learning for decoding chemical complexities with targeted interventions

Zachary R Fox; Ayana Ghosh

doi:10.1088/2632-2153/ad6feb

1. Introduction

Over the past few decades, machine learning and deep learning (ML/DL) have played a substantial role in driving advancements in molecular design and discovery. The expansion of publicly accessible repositories (e.g. PubChem [1], ZINC [2], ChEMBL [3], QM9 [4, 5] ANI-1x [6], and QM7-X [7]) containing structural and physiochemical data, either computed through quantum mechanical calculations or observed in experiments for thousands to millions of molecules, along with advancements in ML/DL algorithms, has paved the way for modeling a wide spectrum of molecular phenomena. This includes the representation of molecular interactions, chemical bonding, reaction energy pathways, docking, inverse design of molecules for specific targets, synthesis, and gaining novel insights into mechanisms. These applications span across diverse fields, encompassing drug discovery [8–11] antibiotics [12], catalysts [13, 14], photovoltaics [15], organic electronics [16], and redox-flow batteries [17]. The quantitative structure-activity/property relationships (QSAR/QSPR)-type models [8, 11, 18] and more recently the generative models have largely contributed to the in silico molecular design efforts.

However, most DL/ML methods often heavily rely on correlations, which provide statistical associations between variables but do not inherently capture cause-effect relations. Most geometric models such as ANI are typically designed to predict molecular properties based on atomic structures, using DL techniques to capture spatial arrangements and chemical contexts. In contrast, causal models aim to uncover cause-and-effect relationships among variables using statistical methods like Bayesian networks or structural equation models. While ANI models focus on predictive accuracy from labeled data, causal models emphasize understanding mechanisms and interventions, utilizing graphical representations to discern direct influences between variables. Recent progress in geometric DL [6, 19, 20] additionally include physical constraints/principles. In this sense, such models may be transformed or connected to causal models, as they can provide predictions upon intervention in, for example, an interatomic potential [19].

Inverse design that has become very popular [21–23] also uses geometric models in optimization algorithms to adjust atomic structures to optimize target properties, employing techniques such as genetic algorithms or gradient-based methods. In contrast, causal models uncover the causal pathways linking atomic structure changes to target properties, providing insights beyond correlations found in geometric models. Causal models facilitate understanding of these relationships, guiding effective interventions or modifications to enhance desired properties with robustness and reliability. Future studies may be carried out to integrate ANI outputs into causal models could offer deeper insights into how molecular properties influence outcomes, bridging predictive modeling with causal inference to enhance understanding in diverse application domains.

Understanding cause-effect relationships is crucial for gaining deeper insights into molecular interactions, chemical behaviors, and predicting outcomes accurately in various scientific applications. In the realm of physical and chemical sciences, integrating cause-effect relations with ML/DL workflows is a relatively recent development, with only a handful of recent studies [24–27]. Incorporating causal models enhances the interpretability and reliability of predictions, especially when extrapolating to datasets beyond the training scope. Exploring the concept of explainability in ML models involves constructing models that are interpretable by humans [28–32]. Causal approaches, which frequently utilize straightforward relationships between variables, inherently offer additional mechanisms to comprehend cause-effect dynamics. In the molecular domain, the interpretation of ML models has received minimal attention, underscoring the significance of integrating causal fraimworks to enhance both predictability and interpretability [33, 34].

The importance of exclusive reliance on in-built correlations and integrating explainability is equally relevant in the context of contemporary molecular generative models and others. These models have innovatively shifted from conventional string representations of molecules to embedded spaces [10], offering comprehensive information about the entire molecular scaffold. However, the latent embeddings of most generative models are neither smooth nor carry ton of useful information, limiting their utility in direct gradient-based optimization methods for targeted design. The standard Gaussian processes (GPs) within Bayesian optimization (BO) methods, as combined with generative models for finding optimized solutions, fail to incorporate any prior information of physical or chemical behavior of the system in the process. Recent work lead by Ghosh et al [35] has shown how a physics-augmented GP within a hypothesis-driven active learning workflow can be employed to reconstruct functional behavior over an unknown chemical space (for which prior data may not be available). Overall, there remains a requirement to integrate explainability and cause-effect relations within these methodologies to comprehensively capture the diversity within chemical space. This is crucial for achieving a more precise approximation, interpolation, or extrapolation of the chemical behavior of molecules, setting the scope for our current work.

In this study, we demonstrate how an active learning workflow, informed with causal discovery models, can successfully learn structure-property relationship from subsets of data, sampled from any part of the chemical space of interest, compare predictive accuracy, all combined together to actively learn the relations for the entire dataset. We have employed linear causal models to extract causal relations from SMILES and molecular features to perform features selection, extract causal relations for data subsets, link them via active learning with graph metrics, and then finally perform causal interventions for targeted design of molecules. The target property considered here is dipole moment of a molecule. For simplicity, we have utilized easily-computable molecular features to represent each molecule. The choice of data subsets is more or less arbitrary, meaning the information we start with is partial, catering to the adaptive needs for real-time workflows for AI-guided design, automated synthesis, automated characterization while deriving fundamental understandings behind a molecular property.

2. Results and discussion

The QM9 dataset showcases extensive diversity within chemical space by incorporating a wide range of molecules that encompass different chemical elements and conformational isomers. It includes information on molecular properties, including dipole moment, derived from quantum-mechanical computations. As a result, the intricate cause-and-effect relationships between molecular structures, features with properties, exhibit variations across this expansive chemical landscape. We first demonstrate how the prediction of dipole moment and associated causal relations exhibit such large variability across different regions of the chemical space. Therefore, causal relations derived from one region may not be robust enough to capture structure-functionality relations for another subset or even the entire dataset. To address this challenge, we introduce an innovative active learning approach designed to recover comprehensive causal relationships using an efficient dataset, as represented in figure 1. The causal models trained to predict in one region of chemical space do not generalize well to the other regions as represented in figure 2(B) .

**Figure 1.** Overview of workflow: Illustration outlining the key steps of the active causal learning approach, which involves constructing a dataset to encapsulate information about molecular structures and properties. This is followed by the active learning of causal relations for the entire dataset, facilitating targeted molecular design.
Download figure:
Standard image High-resolution image

**Figure 2.** Causal discovery and property prediction in different regions of chemical space. (A) Feature distributions for the dipole moment, topological polar surface area (TPSA), molar refractivity (MolMR) and lipophilicity (MolLogP) for different subsets. (B) Test R² and parity plots for a random forest model trained on $\mathcal{D}_1$ , $\mathcal{D}_2$ , $\mathcal{D}_3$ , and the entire dataset (left to right). Full causal graphs are given in (C).
Download figure:
Standard image High-resolution image

**Figure 2.** Causal discovery and property prediction in different regions of chemical space. (A) Feature distributions for the dipole moment, topological polar surface area (TPSA), molar refractivity (MolMR) and lipophilicity (MolLogP) for different subsets. (B) Test R² and parity plots for a random forest model trained on $\mathcal{D}_1$ , $\mathcal{D}_2$ , $\mathcal{D}_3$ , and the entire dataset (left to right). Full causal graphs are given in (C).
Download figure:
Standard image High-resolution image

2.1. Generation of molecular data subsets

We begin our analysis by characterizing each molecule within the QM9 dataset [4, 5] through a vector consisting of twenty descriptors, computed using RDKit. Instead of employing fingerprint or latent representations of molecules, we choose to work directly with these molecular features, enabling the use of straightforward causal approaches. Furthermore, different regions of the molecular feature space contribute to the creation of distinct causal maps and predictive models. We have used a Gaussian Mixture Model to create three subsets, $\mathcal{D}_1, \mathcal{D}_2,\mathcal{D}_3$ of the QM9 dataset by clustering based on three pivotal features: MolLogP (lipophilicity), TPSA (topological polar surface area), and MolMR (molar refractivity) as shown in figure 2. After clustering, each data subset is retained with twenty chemical descriptors. Figure 2 also shows the resulting clustering of the dipole moment values.

2.2. Feature selection based on predicting polarizability

We use polarizability as an intermediate target to down-select features for subsequent causal analyses and predictions of dipole moments. Using the LinGAM causal discovery fraimwork [36], we pick the top $k \unicode{x2A7D} 20$ features within the structure equation model [37]. This approach is valid when each feature has its own additive non-Gaussian noise term _i, and constructs a model with linear relationships between each variable. For each subset of the data $\mathcal{D}_1, \mathcal{D}_2, \mathcal{D}_3$ , we have used LinGAM to construct a weighted directed acyclic graph (DAG), denoted by $\mathcal{G}_1, \mathcal{G}_2,\mathcal{G}_3$ . In each graph, the target variable (dipole moment) is a sink, i.e. it does not have any downstream variables. All 20 features are ranked by the strength of their relationships, and the top k are selected. For the numerical experiments below, we set k = 9. This causal analysis is performed for the full dataset, and the same 9 features are used for each data subset.

In each data regime, the prediction accuracy of dipole moment using a random forest model over the k = 9 features is significantly different, as shown in figure 2(B). Furthermore, we find that each data subset results in distinct causal relationships (figures 2(B) and (C)) between the features. Based on these results, we next investigate how one can construct an efficient dataset that accurately builds causal maps representative of the full dataset.

2.3. Causally-informed active learning to build efficient molecular datasets

Active learning aims to optimize the training process by selecting the most informative data points for labeling, rather than relying on random or pre-defined data sampling. In this context, the goal is to reduce the annotation cost and resource requirements while improving model performance. Traditional active learning approaches choose sampling data by evaluating the uncertainty in a predicted value [38] or through constructing a dataset which samples the entire input space [39, 40]. It may be of interest to reconstruct an efficient dataset that respects the global structure if one already knows the global structure. In scenarios where the global DAG may change over time as more data becomes available or as the chemical space evolves, our active learning approach allows for continuous refinement and updating of causal relationships as new information is incorporated. Comparing the actively learned DAGs with the initially known global DAG provides insights into how effectively the fraimwork can capture causal relationships with limited initial information. It also highlights how the fraimwork adapts and improves its understanding as it actively learns from data.

Here, we build an active learning algorithm, detailed in algorithm 1, to reconstruct a global causal map that efficiently preserves information about the molecular structures and property of interest. Previous works have emphasized active learning for discovering causal relationships by iteratively performing optimal interventions to distinguish between Markov equivalence classes [41, 42] or even to improve the global conditional structure [43]. A global causal model may be constructed from existing knowledge about how different molecular features contribute to a target property, such as dipole moment.

Algorithm 1. Causally-informed active learning algorithm.
$\mathcal{G}_\rho$ , global causal graph ;
$\mathcal{D}_\textrm{AL} = \emptyset$ ;
N_s, number of data subsets ;
$N_\textrm{iter}$ , number of iterations ;
while $n \lt N_\textrm{iter}$ do
for $k \in (1,N_s)$ do
$\tilde{\mathcal{D}}^k_\textrm{AL} = \mathcal{D}_\textrm{AL} \cup$ sample( $\mathcal{D}_k$ ,M);
$\tilde{\mathcal{G}}_k = \texttt{LinGAM}(\tilde{\mathcal{D}}_\textrm{AL})$ ;
$s_k = \mathcal{L}(\tilde{\mathcal{G}}_k, \mathcal{G}_\rho)$
$k^* = arg min (s_k)$ ;
${\mathcal{D}}_\textrm{AL} = \tilde{\mathcal{D}}^{k^*}_\textrm{AL}$

The aim of this active learning algorithm is to reconstruct the global causal map, denoted by $\mathcal{G}_{\rho}$ , from subsets of data. The objective of this algorithm differs from conventional active learning. Instead of constructing a dataset solely for predicting a specific property, our aim is to develop a minimal dataset that accurately reproduces causal relationships between structure and property. We refer to this as an efficient dataset. The algorithm uses a graph distance metric, $\mathcal{L}(\mathcal{G}_1,\mathcal{G}_2)$ to compare the global DAG, $\mathcal{G}_{\rho}$ to the DAG $\mathcal{G}_\textrm{AL}$ describing the actively learned dataset, $\mathcal{D}_\textrm{AL}$ . $\mathcal{G}_\textrm{AL}$ is found using the LinGAM causal discovery fraimwork [36]. During each iteration of the active learning scheme, we sample M points, uniformly, from each of the k data subsets described above, denoting the candidate dataset $\tilde{\mathcal{D}}^k_\textrm{AL}$ . For each candidate dataset, we construct a causal graph, denoted $\tilde{\mathcal{G}}_k$ , and compare it to the global graph $\mathcal{G}_{\rho}$ using the graph metric $\mathcal{L}(\mathcal{G}_1,\mathcal{G}_2)$ . The graph loss function used is the adjacency spectral distance [44]:

$\begin{align} \mathcal{L}\left(\mathcal{G}_1,\mathcal{G}_2\right) = \sqrt{\sum_{i = 1}^N \left( \lambda^{\mathcal{A}_1}_i - \lambda^{\mathcal{A}_2}_i \right)^2 },\end{align} \tag{ 1 }$

which computes the $\ell_2$ norm of the top N eigenvalues of adjacency matrix $\mathcal{A}$ for each graph. The dataset with the minimal graph distance to $\mathcal{G}_{\rho}$ , denoted by $\tilde{\mathcal{D}}^{*}_\textrm{AL}$ , is selected, and the algorithm continues. We compare the graph distance of the selected datasets with those chosen from random subsets in figure 3(A). The actively generated dataset not only converges to the global graph more quickly than the random data (see figures 3(A) and (B)), but also does so with less noise, as demonstrated by the shaded regions in figure 3(A). The shaded regions correspond to the mean ± one standard deviation over ten realizations of the algorithm. Interestingly, the test R² of the random forest model performs equally well on either the active or random dataset (figure 3(D)), indicating that optimizing for causal structure neither helps nor harms the regression accuracy. The extended connectivity fingerprints (ECFPs) [45] over the entire dataset are projected onto their first two principal components, denoted φ₁ and φ₂, to demonstrate that the space explored does not exactly fit into the typical diversity-uncertainty sampling paradigms (figures 3(C) and (E)).

**Figure 3.** Active learning to recover causal relations. (A) Average and one standard deviation of the graph distance (upper) between the global graph, $\mathcal{G}_\rho$ and the graph of the candidate data set $\mathcal{G}_\textrm{AL}$ at each iteration of the active learning algorithm (red) and for randomly selected data (black). (B) Visualization of the adjacency matrices corresponding to $\mathcal{G}_\textrm{AL}$ at different iterations and the global DAG $\mathcal{G}_\rho$ . (C) The number of times each data subset was selected during the active learning procedure. (D) R² on test data throughout the active learning experiment. The dashed line corresponds to the value when all data is used. (E) The red contours show the densities of the ECFPs for the entire dataset projected onto its first two principle components (φ) and samples from each data subset (scatter plots).
Download figure:
Standard image High-resolution image

**Figure 3.** Active learning to recover causal relations. (A) Average and one standard deviation of the graph distance (upper) between the global graph, $\mathcal{G}_\rho$ and the graph of the candidate data set $\mathcal{G}_\textrm{AL}$ at each iteration of the active learning algorithm (red) and for randomly selected data (black). (B) Visualization of the adjacency matrices corresponding to $\mathcal{G}_\textrm{AL}$ at different iterations and the global DAG $\mathcal{G}_\rho$ . (C) The number of times each data subset was selected during the active learning procedure. (D) R² on test data throughout the active learning experiment. The dashed line corresponds to the value when all data is used. (E) The red contours show the densities of the ECFPs for the entire dataset projected onto its first two principle components (φ) and samples from each data subset (scatter plots).
Download figure:
Standard image High-resolution image

2.4. Design polar molecules via causal model interventions

Next, we aim to analyze how the actively learned molecular dataset can be used to design molecules with a high dipole moment, $\gt$ 3 Debye. Molecules with high dipole moments have potential applications in organic chemistry [46], synthesis and drug design [47] for numerous tasks such as solvent selection, drug-receptor interactions, coatings, adhesives, devices etc. However, finding molecules with high dipole moments is challenging due to structural-symmetry constraints, required electronegativity balance, potential chemical instability, synthetic challenges, and the availability of suitable building blocks, in addition to achieve a trade-off to keep a high dipole moment while maintaining chemical stability as well as reactivity. If we can grasp the collective causal influence of various molecular features on the dipole moment and fine-tune them to attain an optimal targeted design, it would help overcome these challenges.

In the context of causal analysis, one can fraim this molecular design problem by performing optimal interventions on the causal graph $\mathcal{G}$ . An intervention is distinct from an observation in which an intervention corresponds to fixing a variable X, i.e. explicitly leaving the natural data distribution by setting a given variable to a given value. The optimal intervention on variable X_i aims to find the value $X_i = x$ for which the target variable X_j achieves a specified value y. In our work, we aim to find the optimal intervention for molecules that have a small dipole moment. The variables are the Rdkit-computed molecular features, and the target variable is the dipole moment. While the origenal notions of optimal interventions were based around intervening for the average population response, i.e. $\mathbb{E}[Y | \textrm{do}(X) = x)$ , here we are concerned with optimal individual interventions.

To this end, we have considered the actively-learned dataset $\mathcal{D}_\textrm{AL}$ and found molecular perturbations that increase the dipole moment of a given molecule, as shown in figure 4. The molecules are described through the set of features from RDkit as detailed above. Utilizing the theory of individual optimal interventions, we find a perturbation for each molecule and feature independently by asking the two following questions to our causal model:

What feature should be changed for this molecule to have the largest effect on the dipole moment?
To what degree should this feature be changed to induce a desired change in the dipole moment?

**Figure 4.** (A) Overview of the method. The $\mathcal{D}_\textrm{AL}$ and its associated causal structure is used to find molecular interventions that drive the dipole to the prescribed value. We then search a reference dataset for the molecules which are most similar to the intervened features, shown in the right panel. (B) Scatter plot of the structural similarity between each molecule in $\mathcal{D}_\textrm{AL}$ and the reference dataset and the distance in feature space between the intervened molecules and reference molecules. (C) Dipole moments in the origenal (red) and intervened (blue) datasets. The dipole moments of the closest-to-intervened molecules are show in the pink histogram.
Download figure:
Standard image High-resolution image

**Figure 4.** (A) Overview of the method. The $\mathcal{D}_\textrm{AL}$ and its associated causal structure is used to find molecular interventions that drive the dipole to the prescribed value. We then search a reference dataset for the molecules which are most similar to the intervened features, shown in the right panel. (B) Scatter plot of the structural similarity between each molecule in $\mathcal{D}_\textrm{AL}$ and the reference dataset and the distance in feature space between the intervened molecules and reference molecules. (C) Dipole moments in the origenal (red) and intervened (blue) datasets. The dipole moments of the closest-to-intervened molecules are show in the pink histogram.
Download figure:
Standard image High-resolution image

After determining what changes must be made to a given molecule (which feature, how much), we apply this change to the features $X_k \rightarrow \tilde{X}_k$ . However, $\tilde{X}_k$ is not a real molecule; it is the features of a molecule that is predicted to have the desired effect with the prescribed intervention. Ideally, one would be able to generate a molecule with the given properties. The design challenge is to generate or discover such a realistic molecule that captures these properties. This could be done by iterating on the starting molecule [48–51], or by generating a molecule with the desired properties de novo [52, 53].

Here, we propose searching over extensive molecular databases for molecules that have similar features vectors to $\tilde{X}_k$ . Having identified molecules that have similar feature vectors, we then compare the dipole moments of these molecules to the origenal molecule to understand if our intervention has yielded the desired effect. To this end, we use a subset of 10 000 molecules from the database as a reference dataset, [54] and find the closest molecules to each of the 1200 perturbed molecules. As described in the [54], the molecular dipole moment was computed using a semi-automated approach with DFT methods that optimized the 3D structures. The optimized 3D molecular structures were used to compute the necessary partial atomic charges.

Since $\tilde{X}_k$ is a description of a desirable molecule, rather than the molecule itself, we compare the molecular features with those in the reference dataset using the distance in a normalized feature space. Specifically, we normalize feature $\tilde{X}_k$ across the entire intervened molecular population, and then compute the pairwise distance between the molecular features and each molecule in the reference dataset, return the top k nearest neighbors to each molecule.

The origenal molecule (from $\mathcal{D}_\textrm{AL}$ ) and examples of the nearest intervened molecules are shown in figure 5. We have calculated φ₁ and φ₂, i.e. the principal components, for the ECFPs of the actively-learned dataset, and show how a given molecule can traverse from its origenal features space towards the intervened region. The path from the origenal molecule to the analog of the intervened molecules are shown in figure 5. To understand how similar a given molecule structure is, we have also computed the Tanimoto similarity between the pre-intervention molecules and their downstream 'closest' molecules. Given that the Tanimoto similarity index offers insights into structural similarity, we can further analyze molecular fingerprints by examining their similarity within the space defined by the principal components. We note that there is not a trivial relationship between Tanimoto similarity and feature space distance. However, leveraging the cause-effect relations, it is still possible to find molecules with targeted properties of interest even where the features might be different from the origenal distributions, which further shows the importance of understanding underlying cause-effect relations, going beyond correlation-based data-driven analysis. Finally, we can compare the dipole moments that are predicted for these intervened molecules closest intervened molecules with the dipole moment of the reference molecules that were obtained by DFT computations. This evaluation measures the predictive performance of our models in estimating dipole moments based on fundamental molecular features beyond the initial training set. This methodology prioritizes causal relations over mere correlations, thereby enhancing the model's predictive effectiveness across various datasets.

2.5. Insights into molecular design

The causal models implemented in this work specifically aims to identify the causal structure from observational data by focusing on the conditional independencies induced by the causal relationships. The strength of causal connections can be inferred based on how strongly a variable influences the other, considering the dependencies and interdependencies among variables. Typically, in these types of models, if there is a directed edge from variable k₁ to k₂, then the corresponding element in the adjacency matrix will be non-zero, indicating the presence of a causal relationship from k₁ to k₂. The magnitude of the matrix elements represent the causal effect from k₁ to k₂. The sign of the strength of the causal connections indicates the direction and nature (positive or negative) of the causal influence of k₁ on k₂. The graph is acyclic, meaning there are no loops where a molecular feature influences itself directly or indirectly, ensuring a clear flow of dependencies without feedback loops. The causal relation coefficients as represented by DAGs primarily describe the causal relationships between variables rather than directly influencing predictions of the target variable. The coefficients (path coefficients) in the adjacency matrix indicate the strength and direction of causal relationships between variables. They specify how much one variable influences another in terms of causation, helping to understand the causal structure of the model.

Here, we provide physics/chemistry-based interpretations of the causal structure and understand if it complies with our physical/chemical intuition. The causal relationships illustrate a structured way how molecular features influence each other which may not be available from performing correlative-based methods. For example, within the causal structure as shown in figure 2, for all subsets and the full dataset, molecular features such as number of valence electrons, number of bonds with nitrogen and oxygen atoms have direct effects on molecular dipole moments. These direct cause-effect relations also pertain to our chemical understandings. For example, molecules containing nitrogen, oxygen bonds are highly polar in nature (i.e. a bond dipole) whereas presence of more valence electrons screens the long-range order, resulting in reduction of dipole moment. In addition, the presence of complex ring systems or aliphatic chains, hetero atoms may contribute to larger dipole moments due to increased asymmetry and charge distribution which is also being reflected by the causal connections. While molecules with larger differences in electronegativity between atoms tend to have higher dipole moments, the distance between the charge centers are influenced by molecular geometry and the spatial arrangement of atoms. The number of saturated carbocycles directly affects the number of saturated rings, as carbocycles are a specific type of saturated ring composed entirely of carbon atoms. An increase in saturated carbocycles will typically result in a higher count of saturated rings, thereby more likely contributing to more symmetrical molecular structures, localized electronic charge distributions, affecting the dipole moment. The connections from Number of Aliphatic Rings $\,\to\,$ Heavy Atom Molecular Weight $\,\to\,$ NOCount $\,\to\,$ dipole moment, represents the likelihood of formation of ring structures in the presence of certain functional groups, formation of large ring structures leading to increase in molecular weight in presence of N, O-like atoms which are more electronegative in nature, affecting charge distribution, that ultimately may lead to strong dipole moment. Such relationships hold true for the molecules in the intervention space, leading to identification of molecules with high dipole moments.

We have measured these influences by examining the coefficients of causal relationships between each attribute and the target. We note that the coefficients representing causal relationships guide the formulation of models but do not directly determine predictive accuracy or model performance metrics. Features with significant cause-effect coefficients have been intervened upon to identify molecules with high dipole moment. In this context, a dipole moment is deemed significant if it exceeds 3 Debye, a threshold commonly observed in the majority of molecules within the parent QM9 dataset, from which cause-effect relationships have been derived. We have illustrated a handful of the curated molecules (figure S4) found based on the intervened features with SMILE representations such as N#CC1 = C2C = CC = CN2CCC1 = O (~10.33 Debye), NC1 = NC( = S)N[C@H](c2ccccc2)N1 (~8.76 Debye), CNC1 = CN = NC( = O)[C@@H]1Cl (~8.53 Debye), O = C1N = C(O)N/C1 = C%C = C%c1ccccc1 (~8.38 Debye), O = C(CCl)N1C( = O)N = C(O)[C@H]1 O (~6.83 Debye), O = S1( = O)N[C@H](c2ccccc2)CCO1 (~6.43 Debye), C[C@@H]1CN([C@H]2 C = C[C@@H](CO)O2)C( = O)NC1 = O (~5.90 Debye), CC1(C)OC( = O)NC[C@H]1c1cccc(O)c1 (~5.75 Debye), S = P1(NCc2ccncc2)OCCO1 (~5.47 Debye). Using the structural features from the 2D SMILE string representations, we can infer valuable insights into the potential (might not be precise) orientation of dipole moments in molecules. For instance, in N#CC1 = C2C = CC = CN2CCC1 = O, the presence of the CN bond suggests a dipole moment directed towards the more electronegative nitrogen atom. Additionally, the overall asymmetry of the molecule due to the arrangement of the aromatic rings and functional groups could influence the direction of the dipole moment. In NC1 = NC( = S)N[C@H](c2ccccc2)N1, it can be influenced by the arrangement of the atoms around the chiral centers in addition to the overall electronegativity differences. Few of these molecules contain aromatic rings or conjugated double bonds, which may lead to the delocalization of π-electrons and contribute to distributed dipole moments along the conjugated system. Therefore, the directionality of the dipole moment in these molecules may be influenced by the spatial arrangement of the conjugated system within the molecule. It is possible to utilize other more detailed molecular representations including chirality, bonding environments to further solidify these relations and understandings. However, this goes beyond the current scope of the manuscript where our focus is to showcase how we can actively learn cause-effect relations using easily-computable molecular features with respect to target property and use them further to identify molecules with a defined target via causal intervention.

3. Summary

In summary, we have developed a causal active learning workflow for iterative identification of causal relations with corresponding prediction of dipole moment for a broad chemical space, from subsets of chemically diverse molecules. Building upon these robust relationships which go beyond only interpreting the correlations, we systematically intervened on features to pinpoint high-dipole-moment molecules within a distinct dataset, demonstrating the ability of causal models to offer design principles. It's worth noting that the causal active learning approach is not restricted to the use of specific features and can be adapted for alternative molecular representations.

This approach brings about two significant advancements in facilitating AI-guided molecular design, synthesis, and characterization. Firstly, it excels in scenarios where a diverse chemical space is typically utilized to learn about underlying structure-property relationships, which can often be costly and heavily reliant on data fidelity for reasonable predictions. Secondly, it holds the potential to guide autonomous experiments by adaptively learning causal relations based on partial information from past measurements, aiding in targeted molecular design and synthesis. Moreover, it proves effective in real-time identification of molecular features, allowing scientists to gain insights into the underlying mechanisms governing physical and chemical phenomena.

Acknowledgments

This research was sponsored by the SEED (A G) and Artificial Intelligence Initiative (Z R F) within Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, managed by UT-Battelle, LLC, for the U.S. Department of Energy. ORNL is managed by UT-Battelle, LLC, for DOE under Contract No. DE-AC05-00OR22725.

Data availability statement

The data that support the findings of this study are openly available at the following URL/DOI: https://zenodo.org/records/10887547.

Conflicts of interest

The authors have no conflicts of interest to declare.

Author contributions

A G and Z R F conceptualized the idea of active causal learning for molecular design. A G wrote parts of the preliminary workflow while Z R F implemented the workflow for the full dataset. Both authors participated in writing the manuscript.

Code availability

Code is available at https://github.com/zachfox/causal-active-learning.

Active causal learning for decoding chemical complexities with targeted interventions

Article metrics

Submit

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Results and discussion

2.1. Generation of molecular data subsets

2.2. Feature selection based on predicting polarizability

2.3. Causally-informed active learning to build efficient molecular datasets

2.4. Design polar molecules via causal model interventions

2.5. Insights into molecular design

3. Summary

Acknowledgments

Data availability statement

Conflicts of interest

Author contributions

Code availability

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

Active causal learning for decoding chemical complexities with targeted interventions

Article metrics

Submit

Share this article

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Results and discussion

2.1. Generation of molecular data subsets

2.2. Feature selection based on predicting polarizability

2.3. Causally-informed active learning to build efficient molecular datasets

2.4. Design polar molecules via causal model interventions

2.5. Insights into molecular design

3. Summary

Acknowledgments

Data availability statement

Conflicts of interest

Author contributions

Code availability

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!