7
Integrating Incomplete Prior Knowledge into Data-Driven Inferential Sensor Models Under Variational Bayesian Framework

Zhichao Chen¹, Hao Wang¹, Yiran Ma¹, Cheng Qiu¹, Le Yao², Xinmin Zhang¹, and Zhihuan Song¹

¹Zhejiang University, College of Control Science and Engineering, State Key Laboratory of Industrial Control Technology, 38 Zheda Rd. (Yuquan Campus), Hangzhou, Zhejiang Province, 310027, P. R. China

²Hangzhou Normal University, School of Mathematics, 2318 Yuhangtang Rd., Hangzhou, Zhejiang Province, 311121, P. R. China

7.1 Introduction

Inferential sensors are pivotal in process automation, significantly contributing to areas such as modeling, control, and optimization [1, 2]. Their relevance extends across various dimensions of the process industry. One critical aspect involves addressing the demanding challenges of boosting profitability, reducing material usage, enhancing safety, and safeguarding the environment, which has spurred the extensive adoption of artificial intelligence methods, especially deep learning (DL). Concurrently, the escalation in the scale of process manufacturing has intensified the necessity for efficient management of substantial industrial big data volumes [3]. Consequently, the integration of DL techniques into inferential sensors is imperative, aimed at facilitating precise control and optimal decision-making in the realm of process industries.

In recent decades, the field of data-driven inferential sensor modeling, especially those methods driven by DL, has seen substantial research and development, fueled by advancements in machine learning algorithms, distributed control systems, and database technologies [1]. DL-based inferential sensor techniques can be classified into several distinct model architectures: (stacked-)autoencoders (SAEs)/feedforward neural networks (FNNs) [4], convolutional neural networks (CNNs) [5], recurrent neural networks (RNNs), and transformers. Each architecture presents its own set of unique advantages and disadvantages. For example, SAEs/FNNs excel in handling process nonlinearity but are less effective with dynamic data. CNNs, with their convolutional kernel structure, offer local memory capabilities but are limited by the predefined size of their receptive field. RNNs adeptly manage dynamic processes [6], yet they are prone to issues like gradient explosion or vanishing during the training phase [7]. Transformers, employing self-attention mechanisms (SAMs), greatly enhance parallelization in sequence modeling, but their inherent “position-insensitivity” due to this mechanism can be a drawback [8]. This diversity among models ensures a well-rounded strategy in addressing the various challenges encountered in inferential sensor modeling.

Despite the significant strides made by current DL-based inferential sensor modeling approaches, a critical question persists: How can we effectively integrate established knowledge of industrial processes with these DL models? Industrial processes are subject to stringent evaluation before operation, leading to data that often conforms to empirical rules specific to areas like unit operations and reaction engineering. Incorporating this well-established knowledge into DL models holds the potential to greatly improve their efficacy and relevance. However, unlike physically informed neural networks (PINNs) [9], which model industrial processes at the transport process scale, current inferential sensor-based models, which model processes at unit operation scale, often lack the capability to capture all the necessary coefficients in semi-empirical equations. For instance, the accurate computation in applications such as an absorber column critically depends on the knowledge of mass transport coefficients. However, these coefficients are typically not obtainable through standard instrumentation, resulting in a gap in the necessary data. This chapter categorizes this gap as “incomplete knowledge”. One of the primary challenges we face is the effective articulation and incorporation of this incomplete knowledge into data-driven methodologies. Doing so is essential for improving the performance of inferential sensors. This integration demands not only a deep understanding of the underlying industrial processes but also innovative strategies to compensate for the missing information in a way that enhances the overall efficacy of the inferential sensor models.

Recent research increasingly adopts graph neural networks (GNNs) to encapsulate prior knowledge in industrial processes [10–13]. This approach leverages the concept that industrial processes can be graphically represented, with instruments functioning as nodes within this framework. Such a configuration significantly enhances modeling efficacy by enforcing interconnections among various instruments. This methodology mirrors the principles used in recommender systems, where modeling user-item pairs leads to performance improvements [14]. However, this direct adoption of predefined graphs can encounter the following specific challenges:

Data Shifting: The training dataset may span various operational periods, leading to spatial and temporal variations in the relevant knowledge required for modeling, as indicated by Ma et al. [15].
Knowledge Selection: The physical properties of materials can vary with changing operating conditions, suggesting the need for a model capable of dynamically adjusting its underlying knowledge base.

To address the first issue effectively, it is advised to incorporate prior knowledge dynamically within the prior term of the loss function, and subsequently exclude it during the model’s inference stage. This approach contrasts with the conventional method of statically embedding prior knowledge into the model’s input graph and maintaining it throughout the inference stage. Addressing the second issue involves designing a model that inherently and seamlessly adjusts to the evolving prior knowledge. In response to these challenges, this chapter introduces the “variational inference over graph” module, a novel approach designed to adeptly handle both data shifting and knowledge selection in dynamic modeling environments.

This chapter is organized as follows: We begin with a comprehensive review of the relevant literature, identifying and discussing the existing technical gaps in the field. Then, in Section 7.3, we delve into the derivation of our proposed methodology, thoroughly detailing the architecture of the innovative module we introduce. Following this, Section 7.4 is dedicated to the empirical validation of our approach, showcasing its application and effectiveness in a real-world catalytic shift unit. The chapter concludes with Section 7.5, where we summarize our key findings and offer insights derived from our research.

This chapter is derived from the works of Chen et al. [16, 17].

7.2 Literature Review

The integration of prior knowledge into industrial process modeling to enhance model performance is a well-explored domain in data-driven approaches specific to industrial processes. Unlike other fields, industrial processes are subject to rigorous design and rating protocols before they begin operation. This results in process measurement data that generally conforms to established (semi)empirical guidelines, facilitating the seamless integration of prior knowledge into inferential models for industrial process data. In Sections 7.2.1 and 7.2.2, this chapter will be divided into two main areas for detailed discussion: the transport process scale and the unit operation scale, each offering unique insights into the application of prior knowledge in industrial process modeling.

7.2.1 Transport Process Scale

In the realm of transport processes, industrial operations are governed by well-established conservation laws of momentum, heat, and mass. Reflecting this foundation, PINNs, which replace nonlinear terms in partial differential equations (PDEs) or ordinary differential equations (ODEs) with neural networks, have gained substantial attention in recent research, as highlighted in works by Raissi et al. [18] and Zhiyong et al. [19]. A prominent example is in the prediction of turbulent flows, where Wang et al. [20] developed the turbulent-flow net, integrating trainable spectral filters with Reynolds-averaged Navier–Stokes equations and large Eddy simulation in a specialized U-net structure. Addressing few-shot learning challenges, Zheng et al. [21] proposed a physics-informed RNN within a model predictive control framework, ensuring compliance with physical laws while substituting key equations with an RNN.

While PINN-based models have found considerable application in process control, as noted by Alhajeri et al. [22] and Zheng et al. [21], their direct application using process measurement data in PINNs faces challenges due to the complex transport phenomena in unit operations. This complexity limits the broader application of PINN-based methods in industrial process inferential sensors.

7.2.2 Unit Operation Scale

At the unit operation scale, practitioners typically use graph structures rather than strict ODE/PDE formulations to represent process knowledge. In this context, Bayesian networks (BNs), as discussed by Zeng et al. [23] and Khosbayar et al. [24], are a natural choice for modeling the correlations between process variables. Khosbayar et al. [24], for instance, designed BN structures based on flowsheets, developed a corresponding parameter estimation algorithm using the expectation–maximization algorithm, and validated its effectiveness in a bitumen upgrading process.

However, the training and inference of BNs, typically based on the expectation propagation algorithm, are not easily integrated into the mini-batch stochastic gradient descent-based DL framework. To address this, recent studies have increasingly focused on GNN-based models [11, 12, 25]. Notably, Ren et al. [12] used normalized mutual information to extract the graph of process variables, applied a multi-level knowledge graph to model large-scale industrial process data, and demonstrated its effectiveness in a cobalt–nickel removal process.

7.2.3 Overall Summary and Technical Gap

Despite significant advancements in embedding prior knowledge in industrial process inferential sensors, the following key issues remain unresolved:

Data Shifting Problem: Current GNN-based methods fix prior knowledge without considering that the most suitable knowledge may vary spatially and temporally. Consequently, the knowledge embedded into models should not be in a rigid, unchanging form.
Knowledge Selection Problem: Existing approaches predominantly employ message passing mechanisms, overlooking the necessity of knowledge selection, which is critical for adapting to internal or external process changes.

To bridge these identified technical gaps, we introduce the variational inference over graph (VIOG) framework in Section 7.3.

7.3 Proposed Approach

In this section, our approach begins by reinterpreting the graph within a Bayesian framework and modifying the loss function to include graph constraints, thereby addressing the “data shifting problem” issue. Next, we consider the knowledge representation problem in detail. Specifically, we examine the process of information selection among various variables, leading to the development of a SAM aimed at resolving “knowledge selection problem” issue. Additionally, we analyze the parallels between this SAM and graph convolution operations to underscore its validity. The section culminates with the presentation of the overall architectural design of the VIOG framework.

7.3.1 Loss Function Derivation

De facto, the task of inferential sensing can be regarded as a specific case of a supervised learning task. Consequently, if we represent the process variables by x, the quality variable by y, and the parameters by θ, our learning objective can be articulated as follows:

(7.1) $arg max Underscript theta Endscripts log p Subscript theta Baseline left-parenthesis y bar x right-parenthesis$

Based on Eq. (7.1), we can reformulate the learning objective as follows:

(7.2) $StartLayout 1st Row 1st Column Blank 2nd Column log p left-parenthesis y bar x right-parenthesis 2nd Row 1st Column Blank 2nd Column equals log integral p left-parenthesis y comma x right-parenthesis normal d 3rd Row 1st Column Blank 2nd Column equals log ModifyingBelow integral StartFraction p left-parenthesis y comma x right-parenthesis Over q left-parenthesis right-parenthesis EndFraction q left-parenthesis right-parenthesis normal d With presentation form for vertical right-brace Underscript double-struck upper E Subscript q left-parenthesis right-parenthesis Baseline left-bracket StartFraction p left-parenthesis y comma x right-parenthesis Over q left-parenthesis right-parenthesis EndFraction right-bracket Endscripts 4th Row 1st Column Blank 2nd Column greater-than-or-equal-to double-struck upper E Subscript q left-parenthesis right-parenthesis Baseline left-bracket log StartFraction p left-parenthesis y comma x right-parenthesis Over q left-parenthesis right-parenthesis EndFraction right-bracket 5th Row 1st Column Blank 2nd Column equals double-struck upper E Subscript q left-parenthesis right-parenthesis Baseline left-bracket log p left-parenthesis y bar comma x right-parenthesis right-bracket minus double-struck upper D Subscript upper K upper L Baseline left-bracket q left-parenthesis right-parenthesis parallel-to p left-parenthesis right-parenthesis right-bracket EndLayout$

where the third line is built upon the Jessen’s inequality [26, 27]. Besides, we can further divide the last line of Eq. (7.2) and formulate a novel optimization problem as follows:

(7.3) $StartLayout 1st Row 1st Column Blank 2nd Column min double-struck upper E Subscript q left-parenthesis right-parenthesis Baseline left-bracket log p left-parenthesis y bar comma x right-parenthesis right-bracket 2nd Row 1st Column Blank 2nd Column normal s period normal t period double-struck upper D Subscript upper K upper L Baseline left-bracket q left-parenthesis right-parenthesis parallel-to p left-parenthesis right-parenthesis right-bracket equals 0 EndLayout$

Comparing Eqs. (7.2) and (7.3), it becomes evident that our learning objective transforms the conventional constraint term, typically a constant in most GNN-based models, into a penalty term. This transformation can be understood through the application of the celebrated Lagrangian multiplier method:

(7.4) $StartLayout 1st Row 1st Column Blank 2nd Column min double-struck upper E Subscript q left-parenthesis right-parenthesis Baseline left-bracket log p left-parenthesis y bar comma x right-parenthesis right-bracket 2nd Row 1st Column Blank 2nd Column normal s period normal t period double-struck upper D Subscript upper K upper L Baseline left-bracket q left-parenthesis right-parenthesis parallel-to p left-parenthesis right-parenthesis right-bracket equals 0 3rd Row 1st Column Blank 2nd Column right double arrow 4th Row 1st Column Blank 2nd Column double-struck upper E Subscript q left-parenthesis right-parenthesis Baseline left-bracket log p left-parenthesis y bar comma x right-parenthesis right-bracket minus beta times double-struck upper D Subscript upper K upper L Baseline left-bracket q left-parenthesis right-parenthesis parallel-to p left-parenthesis right-parenthesis right-bracket EndLayout$

where β functions as the Lagrangian multiplier, with the learning objective of the β-VAE outlined on the right-hand side of Eq. (7.4). A key innovation within our variational inference framework is the conversion of the standard predefined graph, a common feature in GNN-based models, into a penalty form. This adaptation enables our model to operate without reliance on a specific graph structure, enhancing its versatility. This approach effectively addresses the “data shifting” issue as discussed in Section 7.1. Section 7.3.2 will explore the representation and integration of prior knowledge in our model, further clarifying its significance and operational mechanism.

7.3.2 Knowledge Representation

In this subsection, we aim to elucidate the methodology for integrating incomplete knowledge into the prior term p left-parenthesis right-parenthesis (knowledge description) and to outline the framework for such a representation within our model (knowledge utilization).

7.3.2.1 Knowledge Description

Echoing approaches in the natural language processing domain, where sentences are delineated using graph-like grammar trees [28], our model in GNN-based industrial process modeling similarly employs graphs to represent knowledge regarding process covariates. However, as previously mentioned, this prior knowledge is characteristically uncertain. The essence of utilizing this uncertain knowledge effectively lies in its probabilistic representation. Consequently, this subsection is dedicated to introducing an approach for articulating prior knowledge from a probabilistic perspective.

According to previous works [29, 30], the graph adjacency matrix <math xmlns="http://www.w3.org/1998/Math/MathML" display="inline" overflow="scroll" alttext= can be decomposed using the mean-field assumption:

(7.5) $p left-parenthesis right-parenthesis equals product Underscript i equals 1 Overscript nu Endscripts product Underscript j equals 1 Overscript nu Endscripts p left-parenthesis Subscript i comma j Baseline right-parenthesis$

where ν indicates the unique element number of the process variables. In light of this, the edges (denoted as α) in the graph may be present (α = 1) or absent (α = 0). This phenomenon indicates that the Bernoulli distribution can be adopted to describe the edges’ state as Eq. (7.6) shown:

(7.6) $p left-parenthesis Subscript italic i j Baseline right-parenthesis equals rho Superscript alpha Baseline left-parenthesis 1 minus rho right-parenthesis Superscript 1 minus alpha Baseline comma rho element-of left-bracket 0 comma 1 right-bracket$

where ρ represents the expected value of the Bernoulli distribution.

Given our comprehension of the process, knowledge can be classified into three discrete states: definitively present, ambiguously present, and definitively absent. To address this, we employ the 2σ bound within the Gaussian distribution, following the Pauta criterion, to determine the value of ρ. This method enables us to formulate the subsequent disjunctive expression:

(7.7) $StartBinomialOrMatrix upper S 1 Choose rho 1 equals 0.955 EndBinomialOrMatrix xor StartBinomialOrMatrix upper S 2 Choose rho 2 equals 0.500 EndBinomialOrMatrix xor StartBinomialOrMatrix upper S 3 Choose rho 3 equals 0.045 EndBinomialOrMatrix$

where ⊻ indicates that only one square bracket will take effect.

7.3.2.2 Knowledge Section via Self-Attention Mechanism

Having delineated our prior graph p left-parenthesis right-parenthesis , our focus now shifts to the variational distribution q left-parenthesis right-parenthesis . In line with existing literature, graph construction hinges on nodal similarity, which, in our case, translates to the similarity among different process variables. However, solely measuring node similarity is insufficient to tackle the problem of knowledge selection. For instance, consider a source node denoted by i ∈ {1, 2, …, ν} and a target node represented as j ∈ {1, 2, …, ν}. Each target node j might receive input from ν source nodes, potentially leading to redundant information. Therefore, it is imperative to implement a competitive mechanism. This mechanism should not only limit the information received by the target node j but also regulate the information emitted from the source node i.

In summary, based on our analysis, the following design requirements are essential to address the “knowledge selection” issue:

Construct the graph based on similarity measurements.
Implement restrictions on information emission and reception by the nodes.

To address requirement (1), we can conduct similarity measurement as follows:

(7.8) $script upper S equals bold upper Q upper K Superscript down-tack$

where script upper S , Q, and K represent the similarity matrix, the information from the source, and target nodes, respectively. To impose restrictions on this information, we employ a row-by-row normalization using the softmax operator:

(7.9) $s Subscript italic i j Baseline equals StartFraction exp left-parenthesis s Subscript italic i j Baseline right-parenthesis Over sigma-summation Underscript j equals 1 Overscript nu Endscripts exp left-parenthesis s Subscript italic i j Baseline right-parenthesis EndFraction i comma j element-of StartSet 1 comma 2 comma ellipsis comma nu EndSet$

Here, s_ij denotes an element of the matrix script upper S . It is important to note that directly applying a competitive approach to a matrix equations could lead to a “winner takes all” scenario, where the rows of the matrix transform into one-hot vectors. Such an occurrence could complicate the model training process and induce significant variance during the gradient estimation phase [31]. To mitigate these issues, the matrix is derived from script upper S :

(7.10) $equals Softmax left-parenthesis StartFraction script upper S Over StartRoot nu EndRoot EndFraction right-parenthesis equals Softmax left-parenthesis StartFraction bold upper Q upper K Superscript down-tack Baseline Over StartRoot nu EndRoot EndFraction right-parenthesis$

The abovementioned graph construction is similar to the SAM [32] in the transformer model as depicted in Figure 7.1. In the next part, we want to discuss the similarity between the GNN and transformer model.

7.3.2.3 Similarity of GCN and SAM

In Section 7.3.2.3, our primary focus lies in the comprehensive examination of the similarities between GNN-based networks and transformer models. For simplicity, we consider graph CNNs, the most widely used in the process industry, as an analysis object. It is worth emphasizing, as articulated in the scholarly reference [33], that the GNN exhibits distinct characteristics within the realm of message passing neural networks, as explicitly elucidated in Eq. (7.11):

(7.11) $x Subscript i Superscript left-parenthesis k right-parenthesis Baseline equals gamma Superscript left-parenthesis k right-parenthesis Baseline left-parenthesis x Subscript i Superscript left-parenthesis k minus 1 right-parenthesis Baseline comma zeta Underscript j element-of normal double struck upper N left-parenthesis i right-parenthesis Endscripts phi Superscript left-parenthesis k right-parenthesis Baseline left-parenthesis x Subscript i Superscript left-parenthesis k minus 1 right-parenthesis Baseline comma x Subscript j Superscript left-parenthesis k minus 1 right-parenthesis Baseline comma e Subscript i comma j Baseline right-parenthesis right-parenthesis$

where the symbol φ represents an operator that transforms the features of node i and those from its neighboring nodes j into a single, unified message. Concurrently, ζ denotes a permutation-invariant operator responsible for aggregating all messages associated with node i. This aggregation can be executed through various methods, such as summation, mean calculation, or maximization. Finally, the γ operator is employed to map the features of node i onto the aggregated message generated by the ζ operator, effectively integrating node-specific information with broader network insights. As such, the feature at node i in the kth GC layer is obtained:

A set of three different neural network architectures are transformer, self-attention mechanism S A M, and multi-head self-attention mechanism. A. The Transformer model consists of an encoder and a decoder. It includes add and layer norm, convolution block, multi-head attention, and feed-forward neural network. B. A block represents S A M. It is divided into three blocks labeled as Q, K, V and goes to A. C. A multi-head S A M represents linear labeled as Q, K, V. — **Figure 7.1** Model Structure of (a) transformer Model, (b) SAM, and (c) multi-head SAM.

(7.12) $x Subscript i Superscript k plus 1 Baseline equals sigma left-parenthesis x Subscript i Superscript k Baseline upper U Subscript i Superscript k Baseline plus sigma-summation Underscript j element-of normal double struck upper N left-parenthesis i right-parenthesis Endscripts x Subscript j Superscript k Baseline upper W Subscript j Superscript k Baseline right-parenthesis$

Equation (7.11) can be rewritten as Eq. (7.12) for the GC operator [34], where the U and W are the weights that should be learned, the + is adopted as the γ operator in Eq. (7.11), the ∑ (sum) is chosen to be the ζ operator, and φ operator is linear transformation. Besides, in (7.12), the self-loop (the node i itself is included in the set of its neighboring node ℕ(i)) exists for all nodes. By analyzing Eq. (7.12), the corresponding structure can be drawn as Figure 7.2a shows.

Meanwhile, as shown in Figure 7.2b, the SAM [32] can be described as Eq. (7.13) to Eq. (7.15). Equation (7.13) shows that the input data are mapped into the vector space named Q (query), K (key), and V (value), respectively. The similarity for vectors in Q space and K space can be transformed into the weight on vectors in V space, considering the scaling factor D as shown in Eq. (7.14), where the weight is called attention value, and the matrix formed by the attention values is denoted as equations . Equation (7.15) stands for the residual block which means that the final output is the sum of the original input data x and the h vector obtained from Eq. (7.14).

A comparison between two approaches in a neural network context, specifically focusing on traditional weighted sums and self-attention mechanisms. A. An architecture appears to be a simple feedforward neural network with a single hidden layer. The input data X is multiplied by a weight matrix W and then passed through an activation function sigma. The output of this layer is then summed with another input U. B. It allows the network to weigh the importance of different parts of the input sequence when processing a specific position. It involves three matrices Q, K, V. — **Figure 7.2** The structure of (a) the message passing of GNN, (b) the self-attention mechanism.

(7.13) $StartLayout Enlarged left-brace 1st Row q equals italic x upper W Subscript normal upper Q Baseline plus b Subscript normal upper Q Baseline 2nd Row k equals italic x upper W Subscript normal upper K Baseline plus b Subscript normal upper K Baseline 3rd Row v equals italic x upper W Subscript normal upper V Baseline plus b Subscript normal upper V EndLayout$

(7.14) $StartLayout 1st Row 1st Column Blank 2nd Column h Subscript n Baseline equals sigma-summation Underscript j equals 1 Overscript upper N Endscripts alpha Subscript italic n j Baseline v Subscript j Baseline 2nd Row 1st Column Blank 2nd Column equals sigma-summation Underscript j equals 1 Overscript upper N Endscripts Softmax left-parenthesis StartFraction k Superscript down-tack Baseline q Over StartRoot upper D EndRoot EndFraction right-parenthesis v Subscript j Baseline EndLayout$

(7.15) $h Subscript final Baseline equals h Subscript n Baseline plus x Subscript n Baseline upper W Subscript r e s$

This analysis leads to the conclusion that the similarity computation achieved through the inner product and the weighted summation operation, as described in Eq. (7.13), is essentially analogous to the operation of summing the neighboring nodes detailed in Eq. (7.12). In essence, this implies that the SAM can be equated to GC. This equivalence suggests that the attention matrix equations , derived through SAM, can represent the interconnections between different nodes in the network. The process of obtaining this attention matrix can thus be viewed as a data-driven approach to knowledge discovery. Based on these insights, it becomes feasible to design our model architecture with a foundation in SAM.

7.3.2.4 Sampling from Posterior

In Section 7.3.2.3, we conduct a detailed analysis of the reason for introducing SAM and the similarity of SAM and GC operator. Nevertheless, the model training issue remains a challenge; since the loss function is derived within the variational inference framework, where obtaining sample from the posterior distribution is vital to proceed the model training. To this end, the remain of this part will focus on how to obtain samples from the posterior distribution. First, let us review the normalization constraint introduced by SAM:

(7.16) $sigma-summation Underscript j equals 1 Overscript nu Endscripts p left-parenthesis alpha Subscript italic i j Baseline right-parenthesis equals 1$

Building on this framework, the distribution of each row in equations is subject to certain constraints, specifically:

The distribution must exist on a simplex.
Any sample drawn from this distribution should be nonnegative.

Although the Dirichlet distribution meets these criteria and promotes sparsity, making it an intuitive choice, it poses a challenge in that it is not parameterizable for backends based on gradient descent-driven DL techniques. Consequently, in alignment with the approach suggested in Ref. [35], a strategy is adopted wherein random variables are drawn from a nonnegative distribution and subsequently normalized to adhere to the simplex constraint. This process enables the derivation of the column vector ModifyingAbove alpha Subscript i Baseline With right-arrow for the attention matrix or graph equations , as delineated in Eq. (7.6):

(7.17) $ModifyingAbove alpha With right-arrow Subscript i Baseline equals StartFraction ModifyingAbove Above a overTilde With right-arrow Subscript i Baseline Over sigma-summation Underscript j equals 1 Overscript nu Endscripts a overTilde Subscript italic i j Baseline EndFraction comma i element-of StartSet 1 comma 2 comma ellipsis comma nu EndSet$

where the subscript i indicates the row vector at row i, and the a overTilde is a random sample drawn from a nonnegative distribution.

In this chapter, the log-normal distribution defined in Eq. (7.18) is chosen as the nonnegative distribution, and σ of the log-normal distribution is treated as a global hyperparameter:

(7.18) $p left-parenthesis epsilon overTilde bar mu comma sigma right-parenthesis equals StartFraction 1 Over epsilon overTilde sigma StartRoot 2 normal pi EndRoot EndFraction exp left-bracket minus StartFraction left-parenthesis log epsilon overTilde minus mu right-parenthesis squared Over 2 sigma squared EndFraction right-bracket$

Since sampling from a log-normal distribution is equivalent to sampling from a normal distribution:

(7.19) $a overTilde tilde upper L o g hyphen Normal left-parenthesis mu comma sigma squared right-parenthesis left right double arrow a overTilde equals exp left-parenthesis ModifyingAbove alpha With Ì‚ right-parenthesis comma ModifyingAbove alpha With Ì‚ tilde left-parenthesis mu comma sigma squared right-parenthesis$

Thereby, the unnormalized attention weight ModifyingAbove alpha With Ì‚ can be obtained via the expected value μ of the normal distribution, and the expected value μ can be obtained as Eq. (7.20):

(7.20) $mu Subscript italic i j Baseline equals left-bracket StartFraction bold upper Q upper K Superscript down-tack Baseline Over StartRoot nu EndRoot EndFraction minus StartFraction sigma squared Over 2 EndFraction right-bracket Subscript italic i j$

Note that the sampling operation ensures Eq. (7.21) set up, which means that the gradient estimation of the unnormalized graph ModifyingAbove With tilde (the entity is denoted as a overTilde Subscript italic i j ) is unbiased:

(7.21) $double-struck upper E left-parenthesis a overTilde Subscript italic i j Baseline right-parenthesis equals left-bracket exp left-parenthesis StartFraction bold-italic upper Q upper K Superscript down-tack Baseline Over StartRoot nu EndRoot EndFraction right-parenthesis right-bracket Subscript italic i j$

7.3.3 Model Expressions

Based on our abovementioned approaches we name our module VIOG, where we use the variational inference framework, and our model can be treated as a special kind of GNN. Besides, we have not introduced the model structure assumption in our derivation process, which indicates that our proposed approach can be a plug-in module and can be adapted to most of the current model structure. On this basis, our proposed model architecture is given as follows:

Figure 7.3 presents the illustration of the VIOG, which consists of the VIOG module and downstream DL models. The Bayesian SAM parameterized by φ in the VIOG module serves as q Subscript phi Baseline left-parenthesis x right-parenthesis , which undertakes the task of leveraging and reconciling the prior knowledge. While the DL models are parameterized by θ serve as p Subscript theta Baseline left-parenthesis y bar comma x right-parenthesis support tasks like soft sensors and process monitoring.

Suppose the input tensors are in shape like [batch size, sequence length, ν]. Before all operations, the reshape operation in Eq. (7.22) is first executed to avoid breaking their time–series structure.

(7.22) $x equals reshape left-parenthesis x comma left-bracket batch bar size times s e q bar length comma nu right-bracket right-parenthesis$

After that, the linear projection for input tensor embedding on each dimension in shape [batch size × seq length, 1] is conducted as Eq. (7.23):

(7.23) $StartLayout Enlarged left-brace 1st Row q Subscript i Baseline equals x Subscript i Baseline upper W Subscript q Superscript i Baseline plus b Subscript q Superscript i Baseline 2nd Row k Subscript i Baseline equals x Subscript i Baseline upper W Subscript k Superscript i Baseline plus b Subscript k Superscript i Baseline 3rd Row v Subscript i Baseline equals x Subscript i Baseline upper W Subscript v Superscript i Baseline plus b Subscript v Superscript i Baseline EndLayout semicolon i element-of StartSet 1 comma ellipsis comma nu EndSet$

Typically, instead of performing a single attention function, it is beneficial to linearly project the queries, keys, and values H times with different, learned linear transformations. Thereby, (7.23) will execute parallelly in H heads.

Thereafter, the similarity s_ij can be obtained as (7.24):

(7.24) $s Subscript italic i j Baseline equals StartFraction q Subscript i Baseline times k Subscript j Superscript down-tack Baseline Over StartRoot nu EndRoot EndFraction semicolon i comma j element-of StartSet 1 comma ellipsis comma nu EndSet$

On this basis, ModifyingAbove s With Ì‚ Subscript italic i j is calculated row-by-row as (7.25):

(7.25) $ModifyingAbove Above ModifyingAbove s With Ì‚ With right-arrow Subscript i Baseline equals log Softmax Subscript j Baseline left-parenthesis s Subscript italic i j Baseline right-parenthesis semicolon i comma j element-of StartSet 1 comma ellipsis comma nu EndSet$

where the subscript j signifies that the log-softmax operator (log-softmax) is applied column-wise, specifically targeting the column index j, and executed on a row-by-row basis. As such, the ModifyingAbove a With Ì‚ Subscript italic i j is reparametrized with noise sampled from left-parenthesis 0 comma upper I right-parenthesis as per (7.20):

(7.26) $ModifyingAbove a With Ì‚ Subscript italic i j Baseline equals ModifyingAbove s With Ì‚ Subscript italic i j Baseline plus epsilon times sigma minus StartFraction sigma squared Over 2 EndFraction comma epsilon tilde left-parenthesis 0 comma upper I right-parenthesis$

And thus, the nonnegative random variable α_ij can be obtained as per (7.27):

(7.27) $alpha overTilde Subscript italic i j Baseline equals exp left-parenthesis ModifyingAbove a With Ì‚ Subscript italic i j Baseline right-parenthesis$

A framework depicts the structure of the V I O G Visual-Inertial-Odometry-G N S S module. The input information includes Q, K, V, noise, a cap, softmax, and residual. The transformer is depicted with stacked layers xN. It includes F N N, L S T M, C N N. — **Figure 7.3** The illustrator of the VIOG module.

The entity α_ij of <math xmlns="http://www.w3.org/1998/Math/MathML" display="inline" overflow="scroll" is obtained via row-by-row normalization as per (7.17). It is worth noting that (7.27) and (7.17) are functionally equivalent to the softmax operation:

(7.28) $ModifyingAbove alpha With right-arrow Subscript i Baseline equals Softmax Subscript j Baseline left-parenthesis ModifyingAbove alpha With Ì‚ Subscript italic i j Baseline right-parenthesis$

In this procedure, the unnormalized attention weight, instrumental in calculating the KL divergence term, is provided as depicted in (7.29):

(7.29) $ModifyingAbove With tilde equals log Softmax left-parenthesis bold upper K upper Q Superscript down-tack Baseline right-parenthesis$

Note that the inference of the label p Subscript theta Baseline left-parenthesis y bar comma x right-parenthesis is conditioned on A and x. While the abovementioned description mainly concentrates on q Subscript phi Baseline left-parenthesis x right-parenthesis . To include the information of x, the weighted sum in (7.30) and residual operations (7.31) similar to the SAM is also indispensable:

(7.30) $h Subscript v Baseline equals upper V$

(7.31) $h Subscript final Baseline equals h Subscript v Baseline plus italic x upper W Subscript r e s Baseline plus b Subscript r e s$

where h is the hidden feature, and subscript res stands for residual. Besides, to avoid the over-fitting of the VIOG module, it is suggested to adopt the dropout layer after obtaining the final feature h_final [32, 36].

And finally, the loss function for the DL models with the VIOG module can be rewritten as (7.32):

(7.32) $StartLayout 1st Row 1st Column Loss 2nd Column equals minus ELBO Subscript beta Baseline 2nd Row 1st Column Blank 2nd Column equals minus double-struck upper E Subscript q Sub Subscript phi Subscript left-parenthesis right-parenthesis Baseline left-bracket log p Subscript theta Baseline left-parenthesis y bar x comma right-parenthesis right-bracket plus beta italic upper K upper L left-parenthesis q Subscript phi Baseline left-parenthesis x right-parenthesis parallel-to p left-parenthesis right-parenthesis right-parenthesis 3rd Row 1st Column Blank 2nd Column equals double-vertical-bar ModifyingAbove y With Ì‚ minus y double-vertical-bar Subscript 2 Superscript 2 Baseline plus sigma-summation Underscript i equals 1 Overscript nu Endscripts sigma-summation Underscript j equals 1 Overscript nu Endscripts one half left-bracket StartFraction left-parenthesis ModifyingAbove a With Ì‚ minus a right-parenthesis squared Over sigma squared EndFraction right-bracket EndLayout$

where β acts as a balancing mechanism, harmonizing the trade-off between prediction error and knowledge-based regularization.

7.4 Experimental Results

In this section, the effectiveness of the VIOG module for industrial process inferential sensors will be demonstrated on a catalytic shift conversion (CSC) unit. We devise experiments on an industrial inferential sensor dataset, to verify the superiority of OC–NDPLVM and answer the research questions as follows:

Performance: Does VIOG module work? The evaluation of VIOG’s performance in comparison to various baseline methods is detailed in Section 7.4.5.
Uniqueness: Does the knowledge regularization term have an effect? Section 7.4.6 is dedicated to comparing the regularization effect of the VIOG module with that of L1 and L2 regularization terms.
Sensitivity: Is it sensitive to key hyperparameters? Section 7.4.7 elucidates the impact of different hyperparameters on prediction accuracy, analyzing the system’s responsiveness to parameter alterations.

7.4.1 Evaluation Metrics

The root mean square error (RMSE), coefficient of determination (R²), mean absolute error (MAE), and mean absolute percentage error (MAPE) are leveraged as the evaluation indices. The detailed expressions are given in Eqs. (7.33)–(7.36), respectively.

(7.33) $RMSE equals StartRoot StartFraction 1 Over upper N EndFraction sigma-summation Underscript n equals 1 Overscript upper N Endscripts left-parenthesis y Subscript n Baseline minus y overTilde Subscript n Baseline right-parenthesis squared EndRoot$

(7.34) $upper R squared equals 1 minus StartFraction sigma-summation Underscript n equals 1 Overscript upper N Endscripts left-parenthesis y Subscript n Baseline minus y overTilde Subscript n Baseline right-parenthesis squared Over sigma-summation Underscript n equals 1 Overscript upper N Endscripts left-parenthesis y Subscript n Baseline minus y overbar right-parenthesis squared EndFraction$

(7.35) $upper M upper A upper E equals StartFraction 1 Over upper N EndFraction sigma-summation Underscript n equals 1 Overscript upper N Endscripts StartAbsoluteValue y Subscript n Baseline minus y overTilde Subscript n Baseline EndAbsoluteValue$

(7.36) $MAPE equals StartFraction 100 percent-sign Over upper N EndFraction sigma-summation Underscript n equals 1 Overscript upper N Endscripts StartAbsoluteValue StartFraction y Subscript n Baseline minus y overTilde Subscript n Baseline Over y Subscript n Baseline EndFraction EndAbsoluteValue$

where y overbar is the mean value of the label, y overTilde is the predicted value, y is the real value, and N is the data number of the testing dataset. For RMSE, MAE, and MAPE, the smaller value, the more accurate the model. On the contrary, the closer R² to 1, the better the model fits.

7.4.2 Process Description

To showcase the effectiveness of our proposed method, we conduct a case study focusing on quality prediction in a real CSC unit, which belongs to an ammonia synthesis process. In the following content, we offer an in-depth overview of the technical background pertinent to these industrial processes.

Figure 7.4 illustrates the process flow diagram of the CSC unit, as described in Ref. [37], which is integral to an actual ammonia synthesis process. The chemical reaction, as specified in Eq. (7.37), occurs within fixed-bed reactors. A critical aspect of ammonia synthesis is maintaining an optimal carbon–hydrogen ratio. This unit comprises two isothermal fixed-bed reactors arranged sequentially. The initial step involves compressing the reactant gas into the first high-temperature reactor, ensuring the achievement of thermodynamic equilibrium and an appropriate reaction rate. Subsequently, the gas is cooled in the second reactor, modifying the equilibrium phase to favor increased hydrogen yield. This process results in a final product gas, ready for further processing in subsequent downstream operations. Given that Eq. (7.37) outlines an exothermic reaction, as detailed in Ref. [38] (where δH indicates the enthalpy change):

(7.37) $upper C upper O plus normal upper H 2 normal upper O left-right-arrow upper C upper O 2 plus normal upper H 2 comma normal delta bold upper H Subscript 298 normal upper K Baseline equals minus 41.4 k upper J m o l Superscript negative 1$

wherein elevated temperatures may shift the equilibrium toward the reverse reaction. This shift, in accordance with Le Chatelier’s principle, typically results in a decreased conversion rate of carbon monoxide. Conversely, the reaction kinetics, guided by the Arrhenius equation, indicate that the reaction rate is significantly reduced at lower temperatures. Consequently, to optimize the conversion efficiency, it is practical to conduct the reaction using different catalysts at varied temperature ranges. This approach effectively balances the thermodynamic and kinetic considerations, ensuring optimal conversion from both perspectives. The technological specifications previously described necessitate accurate, real-time monitoring of carbon monoxide levels, specifically at the outlet of the low-temperature bed, which is highlighted in green in Figure 7.4. Traditional measurement techniques, such as gas chromatography, however, introduce significant delays in carbon monoxide content detection. To enhance the control of the reactor and enable immediate monitoring of carbon monoxide concentrations, hard sensors have been deployed to gather essential process variables. For the purpose of constructing a robust quality prediction model, 13 key variables have been selected. These variables and their respective roles in the process are elaborately detailed in Table 7.1.

A flow diagram depicts a chemical process and represents a separation or distillation process. The diagram displays the flow of material through various units, including a flash drum, valves, and heat exchangers. The process starts with a feed stream entering the system. The feed stream is first sent to a flash drum and the diagram displays several units labeled U1 to U13. — **Figure 7.4** Flowsheet of CSC unit.

7.4.3 Prior Knowledge Analysis

Based on the principles of chemical reaction engineering, the reactors are analyzed employing the plug-flow reactor model, as depicted in Figure 7.5. By analyzing the length-l infinitesimal (dV) marked in the white zone, the mass conservation of reactant γ can be derived as shown in (7.38), where ζ is the bed voidage, ℱ is the flow rate, and ℛ is the reaction rate.

(7.38) $script upper F Subscript gamma Baseline minus left-parenthesis script upper F Subscript gamma Baseline plus normal d script upper F Subscript gamma Baseline right-parenthesis equals left-parenthesis minus script upper R Subscript gamma Baseline right-parenthesis left-parenthesis 1 minus zeta right-parenthesis gamma$

Table 7.1 Variable of CSC process.

Variable	Description
U1	High-temperature bed, temperature 1
U2	High-temperature bed, temperature 2
U3	High-temperature bed, temperature 3
U4	Outlet temperature of high-temperature bed
U5	Outlet temperature of cooling water
U6	Split-gas temperature
U7	Inlet temperature of low-temperature bed
U8	Low-temperature bed temperature 1
U9	Low-temperature bed temperature 2
U10	Low-temperature bed temperature 3
U11	Outlet temperature of low-temperature bed
U12	Outlet pressure of low-temperature bed
U13	Product gas pressure
Y	Carbon monoxide concentration

A diagram depicts a plug-flow reactor model. It includes inlet, dV, outlet, lambda. — **Figure 7.5** The plug-flow reactor model.

Meanwhile, the heat conservation [39] can be given in (7.39):

(7.39)

where G_c indicates gas heat capacity, subscript c is the abbreviation of capacity, and equations indicates temperature.

According to Ergun equation, the pressure drop ( StartFraction normal d p Over normal d l EndFraction ) can be derived as follows:

(7.40) $StartFraction normal d p Over normal d l EndFraction equals left-parenthesis StartFraction 150 Over Re EndFraction plus 1.75 right-parenthesis left-parenthesis StartFraction 1 minus eta Over zeta cubed EndFraction right-parenthesis left-parenthesis StartFraction rho Subscript normal g Baseline u Subscript normal g Baseline Over d Subscript normal s Baseline EndFraction right-parenthesis$

where Re is Reynold’s number, ρ_g is reactant gas density, u_g is gas velocity, and subscript g is the abbreviation of “gas.” Note that the gas velocity determines the ratio of gas reactant diffusion rate and reaction rate (also known as Thiele modulus). And thus, in the designing stage of the reactor, the velocity and the voidage should be well designed to make the pressure drop as small as possible for higher reactant conversion [40].

Throughout the analysis of the CSC process, we can have the following prior knowledge:

A set of two frameworks depicts A. A 13x13 grid with axes labeled U1, U5, U9, and U13. The color bar above the heatmap ranges from 0.5 to 1.0. A diagonal pattern is evident, where dark shade squares indicate higher values and light shade shade squares indicate lower values. B. A color bar above this heatmap ranges from -2.7 to -2.2. — **Figure 7.6** (a) The prior knowledge adjacency matrix of CSC; (b) The normalized prior knowledge adjacency matrix of CSC.

U1, U2, and U3 in the high-temperature bed can derive temperature difference by minus along the reactant flow direction.
U7, U8, and U9 along the gas flow direction can derive the temperature.
U11, U12, and U13 in the low-temperature bed can derive pressure difference by minus along the reactant flow direction.
U8, U9, U10, and U11 in the low-temperature bed can derive pressure difference by minus along the reactant flow direction.

Consequently, the prior knowledge of the CSC process can be obtained as Figure 7.6a shows, where the prior knowledge part and the diagonal are 0.955, and other elements are 0.045. The normalized graph by log-softmax function by rows is given in Figure 7.6b, which is adopted as the value of μ in the prior log-normal distribution.

7.4.4 Baseline Models

We choose the following kinds of baseline models to demonstrate the effectiveness of our VIOG module. The hyperparameter setting and other training protocols are reported in Appendix 7.A. To facilitate comprehension, the final column showcases the “win counts” as per Ref. [41], representing the metrics where the model featuring the VIOG module outperforms the baseline models:

GNN-Based Models: Our VIOG can be treated as a special kind of GNN, and thus we select graph convolution network (GCN) [34] and graph attention networks (GATs) [42] as our baseline models.
Synthesizer-Based Models: In the VIOG framework, our approach begins by leveraging the scaled-dot product attention mechanism to autonomously learn the graph structure. Progressing from this foundational step, we introduce significant modifications: the scaled-dot product attention is replaced by two new elements – a multi-layer perceptron (MLP) and a learnable matrix. As elaborated by Tay et al. [43], these adaptations are identified as the dense synthesizer (SynDense) and the random synthesizer (SynRandom). The primary goal of these changes is to assess the effectiveness of the similarity function in our model. It is noteworthy that current adaptive GNN models, including Graph WaveNet [44], MTGNN [45], and FC–GAGA [46], can be viewed as variants of either SynRandom or SynDense. Therefore, due to their significance in the adaptive GNN field, SynRandom and SynDense are chosen as the baseline models for our comparative evaluation.
VAE-Based Model: Our model, which is fundamentally based on variational inference techniques, employs two baseline models for comparative analysis – the supervised variational autoencoder (SVAE) and the attentive neural process (ANP). The SVAE, as discussed in references [47, 48], primarily focuses on variable space, while the ANP, described in Kim et al. [49], operates within functional space. These models were specifically selected due to their pertinent application of variational inference methods, aligning closely with the theoretical underpinnings of our model.

7.4.5 Model Performance Comparisons

Table 7.2 presents the evaluation metrics results for different models. The following observations can be obtained:

For most downstream models, our VIOG model can have the best or second-best performance.
The VIOG model does not improve the transformer downstream model as much as other downstream models.
Synthesizer-based models have more competitive performance than the scaled-dot product attention model.

Observation 1 highlights the efficacy and superiority of our proposed VIOG module across most downstream models in industrial data modeling tasks. However, as noted in Observation 2, the VIOG module demonstrates only marginal improvements when compared to other GNN models. This limited enhancement could be attributed to the complexity of the model architecture. For instance, the transformer model, characterized by its larger parameter count and more intricate encoder and layer structure, presents greater optimization challenges compared to other models. Consequently, the VIOG module’s impact on improving the transformer model is less pronounced than it is on other downstream models. Further supporting this, Observation 3 points out the similarity between traffic flow prediction tasks and inferential sensor tasks. In traffic flow predictions, adaptive GNN models such as Graph WaveNet [44], MTGNN [45], and FC–GAGA [46], which can be considered variants of synthesizer-based models [43], align with the findings of Observation 3.

Table 7.2 Comparison to baseline models.

Structure	RMSE	R²	MAPE	MAE
VIOG	0.144 ± 2.414 × E-6	0.939 ± 3.986 × E-6	0.058 ± 5.842 × E-7	0.114 ± 2.249 × E-6
GCN	0.161 ± 9.606 × E-5	0.904 ± 2.689 × E-4	0.065 ± 1.614 × E-5	0.127 ± 6.230 × E-5
GAT	0.179 ± 9.385 × E-4	0.877 ± 4.352 × E-3	0.069 ± 1.340 × E-4	0.137 ± 4.929 × E-4
SynDense	0.174 ± 4.127 × E-4	0.898 ± 7.115 × E-4	0.069 ± 5.540 × E-5	0.135 ± 2.136 × E-4
SynRandom	0.147 ± 3.648 × E-5	0.934 ± 6.423 × E-5	0.059 ± 7.612 × E-6	0.116 ± 2.933 × E-5
SVAE	0.159 ± 1.997 × E-5	0.928 ± 1.681 × E-5	0.064 ± 2.517 × E-6	0.125 ± 9.625 × E-6
ANP	0.239 ± 2.121 × E-4	0.836 ± 4.134 × E-4	0.096 ± 3.981 × E-5	0.188 ± 1.537 × E-4

7.4.6 Comparison with L1 and L2 Regularization Terms

The VIOG module incorporates a prior knowledge regularization term into its learning objective, a feature that inherently enhances model performance. In this subsection, we aim to conduct a comparative analysis between this prior knowledge-based regularization term and the conventional L1 and L2 regularization terms. This comparison is intended to further demonstrate our method’s superiority. Following the approach used in Section 7.4.5, we implement a grid search for the L1 and L2 regularization terms and present the findings in Table 7.3.

From Table 7.3, we found that our VIOG module consistently outperforms conventional L1 and L2 regularization terms across various downstream models. This outcome highlights the effectiveness and superiority of the knowledge regularization term introduced by VIOG. However, it is noted that the prior knowledge regularization term underperforms when applied to LSTM and transformer downstream models within the context of the CSC dataset. This can likely be attributed to reasons analogous to those discussed in Section 7.4.5. Specifically, the CSC unit’s direct connection to downstream production units, which maintains high reactant purity and stable production content at set points, can be equated to “conceptual drift” in machine learning terminology. As a result, the VIOG-augmented downstream model exhibits diminished performance compared to traditional L1 and L2 regularization when tested on the CSC dataset. Nonetheless, in most tested scenarios, the VIOG’s regularization term demonstrably enhances the performance of downstream models beyond what is achieved with standard L1 and L2 terms. This finding emphasizes the broad-scale superiority of the VIOG module in enhancing model performance.

Table 7.3 Comparison to baseline models.

Downstream	Regularization	RMSE	R²	MAPE	MAE
FNN	VIOG	0.144 ± 2.414 × E-6	0.939 ± 3.986 × E-6	0.058 ± 5.842 × E-7	0.114 ± 2.249 × E-6
	L1	0.177 ± 2.043 × E-4	0.907 ± 1.656 × E-4	0.070 ± 3.487 × E-5	0.128 ± 5.882 × E-4
	L2	0.188 ± 1.171 × E-4	0.908 ± 1.563 × E-4	0.075 ± 1.481 × E-5	0.147 ± 5.726 × E-5
CNN	VIOG	0.143 ± 3.583 × E-5	0.934 ± 8.610 × E-5	0.057 ± 5.353 × E-6	0.112 ± 2.048 × E-5
	L1	0.147 ± 4.017 × E-5	0.931 ± 8.462 × E-5	0.058 ± 5.514 × E-6	0.115 ± 2.122 × E-5
	L2	0.151 ± 1.104 × E-4	0.928 ± 9.043 × E-5	0.060 ± 1.616 × E-5	0.118 ± 6.235 × E-5
LSTM	VIOG	0.150 ± 5.858 × E-5	0.921 ± 1.965 × E-4	0.060 ± 9.164 × E-6	0.117 ± 3.531 × E-5
	L1	0.145 ± 3.775 × E-5	0.930 ± 9.769 × E-5	0.058 ± 5.670 × E-6	0.113 ± 2.194 × E-5
	L2	0.149 ± 4.249 × E-5	0.924 ± 1.376 × E-4	0.059 ± 5.456 × E-6	0.116 ± 2.102 × E-5
Transformer	VIOG	0.173 ± 8.454 × E-5	0.906 ± 1.676 × E-4	0.0692 ± 1.426 × E-5	0.1361 ± 5.497 × E-5
	L1	0.180 ± 3.464 × E-4	0.890 ± 1.211 × E-3	0.0737 ± 6.530 × E-5	0.1425 ± 2.508 × E-4
	L2	0.172 ± 3.250 × E-4	0.902 ± 8.558 × E-4	0.0694 ± 5.509 × E-5	0.1364 ± 2.117 × E-4

7.4.7 Sensitivity Analysis

In this subsection, we conduct a sensitivity analysis of the VIOG model paired with a transformer downstream model, utilizing the CSC dataset. The outcomes of this analysis are depicted in Figure 7.7. Here, we observe how the model’s performance varies with adjustments in regularization strength and batch size. As evident from this figure, despite some fluctuations in performance due to changes in hyperparameters, the model consistently exhibits strong performance. This indicates that the VIOG module is effectively designed, demonstrating resilience across a broad range of hyperparameter settings.

A set of eight graphs depicts a to d. A graph of the y-axis ranges from 0.18 to 0.22 versus the log subscript 2(batch size) ranges from 6 to 10. It represents a plot line graph labeled as R M S E, R power 2, M A E, M A P E. e to h. A graph of the y-axis ranges from 0.18 to 0.20 versus the beta ranges from 0 to 2. It represents a plot line graph labeled as R M S E, R power 2, M A E, M A P E. — **Figure 7.7** The sensitivity analysis results on the transformer downstream model of (a) RMSE, (b) R², (c) MAE, (d) MAPE along batch size for the CSC dataset. The sensitivity analysis results of (e) RMSE, (f) R², (g) MAE, and (h) MAPE along regularization strength β for the CSC dataset.

7.5 Conclusions

In this study, we introduced the VIOG module, designed to automatically encode and select knowledge as DL features in data-driven industrial process modeling. The VIOG module enhances traditional DL models by incorporating a VIOG inference network and a novel prior knowledge regularization term. Our approach involved analyzing the probabilistic graph structures inherent in conventional GNNs and DL models. This analysis guided the development of the variational inference technique, which utilizes prior knowledge as a regulatory factor. Furthermore, we crafted the SAM as a unique spatial-based GNN variant, aiming to assimilate prior knowledge effectively. The loss function of the VIOG module was also meticulously derived to align with these objectives. To demonstrate its efficacy, the VIOG module was applied in inferential sensor experiments within the CSC process, where it exhibited superior performance.

While the VIOG module shows promising potential, it stands to benefit from advancements in two key areas. First, the computation of attention scores in the VIOG module is computationally demanding, requiring significant time and memory resources. A potential solution lies in adopting sampling techniques similar to those proposed in [41], which could significantly reduce the training costs associated with VIOG. Secondly, there is room for improvement in the precision of data probabilistic density estimation. The current reliance of the VIOG module on amortized variational inference techniques leads to what is known as an “amortization gap” – a discrepancy between the log-likelihood and the evidence lower bound (ELBO) as detailed in [50]. This gap might be narrowed by integrating variational inference with importance sampling methods, enhancing the tightness of the ELBO, a concept explored in [51].

Experimental Settings

For simplicity, we set the regularization strength β of the VIOG module as 1.0. We conduct grid search on learning rate in [0.0001, 0.001, 0.01, 0.1, 1.0] and batch size in [32, 64, 128, 256, 512, 1024]. For L1 and L2 regularization terms, we select the model regularization strength as 0.001 by applying grid search in [0.001, 0.005, 0.010, 0.020]. The learning rate and batch size adopted in models are listed in Table 7.A.1.

Other hyperparameters are listed as follows:

For the transformer model, the encoder layer is 3, decoder layer is 3. The head number of the attention is set as 8. The sequence length is set as 6 while the start token length is set as 0.
The layer of the LSTM model is 1, where the number of hidden units is 32.
The feature extraction layer of the CNN model is 3, where the convolution kernel size is [4, 4, 4] while the pooling kernel size is [2, 2, 2]. The stride of them is set as 1, and the hidden unit number of the FC layer is set as 30 → 20.
The hidden unit of the FNN is set as 13 → 10 → 7 → 5.

In addition, both GCN and GAT necessitate a predefined graph structure during the model inference phase. To address this requirement, we consider the existence of edges in the prior graph as being governed by a Bernoulli distribution. Following this approach, the edges of the graph are independently sampled from their respective Bernoulli distributions, aligning with the mean-field assumption on graphs as outlined in Ref. [30]. We organize the dataset in ascending order based on timestamps. Subsequently, we allocated the first 60% of the data for training purposes, the subsequent 10% (from 60% to 70%) for validation, and the remaining portion for testing. For model optimization, we employ the Adam optimizer, as outlined in [52]. All experimental procedures are performed on a high-specification workstation equipped with four Intel Xeon E5 processors, eight NVIDIA GTX 1080 graphics cards, and 128 GB of RAM. The model training and inference processes are conducted using Python 3.8, with PyTorch 1.10 [53] serving as the DL backend. To ensure robustness and reduce randomness in our results, each experiment is replicated a minimum of three times, each time using one of seven distinct random seeds.

Table 7.A.1 Hyperparameters of the CSC dataset.

	batch				Lr
Structure	FNN	CNN	LSTM	Transformer	FNN	CNN	LSTM	Transformer
VIOG	1024	64	64	32	0.01	0.001	0.001	0.005
GCN	1024	64	64	32	0.01	0.001	0.001	0.005
GAT	1024	64	64	32	0.01	0.001	0.001	0.005
SynDense	1024	64	256	64	0.01	0.001	0.001	0.001
SynRandom	1024	128	256	1024	0.01	0.01	0.001	0.001
SVAE	32	256	512	1024	0.01	0.01	0.01	0.01
ANP	128	64	64	32	0.01	0.01	0.01	0.01
L1	1024	64	64	32	0.01	0.001	0.001	0.005
L2	1024	64	64	32	0.01	0.001	0.001	0.005

References

1 Sun, Q. and Ge, Z. (2021). A survey on deep learning for data-driven soft sensors. IEEE Transactions on Industrial Informatics 17 (9): 5853–5866. https://doi.org/10.1109/TII.2021. 3053128.
2 Luo, Y., Zhang, X., Kano, M. et al. (2023). Data-driven soft sensors in blast furnace ironmaking: a survey. Frontiers of Information Technology & Electronic Engineering 24 (3): 327–354.
3 Qian, F., Yaochu Jin, S., Qin, J., and Sundmacher, K. (2021). Guest editorial special issue on deep integration of artificial intelligence and data science for process manufacturing. IEEE Transactions on Neural Networks and Learning Systems 32 (8): 3294–3295. https://doi.org/10.1109/TNNLS.2021.3092896.
4 Qian, J., Song, Z., Yao, Y. et al. (2022). A review on autoencoder based representation learning for fault detection and diagnosis in industrial processes. Chemometrics and Intelligent Laboratory Systems 231: 104711.
5 Wang, K., Shang, C., Liu, L. et al. (2019). Dynamic soft sensor development based on convolutional neural networks. Industrial & Engineering Chemistry Research 58 (26): 11521–11531.
6 Yan, F., Yang, C., and Zhang, X. (2022). Dsted: a denoising spatial–temporal encoder–decoder framework for multistep prediction of burn-through point in sintering process. IEEE Transactions on Industrial Electronics 69 (10): 10735–10744. https://doi.org/10.1109/TIE. 2022.3151960.
7 Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning, vol. 1. Cambridge: MIT Press.
8 Zeng, A., Chen, M., Zhang, L., and Xu, Q. (2023). Are transformers effective for time series forecasting? Proceedings of the AAAI Conference on Artificial Intelligence 37 (9): 11121–11128. https://doi.org/10.1609/aaai.v37i9.26317.
9 Karniadakis, G.E., Kevrekidis, I.G., Lu, L. et al. (2021). Physics-informed machine learning. Nature Reviews Physics 3 (6): 422–440.
10 Chen, Z., Xu, J., Alippi, C. et al. (2021). Graph neural network-based fault diagnosis: a review. arXiv preprint arXiv:2111.08185.
11 Chen, Z., Jiamin, X., Peng, T., and Yang, C. (2022). Graph convolutional network-based method for fault diagnosis using a hybrid of measurement and prior knowledge. IEEE Transactions on Cybernetics 52 (9): 9157–9169. https://doi.org/10.1109/TCYB.2021.3059002.
12 Ren, H., Chen, Z., Liang, X. et al. (2023). Association hierarchical representation learning for plant-wide process monitoring by using multilevel knowledge graph. IEEE Transactions on Artificial Intelligence 4 (4): 636–649. https://doi.org/10.1109/TAI.2022.3161860.
13 Ren, H., Liang, X., Yang, C. et al. (2023). Spatial-temporal associations representation and application for process monitoring using graph convolution neural network. Process Safety and Environmental Protection 180: 35–47.
14 Guo, H., Tang, R., Ye, Y. et al. (2017). Deepfm: a factorization-machine based neural network for CTR prediction. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17. pp. 1725–1731. https://doi.org/10.24963/ijcai.2017/239.
15 Ma, Y. and Huang, B. (2018). Extracting dynamic features with switching models for process data analytics and application in soft sensing. AIChE Journal 64 (6): 2037–2051.
16 Chen, Z. and Ge, Z. (2022). Knowledge automation through graph mining, convolution, and explanation framework: a soft sensor practice. IEEE Transactions on Industrial Informatics 18 (9): 6068–6078. https://doi.org/10.1109/TII.2021.3127204.
17 Chen, Z., Song, Z., and Ge, Z. (2023). Variational inference over graph: knowledge representation for deep process data analytics. IEEE Transactions on Knowledge and Data Engineering 1–16:https://doi.org/10.1109/TKDE.2023.3327415.
18 Raissi, M., Perdikaris, P., and Karniadakis, G.E. (2019). Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics 378: 686–707.
19 Zhiyong, W., Wang, H., He, C. et al. (2023). The application of physics-informed machine learning in multiphysics modeling in chemical engineering. Industrial & Engineering Chemistry Research 62 (44): 18178–18204. https://doi.org/10.1021/acs.iecr.3c02383.
20 Wang, R., Kashinath, K., Mustafa, M. et al. (2020). Towards physics-informed deep learning for turbulent flow prediction. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York, NY, USA: Association for Computing Machinery. pp. 1457–1466.
21 Zheng, Y. and Zhe, W. (2023). Physics-informed online machine learning and predictive control of nonlinear processes with parameter uncertainty. Industrial & Engineering Chemistry Research 62 (6): 2804–2818.
22 Alhajeri, M.S., Abdullah, F., Zhe, W., and Christofides, P.D. (2022). Physics-informed machine learning modeling for predictive control using noisy data. Chemical Engineering Research and Design 186: 34–49.
23 Zeng, L., Zheng, J., Yao, L., and Ge, Z. (2022). Dynamic Bayesian networks for feature learning and transfer applications in remaining useful life estimation. IEEE Transactions on Instrumentation and Measurement 72: 1–12.
24 Khosbayar, A., Valluru, J., and Huang, B. (2021). Multi-rate Gaussian Bayesian network soft sensor development with noisy input and missing data. Journal of Process Control 105: 48–61.
25 Wu, D. and Zhao, J. (2021). Process topology convolutional network model for chemical process fault diagnosis. Process Safety and Environmental Protection 150: 93–109.
26 Blei, D.M., Kucukelbir, A., and McAuliffe, J.D. (2017). Variational inference: a review for statisticians. Journal of the American Statistical Association 112 (518): 859–877.
27 Zhang, C., Bütepage, J., Kjellström, H., and Mandt, S. (2019). Advances in variational inference. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (8): 2008–2026. https://doi.org/10.1109/TPAMI.2018.2889774.
28 Tai, K.S., Socher, R., and Manning, C.D. (2015). Improved semantic representations from tree-structured long short-term memory networks. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2015). Beijing, China: Association for Computational Linguistics. pp. 1556–1566.
29 Elinas, P., Bonilla, E.V., and Tiao, L. (2020). Variational inference for graph convolutional networks in the absence of graph data and adversarial settings. Advances in Neural Information Processing Systems 33: 18648–18660.
30 Ying, Z., Bourgeois, D., You, J. et al. (2019). GNNExplainer: generating explanations for graph neural networks. Advances in Neural Information Processing Systems 32: 9244–9255.
31 Jang, E., Gu, S., and Poole, B. (2016). Categorical reparameterization with Gumbel-Softmax. International Conference on Learning Representations. San Juan, Puerto Rico. pp. 1–12.
32 Vaswani, A., Shazeer, N., Parmar, N. et al. (2017). Attention is all you need. Neural Information Processing Systems. Long Beach, California, USA: Curran Associates, Inc. pp. 6000–6010.
33 Gilmer, J., Schoenholz, S.S., Riley, P.F. (2017). Neural message passing for quantum chemistry. Proceedings of the 34th International Conference on Machine Learning, vol. 70. Sydney, Australia: PMLR. pp. 1263–1272.
34 Kipf, T.N. and Welling, M. (2017). Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations (ICLR). Toulon, France. pp. 1–14.
35 Fan, X., Zhang, S., Chen, B., and Zhou, M. (2020). Bayesian attention modules. Advances in Neural Information Processing Systems 33: 16362–16376.
36 Baldi, P. and Sadowski, P.J. (2013). Understanding dropout. Advances in Neural Information Processing Systems 26 (NeuralIPS2013): 2814–2822.
37 Yao, L. and Ge, Z. (2021). Cooperative deep dynamic feature extraction and variable time-delay estimation for industrial quality prediction. IEEE Transactions on Industrial Informatics 17 (6): 3782–3792. https://doi.org/10.1109/TII.2020.3021047.
38 Peter Atkins, P. and De Paula, J. (2014). Atkins’ Physical Chemistry. Oxford University Press.
39 McCabe, W.L., Smith, J.C., and Harriott, P. (1993). Unit Operations of Chemical Engineering, vol. 5. New York: McGraw-Hill.
40 Ross, J.R.H. (2019). Chapter 8: Mass and heat transfer limitations and other aspects of the use of large-scale catalytic reactors. In: Contemporary Catalysis (ed. J.R.H. Ross), 187–213. Amsterdam: Elsevier.
41 Zhou, H., Zhang, S., Peng, J. et al. (2021). Informer: beyond efficient transformer for long sequence time-series forecasting. The Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Conference, vol. 35. pp. 11106–11115.
42 Brody, S., Alon, U., and Yahav, E. (2021). How attentive are graph attention networks? International Conference on Learning Representations. pp. 1–26.
43 Tay, Y., Bahri, D., Metzler, D. et al. (2021). Synthesizer: rethinking self-attention for transformer models. International Conference on Machine Learning, PMLR. pp. 10183–10192.
44 Wu, Z., Pan, S., Long, G. et al. (2019). Graph wavenet for deep spatial-temporal graph modeling. Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI’19. Macao, China: Curran Associates, Inc. pp. 1907–1913.
45 Wu, Z., Pan, S., Long, G. et al. (2020). Connecting the dots: multivariate time series forecasting with graph neural networks. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York, NY, USA: Association for Computing Machinery. pp. 753–763.
46 Oreshkin, B.N., Amini, A., Coyle, L., and Coates, M. (2021). FC-GAGA: fully connected gated graph architecture for spatio-temporal traffic forecasting. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35. pp. 9233–9241.
47 Xie, R., Jan, N.M., Hao, K. et al. (2019). Supervised variational autoencoders for soft sensor modeling with missing data. IEEE Transactions on Industrial Informatics 16 (4): 2820–2828.
48 Shen, B., Yao, L., and Ge, Z. (2020). Nonlinear probabilistic latent variable regression models for soft sensor application: from shallow to deep structure. Control Engineering Practice 94: 104198.
49 Kim, H., Mnih, A., Schwarz, J. et al. (2019). Attentive neural processes. International Conference on Learning Representations. pp. 1–18.
50 Cremer, C., Li, X., and Duvenaud, D. (2018). Inference suboptimality in variational autoencoders. International Conference on Machine Learning. PMLR. pp. 1078–1086.
51 Kim, Y., Wiseman, S., Miller, A. et al. (2018). Semi-amortized variational autoencoders. International Conference on Machine Learning, vol. 80. PMLR. pp. 2678–2687.
52 Kingma, D.P. and Ba, J. (2015). Adam: a method for stochastic optimization. International Conference on Learning Representations. San Diega, CA, USA. pp. 1–8.
53 Paszke, A., Gross, S., Massa, F. et al. (2019). Pytorch: an imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32: 1–12.

Tags: Applied AI Techniques in the Process Industry From Molecular Design to Process Design and Optimization

May 11, 2025 | Posted by admin in General Engineer | Comments Off

Chemistry Engineer Key

Fastest Chemistry Engineer Engine

Integrating Incomplete Prior Knowledge into Data-Driven Inferential Sensor Models Under Variational Bayesian Framework

7
Integrating Incomplete Prior Knowledge into Data-Driven Inferential Sensor Models Under Variational Bayesian Framework

7.1 Introduction

7.2 Literature Review

7.2.1 Transport Process Scale

7.2.2 Unit Operation Scale

7.2.3 Overall Summary and Technical Gap

7.3 Proposed Approach

7.3.1 Loss Function Derivation

7.3.2 Knowledge Representation

7.3.2.1 Knowledge Description

7.3.2.2 Knowledge Section via Self-Attention Mechanism

7.3.2.3 Similarity of GCN and SAM

7.3.2.4 Sampling from Posterior

7.3.3 Model Expressions

7.4 Experimental Results

7.4.1 Evaluation Metrics

7.4.2 Process Description

7.4.3 Prior Knowledge Analysis

7.4.4 Baseline Models

7.4.5 Model Performance Comparisons

7.4.6 Comparison with L1 and L2 Regularization Terms

7.4.7 Sensitivity Analysis

7.5 Conclusions

Experimental Settings

References

Related

Chemistry Engineer Key

Fastest Chemistry Engineer Engine

Integrating Incomplete Prior Knowledge into Data-Driven Inferential Sensor Models Under Variational Bayesian Framework

7.1 Introduction

7.2 Literature Review

7.2.1 Transport Process Scale

7.2.2 Unit Operation Scale

7.2.3 Overall Summary and Technical Gap

7.3 Proposed Approach

7.3.1 Loss Function Derivation

7.3.2 Knowledge Representation

7.3.2.1 Knowledge Description

7.3.2.2 Knowledge Section via Self-Attention Mechanism

7.3.2.3 Similarity of GCN and SAM

7.3.2.4 Sampling from Posterior

7.3.3 Model Expressions

7.4 Experimental Results

7.4.1 Evaluation Metrics

7.4.2 Process Description

7.4.3 Prior Knowledge Analysis

7.4.4 Baseline Models

7.4.5 Model Performance Comparisons

7.4.6 Comparison with L1 and L2 Regularization Terms

7.4.7 Sensitivity Analysis

7.5 Conclusions

Experimental Settings

References

Share this:

Related

Related posts: