Introduction

Multiple imputation1 is a widely applied approach for the analysis of incomplete datasets. It involves replacing each missing cell with several plausible imputed values that are drawn from the corresponding posterior predictive distributions. There are two dominant approaches to arrive at those posterior distributions under multivariate missing data: joint modeling (JM) and fully conditional specification (FCS).

Joint modeling requires a specified joint model for the complete data. Schafer2 illustrated joint modeling imputation under the multivariate normal model, the saturated multinomial model, the log-linear model, and the general location model. However, with an increasing number of variables and different levels of measurement, it can be challenging to formulate the joint distribution of the data.

Fully conditional specification offers a solution to this challenge by allowing a flexible specification of the imputation model for each partially observed variable. The imputation procedure then starts by imputing missing values with a random draw from the marginal distribution. Each incomplete variable is then iteratively imputed with a specified univariate imputation model.

Fully conditional specification has been proposed under a variety of names: chained equations stochastic relaxation, variable-by-variable imputation, switching regression, sequential regressions, ordered pseudo-Gibbs sampler, partially incompatible MCMC and iterated univariate imputation3. Fully conditional specification can be of great value in practice because of its flexibility in model specification. FCS has become a standard in practice and has been widely implemented in software (e.g. mice and mi in R, IVEWARE in SAS, ice in STATA and module MVA in SPSS)4.

Although many simulation studies demonstrated that fully conditional specification yields plausible imputations in various cases, the theoretical properties of fully conditional specification are not thoroughly understood5. A sequence of conditional models may not imply a joint distribution to which the algorithm converges. In such a case, the imputation results may systematically differ according to different visit sequences, which is named “order effects”6.

Van Buuren3 stated two cases in which FCS converges to a joint distribution. First, if all imputation models are linear with a homogenous normal distributed response, the implicit joint model would be the multivariate normal distribution. Second, if three incomplete binary variables are imputed with a two-way interactions logistic regression model, FCS would be equivalent to the joint modeling under a zero three-way interaction log-linear model. Liu et al.7 illustrated a series of sufficient conditions under which the imputation distribution for FCS converges in total variation to the posterior distribution of a joint Bayesian model when the sample size moves to infinity. Complementing the work of Liu et al.7, Hughes6 pointed out that, in addition to the compatibility, a “non-informative margins” condition is another sufficient condition for the equivalency of FCS and joint modeling for finite samples. Hughes6 also showed that with multivariate normal distributed data and a non-informative prior, both compatibility and the non-informative margins conditions are satisfied. In that case, fully conditional specification and joint modeling provide imputations from the same posterior distribution. Zhu & Raghunathan8 discussed conditions for convergence and assessed the properties of FCS. Many authors illustrated convergence properties of FCS when the prior for conditional models is non-informative. However, the case of informative priors has not received much attention. Therefore, we should consider the equivalent prior specification for informative priors under a sequence of conditional and corresponding joint models. This additional investigation allows the imputer to perform imputations under FCS even if they only collect the prior joint information for the incomplete dataset.

For the initial step to evaluate convergence properties of FCS with informative priors, it is sensible to focus on the Bayesian normal linear models and the typical informative prior: normal inverse-gamma prior. This paper will briefly overview joint modeling, fully conditional specification, compatibility, and non-informative margins. Then, we derive a theoretical result and perform a simulation study to evaluate the non-informative margins condition. We also consider the prior for the target joint density of a sequence of normal linear models with normal inverse-gamma priors. Finally, some remarks are concluded.

Background

Joint modeling

Let \(Y^{obs}\) and \(Y^{mis}\) denote the observed and missing data in the dataset Y. Joint modeling involves specifying a parametric joint model \(p(Y^{obs}, Y^{mis}|\theta )\) for the complete data and an appropriate prior distribution \(p(\theta )\) for the parameter \(\theta\). Incomplete cases are partitioned into groups according to various missing patterns and then imputed with different sub-models. Under the assumption of ignorability, the imputation model for each group is the corresponding conditional distribution derived from the assumed joint model

$$\begin{aligned} p(Y^{mis}|Y^{obs}) = \int _{}p(Y^{mis}| Y^{obs}, \theta )p(\theta |Y^{obs})d\theta . \end{aligned}$$

Since the joint modeling algorithm converges to the specified multivariate distribution, once the joint imputation model is correctly specified, results will be valid and theoretical properties are satisfactory.

Fully conditional specification

Fully conditional specification attempts to define the joint distribution

\(p(Y^{obs}, Y^{mis}|\theta )\) by positing a univariate imputation model for each partially observed variable. The imputation model is typically a generalized linear model selected based on the nature of the missing variable (e.g. continuous, semi-continuous, categorical and count). Starting from some simple imputation methods, such as mean imputation or a random draw from the sampled values, FCS algorithms iteratively repeat imputations over all missing variables. Precisely, the tth iteration for the incomplete variable \(Y_{j}^{mis}\) consists of the following draws:

$$\begin{aligned}{} & {} \theta _{j}^{t} \sim f(\theta _{j})f(Y_{j}^{obs}|Y_{-j}^{t-1}, \theta _{j})\\{} & {} Y_{j}^{mis(t)} \sim f(Y_{j}^{mis}|Y_{j}^{obs},Y_{-j}^{t}, \theta _{j}^{t}), \end{aligned}$$

where \(f(\theta _{j})\) is generally specified with a non-informative prior. After a sufficient number of iterations, typically ranging from 5 to 10 iterations3,9, the stationary distribution is achieved. The final iteration generates a single imputed dataset, and the multiple imputations are created by applying FCS in parallel m times with different seeds. If the underlying joint distribution defined by separate conditional models exists, the algorithm is equivalent to a Gibbs sampler.

The attractive feature of fully conditional specification is the flexibility of model specification, which allows models to preserve features in the data, such as skip patterns, incorporating constraints and logical, and consistent bounds5. Such restrictions would be difficult to formulate when applying joint modeling. One could conveniently construct a sequence of conditional models and avoid the specification of a parametric multivariate distribution, which may not be appropriate for the data in practice.

Compatibility

The definition of compatibility is given by Liu et al.7: let \(Y = (Y_1, Y_2, \dots , Y_p)\) be a vector of random variables and \(Y_{-j} = (Y_1, Y_2, \dots , Y_{j-1}, Y_{j+1}, \dots , Y_{p})\). A set of conditional models \(\{f_{j}(Y_j|Y_{-j}, \theta _{j}) : \theta _{j} \in \Theta _{j}, j = 1, 2, \dots , p\}\) is said to be compatible if there exists a joint model \(\{f (Y|\theta ) : \theta \in \Theta \}\) and a collection of surjective maps \(\{t_{j} : \Theta \rightarrow \Theta _{j}\}\) such that for each j, \(\theta _{j} \in \Theta _{j}\) and \(\theta \in t_{j}^{-1}(\theta _{j}) = \{\theta : t_{j}(\theta ) = \theta _{j}\}\). In that case

$$\begin{aligned} f_{j}(Y_j|Y_{-j}, \theta _{j}) = f (Y_j|Y_{-j}, \theta ). \end{aligned}$$

Otherwise, \(\{f_{j}, j = 1, 2, \dots , p\}\) is said to be incompatible. A simple example of compatible models is a set of normal linear models for a vector of continuous data:

$$\begin{aligned} Y_j = N(({\textbf {1}}, Y_{-j})\beta _{j}, \sigma _{j}^2), \end{aligned}$$

where \(\beta _{j}\) is the vector of coefficients and \({\textbf {1}}\) is a vector of ones. In such a case, the joint model of \((Y_1, Y_2, \dots , Y_p)\) would be a multivariate normal distribution and the map \(t_j\) is derived by conditional multivariate normal formula. On the other hand, the typical example of an incompatible model would be the linear model with squared terms7,10.

Incompatibility is a theoretical weakness of fully conditional specification since, in some cases, it is unclear whether the algorithm indeed converges to the desired multivariate distribution11,12,13,14. Consideration of compatibility is significant when the multivariate density is of scientific interest. Both Hughes et al.6 and Liu et al.7 stated the necessity of model compatibility for the algorithm to converge to a joint distribution. Several papers introduced some cases in which FCS models are compatible with joint distributions3,15. Van Buuren14 also performed some simulation studies of fully conditional specification with strongly incompatible models and concluded the effects of incompatibility are negligible. However, further work is necessary to investigate the adverse effects of incompatibility in more general scenarios.

Non-informative margins

Hughes et al.6 showed that the non-informative margins condition is sufficient for fully conditional specification to converge to a multivariate distribution. Suppose \(\pi (\theta _{j})\) is the prior distribution of the conditional model \(p(Y_j|Y_{-j}, \theta _{j})\) and \(\pi (\theta _{-j})\) is the prior distribution of the marginal model \(p(Y_{-j}|\theta _{-j})\), then the non-informative margins condition is satisfied if the joint prior could be factorized into independent priors \(\pi (\theta _{j}, \theta _{-j}) = \pi (\theta _{j})\pi (\theta _{-j})\). It is worthwhile to note that the non-informative margin condition does not hold if \(p(Y_j|Y_{-j}, \theta _{j})\) and \(p(Y_{-j}|\theta _{-j})\) have the same parameter space. When the non-informative margins condition is violated, an order effect appears. In such a case, the inference of parameters would have systematic differences depending on the sequence of the variables in FCS algorithm. Simulations performed by Hughes et al.6 demonstrated that such an order effect is subtle. However, more research is needed to verify such claims, and it is necessary to be aware of the existence of the order effect.

Theoretical results

This section proves the convergence of fully conditional specification under the normal linear model with normal inverse-gamma priors to a joint distribution. Since the compatibility of the normal linear model is well understood, we will check the satisfaction of the non-informative margins condition.

Starting with the problem of Bayesian inference for \(\theta = (\mu , \Sigma )\) under a multivariate normal model, let us apply the following prior distribution. Suppose that, given \(\Sigma\), the prior distribution of \(\mu\) is assumed to be the conditionally multivariate normal,

$$\begin{aligned} \mu | \Sigma \sim N(\mu _{0}, \tau ^{-1}\Sigma ), \end{aligned}$$
(1)

where the hyperparameters \(\mu _{0} \in \mathscr {R}^{p}\) and \(\tau > 0\) are fixed and known and where p denotes the number of variables. Moreover, suppose that the prior distribution of \(\Sigma\) is an inverse-Wishart,

$$\begin{aligned} \Sigma \sim W^{-1}(m, \Lambda ) \end{aligned}$$
(2)

for fixed hyperparameters \(m \ge p\) and \(\Lambda\). The prior density for \(\theta\) can then be written as

$$\begin{aligned} \begin{array}{ll} \pi (\theta ) \propto &{}|\Sigma |^{-(\frac{m+p+2}{2})}\;\exp \;\{-\frac{1}{2}tr(\Lambda ^{-1}\Sigma ^{-1})\}\\ &{} \times \;\exp \;\{-\frac{\tau }{2}(\mu -\mu _{0})^{T}\Sigma ^{-1}(\mu -\mu _{0})\}. \end{array} \end{aligned}$$
(3)

For each variable \(Y_{j}\), we partition the mean vector \(\mu\) as \((\mu _j, \mu _{-j})^T\) and the covariance matrix \(\Sigma\) as

$$\begin{aligned} \left( \begin{array}{cc} \omega _{j} &{} \xi _{j}^T\\ \xi _{j} &{} \Sigma _{-j} \end{array}\right) , \end{aligned}$$

such that \(Y_j \sim \mathscr {N}(\mu _j, \omega _{j})\) and \(Y_{-j} \sim \mathscr {N}(\mu _{-j}, \Sigma _{-j})\). Similarly, we partition the scale parameter \(\mu _{0}\) as \((\mu _{0j}, \mu _{0-j})^T\) and \(\Lambda\) as

$$\begin{aligned} \left( \begin{array}{cc} \Lambda _{j} &{} \psi _{j}^T\\ \psi _{j} &{} \Lambda _{-j} \end{array}\right) . \end{aligned}$$

The conditional model of \(Y_j\) given \(Y_{-j}\) is the normal linear regression \(Y_{j} = \alpha _j + \beta _{j}^TY_{-j} + \sigma _{j}\), where \(\beta _{j}^T = \xi _{j}^T\Sigma _{-j}^{-1}\), \(\alpha _j = \mu _j - \xi _{j}^T\Sigma _{-j}^{-1}\mu _{-j}\) and \(\sigma _{j} = \omega _{j} - \xi _{j}^T\Sigma _{-j}^{-1}\xi _{j}\). The corresponding vectors of parameters \(\theta _{j}\) and \(\theta _{-j}\) would be

$$\begin{aligned} \begin{array}{cc} \theta _{j} &{}= (\alpha _j, \beta _{j}, \sigma _{j})\\ \theta _{-j} &{}= (\mu _{-j}, \Sigma _{-j}). \end{array} \end{aligned}$$
(4)

By applying the partition function16 and by block diagonalization of a partitioned matrix, the joint prior for \(\theta _{j}\) and \(\theta _{-j}\) can be derived from \(\pi (\theta )\) as :

$$\begin{aligned} \pi (\theta _{j}, \theta _{-j}) & = p(\sigma _{j})p(\beta _{j}|\sigma _{j})p(\Sigma _{-j})\\&\quad \times \exp \;\{-\frac{\tau }{2}(\alpha _{j} + \beta _{j}\mu _{0-j}\ - \mu _{0j})^{T}(\sigma _{j})^{-1}(\alpha _{j} + \beta _{j}\mu _{0-j}\ - \mu _{0j})\}\\&\quad \times \exp \{-\frac{\tau }{2}(\mu _{-j}-\mu _{0-j})^{T}\Sigma _{-j}^{-1}(\mu _{-j}-\mu _{0-j})\} \times |\Sigma _{-j}|\\ &=\pi (\theta _{j})\pi (\theta _{-j}), \end{aligned}$$
(5)

where

$$\begin{aligned}\pi (\theta _{j}) & = p(\sigma _{j})p(\beta _{j}|\sigma _{j}) \nonumber \\&\quad\times \exp \;\{-\frac{\tau }{2}(\alpha _{j} + \beta _{j}\mu _{0-j}\ - \mu _{0j})^{T}(\sigma _{j})^{-1}(\alpha _{j} + \beta _{j}\mu _{0-j}\ - \mu _{0j})\},\end{aligned}$$
(6)
$$\begin{aligned}{}&\pi (\theta _{-j}) = p(\Sigma _{-j})\times \exp \{-\frac{\tau }{2}(\mu _{-j}-\mu _{0-j})^{T}\Sigma _{-j}^{-1}(\mu _{-j}-\mu _{0-j})\} \times |\Sigma _{-j}| \end{aligned}$$
(7)

and

\(p(\sigma _{j}) \sim W^{-1}(m, \lambda _j)\), \(p(\beta _{j}|\sigma _{j}) \sim \mathscr {N}(\psi _{j}^T\Lambda _{-j}^{-1}, \lambda _j\Lambda _{-j}^{-1})\), \(p(\Sigma _{-j}) \sim W^{-1}(m-1, \Lambda _{-j})\), \(\lambda _j = \Lambda _{j} - \psi _{j}^T\Lambda _{-j}^{-1}\psi _{j}\)16. Since the joint prior distribution factorizes into independent priors, the “non-informative” margins condition is satisfied. Based on equations (6) and (7), we could derive the prior for the conditional linear model from the prior for the multivariate distribution:

$$\begin{aligned} \begin{array}{l} p(\sigma _{j}) \sim W^{-1}(m, \lambda _j)\\ p(\beta _{j}|\sigma _{j}) \sim \mathscr {N}(\psi _{j}^T\Lambda _{-j}^{-1}, \lambda _j\Lambda _{-j}^{-1})\\ p(\alpha _{j}|\sigma _{j}) \sim \mathscr {N}(\mu _{0j} - \psi _{j}^T\Lambda _{-j}^{-1}\mu _{0-j}, \tau ^{-1}\sigma _{j} - (\mu _{0-j})^{2}\lambda _j\Lambda _{-j}^{-1}). \end{array} \end{aligned}$$
(8)

Since the conditional \(\beta _{j} | \sigma _{j}\) follows a normal distribution, the marginal distribution \(\beta _{j}\) would be a student’s t-distribution \(\beta _{j} \sim t(\psi _{j}^T\Lambda _{-j}^{-1},\)

\(m\Lambda _{-j}^{-1} \lambda _{j}^{-1}, 2m-p+1)\). When the sample size increases, \(\beta _{j}\) tends to the normal distribution \(N(\psi _{j}^T\Lambda _{-j}^{-1}, \frac{\lambda _{j}\Lambda _{-j}}{m-1})\). Similarly, the marginal distribution \(\alpha _{j}\) would be \(t(\mu _{0j} - \psi _{j}^T\Lambda _{-j}^{-1}\mu _{0-j}, m(\tau ^{-1} - (\mu _{0-j})^{2}\Lambda _{-j}^{-1})\Lambda _{j}^{-1}, 2m-p+1)\). When the sample size increases, \(\alpha _{j}\) tends to the normal distribution \(N(\mu _{0j} - \psi _{j}^T\Lambda _{-j}^{-1}\mu _{0-j},\)

\(\frac{1}{(\tau ^{-1} - (\mu _{0-j})^{2}\Lambda _{-j}^{-1})(m-1)}\Lambda _{j})\). Usually, when the sample size is over 30, the difference between the student’s t-distribution and the corresponding normally distributed approximation is negligible. With the prior transformation formula, one could apply Bayesian imputation under the normal linear model with normal inverse-gamma priors. This holds for both the prior information about the distribution of the data (e.g. location and scale of variables) and the scientific model (e.g. regression coefficients).

Simulation

We perform a simulation study to demonstrate the validity and the convergence of fully conditional specification when the conditional models are simple linear regressions with an inverse gamma prior for the error term and a multivariate normal prior for regression weights. In addition, we look for the disappearance of order effects, which is evident in the convergence of fully conditional specification to a multivariate distribution.

We repeat the simulation 500 times and generate a dataset with 200 cases for every simulation according to the following multivariate distribution :

$$\begin{aligned} \begin{pmatrix}x\\ y\\ z \end{pmatrix}\sim & {} \mathscr {N}\left[ \left( \begin{array}{c} 1\\ 4\\ 9 \end{array}\right) ,\left( \begin{array}{ccc} 4 &{} 2 &{} 2\\ 2 &{} 4 &{} 2\\ 2 &{} 2 &{} 9 \end{array}\right) \right] \\ \end{aligned}$$

Fifty percent missingness is induced on either variable x, y or z. The proportion of the three missing patterns is equal. When evaluating whether it is appropriate to specify a normal inverse gamma prior, we consider both missing completely at random (MCAR) mechanisms and right-tailed missing at random (MARr) mechanisms where higher values have a larger probability of being unobserved. When investigating the existence of order effects, we only conduct the simulation under MCAR missingness mechanism to ensure that the missingness does not attribute to any order effects. We specify a weak informative prior for two reasons. First, with a weak informative prior, the frequentist inference is still plausible by applying Rubin’s rules1. Second, Goodrich et al.17 suggested that compared with flat non-informative priors, weak informative priors places warranted weight to extreme parameter values. In such a case, The prior under the joint model is specified as: \(\mu _{0} = (0, 0, 0)^T\), \(\tau = 1\), \(m = 3\) and

$$\begin{aligned} \Lambda = \left( \begin{array}{ccc} 60 &{} 0 &{} 0\\ 0 &{} 60 &{} 0\\ 0 &{} 0 &{} 60 \end{array}\right) \end{aligned}$$

and the corresponding prior for separated linear regression model would be the same, with \(\pi (\sigma ) \sim W^{-1}(3, 60)\) and

$$\begin{aligned} (\alpha , \beta )^T\sim & {} \mathscr {N}\left[ \left( \begin{array}{c} 0\\ 0\\ 0 \end{array}\right) ,\left( \begin{array}{ccc} 60 &{} 0 &{} 0\\ 0 &{} 3600 &{} 0\\ 0 &{} 0 &{} 3600 \end{array}\right) \right] .\\ \end{aligned}$$

Scalar inference for the mean of variable Y

The aim is to assess whether Bayesian imputation under a normal linear model with normal inverse gamma priors would yield unbiased estimates and exact coverage of the nominal 95% confidence intervals. Table 1 shows that with weak informative prior, fully conditional specification also provides valid imputations. The estimates are unbiased, and the coverage of the nominal 95% confidence intervals is correct under both MCAR and MARr. Without the validity of a normal inverse gamma prior specification, further investigations into the convergence would be redundant. Complete case analysis (CCA) gives a biased estimate and reduces the coverage of confidence intervals, demonstrating poor performance in analyzing the incomplete data set without addressing missingness.

Table 1 Bias of the estimates (E(Y)), coverage of nominal 95% confidence intervals (Cov) and the corresponding confidence intervals widths (Ciw) under MCAR and MARr.

Order effect evaluation

The visit sequence laid upon the simulation is z, x and y. To identify the presence of any systematic order effect, we estimate the regression coefficient directly after updating variable z and after updating variable x. Specifically, the ith iteration of fully conditional specification would be augmented as:

  1. 1.

    Impute z given \(x^{i-1}\) and \(y^{i-1}\).

  2. 2.

    Build the linear regression \(y = \alpha + \beta _{1}x + \beta _{2}z + \epsilon\) and collect the coefficient \(\beta _{1}\), denoted as \(\hat{\beta _{1}}^z\).

  3. 3.

    Impute x given \(z^{i}\) and \(y^{i-1}\).

  4. 4.

    Build the linear regression \(y = \alpha + \beta _{1}x + \beta _{2}z + \epsilon\) and collect the coefficient \(\beta _{1}\), denoted as \(\hat{\beta _{1}}^x\).

  5. 5.

    Impute y given \(z^{i}\) and \(x^{i}\).

After a burn-in period with 10 iterations, the fully conditional specification algorithm was performed with an additional 1000 iterations, in which differences between the estimates \(\hat{\beta _{1}}^z - \hat{\beta _{1}}^x\) are recorded. The estimates from the first 10 iterations are omitted since the FCS algorithms commonly reach convergence around 5 to 10 iterations. Estimates from the additional 1000 iterations would be partitioned into subsequences with equal size, which are used for variance calculation. We calculate the nominal 95% confidence interval of the difference. The standard error of the difference is estimated with batch-means methods18. The mean of \(\hat{\beta _{1}}^z - \hat{\beta _{1}}^x\) is set to zero. Since only three 95% confidence intervals derived from 500 repetitions do not cross the zero, there is no indication of any order effects. We also monitor the posterior distribution of the coefficient under both joint modeling and fully conditional specification. Figure 1 shows a quantile-quantile plot demonstrating the closeness of the posterior distribution for \(\beta _{1}\) derived from both joint modeling and fully conditional specification. Since the posterior distributions for \(\beta _{1}\) under joint modeling and FCS are very similar, any differences may be considered negligible in practice.

Figure 1
figure 1

qqplot demonstrating the closeness of the posterior distribution of JM and FCS for \(\beta _{1}\).

All these results confirm that under the normal inverse gamma prior, Bayesian imputation under normal linear model converges to the corresponding multivariate normal distribution.

Conclusion

Based on the theory of the non-informative margins condition proposed by Hughes et al.6, we prove the convergence of fully conditional specification under the normal linear model with normal-inverse-gamma prior distributions. Since it has been shown that a sequence of normal linear models is compatible with a multivariate normal density, we only focus on the non-informative margins condition for the prior. The transformation of the prior between a normal inverse gamma for fully conditional specification and a normal inverse Wishart for joint modeling is useful. With transformation, one could apply fully conditional specification when having prior information about statistical moments (e.g., mean and variance of some variables) rather than prior information about parameters of fully conditional models.

The prior reflects the analyst’s pre-data knowledge about the data or the model. The analyst specifies the prior when only a small sample size is available, for instance, patients in clinical research. Generally, prior distributions are determined by location and variance parameters. The location parameters [(for example, \(\mu _{0}\) in (1) and m in (2)] are commonly based on the results of previous studies. The variance parameters [(for example, \(\tau ^{-1}\Sigma\) in (1) and \(\Lambda\) in (2)] are specified based on the exchangeability of the prior and current study19. Exchangeability indicates the same population for the prior and current studies. Hence, lower variance parameters can be applied. Otherwise, higher variance parameters can be used to include large support of parameters.

We perform simulations under the case when the number of variables is larger than the sample size. However, based on Bayesian theories, the result is valid when the number of variables is smaller than the sample size. For example, Huang et al.20,21 proposed to generate “synthetic data” under a simpler prior distribution to augment the sample size. In this case, the statistical inference heavily depends on the prior specification.

Fully conditional specification is an appealing imputation method because it allows one to specify a sequence of flexible and simple conditional models and bypass the difficulty of multivariate modeling in practice. The default prior for normal linear regression is Jeffreys prior, which satisfies the non-informative margin condition. However, it is worth developing other types of priors for fully conditional specification such that one could select the prior that suits the description of prior knowledge best. Many researchers have discussed the convergence condition of FCS. However, there is no conclusion for the family of posterior distributions that satisfies the condition of convergence. In such a case, when including new kinds of priors in fully conditional specification algorithms, it is necessary to investigate the convergence of the algorithm with new posterior distributions. Specifically, one should study the non-informative margin conditions for new priors. Compatibility should also be considered if the imputation model is novel. Our work takes steps in this direction.

Although a series of investigations have shown that the adverse effects of violating compatibility and the non-informative margin conditions may be small, all of these investigations rely on pre-defined simulation settings. More research is needed to verify conditions under which the fully conditional specification algorithm converges to a multivariate distribution and cases in which the violation of compatibility and non-informative margin has negligible adverse impacts on the result.

There are several directions for future research. From one direction, it is possible to develop a prior setting to eliminate order effects of the fully conditional specification algorithm under the general location model since the compatibility and non-informative margins conditions are satisfied under the saturated multinomial distribution. Moreover, various types of priors of the generalized linear model (e.g., non-linear normal regression) for the fully conditional specification and corresponding joint modeling rationales could be developed. Another open problem is the convergence condition and properties of block imputation, which partitions missing variables into several blocks and iteratively imputes blocks3. Block imputation is a more flexible and user-friendly method. However, its properties have yet to be studied. Finally, it is necessary to investigate the implementation of prior specifications in software.