1. Quasi-likelihood functions

1.1 Independent observations

1.1.1 Covariance functions

Suppose the components of the response vector Y are independent with mean vector $\mu$ and covariance matrix $\sigma^2V(\mu)$, where $\sigma^2$ may be unknown but $V(\mu)$ is a matrix of known functions. It’s assumed that the parameters of interest $\beta$ is related to the dependence of $\mu$ on covariates $x$. Under independent assumption, $\mathbf{V}(\boldsymbol{\mu})=\operatorname{diag}\left{V_{1}(\boldsymbol{\mu}), \ldots, V_{n}(\boldsymbol{\mu})\right}$, and in the majority of applications the functions $V_1,\ldots,V_n$ may be taken to be identical.

1.1.2 Construction of the quasi-likelihood function

1.1.2.1 Preliminary knowledge: Fisher information

The partial derivative with respect to θ of the natural logarithm of the likelihood function is called the “score)”. Under certain regularity conditions, if θ is the true parameter (i.e. X is actually distributed as f(X; θ)), it can be shown that the expected value (the first moment)) of the score is 0.

Proof:
$$
\begin{aligned} \mathrm{E}\left[\frac{\partial}{\partial \theta} \log f(X ; \theta) | \theta\right] &=\int \frac{\frac{\partial}{\partial \theta} f(x ; \theta)}{f(x ; \theta)} f(x ; \theta) d x \ &=\frac{\partial}{\partial \theta} \int f(x ; \theta) d x \ &=\frac{\partial}{\partial \theta} 1=0 \end{aligned}
$$
The variance of the score is defined to be the Fisher information:
$$
\mathcal{I}(\theta)=\mathrm{E}\left[\left(\frac{\partial}{\partial \theta} \log f(X ; \theta)\right)^{2} | \theta\right]=\int\left(\frac{\partial}{\partial \theta} \log f(x ; \theta)\right)^{2} f(x ; \theta) d x
$$
The Fisher information may also be written as:
$$
\mathcal{I}(\theta)=-\mathrm{E}\left[\frac{\partial^{2}}{\partial \theta^{2}} \log f(X ; \theta) | \theta\right]
$$
Since
$$
\frac{\partial^{2}}{\partial \theta^{2}} \log f(X ; \theta)=\frac{\frac{\partial^{2}}{\partial \theta^{2}} f(X ; \theta)}{f(X ; \theta)}-\left(\frac{\frac{\partial}{\partial \theta} f(X ; \theta)}{f(X ; \theta)}\right)^{2}=\frac{\frac{\partial^{2}}{\partial \theta^{2}} f(X ; \theta)}{f(X ; \theta)}-\left(\frac{\partial}{\partial \theta} \log f(X ; \theta)\right)^{2}
$$
and
$$
\mathrm{E}\left[\frac{\frac{\partial^{2}}{\partial \theta^{2}} f(X ; \theta)}{f(X ; \theta)} | \theta\right]=\frac{\partial^{2}}{\partial \theta^{2}} \int f(x ; \theta) d x=0
$$

1.1.2.2 Cramer-Rao bound

Scalar case

In estimation theory and statistics, the Cramér–Rao bound (CRB), Cramér–Rao lower bound (CRLB), Cramér–Rao inequality, Fréchet–Darmois–Cramér–Rao inequality, or information inequality expresses a lower bound on the variance of unbiased estimators of a deterministic (fixed, though unknown) parameter. In its simplest form, the bound states that the variance of any unbiased estimator is at least as high as the inverse of the Fisher information

Suppose $\theta$ is the parameter, $x$ is the measurement. The fisher information is $I(\theta)$, and we have:
$$
\operatorname{var}(\hat{\theta}) \geq \frac{1}{I(\theta)}
$$
where
$$
I(\theta)=n \times \mathrm{E}\left[\left(\frac{\partial \ell(x ; \theta)}{\partial \theta}\right)^{2}\right]=n \times-\mathrm{E}\left[\frac{\partial^{2} \ell(x ; \theta)}{\partial \theta^{2}}\right]
$$

Multivariate case

Extending the Cramér–Rao bound to multiple parameters, define a parameter column vector with
$$
\boldsymbol{\theta}=\left[\theta_{1}, \theta_{2}, \ldots, \theta_{d}\right]^{T} \in \mathbb{R}^{d}
$$
The Fisher information matrix is a $d\times d$ matrix with element $I_{m,k}$ defined as
$$
I_{m, k}=\mathrm{E}\left[\frac{\partial}{\partial \theta_{m}} \log f(x ; \boldsymbol{\theta}) \frac{\partial}{\partial \theta_{k}} \log f(x ; \boldsymbol{\theta})\right]=-\mathrm{E}\left[\frac{\partial^{2}}{\partial \theta_{m} \partial \theta_{k}} \log f(x ; \boldsymbol{\theta})\right]
$$
Let $\boldsymbol{T}(X)$ be an estimator for any vector function of parameters, and
$$
\boldsymbol{T}(X)=\left(T_{1}(X), \ldots, T_{d}(X)\right)^{T}
$$
and denote its expectation vector by $\psi(\boldsymbol{\theta})$. The Cramer-Rao bound then states that the covariance matrix of $\boldsymbol{T}(X)$ satisfies
$$
\operatorname{cov}_{\boldsymbol{\theta}}(\boldsymbol{T}(X)) \geq \frac{\partial \boldsymbol{\psi}(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}[I(\boldsymbol{\theta})]^{-1}\left(\frac{\partial \boldsymbol{\psi}(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}\right)^{T}
$$
Where:

The matrix inequality $A \geq B$ is understood to mean that the matrix $A - B$ is positive semidefinite.
$\partial \psi(\boldsymbol{\theta}) / \partial \boldsymbol{\theta}$ is the Jacobian matrix whose $ij$ element is given by $\partial \psi_{i}(\boldsymbol{\theta}) / \partial \theta_{j}$.

If $\boldsymbol{\psi}(\boldsymbol{\theta})=\boldsymbol{\theta}$, then the Cramer-Rao bound reduces to
$$
\operatorname{cov}_{\boldsymbol{\theta}}(\boldsymbol{T}(X)) \geq I(\boldsymbol{\theta})^{-1}
$$

Single-parameter proof

Let $T = t(X)$ is a estimator satisfying $\mathrm{E}(T)=\psi(\theta)$. The goal is to prove that, for all $\theta$,
$$
\operatorname{var}(t(X)) \geq \frac{\left[\psi^{\prime}(\theta)\right]^{2}}{I(\theta)}
$$
Define V as the score of $\theta$ , $E(V) = 0$,
$$
\begin{aligned} \operatorname{cov}(V, T) &=\mathrm{E}\left(T \cdot\left[\frac{1}{f(X ; \theta)} \frac{\partial}{\partial \theta} f(X ; \theta)\right]\right) \ &=\int t(x)\left[\frac{1}{f(x ; \theta)} \frac{\partial}{\partial \theta} f(x ; \theta)\right] f(x ; \theta) d x \ &=\frac{\partial}{\partial \theta}\left[\int t(x) f(x ; \theta) d x\right]=\psi^{\prime}(\theta) \end{aligned}
$$
The Cauchy–Schwarz inequality shows that:
$$
\sqrt{\operatorname{var}(T) \operatorname{var}(V)} \geq|\operatorname{cov}(V, T)|=\left|\psi^{\prime}(\theta)\right|
$$
Therefore,
$$
\operatorname{var}(T) \geq \frac{\left[\psi^{\prime}(\theta)\right]^{2}}{\operatorname{var}(V)}=\frac{\left[\psi^{\prime}(\theta)\right]^{2}}{I(\theta)}
$$
which completes the proof.

1.1.2.3 First single component of vector Y

Suppose we have only one observation, and write function $
U=u(\mu ; Y)=\frac{Y-\mu}{\sigma^{2} V(\mu)}$. This function $U$ has the following properties in common with a log-likelihood derivative because
$$
\begin{aligned} E(U) &=0 \ \operatorname{var}(U) &=1 /\left{\sigma^{2} V(\mu)\right} \-E\left(\frac{\partial U}{\partial \mu}\right) &=1 /\left{\sigma^{2} V(\mu)\right} \end{aligned}
$$
since $E(Y) = \mu$, and
$$
\begin{aligned}-E\left(\frac{\partial U}{\partial \mu}\right) &=-E\left(-\frac{1}{\sigma^{2} V(\mu)}-\frac{(Y-\mu) V^{\prime}(\mu)}{\sigma^{2} V^{2}(\mu)}\right) \ &=\frac{1}{\sigma^{2} V(\mu)} \end{aligned}
$$
So function $U$ just behave like the first order derivative of the loglikelihood function of $\mu$, so this gives us the idea to define the quasi-likelihood for a single data point as
$$
Q(\mu ; y)=\int_{y}^{\mu} \frac{y-t}{\sigma^{2} V(t)} d t
$$
Since the take the first order derivative of $\mu$ on $Q(\mu ; y)$ gives us $U$.

1.1.2.4 Sum of the individual contributions

Since the components of Y are independent by assumption, the quasi-likelihood for the complete data is the sum of the individual contributions:
$$
Q(\boldsymbol{\mu} ; \mathbf{y})=\sum Q_{i}\left(\mu_{i} ; y_{i}\right)
$$

The above shows a table of quasi-likelihood associated with some simple variance functions.

1.1.3 Quasi-deviance function

By analogy,the quasi-deviance function corresponding to a single observation is
$$
D(y ; \mu)=-2 \sigma^{2} Q(\mu ; y)=2 \int_{\mu}^{y} \frac{y-t}{V(t)} d t
$$
The total deviance can be obtainedby adding over the components, and is a computable function depending on $y$ and $\mu$ alone, it does not depend on $\sigma^2$.

1.1.4 Parameter estimation

1.1.4.1 Newton-Raphson iteration of estimating $\beta$

The quasi-likelihood estimating equations for the regression parameters $\beta$, obtained by differentiating $Q(\mu;y)$, may be written in the form $U(\hat{\beta})=0$, where
$$
\mathbf{U}(\boldsymbol{\beta})=\mathbf{D}^{T} \mathbf{V}^{-1}(\mathbf{Y}-\boldsymbol{\mu}) / \sigma^{2}
$$
The components of D, of order $n \times p$ , satisfy $D_{i r}=\partial \theta_{i} / \partial \beta_{r}$. And V is the covariance matrix of Y. The above formula can be verified by example, say n = 2, and k = 2. We have
$$
T=\int_{y_{1}}^{x_{11} b_{1}+x_{12} b_{2}} \frac{t-y_{1}}{\sigma^{2} V\left(u_{1}\right)} d t+\int_{y_{2}}^{x_{21} b_{1}+x_{22} b 2} \frac{t-y_{2}}{\sigma^{2} V\left(u_{2}\right)} d t
$$

$$
\left(\begin{array}{l}{\frac{\partial T}{\partial b_{1}}} \ {\frac{\partial T}{\partial b_{2}}}\end{array}\right)=U(\beta)=\mathbf{D}^{T} \mathbf{V}^{-1}(\mathbf{Y}-\boldsymbol{\mu}) / \sigma^{2}
$$

Since:
$$
\frac{\partial \int_{a}^{f(b)} x d x}{\partial b}=f^{\prime}(b) \cdot f(b)
$$
The convariacne matrix of $\mathbf{U}(\boldsymbol{\beta})$, which can be derived using the fisher information and the Cramer-rao lower bound which is discussed in the preliminary knowledge. so:

How can I get this $\mathbf{i}{\beta}$?
$$
\mathbf{i}{\beta}=\mathbf{D}^{T} \mathbf{V}^{-1} \mathbf{D} / \sigma^{2}
$$

$$
\operatorname{cov}(\hat{\boldsymbol{\beta}}) \simeq \mathbf{i}_{\beta}^{-1}=\sigma^{2}\left(\mathbf{D}^{T} \mathbf{V}^{-1} \mathbf{D}\right)^{-1}
$$

Beginning with an arbitrary value $\hat{\beta_0}$ sufficiently close to $\hat{\beta}$, the sequence of parameter estimates generated by the Newton-Raphson method with Fisher scoring is
$$
\hat{\boldsymbol{\beta}}{1}=\hat{\boldsymbol{\beta}}{0}+\left(\hat{\mathbf{D}}{0}^{T} \hat{\mathbf{V}}{0}^{-1} \hat{\mathbf{D}}{0}\right)^{-1} \hat{\mathbf{D}}{0}^{T} \hat{\mathbf{V}}{0}^{-1}\left(\mathbf{y}-\hat{\boldsymbol{\mu}}{0}\right)
$$

1.1.4.2 Details of newton-raphson iteration used on the method of scoring

Let $l(\theta)$ be the likelihood function, and $S(\theta)$ be the score function, and
$$
J(\theta)=J(\theta ; X)=-\frac{\partial}{\partial \theta} S(\theta)=-\frac{\partial^{2}}{\partial \theta^{2}} l(\theta)
$$
Then, we take an initial value $\theta_0$ and write
$$
\theta_{1}=\theta_{0}+J\left(\theta_{0}\right)^{-1} S\left(\theta_{0}\right)
$$
and we keep repeating this procedure as long as $\left|S\left(\theta_{j}\right)\right|>\epsilon$.
$$
\theta_{k+1}=\theta_{k}+J\left(\theta_{k}\right)^{-1} S\left(\theta_{0}\right)
$$
If $\hat{\theta}$ is a local maximum for the likelihood function, we must have
$$
J(\hat{\theta})=-\frac{\partial^{2}}{\partial \theta^{2}} l(\hat{\theta})>0
$$
And the quantity $J(\hat{\theta})$ determines the sharpness of the peak in the likelihood fuction around the maximum. It is also known as the observed information.

So in the section above, $(\hat{\mathbf{D}}{0}^{T} \hat{\mathbf{V}}{0}^{-1} \hat{\mathbf{D}}{0})^{-1}$ can be thought as $J\left(\theta{k}\right)^{-1}$, and $\hat{\mathbf{D}}{0}^{T} \hat{\mathbf{V}}{0}^{-1}\left(\mathbf{y}-\hat{\boldsymbol{\mu}}_{0}\right)$ can be thought as $S(\theta_0)$.

1.1.4.3 Estimation of $\sigma^{2}$

$$
\tilde{\sigma}^{2}=\frac{1}{n-p} \sum_{i}\left(Y_{i}-\hat{\mu}{i}\right)^{2} / V{i}\left(\hat{\mu}_{i}\right)=X^{2} /(n-p)
$$

1.2 Dependent observations

1.2.1 Quasi-likelihood estimating equations

Suppose now that $\operatorname{cov}(\mathbf{Y})=\sigma^{2} \mathbf{V}(\boldsymbol{\mu})$ where $\mathbf{V}(\boldsymbol{\mu})$ is a symmetric positive-definite $n \times n$ matrix of known functions $V_{i j}(\mu)$, no longer diagonal. The score function with components $U_{r}(\boldsymbol{\beta})$, has the following properties:
$$
E\left{U_{r}(\boldsymbol{\beta})\right}=0
$$

$$
\begin{aligned} \operatorname{cov}{\mathbf{U}(\boldsymbol{\beta})} &=\mathbf{D}^{T} \mathbf{V}^{-1} \mathbf{D} / \sigma^{2}=\mathbf{i}{\beta} \-E\left(\frac{\partial U{\boldsymbol{r}}(\boldsymbol{\beta})}{\partial \beta_{\boldsymbol{s}}}\right) &=\mathbf{D}^{T} \mathbf{V}^{-1} \mathbf{D} / \sigma^{2} \end{aligned}
$$

Similarly:
$$
\mathbf{U}(\hat{\boldsymbol{\beta}})=\hat{\mathbf{D}}^{T} \hat{\mathbf{V}}^{-1}(\mathbf{Y}-\hat{\boldsymbol{\mu}})=\mathbf{0}
$$
is approximately unbiased for $\beta$ and asymptotically normally distributed with limiting variance $\operatorname{cov}(\hat{\boldsymbol{\beta}}) \simeq \sigma^{2}\left(\mathbf{D}^{T} \mathbf{V}^{-1} \mathbf{D}\right)^{-1}=\mathbf{i}_{\beta}^{-1}$.

2. 论文基本思路：

Candidate model:

论文解决一个聚类问题，聚类问题有两个方面：一个是如何确定聚类个数，另一个是在给定聚类个数后如何分类。本文在解决后一个问题的过程中用了candidate model 的思路，也就是给定很多种聚类方案，通过计算一些统计量，设置一些判断准则的方法，判断哪种聚类方案是最好的。

Model and Estimation

$$
E\left(Y_{i} | \epsilon_{i}\right)=h\left(\alpha+\boldsymbol{x}{i}^{\prime} \boldsymbol{\beta}+\gamma{\omega} \omega_{i}+\epsilon_{i}\right)
$$

$\alpha$ denotes the intercept, $\beta=\left(\beta_{1}, \dots, \beta_{p}\right)^{\prime}$ denotes a vector of p slopes associated with the covariates, and $\gamma_w$ denotes a “jump coefficient “ associated with the cluster $\Omega$. $\epsilon_i$ has variance $\sigma^2$ and sptial correlation $\rho_{i, i^{\prime}} \text { between } s_{i} \text { and } s_{i^{\prime}}$.
$$
\lambda_{i}=E\left{E\left(Y_{i} | \epsilon_{i}\right)\right}=h\left(\alpha+\boldsymbol{x}{i}^{\prime} \boldsymbol{\beta}+\gamma{\omega} \omega_{i}+\sigma_{\tau}\right)
$$
Where $\sigma_{\tau}^{2}=(1 / 2) \sigma^{2}$ when the link function h is exponential.

Let $\boldsymbol{V}=\operatorname{var}(\boldsymbol{Y})$ denotes the marginal cov matrix of the response vector. Let candidates model be $\omega=\left(\omega_{1}, \ldots, \omega_{n}\right)^{\prime} $ and complementary model be $\omega^{c}=\left(\omega_{1}^{c}, \ldots, \omega_{n}^{c}\right)^{\prime}$.

We have $\boldsymbol{\lambda}{\omega^{c}}=h\left(\alpha^{c}+\boldsymbol{X} \boldsymbol{\beta}+\gamma{\omega^{c}} \boldsymbol{\omega}^{c}+\sigma_{\tau}\right)$, $\text { where } \alpha^{c}=\alpha+\gamma_{\omega} \text { and } \gamma_{\omega^{c}}=-\gamma_{\omega} . \text { since } \boldsymbol{\lambda}{\omega} \equiv \boldsymbol{\lambda}{\omega^{c}}$

Apply the QLE function to estimate $\beta,\gamma_{\omega}, \alpha$ and $\gamma_{u_{j}^{\circ}}, \alpha^{c}, \text { and } \beta_{u_{j}^{c}}$ in the complement model.

Determine which candidate model is the best

如果candidate model 足够好，我们可以发现$\left{h\left(\hat{\gamma}_{u_{j}}\right)-h\left(\hat{\gamma}_{u_{j}^{c}}\right)\right}^{2}$ 这一项的值应该是足够大的。因为最好的结果会是$\left{h\left(\gamma_{\omega}\right)-h\left(-\gamma_{\omega}\right)\right}^{2}$. 所以，我们用
$$
\hat{\omega}=\arg \max {\boldsymbol{u}{j} \in \mathbf{T}}\left{h\left(\hat{\gamma}_{u_{j}}\right)-h\left(\hat{\gamma}_{u_{j}^{c}}\right)\right}^{2}
$$
来决定哪个candidate model最好。

Three-step algorithm to wrap it up

这个算法的目的是通过迭代法，不停调整Y协方差矩阵的估计，最优candiate模型的估计，以及模型参数的估计。具体步骤如下：

给定一堆candidate model，给定误差的协方差矩阵，分别用qle对这些model求得参数估计，用上述的criteria求得最好的模型，记录下来。
用这个最好的模型拟合数据，记录残差，反过来重新估计误差的协方差矩阵。
用新得到的协方差矩阵，再看这堆candidate model哪个最好，如果还是上次迭代时得到的那个，那算法终止，所有在当前估计得到的参数会作为最终结果输出。

Identification of multiple clusters

Use quasi-deviance (QDEV) method for model selection. $ D\left(\boldsymbol{\lambda}{\omega{1}, \omega_{2}}, \boldsymbol{\lambda}{\omega{1}}\right)= \left(\boldsymbol{\lambda}{1, \omega{2}}-\boldsymbol{\lambda}_{w_{1}}\right)^{\prime}\left{\boldsymbol{V}_{w_{1}, \omega_{2}}^{-1}\left(\boldsymbol{Y}-\boldsymbol{\lambda}_{w_{1}, \mu_{2}}\right)+\boldsymbol{V}_{w_{1}}^{-1}\left(\boldsymbol{Y}-\boldsymbol{\lambda}_{w_{1}}\right)\right}$ can be used to compare models with different clusters. Lin (2011) showed that $D\left(\boldsymbol{\lambda}{\omega{1}, \omega_{2}}, \boldsymbol{\lambda}{\omega{1}}\right)$ converges in distribution to a chi-squared distribution $\chi_{1}^{2}$ with one degree of freedom.

具体步骤：

To determine whether an additional cluster exists, we update the collection of status vectors to be $\Upsilon_{1}=\left{\boldsymbol{u} \cdot \hat{\boldsymbol{\omega}}{1}^{c}: \boldsymbol{u} \in \Upsilon\right}$, where $\hat{\omega}{1}^{c}=1-\hat{\omega}_{1}$. That is, we exclude the spatial units identified from each status vector in $\Upsilon$.

Let $\boldsymbol{\lambda}{\partial{1}}=h\left{\left(\alpha+\sigma_{\tau}\right)+\boldsymbol{X} \boldsymbol{\beta}+\gamma_{\hat{\omega}{1}} \hat{\boldsymbol{\omega}}{1}\right}$ denote the single-cluster model associated with the estimated status vector $\hat{\omega}_{1}$.
Let $u_{j}^{}, j=1, \ldots, J-1$, denote a candidate vector in $\Upsilon_{1}$, and let $u_{j}^{ c}=\left(1-u_{j}^{}\right) \hat{\omega}{1}^{c}$ denote its complement. We denote the two-cluster model as.$\boldsymbol{\lambda}{\dot{\omega}{1}, u{j}^{}}=h\left{\left(\alpha+\sigma_{\tau}\right)+\boldsymbol{X} \boldsymbol{\beta}+\gamma_{\omega_{1}} \hat{\boldsymbol{\omega}}{1}+\gamma{u_{j}^{}} \boldsymbol{u}_{j}^{}\right}$.The three-step computational algorithm can be applied to ﬁt model $\lambda_{\dot{\alpha}{1}, u{j}^{*}}$ .

Simulation based on a hierarchical generalized linear model

Given $\epsilon_{i}$, the response $Y_i$ was independently Poisson with mean $E\left{Y_{i} ; \epsilon_{i}\right}=\exp \left(\beta_{0}^{*}+\beta_{1} x_{i, 1}+\beta_{2} x_{i, 2}+\beta_{3} x_{i, 3}+\epsilon_{i}\right)$. The model then had:
$$
\begin{aligned} \theta_{i} &=E\left(Y_{i}\right)=\exp \left{\beta_{0}+\beta_{1} x_{i, 1}+\beta_{2} x_{i, 2}+\beta_{3} x_{i, 3}\right} \ \operatorname{var}\left(Y_{i}\right) &=\theta_{i}+\theta_{i}^{2}{\exp (1)-1} \text { and } \operatorname{cov}\left(Y_{i}, Y_{j}\right)=\theta_{i} \theta_{j}\left{\exp \left(\rho_{i j}\right)-1\right} \end{aligned}
$$

具体代码实现中的技巧：

起初协方差矩阵就是一个单位阵，扫过所有可能的candidate model，用qle maximize value function，记录下来，最终选取得到最大value的那个candidate model，用qle得到其参数估计。
把这些估计得到的参数代入计算残差，用残差重新估计协方差矩阵，并在这个重新估计得到的协方差矩阵的基础上，重新扫一遍candidate model，用qle maximize value function，记录下来，选取最大value的那个candidate，并比较这个candidate是否和第一步得到的candidate是同一个，如果是的，算法终止，如果不是，就得重复第一第二步，直至收敛。
再完成第一第二步以后，第三步是看是否要新增一个cluster。方法是用第二步得到的协方差矩阵的估计，在已有的candidate model基础上，再加上一列vector，扫一遍这一列vector的所有可能取值，并取qle能得到的最大value的那个candidate vector，然后用qdev function计算加上这一列vector后，拟合效果是否比以前好很多（有一个判断准则），如果没有好很多，那么就不要加上了，如果好很多，那就得加上，然后重新估计协方差矩阵。理论上还需要根据新的协方差矩阵再次扫一遍所有candidate vector来confirm这个vector（参考第一步和第二步），但代码里略去了这一步。

3. Rcpp 基础知识

Rcpp Armadillo example code: (这是quasi likelihood function的c++ inplementation)

#include <RcppArmadillo.h>
using namespace arma;
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]


Rcpp::List qle(vec beta_in, mat X_in, vec R_in, double n_i, mat Corr, double sigma2, vec fix_par) {

	int p = beta_in.size();
	int n = X_in.n_rows;
	vec beta_in0 = vec(beta_in);
	mat X = mat(X_in);

	mat V1 = exp(mat(n,n).fill(sigma2)%Corr)-mat(n,n).fill(1);
	mat invI = mat(p,p);
	invI.zeros();
	mat V = mat(n,n);
	mat D = mat(n,p);
	mat mu = vec(n);
	mat invV = mat(n,n);
	int iter = 0;

	try
	{
	   while (iter <= 400){
		iter++;


		mu = vec(n).fill(n_i)%exp(X*beta_in+vec(n).fill(0.5*sigma2))%exp(vec(n).fill(sum(fix_par)));
		
		D = repmat(mu, 1, p)%X; 

		V = V1%(mu*mu.t());
	
		
		V.diag() = mu+vec(n).fill(exp(sigma2))%mu%mu-mu%mu;
		invV = inv(V);
		vec beta1 = vec(beta_in);
		invI = inv(D.t()*invV*D);
		beta_in = beta_in+invI*D.t()*invV*(R_in-mu);
		if (sum((beta1-beta_in)%(beta1-beta_in)) < 0.01){break;}
  		}
	  	if (iter == 400){
	  		beta_in = beta_in0;
	  	}
	}catch(...)
	{
	  cout << "矩阵不正定";
	  iter = 0;
	  beta_in = beta_in0;
	}

	cout << "iteration number = ";
	cout << iter;
	cout << "\n";
	
	mat Sigma_u = inv(X_in.col(p-1).t()*invV*X_in.col(p-1));

  	return Rcpp::List::create(Rcpp::Named("iteration") = iter,
		   Rcpp::Named("betahat") = beta_in,
		   Rcpp::Named("covbetahat") = invI,
		   Rcpp::Named("=Sigma_u") = Sigma_u );


  }

#4. homogeneous approach

Apply homogeous approach’s code on evis71 dataset

Results obtained using heterogeous approach:

Rcpp implementation:

50.298 sec elapsed

number of clusters: 6

Estimated coefficients:

para.fix
[,1]
[1,] -1.063637
[2,] -0.123927
[3,] 3.142743
[4,] 2.664509
[5,] 2.300295
[6,] 1.947950
[7,] 1.753625
[8,] 1.435032

Original R code

158.233 sec elapsed

number of clusters = 6

Estimated coefficients:

para.fix
[,1]
[1,] -1.063637
[2,] -0.123927
[3,] 3.142743
[4,] 2.664509
[5,] 2.300295
[6,] 1.947950
[7,] 1.753625
[8,] 1.435032

Results obtained using homogenous approach:

Original R code

43.5 sec elapsed

number of clusters = 14

R cpp implementation

28.823 sec elapsed

number of clusters = 14