# 3 OpenTURNS' methods for Step B: quantification of the uncertainty sources

This section is organized in three parts. The first one gives the list of probabilistic uncertainty models proposed by OpenTURNS. The second part gives an overview of the content of the statistical toolbox that may be used to build these uncertainty models if data are available. The last part is dedicated to the mathematical description of each method.

## 3.1 Probabilistic models proposed in OpenTURNS

OpenTURNS proposes two different types of probabilistic models: non-parametric and parametric ones.

## 3.2 Classical statistical tools for uncertainty quantification

Building a dataset may require to aggregate several data sources; OpenTURNS offers some techniques to check beforehand if these data sources are indeed related to the same probability distribution.

Moreover, when a parametric model is used, OpenTURNS provide statistical tools to estimate the parameters, validate the resulting model and address the important issue of dependencies among uncertainty sources.

## 3.3 Methods description

### 3.3.1 Step B  – Empirical cumulative distribution function

Mathematical description

Goal

The empirical cumulative distribution function provides a graphical representation of the probability distribution of a random vector without implying any prior assumption concerning the form of this distribution. It concerns a non-parametric approach which enables the description of complex behaviour not necessarily detected with parametric approaches.

Therefore, using general notation, this means that we are looking for an estimator ${\stackrel{^}{F}}_{N}$ for the cumulative distribution function ${F}_{X}$ of the random variable $\underline{X}=\left({X}^{1},...,{X}^{{n}_{X}}\right)$:

 $\begin{array}{c}\hfill {\stackrel{^}{F}}_{N}↔{F}_{X}\end{array}$

Principle of the method for ${n}_{X}=1$

Let us first consider the uni-dimensional case, and let us denote $\underline{X}={X}^{1}=X$. The empirical probability distribution is the distribution created from a sample of observed values $\left\{{x}_{1},{x}_{2},...,{x}_{N}\right\}$. It corresponds to a discrete uniform distribution on $\left\{{x}_{1},{x}_{2},...,{x}_{N}\right\}$: where ${X}^{\text{'}}$ follows this distribution,

 $\begin{array}{c}\hfill \forall \phantom{\rule{0.277778em}{0ex}}i\in \left\{1,...,N\right\},\phantom{\rule{4pt}{0ex}}\mathrm{Pr}\left({X}^{\text{'}}={x}_{i}\right)=\frac{1}{N}\end{array}$

The empirical cumulative distribution function ${\stackrel{^}{F}}_{N}$ with this distribution is constructed as follows:

 $\begin{array}{c}\hfill {F}_{N}\left(x\right)=\frac{1}{N}\sum _{i=1}^{N}{\mathbf{1}}_{\left\{{x}_{i}\le x\right\}}\end{array}$

The empirical cumulative distribution function ${F}_{N}\left(x\right)$ is defined as the proportion of observations that are less than (or equal to) $x$ and is thus an approximation of the cumulative distribution function ${F}_{X}\left(x\right)$ which is the probability that an observation is less than (or equal to) $x$.

 $\begin{array}{c}\hfill {F}_{X}\left(x\right)=\mathrm{Pr}\left(X\le x\right)\end{array}$

The diagram below provides an illustration of an ordered sample $\left\{5,6,10,22,27\right\}$.

Principle of the method for ${n}_{X}>1$

The method is similar for the case ${n}_{X}>1$. The empirical probability distribution is a distribution created from a sample $\left\{{\underline{x}}_{1},{\underline{x}}_{2},...,{\underline{x}}_{N}\right\}$. It corresponds to a discrete uniform distribution on $\left\{{\underline{x}}_{1},{\underline{x}}_{2},...,{\underline{x}}_{N}\right\}$: where ${\underline{X}}^{\text{'}}$ follows this distribution,

 $\begin{array}{c}\hfill \forall \phantom{\rule{0.277778em}{0ex}}i\in \left\{1,...,N\right\},\phantom{\rule{4pt}{0ex}}\mathrm{Pr}\left({\underline{X}}^{\text{'}}={\underline{x}}_{i}\right)=\frac{1}{N}\end{array}$

Thus we have:

 $\begin{array}{c}\hfill {F}_{N}\left(\underline{x}\right)=\frac{1}{N}\sum _{i=1}^{N}{\mathbf{1}}_{\left\{{x}_{i}^{1}\le {x}^{1},...,{x}_{N}^{{n}_{X}}\le {x}^{{n}_{X}}\right\}}\end{array}$

in comparison with the theoretical probability density function ${F}_{X}$:

 $\begin{array}{c}\hfill {F}_{X}\left(x\right)=ℙ\left({X}^{1}\le {x}^{1},...,{X}^{{n}_{X}}\le {x}^{{n}_{X}}\right)\end{array}$
Other notations
This method is also referred to in the literature as the empirical distribution function.

Link with OpenTURNS methodology

This method is used in step B "Quantifying Sources of Uncertainty". It enables us to obtain a representation of the distribution of the vector $\underline{X}$ of uncertain variables defined in step A "Specifying Criteria and the Case Study", without applying any a priori modelling hypotheses.
References and theoretical basics
This method has the advantage of depending only on the observed values, without any other modelling assumptions (as in the [kernel smoothing method] ). Nevertheless, in the case where little data is available, the estimation of the criteria defined in step A can be less precise with this non-parametric method than with a parametric approach (e.g. the models described in [standard parametric models] ).

The following bibliographical references provide main starting points for further study of this method:

• Saporta G. (1990). "Probabilités, Analyse de données et Statistique", Technip

• Dixon W.J. & Massey F.J. (1983) "Introduction to statistical analysis (4th ed.)", McGraw-Hill

### 3.3.2 Step B  – Kernel Smoothing

Mathematical description

Kernel smoothing is a non parametric estimation method of the probability density function of a distribution.

In dimension 1, the kernel smoothed probability density function $\stackrel{^}{p}$ has the following expression, where $K$ is the univariate kernel, $n$ the numerical sample size and $\left({X}_{1},\cdots ,{X}_{n}\right)\in {ℝ}^{n}$ the univariate random sample with $\forall i,\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}{X}_{i}\in ℝ$ :

 $\stackrel{^}{p}\left(x\right)=\frac{1}{nh}\sum _{i=1}^{n}K\left(\frac{x-{X}_{i}}{h}\right)$ (1)

The kernel $K$ is a function satisfying $\int K\left(x\right)\phantom{\rule{0.166667em}{0ex}}dx=1$. Usually, $K$ is chosen to be a unimodal probability density fucntion that is symmetric about 0.

The parameter $h$ is called the bandwidth.

In dimension $d>1$, the kernel may be defined as a product kernel ${K}_{d}$, as follows where $\underline{x}=\left({x}_{1},\cdots ,{x}_{d}\right)\in {ℝ}^{d}$ :

 $\begin{array}{c}\hfill {K}_{d}\left(\underline{x}\right)=\prod _{j=1}^{d}K\left({x}_{j}\right)\end{array}$

which leads to the kernel smoothed probability density function in dimension $d$, where $\left({\underline{X}}_{1},\cdots ,{\underline{X}}_{n}\right)$ is the d-variate random sample which components are denoted ${\underline{X}}_{i}=\left({X}_{i1},\cdots ,{X}_{id}\right)$ :

 $\begin{array}{c}\hfill \stackrel{^}{p}\left(\underline{x}\right)=\frac{1}{N{\prod }_{j=1}^{d}{h}_{j}}\sum _{i=1}^{N}{K}_{d}\left(\frac{{x}_{1}-{X}_{i1}}{{h}_{1}},\cdots ,\frac{{x}_{d}-{X}_{id}}{{h}_{d}}\right)\end{array}$

Let's note that the bandwidth is the vector $\underline{h}=\left({h}_{1},\cdots ,{h}_{d}\right)$.

The quality of the approximation may be controlled by the AMISE (Asymptotic Mean Integrated Square error) criteria defined as :

 $\begin{array}{c}\hfill \left\{\begin{array}{ccc}AMISE\left(\stackrel{^}{p}\right)\hfill & =& \text{two}\phantom{\rule{4.pt}{0ex}}\text{first}\phantom{\rule{4.pt}{0ex}}\text{terms}\phantom{\rule{4.pt}{0ex}}\text{in}\phantom{\rule{4.pt}{0ex}}\text{the}\phantom{\rule{4.pt}{0ex}}\text{series}\phantom{\rule{4.pt}{0ex}}\text{expansion}\phantom{\rule{4.pt}{0ex}}\text{with}\phantom{\rule{4.pt}{0ex}}\text{respect}\phantom{\rule{4.pt}{0ex}}\text{to}\phantom{\rule{4.pt}{0ex}}n\phantom{\rule{4.pt}{0ex}}\text{in}\phantom{\rule{4.pt}{0ex}}MISE\left(\stackrel{^}{p}\right)\hfill \\ MISE\left(\stackrel{^}{p}\right)\hfill & =& {𝔼}_{\underline{X}}\left[||\stackrel{^}{p}-p{||}_{{L}_{2}}^{2}\right]=\int \phantom{\rule{0.166667em}{0ex}}MSE\left(\stackrel{^}{p},\underline{x}\right)\phantom{\rule{0.166667em}{0ex}}d\underline{x}\hfill \\ MSE\left(\stackrel{^}{p},\underline{x}\right)\hfill & =& {\left[{𝔼}_{\underline{X}}\left[\stackrel{^}{p}\left(\underline{x}\right)\right]-p\left(\underline{x}\right)\right]}^{2}+{\mathrm{Var}}_{\underline{X}}\left[\stackrel{^}{p}\left(\underline{x}\right)\right]\hfill \end{array}\right\\end{array}$

The quality of the estimation essentially depends on the value of the bandwidth $h$. The bandwidth that minimizes the AMISE criteria has the expression (given in dimension 1) :

 ${h}_{AMISE}\left(K\right)={\left[\frac{R\left(K\right)}{{\mu }_{2}{\left(K\right)}^{2}R\left({p}^{\left(2\right)}\right)}\right]}^{\frac{1}{5}}{n}^{-\frac{1}{5}}$ (2)

where $R\left(K\right)=\int K{\left(\underline{x}\right)}^{2}\phantom{\rule{0.166667em}{0ex}}d\underline{x}$ and ${\mu }_{2}\left(K\right)=\int {\underline{x}}^{2}K\left(\underline{x}\right)\phantom{\rule{0.166667em}{0ex}}d\underline{x}={\sigma }_{K}^{2}$.

If we note that $R\left({p}^{\left(r\right)}\right)={\left(-1\right)}^{r}{\Phi }_{2r}$ with ${\Phi }_{r}=\int {p}^{\left(r\right)}p\left(x\right)\phantom{\rule{0.166667em}{0ex}}dx={𝔼}_{\underline{X}}\left[{p}^{\left(r\right)}\right]$, then relation (2) writes :

 ${h}_{AMISE}\left(K\right)={\left[\frac{R\left(K\right)}{{\mu }_{2}{\left(K\right)}^{2}{\Phi }_{4}}\right]}^{\frac{1}{5}}{n}^{-\frac{1}{5}}$ (3)

Several rules exist to evaluate the optimal bandwidth ${h}_{AMISE}\left(K\right)$ : all efforts are concentrated on the evaluation of the term ${\Phi }_{4}$. We give here the most usual rules :

• the Silverman rule in dimension 1,

• the plug-in bandwidth selection - Solve-the-equation method in dimension $d$,

• the Scott rule in dimension d.

Silverman rule (dimension 1)

In the case where the density $p$ is normal with standard deviation $\sigma$, then the term ${\Phi }_{4}$ can be exactly evaluated. In that particular case, the optimal bandwidth of relation (3) with respect to the AMISE criteria writes as follows :

 ${h}_{AMISE}^{p=normal}\left(K\right)={\left[\frac{8\sqrt{\pi }R\left(K\right)}{3{\mu }_{2}{\left(K\right)}^{2}}\right]}^{\frac{1}{5}}\sigma {n}^{-\frac{1}{5}}$ (4)

An estimator of ${h}_{AMISE}^{p=normal}\left(K\right)$ is obtained by replacing $\sigma$ by its estimator ${\stackrel{^}{\sigma }}^{n}$, evaluated from the numerical sample $\left({X}_{1},\cdots ,{X}_{n}\right)$ :

 ${\stackrel{^}{h}}_{AMISE}^{p=normal}\left(K\right)={\left[\frac{8\sqrt{\pi }R\left(K\right)}{3{\mu }_{2}{\left(K\right)}^{2}}\right]}^{\frac{1}{5}}{\stackrel{^}{\sigma }}^{n}{n}^{-\frac{1}{5}}$ (5)

The Silverman rule consists in considering ${\stackrel{^}{h}}_{AMISE}^{p=normal}\left(K\right)$ of relation (5) even if the density $p$ is not normal :

 ${h}^{Silver}\left(K\right)={\left[\frac{8\sqrt{\pi }R\left(K\right)}{3{\mu }_{2}{\left(K\right)}^{2}}\right]}^{\frac{1}{5}}{\stackrel{^}{\sigma }}^{n}{n}^{-\frac{1}{5}}$ (6)

Relation (6) is empirical and gives good results when the density is not far from a normal one.

Plug-in bandwidth selection - Solve-the-equation method (dimension 1)

Relation (3) requires the evaluation of the quantity ${\Phi }_{4}$. As a generale rule, we use the estimator ${\stackrel{^}{\Phi }}_{r}$ of ${\Phi }_{r}$ defined by :

 ${\stackrel{^}{\Phi }}_{r}=\frac{1}{n}\sum _{i=1}^{n}{\stackrel{^}{p}}^{\left(r\right)}\left({X}_{i}\right)$ (7)

Derivating relation (1) leads to :

 ${\stackrel{^}{p}}^{\left(r\right)}\left(x\right)=\frac{1}{n{h}^{r+1}}\sum _{i=1}^{n}{K}^{\left(r\right)}\left(\frac{x-{X}_{i}}{h}\right)$ (8)

and then the estimator ${\stackrel{^}{\Phi }}_{r}\left(h\right)$ is defined as :

 ${\stackrel{^}{\Phi }}_{r}\left(h\right)=\frac{1}{{n}^{2}{h}^{r+1}}\sum _{i=1}^{n}\sum _{j=1}^{n}{K}^{\left(r\right)}\left(\frac{{X}_{i}-{X}_{j}}{h}\right)$ (9)

We note that ${\stackrel{^}{\Phi }}_{r}\left(h\right)$ depends of the parameter $h$ which can be taken in order to minimize the AMSE (Asymptotic Mean Square Error) criteria evaluated between ${\Phi }_{r}$ and ${\stackrel{^}{\Phi }}_{r}\left(h\right)$. The optimal parameter $h$ is :

 ${h}_{AMSE}^{\left(r\right)}={\left(\frac{-2{K}^{\left(r\right)}\left(0\right)}{{\mu }_{2}\left(K\right){\Phi }_{r+2}}\right)}^{\frac{1}{r+3}}{n}^{-\frac{1}{r+3}}$ (10)

Given that preliminary results, the solve-the-equation plug-in method proceeds as follows :

1. Relation (3) defines ${h}_{AMISE}\left(K\right)$ as a function of ${\Phi }_{4}$ we denote here as :

 ${h}_{AMISE}\left(K\right)=t\left({\Phi }_{4}\right)$ (11)
2. The term ${\Phi }_{4}$ is approximated by its estimator defined in (9) evaluated with its optimal parameter ${h}_{AMSE}^{\left(4\right)}$ defined in (10) :

 ${h}_{AMSE}^{\left(4\right)}={\left(\frac{-2{K}^{\left(4\right)}\left(0\right)}{{\mu }_{2}\left(K\right){\Phi }_{6}}\right)}^{\frac{1}{7}}{n}^{-\frac{1}{7}}$ (12)

which leads to a relation of type :

 ${\Phi }_{4}\simeq {\stackrel{^}{\Phi }}_{4}\left({h}_{AMSE}^{\left(4\right)}\right)$ (13)
3. Relations (3) and (12) lead to the new one :

 ${h}_{AMSE}^{\left(4\right)}={\left(\frac{-2{K}^{\left(4\right)}\left(0\right){\mu }_{2}\left(K\right){\Phi }_{4}}{R\left(K\right){\Phi }_{6}}\right)}^{\frac{1}{7}}{h}_{AMISE}{\left(K\right)}^{\frac{5}{7}}$ (14)

which rewrites :

 ${h}_{AMSE}^{\left(4\right)}=l\left({h}_{AMISE}\left(K\right)\right)$ (15)
4. Relation (14) depends on both terms ${\Phi }_{4}$ and ${\Phi }_{6}$ which are evaluated with their estimators defined in (9) respectively with their AMSE optimal parameters ${g}_{1}$ and ${g}_{2}$ (see relation (10)). It leads to the expressions:

 $\left\{\begin{array}{ccc}{g}_{1}\hfill & =& {\left(\frac{-2{K}^{\left(4\right)}\left(0\right)}{{\mu }_{2}\left(K\right){\Phi }_{6}}\right)}^{\frac{1}{7}}{n}^{-\frac{1}{7}}\hfill \\ {g}_{2}\hfill & =& {\left(\frac{-2{K}^{\left(6\right)}\left(0\right)}{{\mu }_{2}\left(K\right){\Phi }_{8}}\right)}^{\frac{1}{7}}{n}^{-\frac{1}{9}}\hfill \end{array}\right\$ (16)
5. In order to evaluate ${\Phi }_{6}$ and ${\Phi }_{8}$, we suppose that the density $p$ is normal with a variance ${\sigma }^{2}$ which is approximated by the empirical variance of the numerical sample, which leads to :

 $\left\{\begin{array}{ccc}{\stackrel{^}{\Phi }}_{6}\hfill & =& \frac{-15}{16\sqrt{\pi }}{\stackrel{^}{\sigma }}^{-7}\hfill \\ {\stackrel{^}{\Phi }}_{8}\hfill & =& \frac{{105}^{\phantom{\left(}}}{32\sqrt{\pi }}{\stackrel{^}{\sigma }}^{-9}\hfill \end{array}\right\$ (17)

Then, to resume, thanks to relations (11), (13), (15), (16) and (17), the optimal bandwidth is solution of the equation :

 ${h}_{AMISE}\left(K\right)=t\circ {\stackrel{^}{\Phi }}_{4}\circ l\left({h}_{AMISE}\left(K\right)\right)$ (18)

Scott rule (dimension d)

The Scott rule is a simplification of the Silverman rule generalized to the dimension $d$ which is optimal when the density $p$ is normal with independent components. In all the other cases, it gives an empirical rule that gives good result when the density $p$ is not far from the normal one. For examples, the Scott bandwidth may appear too large when $p$ presents several maximum.

The Silverman rule given in dimension 1 in relation (6) can be generalized in dimension $d$ as follows : if we suppose that the density $p$ is normal with independent components, in dimension $d$ and that we use the normal kernel $N\left(0,1\right)$ to estimate it, then the optimal bandwidth vector $\underline{h}$ with respect to the AMISE criteria writes as follows :

 ${\underline{h}}^{Silver}\left(N\right)={\left({\left(\frac{4}{d+2}\right)}^{1/\left(d+4\right)}{\stackrel{^}{\sigma }}_{i}^{n}{n}^{-1/\left(d+4\right)}\right)}_{i}$ (19)

where ${\stackrel{^}{\sigma }}_{i}^{n}$ is the standard deviation of the $i-th$ component of the sample $\left({\underline{X}}_{1},\cdots ,{\underline{X}}_{n}\right)$, and ${\sigma }_{K}$ the standard deviation of the 1D kernel $K$.

The Scott proposition is a simplification of the Silverman rule, based on the fact that the coefficient ${\left(\frac{4}{d+2}\right)}^{1/\left(d+4\right)}$ remains in $\left[0.924,1.059\right]$ when the dimension $d$ varies. Thus, Scott fixed it to 1 :

 ${\left(\frac{4}{d+2}\right)}^{1/\left(d+4\right)}\simeq 1$ (20)

which leads to the simplified expression :

 ${\underline{h}}^{Silver}\left(N\right)\simeq {\left({\stackrel{^}{\sigma }}_{i}^{n}{n}^{-1/\left(d+4\right)}\right)}_{i}$ (21)

Furthermore, in the general case, we have from relation (2) :

 $\frac{{h}_{AMISE}\left({K}_{1}\right)}{{h}_{AMISE}\left({K}_{2}\right)}=\frac{{\sigma }_{{K}_{2}}}{{\sigma }_{{K}_{1}}}{\left[\frac{{\sigma }_{{K}_{1}}R\left({K}_{1}\right)}{{\sigma }_{{K}_{2}}R\left({K}_{2}\right)}\right]}^{1/5}$ (22)

Considering that ${\sigma }_{K}R\left(K\right)\simeq 1$ whatever the kernel $K$, relation (22) simplifies in :

 ${h}_{AMISE}\left({K}_{1}\right)\simeq {h}_{AMISE}\left({K}_{2}\right)\frac{{\sigma }_{{K}_{2}}}{{\sigma }_{{K}_{1}}}$ (23)

If we consider the normal kernel $N\left(0,1\right)$ for ${K}_{2}$, then relation (23) writes in a more general notation :

 ${h}_{AMISE}\left(K\right)\simeq {h}_{AMISE}\left(N\right)\frac{1}{{\sigma }_{K}}$ (24)

If ${h}_{AMISE}\left(N\right)$ is evaluated with the Silverman rule, (24) rewrites :

 ${h}^{Silver}\left(K\right)\simeq {h}^{Silver}\left(N\right)\frac{1}{{\sigma }_{K}}$ (25)

At last, from relation (21) and (25) applied in each direction $i$, we deduce the Scott rule :

 ${\underline{h}}^{Scott}={\left(\frac{{\stackrel{^}{\sigma }}_{i}^{n}}{{\sigma }_{K}}{n}^{-1/\left(d+4\right)}\right)}_{i}$ (26)

Boundary treatment

In dimension 1, the boundary effects may be taken into account in OpenTURNS : the boundaries are automatically detected from the numerical sample (with the min and max functions) and the kernel smoothed PDF is corrected in the boundary areas to remain within the boundaries, according to the miroring technique :

• the Scott bandwidth is evaluated from the numerical sample : $h$

• two subsamples are extracted from the inital numerical sample, containing all the points within the range $\left[min,min+h\left[$ and $\right]max-h,max\right]$,

• both subsamples are transformed into their symmetric samples with respect their respective boundary : its results two samples within the range $\right]min-h,min\right]$ and $\left[max,max+h\left[$,

• a kernel smoothed PDF is built from the new numerical sample composed with the initial one and the two new ones, with the previous bandwidth $h$,

• this last kernel smoothed PDF is truncated within the inital range $\left[min,max\right]$ (conditional PDF).

Implementation in OpenTURNS

The choice of the kind of the kernel is free in OpenTURNS : it is possible to select any 1D distribution and to define it as a kernel. However, in order to optimize the efficiency of the kernel smoothing fitting (it means to minimise the AMISE error), it is recommended to select a symmetric distribution for the kernel.

All the distribution default constructors of OpenTURNS create a symmetric default distribution when possible. It is also possible to work with the Epanechnikov kernel, which is a $Beta\left(r=2,t=4,a=-1,b=1\right)$.

The default kernel is a product of standard Normal distribution. The dimension of the product is automatically evaluated from the random sample.

The bandwidth $\underline{h}$ may be fixed by the User. However, it is recommended to let OpenTURNS evaluate it automatically from the numerical sample according to the following rules :

In dimension $d$, OpenTURNS automatically applies the Scott rule.

In dimension 1, the automatic bandwidth selection method depends on the size $n$ of the numerical sample. As a matter of fact, the computation bottleneck is the estimation of the estimators ${\stackrel{^}{\Phi }}_{r}$ as it requires the evaluation of a double summation on the numerical sample, which has a cost of $𝒫\left({n}^{2}\right)$.

• if $n\le 250$, the Solve-the-equation plug-in method is used on the entire numerical sample. The optimal bandwidth ${h}_{AMISE}\left(N\right)$ is first evaluated when considering a normal kernel, resolving equation (18) for the Normal kernel. Then relation (24) is applied in order to evaluate ${h}_{AMISE}\left(K\right)$.

• if $n>250$, the Solve-the-equation plug-in method is too computationnally expensive. Then, OpenTURNS proceeds as follows :

1. OpenTURNS evaluates the bandwidth ${h}_{AMISE}^{{n}_{1},PI}\left(N\right)$ with the plug-in method applied on the first ${n}_{1}=250$ points of the numerical sample, using the Normal kernel $N$ (by solving equation (18) with $K=N$);

2. OpenTURNS evaluates the bandwidth ${h}^{{n}_{1},Silver}\left(N\right)$ with the Silverman rule applied on the first ${n}_{1}=250$ points of the numerical sample, using the Normal kernel $N$ (relation (6) with $K=N$);

3. OpenTURNS evaluates the bandwidth ${h}^{n,Silver}\left(N\right)$ with the Silverman rule applied on the entire numerical sample, using the Normal kernel $N$ (relation (6) with $K=N$);

4. Considering from relation (3) that :

 $\frac{{h}^{Silver}\left(K\right)}{{h}_{AMISE}\left(K\right)}={\left[\frac{{\Phi }_{4}\left(p=normal\right)}{{\Phi }_{4}\left(p\right)}\right]}^{\frac{1}{5}}$ (27)

which is independent of the size $n$, we have the final relation :

 ${h}_{AMISE}^{n,PI}\left(N\right)=\frac{{h}_{AMISE}^{{n}_{1},PI}\left(N\right)}{{h}^{{n}_{1},Silver}\left(N\right)}{h}^{n,Silver}\left(N\right)$ (28)

Then, if the User has chosen the kernel $K$ rather than the normal kernel $N$, relation (24) is used, which leads to :

 ${h}_{AMISE}^{n,PI}\left(K\right)=\frac{1}{{\sigma }_{K}}\frac{{h}_{AMISE}^{{n}_{1},PI}\left(N\right)}{{h}_{AMISE}^{{n}_{1},Silver}\left(N\right)}{h}_{AMISE}^{n,Silver}\left(N\right)$ (29)

Other notations

-

Link with OpenTURNS methodology

This kernel smoothing method can be used to estimate the probability density function :
• of the distribution of the input random variable (Step B of the global methodology),

• of the distribution of the ouput variable of interest (Step C of the global methodology).

References and theoretical basics

The following references gives details on the method :
• Kernel smoothing, M.P. Wand and M.C. Jones, Chapman & Hall/CRC edition, ISNB 0-412-55270-1.

• Multivariate Density Estimation, practice and Visualisation, Theory, David W. Scott, Wiley edition.

Examples

Choice of the bandwidth $h$

This example illustrates the effect of the choice of the bandwidth $h$ on the estimation of the pdf compared to the optimal one. Depending on the choice of $h$, one could observe for the same size $N$ of input values over-smoothing effects or under-smoothing effects.

Oversmoothing effect

In this case, $h$ is bigger than the optimal choice ${h}_{opt}$. The effect of the values is more widely spread as in the optimal case.

Undersmoothing effect

In this case, $h$ is smaller than the optimal choice ${h}_{opt}$. The effect of the values is more locally focused on the values obtained in the data set than in the optimal case.

Optimal smoothing

Following the previous Silverman rule, for a Gaussian distribution.

### 3.3.3 Step B  – Standard parametric models

Mathematical description

Objective

Parametric models aim to describe probability distributions of a random variable with the aid of a limited number of parameters $\underline{\theta }$. Therefore, in the case of continuous variables (i.e. where all possible values are continuous), this means that the probability density of $\underline{X}=\left({X}^{1},...,{X}^{{n}_{X}}\right)$ can be expressed as ${f}_{X}\left(\underline{x};\underline{\theta }\right)$. In the case of discrete variables (i.e. those which take only discrete values), their probabilities can be described in the form $ℙ\left(\underline{X}=\underline{x};\underline{\theta }\right)$.

The available distributions of OpenTURNS are listed in this section. We start with continuous distributions.

• Arcsine distribution: $\underline{\theta }=\left(a,b\right)$, with the constraint $a. The probability density function writes:

 $\frac{1}{\pi \frac{b-a}{2}\sqrt{1-{\left(\frac{x-\frac{a+b}{2}}{\frac{b-a}{2}}\right)}^{2}}}$ (30)

The support is $\left[a,b\right]$.

• Beta distribution: Univariate distribution. $\underline{\theta }=\left(r,t,a,b\right)$, with the constraints $r>0$, $t>r$, $b>a$. The probability density function writes:

 ${f}_{X}\left(x;\underline{\theta }\right)=\frac{{\left(x-a\right)}^{r-1}{\left(b-x\right)}^{t-r-1}}{{\left(b-a\right)}^{t-1}B\left(r,t-r\right)}{\mathbf{1}}_{a\le x\le b}$ (31)

where $B$ denotes the Beta function. The support is $\left[a,b\right]$.

Note that the Epanechnikov distribution is a particular Beta distribution : $Beta\left(a=-1,b=1,r=2,t=4\right)$. It is usefull within the kernel smoothing theory (see [kernel smoothing] ).

 PDF of a Beta distribution. PDF of a Beta distribution.
Figure 1
 PDF of a Beta distribution. PDF of a Beta distribution.
Figure 2
 PDF of a Beta distribution. PDF of a Beta distribution.
Figure 3
 PDF of a Beta distribution. PDF of a Epanechnikov distribution.
Figure 4
• Burr distribution: Univariate distribution. $\underline{\theta }=\left(c,k\right)$, with the constraints $c>0$, $k>0$. The probability density function writes:

 ${f}_{X}\left(x;\underline{\theta }\right)=ck\frac{{x}^{\left(c-1\right)}}{{\left(1+{x}^{c}\right)}^{\left(k+1\right)}}{\mathbf{1}}_{x>0}$ (32)
 PDF of a Burr distribution. PDF of a Burr distribution.
Figure 5
• Chi: Univariate distribution. $\underline{\theta }=\nu$ with the constraint $\nu >0$. The probability density function writes:

 ${f}_{X}\left(x;\underline{\theta }\right)={x}^{\nu -1}{e}^{-{x}^{2}/2}\frac{{2}^{1-{\nu }^{\phantom{\left(}}/2}}{\Gamma {\left(\nu /2\right)}_{\phantom{\left(}}}{1}_{\left[0,+\infty \left[}\left(x\right)$ (33)
 PDF of a Chi distribution.
Figure 6
• ChiSquare: Univariate distribution. $\underline{\theta }=\nu$ with the constraint $\nu >0$. The probability density function writes:

 ${f}_{X}\left(x;\underline{\theta }\right)=\frac{{2}^{-{\nu }^{\phantom{\left(}}/2}}{\Gamma {\left(\nu /2\right)}_{\phantom{\left(}}}{x}^{\left(\nu /2-1\right)}{e}^{-x/2}{1}_{\left[0,+\infty \left[}\left(x\right)$ (34)

The support is $\left[0,+\infty \left[$.

 PDF of a Chi Square distribution. PDF of a Chi Square distribution.
Figure 7
 PDF of a Chi Square distribution.
Figure 8
• Dirichlet distribution: Multivariate $d$-dimensional distribution. $\underline{\theta }=\left({\theta }_{1},\cdots ,{\theta }_{d+1}\right)$, with the constraints $d\ge 1$ and ${\theta }_{i}>0$. The probability density function writes:

 ${f}_{X}\left(\underline{x};\underline{\theta }\right)=\frac{\Gamma \left({\sum }_{j=1}^{d+1}{\theta }_{j}\right)}{{\prod }_{j=1}^{d+1}\Gamma \left({\theta }_{j}\right)}{\left[1-\sum _{j=1}^{d}{x}_{j}\right]}^{\left({\theta }_{d+1}-1\right)}\prod _{j=1}^{d}{x}_{j}^{\left({\theta }_{j}-1\right)}{\mathbf{1}}_{\Delta }\left(\underline{x}\right)$ (35)

with $\Delta =\left\{\underline{x}\in {ℝ}^{d}/\forall i,{x}_{i}\ge 0,{\sum }_{i=1}^{d}{x}_{i}\le 1\right\}$.

• Epanechnikov distribution: Univariate distribution. The Epanechnikov distribution is a particular Beta distribution : $Beta\left(a=-1,b=1,r=2,t=4\right)$. It is usefull within the kernel smoothing theory (see [kernel smoothing] ). See Figure 4 for the graph of its pdf.

• Exponential distribution: $\underline{\theta }=\left(\lambda ,\gamma \right)$, with the constraint $\lambda >0$. The probability density function writes:

 ${f}_{X}\left(x;\underline{\theta }\right)=\lambda exp\left(-\lambda \left(x-\gamma \right)\right){\mathbf{1}}_{\gamma \le x}$ (36)

The support is $\left[\gamma ,+\infty \left[$, and is right skewed. The expected value of the distribution is $\gamma +1/\lambda$. The coefficient of variation (standard deviation / mean) is equal to $\frac{1}{1+\gamma \lambda }$ and does not depend on $\lambda$ if $\gamma =0$.

 PDF of an Exponential distribution.
Figure 9
• Fisher-Snedecor distribution: $\underline{\theta }=\left({d}_{1},{d}_{2}\right)$, with the constraint ${d}_{i}>0$. The probability density function writes:

 ${f}_{X}\left(x;\underline{\theta }\right)=\frac{1}{xB\left({d}_{1}/2,{d}_{2}/2\right)}\left[{\left(\frac{{d}_{1}x}{{d}_{1}x+{d}_{2}}\right)}^{{d}_{1}/2}{\left(1-\frac{{d}_{1}x}{{d}_{1}x+{d}_{2}}\right)}^{{d}_{2}/2}\right]{\mathbf{1}}_{x\ge 0}$ (37)

The support is $\left[0,+\infty \left[$, and is right skewed.

 PDF of a Fisher-Snedecor distribution. PDF of a Fisher-Snedecor distribution.
Figure 10
• Gamma distribution: Univariate distribution. $\underline{\theta }=\left(\lambda ,k,\gamma \right)$, with the constraints $\lambda >0$, $k>0$. The probability density function writes:

 ${f}_{X}\left(x;\underline{\theta }\right)=\frac{\lambda }{\Gamma \left(k\right)}{\left(\lambda \left(x-\gamma \right)\right)}^{k-1}exp\left(-\lambda \left(x-\gamma \right)\right){\mathbf{1}}_{\gamma \le x}$ (38)

where $\Gamma$ is the gamma function. The support is $\left[\gamma ,+\infty \left[$, and is right skewed.

 PDF of a Gamma distribution. PDF of a Gamma distribution.
Figure 11
 PDF of a Gamma distribution.
Figure 12
• Generalized Pareto distribution: Univariate distribution. $\underline{\theta }=\left(\xi ,\sigma \right)$, with the constraints $\sigma >0$. The cumulative probability function writes:

 ${F}_{X}\left(x;\underline{\theta }\right)=\left\{\begin{array}{cc}1-{\left(1+\frac{\xi x}{\sigma }\right)}_{\phantom{\left(}}^{-1/\xi }\hfill & \text{if}\phantom{\rule{4.pt}{0ex}}\xi \ne 0\hfill \\ 1-exp\left(-\frac{x}{\sigma }\right)\hfill & \text{if}\phantom{\rule{4.pt}{0ex}}\xi =0\hfill \end{array}\right\$ (39)

The support is ${ℝ}^{+}$ if $\xi \ge 0$ and $\left[0,-\frac{1}{\xi }\right]$ if $\xi <0$.

 PDF of a Generalized Pareto distribution. PDF of a Generalized Pareto distribution.
Figure 13
• Gumbel distribution: Univariate distribution. $\underline{\theta }=\left(\alpha ,\beta \right)$, with the constraint $\alpha >0$. The probability density function writes:

 ${f}_{X}\left(x;\underline{\theta }\right)=\alpha exp\left(-\alpha \left(x-\beta \right)-{e}^{-\alpha \left(x-\beta \right)}\right)$ (40)

The support is $ℝ$. $\beta$ describes the most likely value, but this is less than the expected value of the distribution because the distribution is asymmetric (right skewed): the probability values in the distribution's right tail (i.e. values greater than $\beta$) decrease more gradually than those in the left tail (i.e. values less than $\beta$). a provides a measure of dispersion: the probability density function flattens as $\alpha$ decreases.

 PDF of a Gumbel distribution.
Figure 14
• Histogram distribution: Univariate distribution. $\underline{\theta }=\left({\left({l}_{i},{h}_{i}\right)}_{i}\right)$, with the constraint ${h}_{i}>0$ and ${l}_{i}>0$. The probability density function writes:

 ${f}_{X}\left(x;\underline{\theta }\right)=\sum _{i=1}^{n}{H}_{i}\phantom{\rule{0.277778em}{0ex}}{1}_{\left[{x}_{i},{x}_{i+1}\right]}\left(x\right)$ (41)

where

• ${H}_{i}={h}_{i}/S$ is the normalized heights, with $S={\sum }_{i=1}^{n}{h}_{i}\phantom{\rule{0.166667em}{0ex}}{l}_{i}$ being the initial surface of the histogram.

• ${l}_{i}={x}_{i+1}-{x}_{i}$, $1\le i\le n$

• $n$ is the size of the HistogramPairCollection

 PDF of a Histogram distribution. PDF of a Histogram distribution.
Figure 15
• Inverse ChiSquare distribution: Univariate distribution. $\underline{\theta }=\left(\nu \right)$,, with the constraint $\nu >0$. The probability density function writes:

 ${f}_{X}\left(x;\underline{\theta }\right)=\frac{exp\left(-\frac{1}{2x}\right)}{\Gamma \left(\frac{\nu }{2}\right){\lambda }^{\frac{\nu }{2}}{x}^{\frac{\nu }{2}+1}}{\mathbf{1}}_{x>0}$ (42)

A Inverse ChiSquare distribution parametered by $\nu$ is exactly the InverseGamma distribution parametered by $\left(\frac{\nu }{2},2\right)$.

 PDF of an Inverse ChiSquare distribution.
Figure 16
• Inverse Gamma distribution: Univariate distribution. $\underline{\theta }=\left(k,\lambda \right)$,, with the constraint $k>0$ and $\lambda >0$. The probability density function writes:

 ${f}_{X}\left(x;\underline{\theta }\right)=\frac{exp\left(-\frac{1}{\lambda x}\right)}{{\Gamma }^{\phantom{\left(}}\left(k\right){\lambda }^{k}{x}^{k+1}}{\mathbf{1}}_{x>0}$ (43)
 PDF of an Inverse Gamma distribution.
Figure 17
• Inverse Normal distribution: Univariate distribution. $\underline{\theta }=\left(\lambda ,\mu \right)$, with the constraint $\lambda >0$ and $\mu >0$. The probability density function writes:

 ${f}_{X}\left(x;\underline{\theta }\right)={\left(\frac{\lambda }{2\pi {x}^{3}}\right)}^{1/2}{e}^{-\lambda {\left(x-\mu \right)}^{2}/\left(2{\mu }^{2}x\right)}{\mathbf{1}}_{x>0}$ (44)
 PDF of a Inverse Normal distribution. PDF of a Inverse Normal distribution.
Figure 18
• Inverse Wishart distribution: Multivariate distribution. $\underline{\theta }=\left(\underline{\underline{V}},\nu \right)$ where $\underline{\underline{V}}$ is a symmetric positive definite matrix of dimension $p$ and $\nu >p-1$. The probability density function writes:

 ${f}_{\underline{x}}\left(\underline{x};\underline{\theta }\right)=\frac{|\underline{\underline{V}}{|}^{\frac{\nu }{2}}{e}^{-\frac{\mathrm{tr}{\left(\underline{\underline{V}}m{\left(\underline{x}\right)}^{-1}\right)}^{\phantom{\left(}}}{2}}}{{2}^{\frac{\nu p}{2}}{|m\left(\underline{x}\right)|}^{\frac{\nu +p+1}{2}}{\Gamma }_{p}{\left(\frac{\nu }{2}\right)}_{\phantom{\left(}}}{\mathbf{1}}_{{ℳ}_{p}^{+}\left(ℝ\right)}\left(m\left(\underline{x}\right)\right)$ (45)

where $\underline{x}\in {ℝ}^{\frac{p\left(p+1\right)}{2}}$, ${ℳ}_{p}^{+}\left(ℝ\right)$ is the set of symmetric positive matrices of dimension $p$ and $m:{ℝ}^{\frac{p\left(p+1\right)}{2}}\to {ℳ}_{p}^{+}\left(ℝ\right)$ is given by:

 $\begin{array}{c}\hfill m\left(\underline{x}\right)=\left(\begin{array}{cccc}{x}_{1}& {x}_{2}& \cdots & {x}_{1+p\left(p-1\right)/2}\\ {x}_{2}& {x}_{3}& & ⋮\\ ⋮& & \ddots & ⋮\\ {x}_{1+p\left(p-1\right)/2}& \cdots & \cdots & {x}_{p\left(p+1\right)/2}\end{array}\right)\end{array}$ (46)
• Laplace distribution: Univariate distribution. $\underline{\theta }=\left(\lambda ,\mu \right)$, with the constraint $\lambda >0$. The probability density function writes:

 ${f}_{X}\left(x;\underline{\theta }\right)=\frac{{\lambda }^{\phantom{\left(}}}{{2}_{\phantom{\left(}}}{e}^{-\lambda |x-\mu |}$ (47)

The Laplace distribution is the generalisation of the Exponential distribution to the range $ℝ$.

 PDF of a Laplace distribution.
Figure 19
• Logistic distribution: Univariate distribution. $\underline{\theta }=\left(\alpha ,\beta \right)$, with the constraint $\beta \ge 0$. The probability density function writes:

 ${f}_{X}\left(x;\underline{\theta }\right)=\frac{exp\left(-\frac{x-\alpha }{\beta }\right)}{\beta {\left[1+exp\left(-\frac{x-\alpha }{\beta }\right)\right]}^{2}}$ (48)

The support is $ℝ$. $\alpha$ describes the most likely value. $\beta$ provides a measure of dispersion: the probability density function flattens as $\beta$ decreases.

 PDF of a logistic distribution.
Figure 20
• LogNormal distribution: Univariate distribution. $\underline{\theta }=\left({\mu }_{\ell },{\sigma }_{\ell },\gamma \right)$, with the constraint ${\sigma }_{\ell }>0$. The probability density function writes:

 ${f}_{X}\left(x;\underline{\theta }\right)=\frac{1}{{\sigma }_{\ell }\left(x-\gamma \right)\sqrt{2\pi }}exp\left(-\frac{1}{2}{\left(\frac{\mathrm{ln}\left(x-\gamma \right)-{\mu }_{\ell }}{{\sigma }_{\ell }}\right)}^{2}\right){\mathbf{1}}_{\gamma \le x}$ (49)

The support is $\left[\gamma ,+\infty \left[$, and is right skewed.

 PDF of a LogNormal distribution.
Figure 21
• LogUniform distribution: Univariate distribution. $\underline{\theta }=\left({a}_{\ell },{b}_{\ell }\right)$, with the constraint ${b}_{\ell }>{a}_{\ell }$. The probability density function writes:

 ${f}_{X}\left(x;\underline{\theta }\right)=\frac{1}{x\left({b}_{\ell }-{a}_{\ell }\right)}{\mathbf{1}}_{{a}_{\ell }\le log\left(x\right)\phantom{\rule{4pt}{0ex}}leq{b}_{\ell }}$ (50)

The support is $\left[exp\left({a}_{\ell }\right),exp\left({b}_{\ell }\right)\right]$, and is right skewed.

 PDF of a LogUniform distribution.
Figure 22
• Maximum entropy statistics distribution: Multivariate distribution, parametrized by $d$ marginals, with bounds $a,b$ verifying ${a}_{i}\le {a}_{i+1}$ and ${b}_{i}\le {b}_{i+1}$. The probability density function writes:

 ${f}_{X}\left(x\right)={f}_{1}\left({x}_{1}\right)\prod _{k=2}^{d}{\phi }_{k}\left({x}_{k}\right)exp\left(-{\int }_{{x}_{k-1}}^{{x}_{k}}{\phi }_{k}\left(s\right)ds\right){\mathbf{1}}_{{x}_{1}\le \cdots \le {x}_{d}}$ (51)

with

 ${\phi }_{k}\left({x}_{k}\right)=\frac{{f}_{k}\left({x}_{k}\right)}{{F}_{k-1}\left({x}_{k}\right)-{F}_{k}\left({x}_{k}\right)}$ (52)
• MeixnerDistribution distribution: Univariate distribution. $\underline{\theta }=\left(\alpha ,\beta ,\delta ,\mu \right)$, with the constraint $\alpha >0$, $\beta \in \right]-\pi ,\pi \left[$ and $\delta >0$. The probability density function writes:

 ${f}_{X}\left(x;\underline{\theta }\right)=\frac{{\left[2cos\left(\beta /2\right)\right]}^{2\delta }}{2\alpha \pi \Gamma \left(2\delta \right)}{e}^{\frac{\beta \left(x-\mu \right)}{\alpha }}{|\Gamma \left(\delta +i\frac{x-\mu }{\alpha }\right)|}^{2}$ (53)

where ${i}^{2}=-1$. The support is $ℝ$.

 PDF of a MeixnerDistribution distribution.
Figure 23
• Non Central Chi Square: Univariate distribution. $\underline{\theta }=\left(\nu ,\lambda \right)$, with the constraint $\nu >0$ and $\lambda \ge 0$. The probability density function writes:

 ${f}_{X}\left(x;\underline{\theta }\right)=\sum _{j=0}^{\infty }{e}^{-\lambda }\frac{{\lambda }^{j}}{j!}{p}_{{\chi }^{2}\left(\nu +2j\right)}\left(x\right)$ (54)

where ${p}_{{\chi }^{2}\left(q\right)}$ is the probability density function of a ${\chi }^{2}\left(q\right)$ random variate.

 PDF of a Non Central Chi Square distribution. PDF of a Non Central Chi Square distribution.
Figure 24
• Non Central Student: Univariate distribution. $\underline{\theta }=\left(\nu ,\delta ,\gamma \right)$. Let's note that a random variable $X$ is said to have a standard non-central student distribution $𝒯\left(\nu ,\delta \right)$ if it can be written as:

 $X=\frac{N}{\sqrt{C/\nu }}$ (55)

where $N$ has the normal distribution $𝒩\left(\delta ,1\right)$ and $C$ has the ${\chi }^{2}\left(\nu \right)$ distribution, $N$ and $C$ being independent.

The non-central Student distribution in OpenTURNS has an additional parameter $\gamma$ such that the random variable $X$ is said to have a non-central Student distribution $𝒯\left(\nu ,\delta ,\gamma \right)$ if $X-\gamma$ has a standard $𝒯\left(\nu ,\delta \right)$ distribution.

We explicitate here the probability density function of the Non Central Student :

 ${p}_{NCS}\left(x\right)=\frac{exp\left(-{\delta }^{2}/2\right)}{\sqrt{\nu \pi }\Gamma \left(\nu /2\right)}{\left(\frac{\nu }{\nu +{\left(x-\gamma \right)}^{2}}\right)}^{\left(\nu +1\right)/2}\sum _{j=0}^{\infty }\frac{\Gamma \left(\frac{\nu +j+1}{2}\right)}{\Gamma \left(j+1\right)}{\left(\delta \left(x-\gamma \right)\sqrt{\frac{2}{\nu +{\left(x-\gamma \right)}^{2}}}\right)}^{j}$ (56)
 PDF of a Non Central Student distribution.
Figure 25
• Normal Gamma: Bivariate distribution. $\underline{\theta }=\left(\mu ,\kappa ,\alpha ,\beta \right)$.

The Normal Gamma distribution is the distribution of the random vector $\left(X,Y\right)$ where $Y$ follows the distribution $\Gamma \left(\alpha ,\beta \right)$ with $\alpha >0$ and $\beta >0$, $X|Y$ follows the distribution $𝒩\left(\mu ,\frac{1}{\sqrt{\kappa Y}}\right)$.

We explicitate here the probability density function of the Non Central Student :

 ${p}_{NG}\left(x,y\right)=\frac{\Gamma \left(\alpha \right)}{{\beta }^{\alpha }}\sqrt{\frac{2\pi }{\kappa }}{y}^{\alpha -1/2}exp\left(-\frac{y}{2}\left[\kappa {\left(x-\mu \right)}^{2}+2\beta \right]\right)$ (57)
• Normal distribution (or Gaussian distribution): Multivariate $n$-dimensional distribution. In the case $n=1$, $\underline{\theta }=\left(\mu ,\sigma \right)$, with the constraint $\sigma >0$. The probability density is given as:

 ${f}_{X}\left(x;\underline{\theta }\right)=\frac{1}{\sigma \sqrt{2\pi }}exp\left(-\frac{1}{2}{\left(\frac{x-\mu }{\sigma }\right)}^{2}\right)$ (58)

The support is $ℝ$. $\mu$ provides the most likely value (for which the probability density function is at its highest), and the density function is symmetric around this value (the values $\mu -a$ and $\mu +a$ are equally likely); $\mu$ is also the expected value (mean) of this distribution. Whilst $\sigma$ provides a measure of dispersion: the larger it is, the flatter the probability density function is (i.e. values far away from $\mu$ are still likely, or in other words possible values are more spread out).

 PDF of a Normal distribution.
Figure 26

In dimension $n>1$, the Multi-Normal Distribution (or Multivariate Normal Distribution) writes :

 ${f}_{X}\left(x;\underline{\theta }\right)=\frac{1}{{\left(2\pi \right)}^{\frac{n}{2}}{\left(\mathrm{det}\underline{\underline{\Sigma }}\right)}^{\frac{1}{2}}}{e}^{-\frac{1}{2}{\left(\underline{x}-\underline{\mu }\right)}^{t}{\underline{\underline{\Sigma }}}^{-{1}^{\phantom{\left(}}}\left(\underline{x}-\underline{\mu }\right)}$ (59)

where $\underline{\underline{\Sigma }}={\underline{\underline{\Lambda }}}_{\underline{\sigma }}\underline{\underline{R}}{\underline{\underline{\Lambda }}}_{\underline{\sigma }}$, ${\underline{\underline{\Lambda }}}_{\underline{\sigma }}=\mathrm{diag}\left(\underline{\sigma }\right)$, $\underline{\underline{R}}$ SPD, ${\sigma }_{i}>0$. The distribution is parameterized by $\left(\underline{\mu },\underline{\sigma },\underline{\underline{R}}\right)$ or $\left(\underline{\mu },\underline{\underline{\Sigma }}\right)$.

• Random Mixture distribution: Univariate distribution. A Random Mixture $Y$ is defined as an affine combination of random variables ${X}_{i}$ as follows:

 $Y={a}_{0}+\sum _{i=1}^{n}{a}_{i}{X}_{i}$ (60)

where ${\left({a}_{i}\right)}_{0\le i\le n}\in {ℝ}^{n+1}$ and ${\left({X}_{i}\right)}_{1\le i\le n}$ are some independent univariate distributions.

For example,

 $Y=2+5{X}_{1}+{X}_{2}$ (61)

where :

• ${X}_{1}$ follows a $ℰ\left(\lambda =1.5\right)$,

• ${X}_{3}$ follows a $𝒩\left(\mu =4,Variance=1\right)$.

The pdf and cdf of this distribution are drawn in Fig.27 and Fig.27.

 Probability density function of a Random Mixture. Cumulative density function of a Random Mixture.
Figure 27
• Rice distribution: Univariate distribution. $\underline{\theta }=\left(\sigma ,\nu \right)$, with the constraint $\nu \ge 0$ and $\sigma >0$. The probability density is given as:

 ${f}_{X}\left(x;\underline{\theta }\right)=2\frac{x}{{\sigma }^{2}}{p}_{{\chi }^{2}\left(2,\frac{{\nu }^{2}}{{\sigma }^{2}}\right)}\left(\frac{{x}^{2}}{{\sigma }^{2}}\right)$ (62)

where ${p}_{{\chi }^{2}\left(\nu ,\lambda \right)}$ is the probability density function of a Non Central Chi Square distribution.

 Probability density function of a Rice distribution. Cumulative density function of a Rice distribution.
Figure 28
• Rayleigh distribution: Univariate distribution. $\underline{\theta }=\left(\sigma ,\gamma \right)$, with the constraint $\sigma >0$.The probability density is given as:

 ${f}_{X}\left(x;\underline{\theta }\right)=\frac{\left(x-\gamma \right)}{{\sigma }^{2}}{e}^{-\frac{{\left(x-\gamma \right)}^{2}}{2{\sigma }^{2}}}{1}_{\left[\gamma ,+\infty \left[}\left(x\right)$ (63)
 PDF of a Rayleigh distribution.
Figure 29
• Student distribution: Univariate distribution. $\underline{\theta }=\left(\nu ,\underline{\mu },\underline{\sigma },\underline{\underline{R}}\right)$, with the constraint $\nu >2$.The Student distribution has the following probability density function, written en dimension $d$ :

 ${p}_{T}\left(\underline{x}\right)=\frac{\Gamma \left(\frac{\nu +d}{2}\right)}{{\left(\pi d\right)}^{\frac{d}{2}}\Gamma \left(\frac{\nu }{2}\right)}\frac{{\left|\mathrm{det}\left(\underline{\underline{R}}\right)\right|}^{-1/2}}{{\prod }_{k=1}^{d}{\sigma }_{k}}{\left(1+\frac{{\underline{z}}^{t}{\underline{\underline{R}}}^{-1}\underline{z}}{\nu }\right)}^{-\frac{\nu +d}{2}}$ (64)

where $\underline{z}={\underline{\underline{\Delta }}}^{-1}\left(\underline{x}-\underline{\mu }\right)$ with $\underline{\underline{\Delta }}=\underline{\underline{\mathrm{diag}}}\left(\underline{\sigma }\right)$.

In dimension $d=1$, $\underline{\theta }=\left(\nu ,\mu ,\sigma \right)$ and the distribution writes :

 ${p}_{T}\left(x\right)=\frac{\Gamma \left(\frac{\nu +1}{2}\right)}{\sqrt{\pi }\Gamma \left(\frac{\nu }{2}\right)}\frac{1}{\sigma }{\left(1+\frac{{\left(x-\mu \right)}^{2}}{\nu }\right)}^{-\frac{\nu +1}{2}}$ (65)

The parameter $\mu$ describes the most likely value. $\nu$ is a measure of dispersion: the probability density function flattens as $\nu$ decreases.

 PDF of a Student distribution.
Figure 30
• Trapezoidal distribution: $\underline{\theta }=\left(a,b,c,d\right)$, with the constraint $a\le b. The probability density function writes:

 ${f}_{X}\left(x;\underline{\theta }\right)=\left\{\begin{array}{cc}h\frac{x-a}{b-a}\hfill & \mathrm{if}\phantom{\rule{4pt}{0ex}}a\le x\le b\hfill \\ h\hfill & \mathrm{if}\phantom{\rule{4pt}{0ex}}b\le x\le c\hfill \\ h\frac{d-x}{d-c}\hfill & \mathrm{if}\phantom{\rule{4pt}{0ex}}c\le x\le d\hfill \\ 0\hfill & \mathrm{otherwise}\hfill \end{array}\right\$ (66)

with $h=\frac{2}{d+c-a-b}$

The support is $\left[a,d\right]$.

• Triangular distribution: Univariate distribution. $\underline{\theta }=\left(a,b,m\right)$, with the constraints $a\le m$, $m\le b$, $b>a$. The probability density function writes:

 ${f}_{X}\left(x;\underline{\theta }\right)=\left\{\begin{array}{cc}2\frac{x-a}{\left(m-a\right)\left(b-a\right)}\hfill & \mathrm{if}\phantom{\rule{4pt}{0ex}}a\le x\le m\hfill \\ 2\frac{b-x}{\left(b-m\right)\left(b-a\right)}\hfill & \mathrm{if}\phantom{\rule{4pt}{0ex}}m\le x\le b\hfill \\ 0\hfill & \mathrm{otherwise}\hfill \end{array}\right\$ (67)

The support is $\left[a,b\right]$. $m$ describes the most likely value.

 PDF of a Triangular distribution.
Figure 31
• Truncated Normal Distribution: Univariate distribution. $\underline{\theta }=\left({\mu }_{n},{\sigma }_{n},a,b\right)$, with the constraints ${\sigma }_{n}>0$, $b>a$. The probability density function writes:

 ${f}_{X}\left(x;\underline{\theta }\right)=\frac{\varphi \left(\frac{x-{\mu }_{n}}{{\sigma }_{n}}\right)/{\sigma }_{n}}{\Phi \left(\frac{b-{\mu }_{n}}{{\sigma }_{n}}\right)-\Phi \left(\frac{a-{\mu }_{n}}{{\sigma }_{n}}\right)}{\mathbf{1}}_{a\le x\le b}$ (68)

where $\varphi$ and $\Phi$ represent the probability density and the cumulative distribution function respectively of the reduced centred Normal distribution (i.e. the mean $\mu$ zero and standard deviation $\sigma$ equal to 1). The support is $\left[a,b\right]$. $\mu$ describes the most likely value. Whilst $\sigma$ provides a measure of dispersion: the probability density function flattens as s increases (the probability density becomes zero for values outside the interval $\left[a,b\right]$).

 PDF of a TruncatedNormal distribution. PDF of a TruncatedNormal distribution.
Figure 32
• Uniform distribution: Univariate distribution. $\underline{\theta }=\left(a,b\right)$, with the constraint $a. The probability density function writes:

 ${f}_{X}\left(x;\underline{\theta }\right)=\frac{1}{b-a}{\mathbf{1}}_{a\le x\le b}$ (69)

The support is $\left[a,b\right]$. All values in this interval are equally-likely.

 PDF of a Uniform distribution.
Figure 33
• Weibull distribution: Univariate distribution. $\underline{\theta }=\left(\alpha ,\beta ,\gamma \right)$, with the constraints $\alpha >0$, $\beta >0$. probability density function writes:

 ${f}_{X}\left(x;\underline{\theta }\right)=\frac{\beta }{\alpha }{\left(\frac{x-\gamma }{\alpha }\right)}^{\beta -1}exp\left(-{\left(\frac{x-\gamma }{\alpha }\right)}^{\beta }\right){\mathbf{1}}_{\gamma \le x}$ (70)

The support is $\left[\gamma ,+\infty \left[$, and is right skewed. Both $\alpha$ and $\beta$ influence the dispersion. We note that the distribution becomes more skewed as $\beta$ decreases. In the case where $\beta =1$ this is corresponds to the Exponential distribution.

 PDF of a Weibull distribution. PDF of a Weibull distribution.
Figure 34
• Wishart distribution: Multivariate distribution. $\underline{\theta }=\left(\underline{\underline{V}},\nu \right)$ where $\underline{\underline{V}}$ is a symmetric positive definite matrix of dimension $p$ and $\nu >p-1$. The probability density function writes:

 ${f}_{\underline{x}}\left(\underline{x};\underline{\theta }\right)=\frac{|m\left(\underline{x}\right){|}^{\frac{\nu -p-1}{2}}{e}^{-\frac{\mathrm{tr}{\left({\underline{\underline{V}}}^{-1}m\left(\underline{x}\right)\right)}^{\phantom{\left(}}}{2}}}{{2}^{\frac{\nu p}{2}}{|\underline{\underline{V}}|}^{\frac{\nu }{2}}{\Gamma }_{p}{\left(\frac{\nu }{2}\right)}_{\phantom{\left(}}}{\mathbf{1}}_{{ℳ}_{p}^{+}\left(ℝ\right)}\left(m\left(\underline{x}\right)\right)$ (71)

where $\underline{x}\in {ℝ}^{\frac{p\left(p+1\right)}{2}}$, ${ℳ}_{p}^{+}\left(ℝ\right)$ is the set of symmetric positive matrices of dimension $p$ and $m:{ℝ}^{\frac{p\left(p+1\right)}{2}}\to {ℳ}_{p}^{+}\left(ℝ\right)$ is given by:

 $\begin{array}{c}\hfill m\left(\underline{x}\right)=\left(\begin{array}{cccc}{x}_{1}& {x}_{2}& \cdots & {x}_{1+p\left(p-1\right)/2}\\ {x}_{2}& {x}_{3}& & ⋮\\ ⋮& & \ddots & ⋮\\ {x}_{1+p\left(p-1\right)/2}& \cdots & \cdots & {x}_{p\left(p+1\right)/2}\end{array}\right)\end{array}$ (72)

OpenTURNS also proposes some Discrete Distributions.

• Bernoulli distribution: Univariate distribution. $\underline{\theta }=p$, with the constraint $0. The Bernoulli distribution takes only 2 values : 0 and 1.

 $ℙ\left(X=1\right)=p,\phantom{\rule{0.166667em}{0ex}}ℙ\left(X=0\right)=1-p$ (73)
 Distribution of a Bernoulli distribution. CDF of a Bernoulli distribution.
Figure 35
• Binomial distribution: Univariate distribution. $\underline{\theta }=\left(n,p\right)$, with the constraint $0 and . The Binomial distribution values are the integer between 0 and n.

 $P\left(X=k\right)={C}_{n}^{k}{p}^{k}{\left(1-p\right)}^{n-k}$ (74)

where

 $\begin{array}{c}{k}^{\phantom{\left(}}\in \left\{0,\cdots ,n\right\}\hfill \\ n\in ℕ\hfill \\ p\in \left[0,1\right]\hfill \end{array}$ (75)
 Distribution of a Binomial distribution. CDF of a Binomial distribution.
Figure 36
• Dirac distribution: Multivariate distribution. $\underline{\theta }=point$. The Dirac distribution takes only one value : $point\in {ℝ}^{n}$.

 $ℙ\left(\underline{X}=point\right)=1$ (76)
 Distribution of a Dirac distribution. CDF of a Dirac distribution.
Figure 37
• Geometric distribution: Univariate distribution. $\underline{\theta }=p$, with the constraint $0. all natural numbers $k\in {ℕ}^{*}$,

 $ℙ\left(X=k;\underline{\theta }\right)=p{\left(1-p\right)}^{k-1}$ (77)

The support is ${ℕ}^{*}$.

 Distribution of a Geometric distribution. CDF of a Geometric distribution.
Figure 38
• KPermutationsDistribution distribution: Multivariate $d$-dimensional distribution. $\underline{\theta }=\left(k,n\right)$, with the constraints $n\ge 1$ and $k\ge 1$. The KPermutationsDistribution is the discrete uniform distribution on the set of injective functions $\left({i}_{0},\cdots ,{i}_{{k}_{1}}\right)$ from $\left\{0,\cdots ,k-1\right\}$ into $\left\{0,\cdots ,n-1\right\}$:

 $P\left(\underline{X}=\left({i}_{0},\cdots ,{i}_{k-1}\right)\right)=\frac{1}{d}$ (78)

where $d={A}_{n}^{k}=\frac{n!}{\left(n-k\right)!}$.

• Multinomial distribution: Multivariate $n$-dimensional distribution. $\underline{\theta }=\left({\left({p}_{k}\right)}_{1\le k\le n},N\right)$, with the constraint $0 and ${x}_{i}\in {ℕ}^{*}$,

 $P\left(\underline{X}=\underline{x}\right)=\frac{N!}{{x}_{1}!\cdots {x}_{n}!\left(N-s\right)!}{p}_{1}^{{x}_{1}}\cdots {p}_{n}^{{x}_{n}}{\left(1-q\right)}^{N-s}$ (79)

where

 $\begin{array}{c}{0}^{\phantom{\left(}}\le {p}_{i}\le 1\hfill \\ {x}_{i}\in ℕ\hfill \\ q=\sum _{k=1}^{n}{p}_{k}\le 1\hfill \\ s=\sum _{k=1}^{n}{x}_{k}\le {N}_{\phantom{\left(}}\hfill \end{array}$ (80)

In dimension $n=1$, this definition corresponds to the Binomial distribution.

• Negative Binomial distribution: Univariate distribution. $\underline{\theta }=\left(r,p\right)$, with the constraint $0 and $r>0$. The Negative Binomial distribution values are the positive integers $0,1,\cdots$

 $P\left(X=k\right)=\frac{\Gamma \left(k+r\right)}{\Gamma \left(r\right)\Gamma \left(k+1\right)}{p}^{k}{\left(1-p\right)}^{r}$ (81)

where $k\in ℕ$

 Distribution of a Negative Binomial distribution. CDF of a Negative Binomial distribution.
Figure 39
• Poisson distribution: Univariate distribution. $\underline{\theta }=\lambda$, with the constraint $\lambda >0$. For all $k\in ℕ$,

 $ℙ\left(X=k;\underline{\theta }\right)=\frac{{\lambda }^{k}}{k!}exp\left(-\lambda \right)$ (82)

The support is $ℕ$.

 Distribution of a Poisson distribution. CDF of a Poisson distribution.
Figure 40
• Skellam distribution: Univariate distribution. $\underline{\theta }=\left({\lambda }_{1},{\lambda }_{2}\right)$, with the constraint ${\lambda }_{i}>0$. The Skellan distribution takes its values in $ℤ$. It is the distribution of $\left({X}_{1}-{X}_{2}\right)$ for $\left({X}_{1},{X}_{2}\right)$ independant and respectively distributed according to $Poisson\left({\lambda }_{i}\right)$. The probability distribution function is:

 $\forall k\in ℤ,\phantom{\rule{1.em}{0ex}}ℙ\left(X=k\right)=2ℙ\left(Y=2{\lambda }_{1}\right)$ (83)

where $Y$ is distributed according to the the non central chi-square distribution ${\chi }_{\nu ,\delta }^{2}$, with $\nu =2\left(k+1\right)$ and $\delta =2{\lambda }_{2}$.

 Distribution of a Skellam distribution. CDF of a Skellam distribution.
Figure 41
• UserDefined: Multivariate $n$-dimensional distribution. $\underline{\theta }={\left(\underline{{x}_{k}},{p}_{k}\right)}_{1\le k\le N}$, with the constraint $\lambda >0$, where $0\le {p}_{k}\le 1$, $\sum _{k=1}^{{N}^{\phantom{\left(}}}{p}_{k}=1$.

 $P\left(\underline{X}={\underline{x}}_{k}\right)={p}_{k}{\right)}_{1\le k\le N}$ (84)

The support is $ℕ$.

 Distribution of a UserDefined distribution. CDF of a UserDefined distribution.
Figure 42
• Zipf-Mandelbrot distribution: Univariate distribution. $\underline{\theta }=\left(N,q,s\right)$, with the constraints $N\ge 1$, $q\ge 0$ and $s>0$. For all $k\in \left[1,N\right]$, $k$ integer,

 $\forall k\in \left[1,N\right],P\left(X=k\right)=\frac{1}{{\left(k+q\right)}^{s}}\frac{1}{H\left(N,q,s\right)}$ (85)

where $H\left(N,q,s\right)$ is the Generalized Harmonic Number : $H\left(N,q,s\right)=\sum _{i=1}^{N}\frac{1}{{\left(i+q\right)}^{s}}$.

 Distribution of a Zipf-Mandelbrot distribution. CDF of a Zipf-Mandelbrot distribution.
Figure 43

Standard representative of distributions

OpenTURNS associates to each distribution a standard representative, corresponding to a specific set of its parameters. The following tabulars detail the specific set of parameters and gives the expression of its non centered moments of order $n$.

Other notations

-

Link with OpenTURNS methodology

These probability distributions can be used in step B "Quantifying Sources of Uncertainty". Choosing a probability distribution is equivalent to implicitly making a hypothesis on the type of uncertainty of one of the variables $\underline{X}$ defined in step A "Specifying Criteria and the Case Study".
References and theoretical basics
This parametric approach has the advantage to feature the uncertainty using a reduced number of parameters. This is particularly useful when there is little data available for the unknown variables (situation in which a non-parametric approach would be limited – see [empirical distribution function] and [kernel smoothing] ) and even when there is no data (the analysis can thus only rely on expert judgement, easier to interpret when there are few distribution parameters).

Moreover, a parametric approach is often preferable when the uncertainty study criterion defined in step A deals with a rare event, obtaining a precise evaluation of the necessary criteria generally necessitates the extrapolation of X values from the observed data. Beware however! An unwise modelling assumption (bad choice of distribution) can lead to an erroneous extrapolation and thus the results of the study may be false!

The correct choice of probability distribution is thus crucial. Statistical tools are available to validate or invalidate the choice of distribution given a set of data (see for example [Graphical analysis] [Kolmogorov-Smirnov test] ). But consideration of the underlying context is also recommended. For example:

• the Normal distribution is relevant in metrology to represent certain measures of uncertainty.

• the Exponential distribution is useful for modelling uncertainty when considering the life duration of material that is not subject to ageing,

• the Gumbel distribution is defined to describe extreme phenomenon (e.g. maximal annual flow of a river or of wind speed)

Some distributions are often used to express expert judgement in simple terms:

• the Uniform distribution expresses knowledge concerning the absolute limits of variables (i.e. the probability to exceed these limits is strictly zero) without any other prior assumption about the distribution (such as, for example the mean value or the most likely value),

• the Triangular distribution expresses knowledge concerning the absolute limits of variables and the most likely value.

Finally, an important point concerning the multi-dimensional case where ${n}_{X}>1$. Choosing the type of distribution implies an assumption about the uncertainty of each of the variables ${X}^{i}$, but also on the potential inter-dependencies between variables. These inter-dependencies between unknown variables can consequently have an impact on the results of the uncertainty study.

Readers wishing to consider the dependencies in their study more deeply are referred to, for example, [copula method] , [linear correlation] , [rank correlation] .

The following bibliographical references provide main starting points for further study of this method:

• Saporta, G. (1990). "Probabilités, Analyse de données et Statistique", Technip

• Dixon, W.J. & Massey, F.J. (1983) "Introduction to statistical analysis (4th ed.)", McGraw-Hill

• Bhattacharyya, G.K., and R.A. Johnson, (1997). "Statistical Concepts and Methods", John Wiley and Sons, New York.

### 3.3.4 Step B  – Copula

Mathematical description

Goal

To define the joined probability density function of the random input vector $\underline{X}$ by composition, one needs:

• the specification of the copula of interest $C$ with its parameters,

• the specification of the ${n}_{X}$ marginal laws of interest ${F}_{{X}_{i}}$ of the ${n}_{X}$ input variables ${X}_{i}$.

The joined cumulative density function is therefore defined by :

 $\begin{array}{c}\hfill ℙ\left({X}^{1}\le {x}^{1},{X}^{2}\le {x}^{2},\cdots ,{X}^{{n}_{X}}\le {x}^{{n}_{X}}\right)=C\left({F}_{{X}^{1}}\left({x}^{1}\right),{F}_{{X}^{2}}\left({x}^{2}\right),\cdots ,{F}_{{X}^{{n}_{X}}}\left({x}^{{n}_{X}}\right)\right)\end{array}$

Within this part, we define the concept of copula and its use within OpenTURNS.

Principles

Copulas allow to represent the part of the joined cumulative density function which is not described by the marginal laws. It enables to represent the dependence structure of the input variables. A copula is a special cumulative density function defined on ${\left[0,1\right]}^{{n}_{X}}$ whose marginal distributions are uniform on $\left[0,1\right]$. The choice of the dependence structure is disconnected from the choice of the marginal distributions.

Basic properties of copulas

A copula, restricted to ${\left[0,1\right]}^{{n}_{X}}$ is a ${n}_{U}$-dimensional cumulative density function with uniform marginals.

• $C\left(\underline{u}\right)\ge 0$, $\forall \underline{u}\in {\left[0,1\right]}^{{n}_{U}}$

• $C\left(\underline{u}\right)={u}_{i}$, $\forall \underline{u}=\left(1,...,1,{u}_{i},1,...,1\right)$

• For all $N$-box $ℬ=\left[{a}_{1},{b}_{1}\right]×\cdots ×\left[{a}_{{n}_{U}},{b}_{{n}_{U}}\right]\in {\left[0,1\right]}^{{n}_{U}}$, we have ${𝒱}_{C}\left(ℬ\right)\ge 0$, where:

• ${𝒱}_{C}\left(ℬ\right)={\sum }_{i=1,\cdots ,{2}^{{n}_{U}}}sign\left({\underline{v}}_{i}\right)×C\left({\underline{v}}_{i}\right)$, the summation being made over the ${2}^{{n}_{U}}$ vertices $\underline{{v}_{i}}$ of $ℬ$.

• $sign\left({\underline{v}}_{i}\right)=+1$ if ${v}_{i}^{k}={a}_{k}$ for an even number of ${k}^{\text{'}}s$, $sign\left({\underline{v}}_{i}\right)=-1$ otherwise.

Copulas available within OpenTURNS

Different copulas are available within OpenTURNS:

Ali-Mikhail-Haq Copula: The Ali-Mikhail-Haq copula is archimedean, parameterized by a scalar $\theta \ge 0$ . The Clayton copula is thus defined by:

 $\begin{array}{c}\hfill C\left({u}_{1},{u}_{2}\right)=\frac{{u}_{1}{u}_{2}}{1-\theta \left(1-{u}_{1}\right)\left(1-{u}_{2}\right)}\end{array}$
 Iso-PDF of a Ali-Mikhail-Haq copula.
Figure 44

Clayton Copula: The Clayton copula is parameterized by a scalar $\theta \ge 0$ . The Clayton copula is thus defined by:

 $\begin{array}{c}\hfill C\left({u}_{1},{u}_{2}\right)={\left({u}_{1}^{-\theta }+{u}_{2}^{-\theta }-1\right)}^{-1/\theta }\end{array}$
 Iso-PDF of a Clayton copula.
Figure 45

Composed Copula: A copula may be defined as the product of other copulas : if ${C}_{1}$ and ${C}_{2}$ are two copulas respectively of random vectors in ${ℝ}^{{n}_{1}}$ and ${ℝ}^{{n}_{2}}$, we can create the copula of a random vector of ${ℝ}^{{n}_{1}+{n}_{2}}$, noted $C$ as follows :

 $\begin{array}{c}\hfill C\left({u}_{1},\cdots ,{u}_{n}\right)={C}_{1}\left({u}_{1},\cdots ,{u}_{{n}_{1}}\right){C}_{2}\left({u}_{{n}_{1}+1},\cdots ,{u}_{{n}_{1}+{n}_{2}}\right)\end{array}$

It means that both subvectors $\left({u}_{1},\cdots ,{u}_{{n}_{1}}\right)$ and $\left({u}_{{n}_{1}+1},\cdots ,{u}_{{n}_{1}+{n}_{2}}\right)$ of ${ℝ}^{{n}_{1}}$ and ${ℝ}^{{n}_{2}}$ are independent.

Farlie-Gumbel-Morgenstern Copula: The Farlie-Gumbel-Morgenstern copula is parameterized by a scalar $\theta \in \left[-1,1\right]$ . The Farlie-Gumbel-Morgenstern copula is thus defined by:

 $\begin{array}{c}\hfill C\left({u}_{1},{u}_{2}\right)={u}_{1}{u}_{2}\left(1+\theta \left(1-{u}_{1}\right)\left(1-{u}_{2}\right)\right)\end{array}$
 Iso-PDF of a Farlie-Gumbel-Morgenstern copula.
Figure 46

Frank Copula: The Frank copula is parameterized by a scalar $\theta \ne 0$ . The Frank copula is thus defined by:

 $\begin{array}{c}\hfill C\left({u}_{1},{u}_{2}\right)=-\frac{1}{\theta }log\left(1+\frac{\left({e}^{-\theta {u}_{1}}-1\right)\left({e}^{-\theta {u}_{2}}-1}{{e}^{-\theta }-1}\right)\end{array}$
 Iso-PDF of a Frank copula.
Figure 47

Gumbel Copula: The Gumbel copula is parameterized by a scalar $\theta \ge 0$ . The Gumbel copula is thus defined by:

 $\begin{array}{c}\hfill C\left({u}_{1},{u}_{2}\right)=exp\left(-{\left({\left(-log\left({u}_{1}\right)\right)}^{\theta }+{\left(-log\left({u}_{2}\right)\right)}^{\theta }\right)}^{1/\theta }\right)\end{array}$
 Iso-PDF of a Gumbel copula.
Figure 48

Independent Copula: It means that all the input variables are independent the ones from the others. The independent copula is defined by:

 $\begin{array}{c}\hfill C\left({u}_{1},{u}_{2},\cdots ,{u}_{{n}_{\phantom{\rule{0.222222em}{0ex}}U}}\right)=\prod _{i=1}^{{n}_{\phantom{\rule{0.222222em}{0ex}}U}}{u}_{i}\end{array}$
 Iso-PDF of an Independent copula.
Figure 49

Maximum-entropy statistics copula: The density function is defined by:

 $\begin{array}{c}\hfill {f}_{U}\left(u\right)=\prod _{k=2}^{d}\frac{exp\left(-{\int }_{{\partial }_{k-1}^{-1}\left({u}_{k-1}\right)}^{{\partial }_{k}^{-1}\left({u}_{k}\right)}{\phi }_{k}\left(s\right)ds\right)}{{\partial }_{k-1}\left({\partial }_{k}^{-1}\left({u}_{k}\right)\right)-{u}_{k}}{\mathbf{1}}_{{F}_{1}^{-1}\left({u}_{1}\right)\le \cdots \le {F}_{d}^{-1}\left({u}_{d}\right)}\end{array}$
 $\begin{array}{c}\hfill \text{with}\phantom{\rule{4.pt}{0ex}}{\partial }_{k}\left(t\right)={F}_{k}\left({G}^{-1}\left(t\right)\right)\phantom{\rule{4.pt}{0ex}}\text{and}\phantom{\rule{4.pt}{0ex}}G\left(t\right)=\frac{1}{t}\sum _{k=1}^{d}{F}_{k}\left(t\right)\end{array}$

Min Copula: The Min copula is the upper Fréchet-Hoeffding bound defined by:

 $\begin{array}{c}\hfill C\left({u}_{1},\cdots ,{u}_{n}\right)=min\left({u}_{1},\cdots ,{u}_{n}\right)\end{array}$

Normal Copula: The Normal copula is parameterized by a correlation matrix $𝐑$. The Normal copula is thus defined by:

 $\begin{array}{c}\hfill C\left({u}_{1},\cdots ,{u}_{n}\right)={\Phi }_{𝐑}^{{n}_{\phantom{\rule{0.222222em}{0ex}}U}}\left({\Phi }^{-1}\left({u}_{1}\right),{\Phi }^{-1}\left({u}_{2}\right),\cdots ,{\Phi }^{-1}\left({u}_{{n}_{\phantom{\rule{0.222222em}{0ex}}U}}\right)\right)\end{array}$

where:

• ${\Phi }_{𝐑}^{{n}_{X}}$ is the multinormal cumulative density function in dimension ${n}_{X}$:

 $\begin{array}{c}\hfill {\Phi }_{𝐑}^{{n}_{X}}\left(\underline{x}\right)={\int }_{-\infty }^{{x}_{1}}...{\int }_{-\infty }^{{x}_{{n}_{X}}}\frac{1}{{\left(2\pi .det𝐑\right)}^{\frac{{n}_{X}}{2}}}\phantom{\rule{0.222222em}{0ex}}.\phantom{\rule{0.222222em}{0ex}}{e}^{-\frac{{}^{t}\underline{u}.𝐑.\underline{u}}{2}}\phantom{\rule{0.222222em}{0ex}}d{u}_{1}...d{u}_{{n}_{\phantom{\rule{0.222222em}{0ex}}X}}\end{array}$
• $\Phi$ is the cumulative distribution function of the normal law in dimension 1:

 $\begin{array}{c}\hfill \Phi \left(x\right)={\int }_{-\infty }^{x}\frac{1}{\sqrt{2\pi }}\phantom{\rule{0.222222em}{0ex}}{e}^{-\frac{{t}^{2}}{2}}\phantom{\rule{0.222222em}{0ex}}dt\end{array}$
• $𝐑$ is the correlation matrix. This matrix is defined by its algebric properties: symmetric, definite and positive.

The correlation matrix $𝐑$ can be obtained by different means:

• If one knows the Spearmann correlation Matrix, that is to say,

 $\begin{array}{c}\hfill {\rho }_{ij}^{S}={\rho }^{S}\left({X}_{i},{X}_{j}\right)={\rho }^{P}\left({F}_{{X}_{i}}\left({X}_{i}\right),{F}_{{X}_{j}}\left({X}_{j}\right)\right)\end{array}$

the correlation matrix $𝐑$ is deduced by the following formula:

 $\begin{array}{c}\hfill {𝐑}_{ij}=2sin\left(\frac{\pi }{6}{\rho }_{ij}^{S}\right)\end{array}$
• If one knows the Kendall measure of correlation, that is to say,

 $\begin{array}{c}\hfill {\tau }_{ij}=\tau \left({X}_{i},{X}_{j}\right)=ℙ\left(\left({X}_{{i}_{1}}-{X}_{{i}_{2}}\right).\left({X}_{{j}_{1}}-{X}_{{j}_{2}}\right)>0\right)-ℙ\left(\left({X}_{{i}_{1}}-{X}_{{i}_{2}}\right).\left({X}_{{j}_{1}}-{X}_{{j}_{2}}\right)<0\right)\end{array}$

where $\left({X}_{{i}_{1}},{X}_{{j}_{1}}\right)$ and $\left({X}_{{i}_{2}},{X}_{{j}_{2}}\right)$ follow the law of $\left({X}_{i},{X}_{j}\right)$, the correlation matrix $𝐑$ is deduced by the following formula:

 $\begin{array}{c}\hfill {𝐑}_{ij}=sin\left(\frac{\pi }{2}.{\tau }_{ij}\right)\end{array}$
• If one knows the Pearson correlation Matrix ${𝐑}^{P}$, there are two possibilities:

1. If and only if all the marginal laws are Normal,

 $\begin{array}{c}\hfill 𝐑\equiv {𝐑}^{P}\end{array}$
2. In the other cases, one has to build the correlation matrix $𝐑$ by inversion of the following formula from the Pearson Correlation Matrix ${𝐑}^{P}$:

 $\begin{array}{c}\hfill {𝐑}_{ij}^{P}=\int {\int }_{{ℝ}^{2}}\left({x}^{i}-𝔼\left[{X}^{i}\right]\right)\left({x}^{j}-𝔼\left[{X}^{j}\right]\right){\Phi }_{ij}\left({x}^{i},{x}^{j},{𝐑}_{ij}\right)d{x}^{i}d{x}^{j}\end{array}$
 Iso-PDF of a Normal copula.
Figure 50

Sklar Copula: The Sklar copula is obtained directly from the expression of the $n$-dimensional distribution which cumulative distribution function is $F$ with ${F}_{i}$ its marginals :

 $\begin{array}{c}\hfill C\left({u}_{1},\cdots ,{u}_{n}\right)=F\left({F}_{1}^{-1}\left({u}_{1}\right),\cdots ,{F}_{n}^{-1}\left({u}_{n}\right)\right)\end{array}$

Figure 51 shows the iso-PDF of a Sklar copula extracted from a bidimensional Student distribution.

 Iso-PDF of a Sklar copula.
Figure 51

Other notations

-

Link with OpenTURNS methodology

This method of modelling the dependencies between the input variables is part of the step B of the global methodology ("quantify sources of uncertainty"). It enables to build an expression of the probability density function of the input variables $\underline{X}$ defined in step A ("specification of the model and criteria") by composition with the marginal distributions of each ${X}^{i}$. This method requires the knowledge of the Spearman correlation matrix or the Kendall correlation measure. It can also be used if one knows the Pearson correlation matrix, but only with the assumption of Normal marginal laws for all the input variables.

References and theoretical basics

One has to pay attention that the composition of the marginal distributions and the copulas available in OpenTURNS is not sufficient to represent all types of dependencies (see examples in the next section). Previous statistical and/or justifications should be done to justify this choice of modeling dependencies. Besides, as previously discussed, the use of Copula is totally decoupled from the knowledge of the marginal laws of the input variables.

The following references give a first entry point to the Copulas:

• Nelsen, 'Introduction to Copulas'

• Embrechts P., Lindskog F., Mc Neil A., 'Modelling dependence with copulas and application to Risk Management', ETZH 2001.

Examples

-

### 3.3.5 Step B  – Random Mixture : affine combination of independent univariate distributions

Mathematical description

Goal

A multivariate random variable $\underline{Y}$ may be defined as an affine transform of $n$ independent univariate random variable, as follows :

 $\underline{Y}={\underline{y}}_{0}+\underline{\underline{M}}\phantom{\rule{0.166667em}{0ex}}\underline{X}$ (86)

where ${\underline{y}}_{0}\in {ℝ}^{d}$ is a deterministic vector with $d\in \left\{1,2,3\right\}$, $\underline{\underline{M}}\in {ℳ}_{d,n}\left(ℝ\right)$ a deterministic matrix and ${\left({X}_{k}\right)}_{1\le k\le n}$ are some independent univariate distributions.

In such a case, it is possible to evaluate directly the distribution of $\underline{Y}$ and then to ask $\underline{Y}$ any request compatible with a distribution : moments, probability and cumulative density functions, quantiles (in dimension 1 only) ...

Principle

Evaluation of the probability density function of the Random Mixture

As the univariate random variables ${X}_{i}$ are independent, the characteristic function of $\underline{Y}$, denoted ${\phi }_{Y}$, is easily defined from the characteristic function of ${X}_{k}$ denoted ${\phi }_{{X}_{k}}$ as follows :

 ${\phi }_{Y}\left({u}_{1},\cdots ,{u}_{d}\right)=\prod _{j=1}^{d}{e}^{i{u}_{j}{{y}_{0}}_{j}}\prod _{k=1}^{n}{\phi }_{{X}_{k}}\left({\left({M}^{t}u\right)}_{k}\right),\phantom{\rule{4.pt}{0ex}}\text{for}\phantom{\rule{4.pt}{0ex}}\underline{u}\in {ℝ}^{d}$ (87)

Once ${\phi }_{Y}$ evaluated, it is possible to evaluate the probability density function of $Y$, denoted ${p}_{Y}$ : several techniques are possible, as the inversion of the Fourier transformation. This technique is not easy to implement.

OpenTURNS uses another technique, based on the Poisson sum formulation, defined as follows :

 $\sum _{{j}_{1}\in ℤ}\cdots \sum _{{j}_{d}\in ℤ}{p}_{Y}\left({y}_{1}+\frac{2\pi {j}_{1}}{{h}_{1}},\cdots ,{y}_{d}+\frac{2\pi {j}_{d}}{{h}_{d}}\right)=\prod _{j=1}^{d}\frac{{h}_{j}}{2*\pi }\sum _{{k}_{1}\in ℤ}\cdots \sum _{{k}_{d}\in ℤ}\phi \left({k}_{1}{h}_{1},\cdots ,{k}_{d}{h}_{d}\right){e}^{-ı\left({\sum }_{m=1}^{d}{k}_{m}{h}_{m}{y}_{m}\right)}$ (88)

By fixing ${h}_{1},\cdots ,{h}_{d}$ small enough, $\frac{2k\pi }{{h}_{j}}\approx +\infty$ and ${p}_{Y}\left(\cdots ,\frac{2k\pi }{{h}_{j}},\cdots \right)\approx 0$ because of the decreasing properties of ${p}_{Y}$. Thus the nested sums of the left term of (88) are reduced to the central term ${j}_{1}=\cdots ={j}_{d}=0$ : the left term is approximatively equal to ${p}_{Y}\left(y\right)$.

Furthermore, the right term of (88) is a series which converges very fast: only few terms of the series are enough to get machine-precision accuracy. Let us note that the factors ${\phi }_{Y}\left({k}_{1}{h}_{1},\cdots ,{k}_{d},{h}_{d}\right)$, which are expensive to evaluate, do not depend on $y$ and are evaluated once only.

It is also possible to greatly improve the performance of the algorithm by noticing that equation (88) is linear between ${p}_{Y}$ and ${\phi }_{Y}$. We denote ${q}_{Y}$ and ${\psi }_{Y}$ respectively the density and the characteristic function of the multivariate normal distribution with the same mean $\underline{\mu }$ and same covariance matrix $\underline{C}$ as the random mixture. By applying this multivariate normal distribution to the equation (88), we obtain by subtraction:
 ${p}_{Y}\left(y\right)=\sum _{j\in {ℤ}^{d}}{q}_{Y}\left({y}_{1}+\frac{2\pi {j}_{1}}{{h}_{1}},\cdots ,{y}_{d}+\frac{2\pi {j}_{d}}{{h}_{d}}\right)+\frac{H}{{2}^{d}{\pi }^{d}}\sum _{|{k}_{1}|\le N}\cdots \sum _{|{k}_{d}|\le N}{\delta }_{Y}\left({k}_{1}{h}_{1},\cdots ,{k}_{d}{h}_{d}\right){e}^{-ı\left({\sum }_{m=1}^{d}{k}_{m}{h}_{m}{y}_{m}\right)}$ (89)

where $H={h}_{1}×\cdots ×{h}_{d}$, $j=\left({j}_{1},\cdots ,{j}_{d}\right)$, ${\delta }_{Y}:={\phi }_{Y}-{\psi }_{Y}$

In the case where $n\gg$ 1, using the limit central theorem, the law of $\underline{Y}$ tends to the normal distribution density $q$, which will drastically reduce $N$. The sum on $q$ will become the most CPU-intensive part, because in the general case we will have to keep more terms than the central one in this sum, since the parameters ${h}_{1},\cdots {h}_{d}$ were calibrated with respect to $p$ and not $q$.

The parameters ${h}_{1},\cdots {h}_{d}$ are calibrated using the following formula:

 $\begin{array}{c}\hfill {h}_{\ell }=\frac{2\pi }{\left(\beta +4\alpha \right){\sigma }_{\ell }}\end{array}$ (90)

where ${\sigma }_{\ell }=\sqrt{\mathrm{Cov}{\left[\underline{Y}\right]}_{\ell ,\ell }}$ and $\alpha$, $\beta$ are respectively the number of standard deviations covered by the marginal distribution ($\alpha =5$ by default) and $\beta$ the number of marginal deviations beyond which the density is negligible ($\beta =8.5$ by default).

The $N$ parameter is dynamically calibrated: we start with $N=8$ then we double $N$ value until the total contribution of the additional terms is negligible.

Evaluation of the moments of the Random Mixture

The relation (86) enables to evaluate all the moments of the random mixture, if mathematically defined. For example, we have :

 $\begin{array}{c}\hfill \left\{\begin{array}{ccc}𝔼\left[\underline{Y}\right]\hfill & =& \underline{{y}_{0}}+\underline{\underline{M}}𝔼\left[\underline{X}\right]\hfill \\ \mathrm{Cov}\left[\underline{Y}\right]\hfill & =& \underline{\underline{M}}\phantom{\rule{0.166667em}{0ex}}\mathrm{Cov}\left[\underline{X}\right]{\underline{\underline{M}}}^{t}\hfill \end{array}\right\\end{array}$

Computation on a regular grid

The interest is to compute the density function on a regular grid. Purposes are to get quickly ann approximation. The regular grid is of form:

 $\begin{array}{c}\hfill \forall r\in \left\{1,\cdots ,d\right\},\forall m\in \left\{0,\cdots ,M-1\right\},\phantom{\rule{0.222222em}{0ex}}{y}_{r,m}={\mu }_{r}+b\left(\frac{2m+1}{M}-1\right){\sigma }_{r}\end{array}$ (91)

By denoting ${p}_{{m}_{1},\cdots ,{m}_{d}}={p}_{\underline{Y}}\left({y}_{1,{m}_{1}},\cdots ,{y}_{d,{m}_{d}}\right)$:

 $\begin{array}{c}\hfill {p}_{{m}_{1},\cdots ,{m}_{d}}={Q}_{{m}_{1},\cdots ,{m}_{d}}+{S}_{{m}_{1},\cdots ,{m}_{d}}\end{array}$ (92)

for which the term ${S}_{{m}_{1},\cdots ,{m}_{d}}$ is the most CPU consuming. This term rewrites:

 $\begin{array}{cc}\hfill {S}_{{m}_{1},\cdots ,{m}_{d}}=& \frac{H}{{2}^{d}{\pi }^{d}}\sum _{{k}_{1}=-N}^{N}\cdots \sum _{{k}_{d}=-N}^{N}\delta \left({k}_{1}{h}_{1},\cdots ,{k}_{d}{h}_{d}\right){E}_{{m}_{1},\cdots ,{m}_{d}}\left({k}_{1},\cdots ,{k}_{d}\right)\hfill \end{array}$ (3.3)

with:

 $\begin{array}{cc}\hfill \delta \left({k}_{1}{h}_{1},\cdots ,{k}_{d}{h}_{d}\right)& =\left(\phi -\psi \right)\left({k}_{1}{h}_{1},\cdots ,{k}_{d}{h}_{d}\right)\hfill \\ \hfill {E}_{{m}_{1},\cdots ,{m}_{d}}\left({k}_{1},\cdots ,{k}_{d}\right)& ={e}^{-i{\sum }_{j=1}^{d}{k}_{j}{h}_{j}\left({\mu }_{j}+b\left(\frac{2{m}_{j}+1}{M}-1\right){\sigma }_{j}\right)}\hfill \end{array}$ (3.3)

The aim is to rewrite the previous expression as a $d$- discrete Fourier transform, in order to apply Fast Fourier Transform (FFT) for its evaluation.

We set $M=N$ and $\forall j\in \left\{1,\cdots ,d\right\},\phantom{\rule{0.222222em}{0ex}}{h}_{j}=\frac{\pi }{b{\sigma }_{j}}$ and ${\tau }_{j}=\frac{{\mu }_{j}}{b{\sigma }_{j}}$. For convenience, we introduce the functions:

 ${f}_{j}\left(k\right)={e}^{-i\pi \left(k+1\right)\left({\tau }_{j}-1+\frac{1}{N}\right)}$

We use $k+1$ instead of $k$ in this function to simplify expressions below.

We obtain:

 $\begin{array}{cc}\hfill {E}_{{m}_{1},\cdots ,{m}_{d}}\left({k}_{1},\cdots ,{k}_{d}\right)& ={e}^{-i{\sum }_{j=1}^{d}{k}_{j}{h}_{j}b{\sigma }_{j}\left(\frac{{\mu }_{j}}{b{\sigma }_{j}}+\frac{2{m}_{j}}{N}+\frac{1}{N}-1\right)}\hfill \\ & ={e}^{-2i\pi \left(\frac{{\sum }_{j=1}^{d}{k}_{j}{m}_{j}}{N}\right)}{e}^{-i\pi {\sum }_{j=1}^{d}{k}_{j}\left({\tau }_{j}-1+\frac{1}{N}\right)}\hfill \\ & ={e}^{-2i\pi \left(\frac{{\sum }_{j=1}^{d}{k}_{j}{m}_{j}}{N}\right)}{f}_{1}\left({k}_{1}-1\right)×\cdots ×{f}_{d}\left({k}_{d}-1\right)\hfill \end{array}$ (3.3)

For performance reasons, we want to use the discrete Fourier transform with the following convention in dimension 1 :

 ${A}_{m}=\sum _{k=0}^{N-1}{a}_{k}{e}^{-2i\pi \frac{km}{N}}$

which extension to dimensions 2 and 3 are respectively :

 ${A}_{m,n}=\sum _{k=0}^{N-1}\sum _{l=0}^{N-1}{a}_{k,l}{e}^{-2i\pi \frac{km}{N}}{e}^{-2i\pi \frac{ln}{N}}$
 ${A}_{m,n,p}=\sum _{k=0}^{N-1}\sum _{l=0}^{N-1}\sum _{s=0}^{N-1}{a}_{k,l,s}{e}^{-2i\pi \frac{km}{N}}{e}^{-2i\pi \frac{ln}{N}}{e}^{-2i\pi \frac{sp}{N}}$

We decompose sums of (3.3) on the interval $\left[-N,N\right]$ into three parts:

 $\begin{array}{cc}\hfill \sum _{{k}_{j}=-N}^{N}\delta \left({k}_{1}{h}_{1},\cdots ,{k}_{d}{h}_{d}\right){E}_{{m}_{1},\cdots ,{m}_{d}}\left({k}_{1},\cdots ,{k}_{d}\right)=& \sum _{{k}_{j}=-N}^{-1}\delta \left({k}_{1}{h}_{1},\cdots ,{k}_{d}{h}_{d}\right){E}_{{m}_{1},\cdots ,{m}_{d}}\left({k}_{1},\cdots ,{k}_{d}\right)\hfill \\ & +\delta \left({k}_{1}{h}_{1},\cdots ,0,\cdots ,{k}_{d}{h}_{d}\right){E}_{{m}_{1},\cdots ,0,\cdots ,{m}_{d}}\left({k}_{1},\cdots ,0,\cdots ,{k}_{d}\right)\hfill \\ & +\sum _{{k}_{j}=1}^{N}\delta \left({k}_{1}{h}_{1},\cdots ,{k}_{d}{h}_{d}\right){E}_{{m}_{1},\cdots ,{m}_{d}}\left({k}_{1},\cdots ,{k}_{d}\right)\hfill \end{array}$ (3.3)

If we already computed $E$ for dimension $d-1$, then the middle term in this sum is trivial.

To compute the last sum of (3.3), we apply a change of variable ${k}_{j}^{\text{'}}={k}_{j}-1$:

 $\begin{array}{cc}\hfill \sum _{{k}_{j}=1}^{N}\delta \left({k}_{1}{h}_{1},\cdots ,{k}_{d}{h}_{d}\right){E}_{{m}_{1},\cdots ,{m}_{d}}\left({k}_{1},\cdots ,{k}_{d}\right)=& \sum _{{k}_{j}=0}^{N-1}\delta \left({k}_{1}{h}_{1},\cdots ,\left({k}_{j}+1\right){h}_{j},\cdots ,{k}_{d}{h}_{d}\right)×\hfill \\ & \phantom{\rule{85.35826pt}{0ex}}{E}_{{m}_{1},\cdots ,{m}_{d}}\left({k}_{1},\cdots ,{k}_{j}+1,\cdots ,{k}_{d}\right)\hfill \end{array}$ (3.3)

Equation (3.3) gives:

 $\begin{array}{cc}\hfill {E}_{{m}_{1},\cdots ,{m}_{d}}\left({k}_{1},\cdots ,{k}_{j}+1,\cdots ,{k}_{d}\right)& ={e}^{-2i\pi \left(\frac{{\sum }_{l=1}^{d}{k}_{l}{m}_{l}}{N}+\frac{{m}_{j}}{N}\right)}{f}_{1}\left({k}_{1}-1\right)×\cdots ×{f}_{j}\left({k}_{j}\right)×\cdots ×{f}_{d}\left({k}_{d}-1\right)\hfill \\ & ={e}^{-2i\pi \left(\frac{{m}_{j}}{N}\right)}{e}^{-2i\pi \left(\frac{{\sum }_{l=1}^{d}{k}_{l}{m}_{l}}{N}\right)}{f}_{1}\left({k}_{1}-1\right)×\cdots ×{f}_{j}\left({k}_{j}\right)×\cdots ×{f}_{d}\left({k}_{d}-1\right)\hfill \end{array}$ (3.3)

Thus

 $\begin{array}{cc}\hfill \sum _{{k}_{j}=1}^{N}\delta \left({k}_{1}{h}_{1},\cdots ,{k}_{d}{h}_{d}\right){E}_{{m}_{1},\cdots ,{m}_{d}}& \left({k}_{1},\cdots ,{k}_{d}\right)={e}^{-2i\pi \left(\frac{{m}_{j}}{N}\right)}\sum _{{k}_{j}=0}^{N-1}\delta \left({k}_{1}{h}_{1},\cdots ,\left({k}_{j}+1\right){h}_{j},\cdots ,{k}_{d}{h}_{d}\right)×\hfill \\ & {e}^{-2i\pi \left(\frac{{\sum }_{l=1}^{d}{k}_{l}{m}_{l}}{N}\right)}{f}_{1}\left({k}_{1}-1\right)×\cdots ×{f}_{j}\left({k}_{j}\right)×\cdots ×{f}_{d}\left({k}_{d}-1\right)\hfill \end{array}$ (3.3)

To compute the first sum of equation (3.3), we apply a change of variable ${k}_{j}^{\text{'}}=N+{k}_{j}$:

 $\begin{array}{cc}\hfill \sum _{{k}_{j}=-N}^{-1}\delta \left({k}_{1}{h}_{1},\cdots ,{k}_{d}{h}_{d}\right){E}_{{m}_{1},\cdots ,{m}_{d}}\left({k}_{1},\cdots ,{k}_{d}\right)=& \sum _{{k}_{j}=0}^{N-1}\delta \left({k}_{1}{h}_{1},\cdots ,\left({k}_{j}-N\right){h}_{j},\cdots ,{k}_{d}{h}_{d}\right)×\hfill \\ & \phantom{\rule{85.35826pt}{0ex}}{E}_{{m}_{1},\cdots ,{m}_{d}}\left({k}_{1},\cdots ,{k}_{j}-N,\cdots ,{k}_{d}\right)\hfill \end{array}$ (3.3)

Equation (3.3) gives:

 $\begin{array}{cc}\hfill {E}_{{m}_{1},\cdots ,{m}_{d}}\left({k}_{1},\cdots ,{k}_{j}-N,\cdots ,{k}_{d}\right)& ={e}^{-2i\pi \left(\frac{{\sum }_{l=1}^{d}{k}_{l}{m}_{l}}{N}-{m}_{j}\right)}{f}_{1}\left({k}_{1}-1\right)×\cdots ×{f}_{j}\left({k}_{j}-1-N\right)×\cdots ×{f}_{d}\left({k}_{d}-1\right)\hfill \\ & ={e}^{-2i\pi \left(\frac{{\sum }_{l=1}^{d}{k}_{l}{m}_{l}}{N}\right)}{f}_{1}\left({k}_{1}-1\right)×\cdots ×{\overline{f}}_{j}\left(N-1-{k}_{j}\right)×\cdots ×{f}_{d}\left({k}_{d}-1\right)\hfill \end{array}$ (3.3)

Thus:

 $\begin{array}{cc}\hfill \sum _{{k}_{j}=-N}^{-1}\delta \left({k}_{1}{h}_{1},\cdots ,{k}_{d}{h}_{d}\right){E}_{{m}_{1},\cdots ,{m}_{d}}& \left({k}_{1},\cdots ,{k}_{d}\right)=\sum _{{k}_{j}=0}^{N-1}\delta \left({k}_{1}{h}_{1},\cdots ,\left({k}_{j}-N\right){h}_{j},\cdots ,{k}_{d}{h}_{d}\right)×\hfill \\ & {e}^{-2i\pi \left(\frac{{\sum }_{l=1}^{d}{k}_{l}{m}_{l}}{N}\right)}{f}_{1}\left({k}_{1}-1\right)×\cdots ×{\overline{f}}_{j}\left(N-1-{k}_{j}\right)×\cdots ×{f}_{d}\left({k}_{d}-1\right)\hfill \end{array}$ (3.3)

To summarize:

1. In order to compute sum from ${k}_{1}=1$ to $N$, we multiply by ${e}^{-2i\pi \left(\frac{{m}_{1}}{N}\right)}$ and consider $\delta \left(\left({k}_{1}+1\right)h,\cdots \right){f}_{1}\left({k}_{1}\right)$

2. In order to compute sum from ${k}_{1}=-N$ to $-1$, we consider $\delta \left(\left({k}_{1}-N\right)h,\cdots \right){\overline{f}}_{1}\left(N-1-{k}_{1}\right)$

OpenTURNS

In the 0.13 version of OpenTURNS, distributions which are able to evaluate their characteristic function are the following ones : ${\chi }^{2}$, Exponential, Gamma, Laplace, Logistic, Mixture, univariate Normal, Rayleigh, Triangular, univariate TruncatedNormal, Uniform, KernelMixture (which the distribution coming from a kernel smoothing method without treatment of bounds), RandomMixture.

Thus, all the requests to $Y$ that require the evaluation of the probability density function may be satisfied only if the univariate random variables ${X}_{i}$ follow distributions which characteristic function has been implemented.

Until the 1.5 version of OpenTURNS, only univariate random mixtures were available. For all the other requests, no restriction is assigned.

Other notations

Link with OpenTURNS methodology

Within the global methodology, random mixtures may be used to define the output variable of interest from some indepedent univariate random variables, within the step B.
References and theoretical basics
"Abate, J. and Whitt, W. (1992). The Fourier-series method for inverting transforms of probability distributions. Queueing Systems 10, 5–88., 1992", formula 5.5.

Examples

The example here is an output variable of interest defined as the following combination :
 $\begin{array}{c}\hfill Y=2+5{X}_{1}+{X}_{2}\end{array}$

where ${X}_{1}$ and ${X}_{2}$ are independent and :

• ${X}_{1}$ follows a $ℰ\left(1.5\right)$,

• ${X}_{2}$ follows a $𝒩\left(4,1\right)$.

The pdf and cdf graphs are the following ones.

### 3.3.6 Step B  – Using QQ-plot to compare two samples

Mathematical description

Goal

Let $X$ be a scalar uncertain variable modelled as a random variable. This method deals with the construction of a dataset prior to the choice of a probability distribution for $X$. A QQ-plot (where "QQ" stands for "quantile-quantile") is a tool that may be used to compare two samples $\left\{{x}_{1},...,{x}_{N}\right\}$ and $\left\{{x}_{1}^{\text{'}},...,{x}_{M}^{\text{'}}\right\}$; the goal is to determine graphically whether these two samples come from the same probability distribution or not. If this is the case, the two samples should be aggregated in order to increase the robustness of further statistical analyses.

Principle of the method

A QQ-plot is based on the notion of quantile. The $\alpha$-quantile ${q}_{X}\left(\alpha \right)$ of $X$, where $\alpha \in \left(0,1\right)$, is defined as follows:

 $\begin{array}{c}\hfill ℙ\left(X\le {q}_{X}\left(\alpha \right)\right)=\alpha \end{array}$

If a sample $\left\{{x}_{1},...,{x}_{N}\right\}$ of $X$ is available, the quantile can be estimated empirically:

1. the sample $\left\{{x}_{1},...,{x}_{N}\right\}$ is first placed in ascending order, which gives the sample $\left\{{x}_{\left(1\right)},...,{x}_{\left(N\right)}\right\}$;

2. then, an estimate of the $\alpha$-quantile is:

 $\begin{array}{c}\hfill {\stackrel{^}{q}}_{X}\left(\alpha \right)={x}_{\left(\left[N\alpha \right]+1\right)}\end{array}$

where $\left[N\alpha \right]$ denotes the integral part of $N\alpha$.

Thus, the ${j}^{\mathrm{th}}$ smallest value of the sample ${x}_{\left(j\right)}$ is an estimate ${\stackrel{^}{q}}_{X}\left(\alpha \right)$ of the $\alpha$-quantile where $\alpha =\left(j-1\right)/N$ ($1). Let us then consider our second sample $\left\{{x}_{1}^{\text{'}},...,{x}_{M}^{\text{'}}\right\}$; this one also provides an estimate ${\stackrel{^}{q}}_{X}^{\text{'}}\left(\alpha \right)$ of this same quantile:

 $\begin{array}{c}\hfill {\stackrel{^}{q}}_{X}^{\text{'}}\left(\alpha \right)={x}_{\left(\left[M×\left(j-1\right)/N\right]+1\right)}^{\text{'}}\end{array}$

If the the two samples correspond to the same probability distribution, then ${\stackrel{^}{q}}_{X}\left(\alpha \right)$ and ${\stackrel{^}{q}}_{X}^{\text{'}}\left(\alpha \right)$ should be close. Thus, graphically, the points $\left\{\left({\stackrel{^}{q}}_{X}\left(\alpha \right),{\stackrel{^}{q}}_{X}^{\text{'}}\left(\alpha \right)\right),\phantom{\rule{4pt}{0ex}}\alpha =\left(j-1\right)/N,\phantom{\rule{4pt}{0ex}}1 should be close to the diagonal.

The following figure illustrates the principle of a QQ-plot with two samples of size $M=50$ and $N=50$. Note that the unit of the two axis is that of the variable $X$ studied. In this example, the points remain close to the diagonal and the hypothesis "the two samples come frome the same distribution" does not seem irrelevant, even if a more quantitative analysis (see [Smirnov test] ) should be carried out to confirm this.

In this second example, the two samples clearly arise from two different distributions.

Other notations

Link with OpenTURNS methodology

This method is used in step B "Quantifying Sources of Uncertainty". It is a tool for the construction of a dataset that can be used afterwards to choose a probability distribution for some uncertain variables defined in step A "Specifying Criteria and the Case Study".
References and theoretical basics
A QQ-plot is a graphical analysis, the conclusion of which remains obviously subjective. The reader is referred to [Smirnov test] for a more quantitative analysis. The following bibliographical references provide main starting points for further study of this method:
• Saporta, G. (1990). "Probabilités, Analyse de données et Statistique", Technip

• Dixon, W.J. & Massey, F.J. (1983) "Introduction to statistical analysis (4th ed.)", McGraw-Hill

• D'Agostino, R.B. and Stephens, M.A. (1986). "Goodness-of-Fit Techniques", Marcel Dekker, Inc., New York.

• Bhattacharyya, G.K., and R.A. Johnson, (1997). "Statistical Concepts and Methods", John Wiley and Sons, New York.

• Sprent, P., and Smeeton, N.C. (2001). "Applied Nonparametric Statistical Methods – Third edition", Chapman & Hall

### 3.3.7 Step B  – Comparison of two samples using Smirnov test

Mathematical description

Goal

Let $X$ be a scalar uncertain variable modelled as a random variable. This method deals with the construction of a dataset prior to the choice of a probability distribution for $X$. Smirnov's test is a tool that may be used to compare two samples $\left\{{x}_{1},...,{x}_{N}\right\}$ and $\left\{{x}_{1}^{\text{'}},...,{x}_{M}^{\text{'}}\right\}$; the goal is to determine whether these two samples come from the same probability distribution or not. If this is the case, the two samples should be aggregated in order to increase the robustness of further statistical analyses.

Principle of the method

Smirnov's test is a statistical test based on the maximum distance between the cumulative distribution function ${\stackrel{^}{F}}_{N}$ and ${\stackrel{^}{F}}_{M}^{\text{'}}$ of the samples $\left\{{x}_{1},...,{x}_{N}\right\}$ and $\left\{{x}_{1}^{\text{'}},...,{x}_{M}^{\text{'}}\right\}$ (see [empirical cumulative distribution function] ). This distance is expressed as follows:

 $\begin{array}{c}\hfill {\stackrel{^}{D}}_{M,N}=\underset{x}{sup}\left|{\stackrel{^}{F}}_{N}\left(x\right)-{\stackrel{^}{F}}_{M}^{\text{'}}\left(x\right)\right|\end{array}$

The probability distribution of the distance ${\stackrel{^}{D}}_{M,N}$ is asymptotically known (i.e. as the size of the samples tends to infinity). If $M$ and $N$ are sufficiently large, this means that for a probability $\alpha$, one can calculate the threshold / critical value ${d}_{\alpha }$ such that:

• if ${\stackrel{^}{D}}_{M,N}>{d}_{\alpha }$, we conclude that the two samples are not identically distributed, with a risk of error $\alpha$,

• if ${\stackrel{^}{D}}_{M,N}\le {d}_{\alpha }$, it is reasonable to say that both samples arise frome the same distribution.

An important notion is the so-called "$p$-value" of the test. This quantity is equal to the limit error probability ${\alpha }_{\mathrm{lim}}$ under which the "identically-distributed" hypothesis is rejected. Thus, the two samples will be supposed identically distributed if and only if ${\alpha }_{\mathrm{lim}}$ is greater than the value $\alpha$ desired by the user. Note that the higher ${\alpha }_{\mathrm{lim}}-\alpha$, the more robust the decision.

Other notations
This test is also referred to as the Kolmogorov-Smirnov's test for two samples.

Link with OpenTURNS methodology

This method is used in step B "Quantifying Sources of Uncertainty". It is a tool for the construction of a dataset that can be used afterwards to choose a probability distribution for some uncertain variables defined in step A "Specifying Criteria and the Case Study".
References and theoretical basics
The test deals with the maximum deviation between the tw empirical distributions; it is by nature highly sensitive to presence of local deviations (two samples may be rejected even if they seem similar for almost the whole domain of variation).

We remind the reader that the underlying theoretical results of the test are asymptotic. There is no rule to determine the minimum number of data values one needs to use this test; but it is often considered a reasonable approximation when $N$ is of an order of a few dozen.

The following bibliographical references provide main starting points for further study of this method:

• Saporta G. (1990). "Probabilités, Analyse de données et Statistique", Technip

• Dixon W.J. & Massey F.J. (1983) "Introduction to statistical analysis (4th ed.)", McGraw-Hill

### 3.3.8 Step B  – Maximum Likelihood Principle

Mathematical description

Goal

This method deals with the parametric modelling of a probability distribution for a random vector $\underline{X}=\left({X}^{1},...,{X}^{{n}_{X}}\right)$. The appropriate probability distribution is found by using a sample of data $\left\{{\underline{x}}_{1},...,{\underline{x}}_{N}\right\}$. Such an approach can be described in two steps as follows:

• Choose a probability distribution (e.g. the Normal distribution, or any other distribution available in OpenTURNS see [standard parametric models] ),

• Find the parameter values $\underline{\theta }$ that characterize the probability distribution (e.g. the mean and standard deviation for the Normal distribution) which best describes the sample $\left\{{\underline{x}}_{1},...,{\underline{x}}_{N}\right\}$.

The maximum likelihood method is used for the second step.

Principle

In the current version of OpenTURNS this method is restricted to the case where ${n}_{X}=1$ and continuous probability distributions. Please note therefore that $\underline{X}={X}^{1}=X$ in the following text. The maximum likelihood estimate (MLE) of $\underline{\theta }$ is defined as the value of $\underline{\theta }$ which maximizes the likelihood function $L\left(X,\underline{\theta }\right)$:

 $\begin{array}{c}\hfill \stackrel{^}{\underline{\theta }}=\mathrm{argmax}\phantom{\rule{4pt}{0ex}}L\left(X,\underline{\theta }\right)\end{array}$

Given that $\left\{{x}_{1},...,{x}_{N}\right\}$ is a sample of independent identically distributed (i.i.d) observations, $L\left({x}_{1},...,{x}_{N},\underline{\theta }\right)$ represents the probability of observing such a sample assuming that they are taken from a probability distribution with parameters $\underline{\theta }$. In concrete terms, the likelihood $L\left({x}_{1},...,{x}_{N},\underline{\theta }\right)$ is calculated as follows:

 $L\left({x}_{1},...,{x}_{N},\underline{\theta }\right)=\prod _{j=1}^{N}{f}_{X}\left({x}_{j};\underline{\theta }\right)$

if the distribution is continuous, with density ${f}_{X}\left(x;\underline{\theta }\right)$.

For example, if we suppose that $X$ is a Gaussian distribution with parameters $\underline{\theta }=\left\{\mu ,\sigma \right\}$ (i.e. the mean and standard deviation),

 $\begin{array}{ccc}\hfill L\left({x}_{1},...,{x}_{N},\underline{\theta }\right)& =& \prod _{j=1}^{N}\frac{1}{\sigma \sqrt{2\pi }}exp\left[-\frac{1}{2}{\left(\frac{{x}_{j}-\mu }{\sigma }\right)}^{2}\right]\hfill \\ & =& \frac{1}{{\sigma }^{N}{\left(2\pi \right)}^{N/2}}exp\left[-\frac{1}{2{\sigma }^{2}}\sum _{j=1}^{N}{\left({x}_{j}-\mu \right)}^{2}\right]\hfill \end{array}$

The following figure graphically illustrates the maximum likelihood method, in the particular case of a Gaussian probability distribution.

In general, in order to maximize the likelihood function classical optimization algorithms (e.g. gradient type) can be used. The Gaussian distribution case is an exception to this, as the maximum likelihood estimators are obtained analytically:

 $\begin{array}{c}\hfill \stackrel{^}{\mu }=\frac{1}{N}\sum _{i=1}^{N}{x}_{i},\phantom{\rule{4pt}{0ex}}\stackrel{^}{{\sigma }^{2}}=\frac{1}{N}\sum _{i=1}^{N}{\left({x}_{i}-\stackrel{^}{\mu }\right)}^{2}\end{array}$

Other notations

-

Link with OpenTURNS methodology

Having specified the variable of interest and having defined a criterion (step A "Specifying Criteria and the Case Study"), the uncertainty of the input variable ${X}^{i}$ must be then quantified in step B. The superscript $i$ is omitted, as only a single component is used here, that is a single unknown variable (or source of uncertainty).

Input:

$\left\{{x}_{1},...,{x}_{N}\right\}$: sample data

Distribution: Distribution type chosen from the proposed continuous 1-dimensional distributions in [standard parametric models]

Output :

$\stackrel{^}{\underline{\theta }}$: maximum likelihood estimate of $\underline{\theta }$

References and theoretical basics
The sample size used in the maximum likelihood method has an effect on the quality of results. In fact:
• as $N$ tends to infinity, the asymptotic theory results assure, under certain assumptions concerning the regularity of the model, that the MLE is the best possible estimator (its bias tends towards 0 i.e. no tendency towards under- or over-estimation, the uncertainty of $\stackrel{^}{\underline{\theta }}$ is lesser than in all other unbiased estimation methods); in practice, one often considers the asymptotic behaviour to be reached when $N\ge$ a few dozens, even if no theoretical rule can assure this with certitude.

• if $N$ is smaller, the MLE is still useful but $\stackrel{^}{\underline{\theta }}$ is less robust (uncertainty greater and bias possible).

A more advanced study of the goodness-of-fit of the selected probability distribution with the given sample data is described in [Graphical analysis] [Kolmogorov-Smirnov test] , [Cramer-Von Mises test] , [Anderson-Darling test] and [BIC criterion] .

The following bibliographical references provide main starting points for further study of this method:

• Saporta G. (1990). "Probabilités, Analyse de données et Statistique", Technip

• Dixon W.J. & Massey F.J. (1983) "Introduction to statistical analysis (4th ed.)", McGraw-Hill

### 3.3.9 Step B  – Bayesian Calibration

Mathematical description

Goal

We consider a computer model $h$ (i.e. a deterministic function) to calibrate:

 $\begin{array}{c}\hfill \underline{z}=h\left(\underline{x},{\underline{\theta }}_{h}\right),\end{array}$

where

• $\underline{x}\in {ℝ}^{{d}_{x}}$ is the input vector;

• $\underline{z}\in {ℝ}^{{d}_{z}}$ is the output vector;

• ${\underline{\theta }}_{h}\in {ℝ}^{{d}_{h}}$ are the unknown parameters of $h$ to calibrate.

Our goal here is to estimate ${\underline{\theta }}_{h}$, based on a certain set of $n$ inputs $\left({\underline{x}}^{1},...,{\underline{x}}^{n}\right)$ (an experimental design) and some associated observations $\left({\underline{y}}^{1},...,{\underline{y}}^{n}\right)$ which are regarded as the realizations of some random vectors $\left({\underline{Y}}^{1},...,{\underline{Y}}^{n}\right)$, such that, for all $i$, the distribution of ${\underline{Y}}^{i}$ depends on ${\underline{z}}^{i}=h\left({\underline{x}}^{i},{\underline{\theta }}_{h}\right)$. Typically, ${\underline{Y}}^{i}={\underline{z}}^{i}+{\underline{\epsilon }}^{i}$ where ${\underline{\epsilon }}^{i}$ is a random measurement error.

For the sake of clarity, lower case letters are used for both random variables and realizations in the following (the notation does not distinguish the two anymore), as usual in the bayesian literature.

In fact, the bayesian procedure which is implemented allows to infer some unknown parameters $\underline{\theta }\in {ℝ}^{{d}_{\theta }}$ from some data $\underline{\underline{y}}=\left({\underline{y}}^{1},...,{\underline{y}}^{n}\right)$ as soon as the conditional distribution of each ${\underline{y}}^{i}$ given $\underline{\theta }$ is specified. Therefore $\underline{\theta }$ can be made up with some computer model parameters ${\underline{\theta }}_{h}$ together with some others ${\underline{\theta }}_{\epsilon }$: $\underline{\theta }={\left({{\underline{\theta }}_{h}}^{t},{{\underline{\theta }}_{\epsilon }}^{t}\right)}^{t}$. For example, ${\underline{\theta }}_{\epsilon }$ may represent the unknown standard deviation $\sigma$ of an additive centered gaussian measurement error affecting the data (see the example hereafter). Besides the procedure can be used to estimate the parameters of a distribution from direct observations (no computer model to calibrate: $\underline{\theta }={\underline{\theta }}_{\epsilon }$).

More formally, the likelihood $L\left(\underline{\underline{y}}|\underline{\theta }\right)$ is defined by, firstly, a family $\left\{{𝒫}_{\underline{w}},\underline{w}\in {ℝ}^{{d}_{w}}\right\}$ of probability distributions parametrized by $\underline{w}$, which is specified in practice by a conditional distribution $f\left(.|\underline{w}\right)$ given $\underline{w}$ ($f$ is a PDF or a probability mass function), and, secondly, a function $g:{ℝ}^{{d}_{\theta }}⟶{ℝ}^{n\phantom{\rule{0.166667em}{0ex}}{d}_{w}}$ such that $g\left(\theta \right)={\left({{g}^{1}\left(\underline{\theta }\right)}^{t},...,{{g}^{n}\left(\underline{\theta }\right)}^{t}\right)}^{t}$ which enables to express the parameter ${\underline{w}}^{i}$ of the ith observation ${\underline{y}}^{i}\sim f\left(.|{\underline{w}}^{i}\right)$ in function of $\underline{\theta }$: ${g}^{i}\left(\underline{\theta }\right)={\underline{w}}^{i}$ thus ${\underline{y}}^{i}\sim f\left(.|{g}^{i}\left(\underline{\theta }\right)\right)$ and

 $\begin{array}{c}\hfill L\left(\underline{\underline{y}}|\underline{\theta }\right)=\prod _{i=1}^{n}f\left({\underline{y}}^{i}|{g}^{i}\left(\underline{\theta }\right)\right).\end{array}$

Considering the issue of the calibration of some computer model parameters ${\underline{\theta }}_{h}$, the full statistical model can be seen as a two-level hierarchical model, with a single level of latent variables $\underline{z}$. A classical example is given by the nonlinear Gaussian regression model:

 $\begin{array}{ccc}\hfill {y}_{i}& =\hfill & \hfill h\left({\underline{x}}_{i}|{\underline{\theta }}_{h}\right)+{\epsilon }_{i},\phantom{\rule{4.pt}{0ex}}\text{where}\phantom{\rule{4.pt}{0ex}}{\epsilon }_{i}\stackrel{i.i.d.}{\sim }𝒩\left(0,{\sigma }^{2}\right),\phantom{\rule{1.em}{0ex}}i=1,...,n.\end{array}$

It can be implemented with $f\left(.|{\left(\mu ,\sigma \right)}^{t}\right)$ the PDF of the gaussian distribution $𝒩\left(\mu ,{\sigma }^{2}\right)$, with ${g}^{i}\left(\underline{\theta }\right)={\left(h\left({\underline{x}}^{i},{\underline{\theta }}_{h}\right),\phantom{\rule{0.222222em}{0ex}}\sigma \right)}^{t}$, and with $\underline{\theta }={\underline{\theta }}_{h}$, respectively $\underline{\theta }={\left({{\underline{\theta }}_{h}}^{t},\sigma \right)}^{t}$, if $\sigma$ is considered known, respectively unknown.

Given a distribution modelling the uncertainty on $\underline{\theta }$ prior to the data, Bayesian inference is used to perform the inference of $\underline{\theta }$, hence the name Bayesian calibration.

Principle

Contrary to the maximum likelihood approach described in [Maximum Likelihood Principle] , which provides a single `best estimate' value $\stackrel{^}{\underline{\theta }}$, together with confidence bounds accounting for the uncertainty remaining on the true value $\underline{\theta }$, the Bayesian approach derives a full distribution of possible values for $\underline{\theta }$ given the available data $\underline{\underline{y}}$. Known as the posterior distribution of $\underline{\theta }$ given the data $\underline{\underline{y}}$, its density can be expressed according to Bayes' theorem:

 $\begin{array}{ccc}\hfill \pi \left(\underline{\theta }|\underline{\underline{y}}\right)& =\hfill & \hfill \frac{L\left(\underline{\underline{y}}|\underline{\theta }\right)×\pi \left(\underline{\theta }\right)}{m\left(\underline{\underline{y}}\right)},\end{array}$ (3.3)

where

• $L\left(\underline{\underline{y}}|\underline{\theta }\right)$ is the (data) likelihood;

• $\pi \left(\underline{\theta }\right)$ is the so-called prior distribution of $\underline{\theta }$ (with support $\Theta$), which encodes all possible $\underline{\theta }$ values weighted by their prior probabilities, before consideration of any experimental data (this allows for instance to incorporate expert information or known physical constraints on the calibration parameter)

• $m\left(\underline{\underline{y}}\right)$ is the marginal likelihood:

 $\begin{array}{ccc}\hfill m\left(\underline{\underline{y}}\right)& =\hfill & \hfill {\int }_{\underline{\theta }\in \Theta }L\left(\underline{\underline{y}}|\underline{\theta }\right)\pi \left(\underline{\theta }\right)d\underline{\theta },\end{array}$

which is the necessary normalizing constant ensuring that the posterior density integrates to 1.

Except in very simple cases, (3.3) has, in general, no closed form. Thus, it must be approximated, either using numerical integration when the parameter space dimension ${d}_{\theta }$ is low, or more generally through stochastic sampling techniques known as Monte-Carlo Markov-Chain (MCMC) methods. See [The Metropolis-Hastings Algorithm] .

The following bibliographical references provide main starting points for further study of this method:

• Berger, J.O. (1985). "Statistical Decision Theory and Bayesian Analysis", Springer.

• Marin J.M. & Robert C.P. (2007) "Bayesian Core: A Practical Approach to Computational Bayesian Statistics", Springer.

Other notations

-

### 3.3.10 Step B  – The Metropolis-Hastings Algorithm

A rigorous and complete documentation about Markov Chain Monte-Carlo sampling is beyond the purpose of this section which provides a short introduction to the Metropolis-Hastings algorithm. In particular, the Metropolis-Hastings algorithm is only introduced hereafter in the context of the simulation of an homogeneous Markov Chain (no dynamical adaptation). The reader is invited to refer to the monographs suggested below for further explanations or details.

Mathematical description

Definitions and notation

Markov chain. Considering a $\sigma$-algebra $𝒜$ on $\Omega$, a Markov chain is a process ${\left({X}_{k}\right)}_{k\in ℕ}$ such that

 $\begin{array}{c}\hfill \forall \left(A,{x}_{0},...,{x}_{k-1}\right)\in 𝒜×{\Omega }^{k}\phantom{\rule{1.em}{0ex}}ℙ\left({X}_{k}\in A\phantom{\rule{0.166667em}{0ex}}|\phantom{\rule{0.166667em}{0ex}}{X}_{0}={x}_{0},...,{X}_{k-1}={x}_{k-1}\right)=ℙ\left({X}_{k}\in A\phantom{\rule{0.166667em}{0ex}}|\phantom{\rule{0.166667em}{0ex}}{X}_{k-1}={x}_{k-1}\right).\end{array}$

An example is the random walk for which ${X}_{k}={X}_{k-1}+{\epsilon }_{k}$ where the steps ${\epsilon }_{k}$ are independent and identically distributed.

Transition kernel. A transition kernel on $\left(\Omega ,𝒜\right)$ is a mapping $K:\left(\Omega ,𝒜\right)\to \left[0,1\right]$ such that

• $\forall A\in 𝒜\phantom{\rule{1.em}{0ex}}K\left(.,A\right)$ is measurable;

• $\forall x\in \Omega \phantom{\rule{1.em}{0ex}}K\left(x,.\right)$ is a probability distribution on $\left(\Omega ,𝒜\right)$.

The kernel $K$ has density $k$ if $\forall \left(x,A\right)\in \Omega ×𝒜\phantom{\rule{1.em}{0ex}}K\left(x,A\right)={\int }_{A}\phantom{\rule{0.222222em}{0ex}}k\left(x,y\right)\text{d}y$.

${\left({X}_{k}\right)}_{k\in ℕ}$ is a homogeneous Markov Chain of transition $K$ if $\forall \left(A,x\right)\in 𝒜×\Omega \phantom{\rule{1.em}{0ex}}ℙ\left({X}_{k}\in A|{X}_{k-1}=x\right)=K\left(x,A\right)$.

Some Notations. Let ${\left({X}_{k}\right)}_{k\in ℕ}$ be a homogeneous Markov Chain of transition $K$ on $\left(\Omega ,𝒜\right)$ with initial distribution $\nu$ (that is ${X}_{0}\sim \nu$):

• ${K}_{\nu }$ denotes the probability distribution of the Markov Chain ${\left({X}_{k}\right)}_{k\in ℕ}$;

• $\nu {K}^{k}$ denotes the probability distribution of ${X}_{k}$ (${X}_{k}\sim \nu {K}^{k}$);

• ${K}^{k}$ denotes the mapping defined by ${K}^{k}\left(x,A\right)=ℙ\left({X}_{k}\in A|{X}_{0}=x\right)$ for all $\left(x,A\right)\in \Omega ×𝒜$.

Total variation convergence. A Markov Chain of distribution ${K}_{\nu }$ is said to converge in total variation distance towards the distribution $t$ if

 $\begin{array}{c}\hfill \underset{k\to +\infty }{lim}\underset{A\in 𝒜}{sup}\left|\nu {K}^{k}\left(A\right)-t\left(A\right)\right|=0.\end{array}$

Then the notation used here is $\nu {K}^{k}{\to }_{TV}t$.

Some interesting properties. Let $t$ be a (target) distribution on $\left(\Omega ,𝒜\right)$, then a transition kernel $K$ is said to be:

• $t$-invariant if $tK=t$;

• $t$-irreducible if, $\forall \left(A,x\right)\in \Omega ×𝒜$ such that $t\left(A\right)>0$, $\exists k\in {𝒩}^{*}\phantom{\rule{1.em}{0ex}}{K}^{k}\left(x,A\right)>0$ holds.

Goal

Markov Chain Monte-Carlo techniques allows to sample and integrate according to a distribution $t$ which is only known up to a multiplicative constant. This situation is common in Bayesian statistics where the "target" distribution, the posterior one $t\left(\underline{\theta }\right)=\pi \left(\underline{\theta }|\underline{\underline{y}}\right)$, is proportional to the product of prior and likelihood: see equation (3.3).

In particular, given a "target" distribution $t$ and a $t$-irreducible kernel transition $Q$, the Metropolis-Hastings algorithm produces a Markov chain ${\left({X}_{k}\right)}_{k\in ℕ}$ of distribution ${K}_{\nu }$ with the following properties:

• the transition kernel of the Markov chain is $t$-invariant;

• $\nu {K}^{k}{\to }_{TV}t$;

• the Markov chain satisfies the ergodic theorem: let $\phi$ be a real-valued function such that ${𝔼}_{X\sim t}\left[|\phi \left(X\right)|\right]<+\infty$, then, whatever the initial distribution $\nu$ is:

 $\begin{array}{c}\hfill \frac{1}{n}\sum _{k=1}^{n}\phantom{\rule{0.222222em}{0ex}}\phi \left({X}_{k}\right)\underset{k\to +\infty }{\to }{𝔼}_{X\sim t}\left[\phi \left(X\right)\right]\phantom{\rule{4.pt}{0ex}}\text{almost}\phantom{\rule{4.pt}{0ex}}\text{surely}.\end{array}$

In that sense, simulating ${\left({X}_{k}\right)}_{k\in ℕ}$ amounts to sampling according to $t$ and can be used to integrate relatively to the probability measure $t$. Let us remark that the ergodic theorem implies that $\frac{1}{n}\sum _{k=1}^{n}\phantom{\rule{0.222222em}{0ex}}{\mathbf{1}}_{A}\left({X}_{k}\right)\underset{k\to +\infty }{\to }{ℙ}_{X\sim t}\left(X\in A\right)$ almost surely.

Principle

By abusing the notation, $t\left(x\right)$ represents, in the remainder of this section, a function of $x$ which is proportional to the PDF of the target distribution $t$. Given a transition kernel $Q$ of density $q$, the scheme of the Metropolis-Hastings algorithm is the following (lower case letters are used hereafter for both random variables and realizations as usual in the bayesian literature):

0)

Draw ${x}_{0}\sim \nu$ and set $k=1$.

1)

Draw a candidate for ${x}_{k}$ according to the given transition kernel $Q$: $\stackrel{˜}{x}\sim Q\left({x}_{k-1},.\right)$.

2)

Compute the ratio $\rho =\frac{t\left(\stackrel{˜}{x}\right)/q\left({x}_{k-1},\stackrel{˜}{x}\right)}{t\left({x}_{k-1}\right)/q\left(\stackrel{˜}{x},{x}_{k-1}\right)}$.

3)

Draw $u\sim 𝒰\left(\left[0,1\right]\right)$; if $u\le \rho$ then set ${x}_{k}=\stackrel{˜}{x}$, otherwise set ${x}_{k}={x}_{k-1}$.

4)

Set $k=k+1$ and go back to 1).

Of course, if $t$ is replaced by a different function of $x$ which is proportional to it, the algorithm keeps unchanged, since $t$ only takes part in the latter in the ratio $\frac{t\left(\stackrel{˜}{x}\right)}{t\left({x}_{k-1}\right)}$. Moreover, if $Q$ proposes some candidates in a uniform manner (constant density $q$), the candidate $\stackrel{˜}{x}$ is accepted according to a ratio $\rho$ which reduces to the previous "natural" ratio $\frac{t\left(\stackrel{˜}{x}\right)}{t\left({x}_{k-1}\right)}$ of PDF. The introduction of $q$ in the ratio $\rho$ prevents from the bias of a non-uniform proposition of candidates which would favor some areas of $\Omega$.

The $t$-invariance is ensured by the symmetry of the expression of $\rho$ ($t$-reversibility).

In practice, $Q$ is specified as a random walk ($\exists {q}_{RW}$ such that $q\left(x,y\right)={q}_{RW}\left(x-y\right)$) or as a independent sampling ($\exists {q}_{IS}$ such that $q\left(x,y\right)={q}_{IS}\left(y\right)$), or as a mixture of random walk and independent sampling.

The important property the practitioner have to keep in mind when choosing the transition kernel $Q$ is the $t$-irreducibility. Moreover, for efficient convergence, $Q$ has to be chosen so as to explore quickly the whole support of $t$ without conducting to a too small acceptance ratio (the ratio of accepted candidates $\stackrel{˜}{x}$ ). It is usually recommended that this latter ratio is about $0.2$ but such a ratio is neither a warranty of efficiency, nor a substitute to a convergence diagnosis.

The following bibliographical references provide main starting points for further study of this method:

• Robert, C.P. and Casella, G. (2004). "Monte Carlo Statistical Methods" (Second Edition), Springer.

• Meyn, S. and Tweedie R.L. (2009). "Markov Chains ans Stochastic Stability" (Second Edition), Cambridge University Press.

Other notations

-

### 3.3.11 Step B  – Parametric Estimation

Mathematical description

Goal

The objective is to estimate the value of the parameters based on a sample of an unknown distribution, supposed to be a member of a parametric family of distributions. We describes here the estimators implemented in OpenTURNS for the estimation of the several parametric models. They are all derived from either the Maximum Likelihood method or from the method of moments, excepted for the bound parameters that are systematically modified to strictly include the extrem realizations of the underlying sample $\left({x}_{1},\cdots ,{x}_{n}\right)$.

We suppose that we have a realization $\left({\underline{x}}_{1},\cdots ,{\underline{x}}_{n}\right)$ of a sample $\left({\underline{X}}_{1},\cdots ,{\underline{X}}_{n}\right)$ of size $n$, with the ${X}_{i}$ being iid, with common distribution $𝒟\left(\underline{\theta }\right)$. The objective is to build an estimator ${\stackrel{^}{\theta }}_{n}$ of $\underline{\theta }$, based on the realization $\left({\underline{x}}_{1},\cdots ,{\underline{x}}_{n}\right)$. We adopt the following notations:

• ${\overline{\underline{x}}}_{n}=\frac{1}{n}{\sum }_{i=1}^{n}{\underline{x}}_{i}$ the sample mean (${\overline{x}}_{n}$ in the 1D case);

• ${\sigma }_{n}=\sqrt{\frac{1}{n-1}{\sum }_{i=1}^{n}{\left({x}_{i}-\overline{x}\right)}^{2}}$ the sample standard deviation in the 1D case;

• ${x}_{\left(1,n\right)}={min}_{i=1,\cdots ,n}{x}_{i}$ the minimum of the realization in the 1D case;

• ${x}_{\left(n,n\right)}={max}_{i=1,\cdots ,n}{x}_{i}$ the maximum of the realization in the 1D case;

• ${x}_{1/2}$ the median of the sample in the 1D case;

Continuous univariate distributions:

 Arcsine $\begin{array}{c}\stackrel{^}{\mu }={\stackrel{^}{\mu }}_{x}\hfill \\ \stackrel{^}{\sigma }={\stackrel{^}{\sigma }}_{x}\hfill \end{array}$ Beta $\begin{array}{c}{\stackrel{^}{a}}_{n}=\left(1-\mathrm{sign}\left({x}_{\left(1,n\right)}\right)/\left(2+n\right)\right){x}_{\left(1,n\right)}\hfill \\ {\stackrel{^}{b}}_{n}=\left(1+\mathrm{sign}\left({x}_{\left(n,n\right)}\right)/\left(2+n\right)\right){x}_{\left(n,n\right)}\hfill \\ {\stackrel{^}{t}}_{n}=\frac{\left({\stackrel{^}{b}}_{n}-{\overline{x}}_{n}\right)\left({\overline{x}}_{n}-{\stackrel{^}{a}}_{n}\right)}{{\left({\sigma }_{n}^{X}\right)}^{2}-1}\hfill \\ {\stackrel{^}{r}}_{n}=\frac{t\left({\overline{x}}_{n}-{\stackrel{^}{a}}_{n}\right)}{{\stackrel{^}{b}}_{n}-{\stackrel{^}{a}}_{n}}\hfill \end{array}$ Burr ${\stackrel{^}{c}}_{n}$ is the solution of the non linear equation : $1+\frac{c}{n}\left[SR-\frac{n}{{\sum }_{i=1}^{n}log\left(1+{x}_{i}^{c}\right)}SSR\right]=0$ where $SR=\sum _{i=1}^{n}\frac{log\left({x}_{i}\right)}{1+{x}_{i}^{c}}$ and $SSR=\sum _{i=1}^{n}\frac{{x}_{i}^{c}log\left({x}_{i}\right)}{1+{x}_{i}^{c}}$. Then ${\stackrel{^}{k}}_{n}=\frac{n}{{\sum }_{i=1}^{n}log\left(1+{x}_{i}^{c}\right)}.$

 Chi ${\stackrel{^}{\nu }}_{n}={\overline{{x}^{2}}}_{n}$ ChiSquare ${\stackrel{^}{\nu }}_{n}={\overline{x}}_{n}$ Dirichlet Maximum likelihood estimators, according to the reference J. Huang Epanechnikov no parameter to estimate Exponential $\begin{array}{c}{\stackrel{^}{\gamma }}_{n}=\left(1-\mathrm{sign}\left({x}_{\left(1,n\right)}\right)/\left(2+n\right)\right){x}_{\left(1,n\right)}\hfill \\ {\stackrel{^}{\lambda }}_{n}=1/{\overline{x}}_{n}-{\stackrel{^}{\gamma }}_{n}\hfill \end{array}$ Fisher-Snedecor No factory method implemented so far Gamma $\begin{array}{c}{\stackrel{^}{\gamma }}_{n}=\left(1-\mathrm{sign}\left({x}_{\left(1,n\right)}\right)/\left(2+n\right)\right){x}_{\left(1,n\right)}\hfill \\ {\stackrel{^}{\lambda }}_{n}=\frac{{\overline{x}}_{n}-{\stackrel{^}{\gamma }}_{n}}{{\left({\sigma }_{n}^{X}\right)}^{2}}\hfill \\ {\stackrel{^}{k}}_{n}=\frac{{\left({\overline{x}}_{n}-{\stackrel{^}{\gamma }}_{n}\right)}^{2}}{{\left({\sigma }_{n}^{X}\right)}^{2}}\hfill \end{array}$ Generalized Pareto see text below Gumbel $\begin{array}{c}{\stackrel{^}{\alpha }}_{n}=\frac{\pi }{{\sigma }_{n}^{X}\sqrt{6}}\hfill \\ {\stackrel{^}{\beta }}_{n}={\overline{x}}_{n}-\frac{\gamma \sqrt{6}}{\pi }{\sigma }_{n}^{X}\hfill \\ \gamma \simeq 0.57721\phantom{\rule{4.pt}{0ex}}\text{is}\phantom{\rule{4.pt}{0ex}}\text{Euler's}\phantom{\rule{4.pt}{0ex}}\text{constant.}\hfill \end{array}$ Histogram The bandwidth is the AMISE-optimal one : $h=\frac{{\left(24\sqrt{\pi }\right)}^{1/3}{\sigma }_{n}}{{n}^{1/3}}$ where ${\sigma }_{n}^{2}$ is the non biaised variance of the data. The range is $\left[min\left(data\right),max\left(data\right)\right]$. Inverse ChiSquare No factory method implemented so far Inverse Gamma No factory method implemented so far Inverse Normal $\begin{array}{c}{\stackrel{^}{\mu }}_{n}={\overline{x}}_{n}\hfill \\ {\stackrel{^}{\lambda }}_{n}={\left(\frac{1}{n}\sum _{i=1}^{n}\frac{1}{{x}_{i}}-\frac{1}{{\overline{x}}_{n}}\right)}^{-1}\hfill \end{array}$ Laplace $\begin{array}{c}{\stackrel{^}{\mu }}_{n}={x}_{1/2}\hfill \\ {\stackrel{^}{\lambda }}_{n}=\frac{1}{n}\sum _{i=1}^{n}|{x}_{i}-{\stackrel{^}{\mu }}_{n}|\hfill \end{array}$ Logistic $\begin{array}{c}\stackrel{^}{\alpha }={\overline{x}}_{n}\hfill \\ {\stackrel{^}{\beta }}_{n}={\sigma }_{n}^{X}\hfill \end{array}$ LogNormal see text below LogUniform $\begin{array}{c}{\stackrel{^}{a}}_{n}=\left(1-1/\left(2+n\right)\right){x}_{\left(1,n\right)}\hfill \\ {\stackrel{^}{b}}_{n}=\left(1+1/\left(2+n\right)\right){x}_{\left(n,n\right)}\hfill \end{array}$ Meixner Moments method. See details below. Non Central Chi Square No factory method implemented so far Non Central Student No factory method implemented so far Normal Maximum likelihood estimators Normal Gamma No factory method implemented so far Rayleigh $\begin{array}{c}{\stackrel{^}{\gamma }}_{n}=\left(1-\mathrm{sign}\left({x}_{\left(1,n\right)}\right)/\left(2+n\right)\right){x}_{\left(1,n\right)}\hfill \\ {\stackrel{^}{\sigma }}_{n}=\sqrt{\frac{2}{n}{\sum }_{i=1}^{n}{\left({x}_{i}-{\stackrel{^}{\gamma }}_{n}\right)}^{2}}\hfill \end{array}$

 Rice Moments estimators, according to the reference C.G. Koay Student (1d) Moments estimators Trapezoidal Numerical resolution of maximum likelihood estimators Triangular $\begin{array}{c}{\stackrel{^}{a}}_{n}=\left(1-\mathrm{sign}\left({x}_{\left(1,n\right)}\right)/\left(2+n\right)\right){x}_{\left(1,n\right)}\hfill \\ {\stackrel{^}{b}}_{n}=\left(1+\mathrm{sign}\left({x}_{\left(n,n\right)}\right)/\left(2+n\right)\right){x}_{\left(n,n\right)}\hfill \\ {\stackrel{^}{m}}_{n}=3{\overline{x}}_{n}-{\stackrel{^}{a}}_{n}-{\stackrel{^}{b}}_{n}\hfill \end{array}$ TruncatedNormal Numerical maximum likelihood estimation. Uniform $\begin{array}{c}{\stackrel{^}{a}}_{n}=\left(1-\mathrm{sign}\left({x}_{\left(1,n\right)}\right)/\left(2+n\right)\right){x}_{\left(1,n\right)}\hfill \\ {\stackrel{^}{b}}_{n}=\left(1+\mathrm{sign}\left({x}_{\left(n,n\right)}\right)/\left(2+n\right)\right){x}_{\left(n,n\right)}\hfill \end{array}$ Weibull $\begin{array}{c}{\stackrel{^}{\gamma }}_{n}=\left(1-\mathrm{sign}\left({x}_{\left(1,n\right)}\right)/\left(2+n\right)\right){x}_{\left(1,n\right)}\hfill \\ \left({\stackrel{^}{\alpha }}_{n},{\stackrel{^}{\beta }}_{n}\right)\phantom{\rule{4.pt}{0ex}}\text{solution}\phantom{\rule{4.pt}{0ex}}\text{of}\phantom{\rule{4.pt}{0ex}}\left\{\begin{array}{c}{\overline{x}}_{n}={\stackrel{^}{\gamma }}_{n}+{\stackrel{^}{\alpha }}_{n}+\Gamma \left(1+1/{\stackrel{^}{\beta }}_{n}\right)\hfill \\ {\left({\sigma }_{n}^{X}\right)}^{2}={\stackrel{^}{\alpha }}_{n}\left(\Gamma \left(1+2/{\stackrel{^}{\beta }}_{n}\right)-\Gamma {\left(1+1/{\stackrel{^}{\beta }}_{n}\right)}^{2}\right)\hfill \end{array}\right\\hfill \end{array}$

Details for the Generalized Pareto distribution :

OpenTURNS implements three parametric estimation methods: the classical moments method, the exponential regression method and the probability weighted moments method, according to the reference G. Matthys & J. Beirlant. The default strategy is to use the probability weighted moments method when the sample size is smaller than the threshold defined in the RessourceMap object ($GeneralizedParetoFactory-SmallSize$). In case of failure, it uses the exponential regression method. If the sample size is to high, it uses the exponential regression method. The classical moments method is proposed but not used by default.

Details for the LogNormal distribution :

We note :

 $\begin{array}{ccc}\hfill {S}_{0}& =& \sum _{i=1}^{n}\frac{1}{{x}_{i}-\gamma }\hfill \\ \hfill {S}_{1}& =& \sum _{i=1}^{n}log\left({x}_{i}-\gamma \right)\hfill \\ \hfill {S}_{2}& =& \sum _{i=1}^{n}lo{g}^{2}\left({x}_{i}-\gamma \right)\hfill \\ \hfill {S}_{3}& =& \sum _{i=1}^{n}\frac{log\left({x}_{i}-\gamma \right)}{{x}_{i}-\gamma }\hfill \end{array}$ (3.3)

OpenTURNS tries to evaluate the parameters first using the Local Maximum Likelihood based estimators of $\left({\mu }_{\ell },{\sigma }_{\ell },\gamma \right)$ defined by :

 $\begin{array}{ccc}\hfill {\stackrel{^}{\mu }}_{\ell ,n}& =& \frac{{S}_{1}\left(\stackrel{^}{\gamma }\right)}{n}\hfill \\ \hfill {\stackrel{^}{\sigma }}_{\ell ,n}^{2}& =& \frac{{S}_{2}\left(\stackrel{^}{\gamma }\right)}{n}-{\stackrel{^}{\mu }}_{l,n}^{2}\hfill \end{array}$ (3.3)

Thus, ${\stackrel{^}{\gamma }}_{n}$ verifies the relation :

 ${S}_{0}\left(\gamma \right)\left({S}_{2}\left(\gamma \right)-{S}_{1}\left(\gamma \right)\left(1+\frac{{S}_{1}\left(\gamma \right)}{n}\right)\right)+n{S}_{4}\left(\gamma \right)=0$ (96)

under the constraint $\gamma \le min{x}_{i}$.

OpenTURNS tries to solve (96) by the step doubling bracheting method followed by the bisection method. Once ${\stackrel{^}{\gamma }}_{n}$ is evaluated, $\left({\stackrel{^}{\mu }}_{\ell ,n},{\stackrel{^}{\sigma }}_{\ell ,n}\right)$ are evaluated as defined in (3.3) and ().

If the resolution of (96) is not possible, OpenTURNS sends a message to the User and evaluates the parameters from the Modified Moments based estimators unsing ${\overline{x}}_{n}$, ${\sigma }_{n}^{2}$ and the additional modified moment equation :

 $𝔼\left[log\left({X}_{\left(1\right)}-\gamma \right)\right]=log\left({x}_{\left(1\right)}-\gamma \right)$ (97)

The quantity $E{Z}_{1}\left(n\right)=\frac{𝔼\left[log\left({X}_{\left(1\right)}-\gamma \right)\right]-{\mu }_{\ell }}{{\sigma }_{\ell }}$ is the mean of the first order statistics of a standard normal sample of size $n$. We have the following relation :

 $E{Z}_{1}\left(n\right)={\int }_{ℝ}nz\varphi \left(z\right){\left(1-\Phi \left(z\right)\right)}^{n-1}\phantom{\rule{0.166667em}{0ex}}\mathrm{d}z$ (98)

where $\varphi$ et $\Phi$ are the pdf and cdf of the standard normal distribution.

The estimator ${\stackrel{^}{\omega }}_{n}$ of $\omega ={e}^{{\sigma }_{\ell }^{2}}$ is obtained as solution of :

 $\omega \left(\omega -1\right)-{\kappa }_{n}{\left[\sqrt{\omega }-{e}^{E{Z}_{1}\left(n\right)\sqrt{log\omega }}\right]}^{2}=0$ (99)

where ${\kappa }_{n}=\frac{{s}_{n}^{2}}{{\left({\overline{x}}_{n}-{x}_{\left(1\right)}\right)}^{2}}$.

Then $\left({\stackrel{^}{{\mu }_{\ell }}}_{n},{\stackrel{^}{{\sigma }_{\ell }}}_{n},{\stackrel{^}{\gamma }}_{n}\right)$ are evaluated from :

 $\begin{array}{cc}\hfill {\stackrel{^}{{\mu }_{\ell }}}_{n}=& log{\stackrel{^}{\beta }}_{n}\hfill \\ \hfill {\stackrel{^}{{\sigma }_{\ell }}}_{n}=& \sqrt{log{\stackrel{^}{\omega }}_{n}}\hfill \\ \hfill {\stackrel{^}{\gamma }}_{n}=& {\overline{x}}_{n}-{\stackrel{^}{\beta }}_{n}\sqrt{{\stackrel{^}{\omega }}_{n}}\hfill \end{array}$ (3.3)

where ${\stackrel{^}{\beta }}_{n}=\frac{{s}_{n}}{\sqrt{{\stackrel{^}{\omega }}_{n}\left({\stackrel{^}{\omega }}_{n}-1\right)}}$.

If the resolution of (99) is not possible, OpenTURNS sends a message to the User and evaluates the parameters from the Moments based estimators which are always defined.

The estimator ${\stackrel{^}{\omega }}_{n}$ of $\omega ={e}^{{\sigma }_{\ell }^{2}}$ is the positive root of :

 ${\omega }^{3}+3{\omega }^{2}-\left(4+{a}_{3,n}^{2}\right)=0$ (101)

which is always defined. Then we have $\left({\stackrel{^}{{\mu }_{\ell }}}_{n},{\stackrel{^}{{\sigma }_{\ell }}}_{n},{\stackrel{^}{\gamma }}_{n}\right)$ using the relations (3.3).

Details for the Meixner distribution :

We use the following estimators:

 $\begin{array}{cc}\hfill {\stackrel{^}{{\gamma }_{1}}}_{n}=& \frac{\frac{1}{n}{\sum }_{i=1}^{n}{\left({x}_{i}-{\stackrel{^}{x}}_{n}\right)}^{3}}{{\stackrel{^}{\sigma }}_{n}^{3}}\hfill \\ \hfill {\stackrel{^}{{\gamma }_{2}}}_{n}=& \frac{\frac{1}{n}{\sum }_{i=1}^{n}{\left({x}_{i}-{\stackrel{^}{x}}_{n}\right)}^{4}}{{\stackrel{^}{\sigma }}_{n}^{4}}\hfill \\ \hfill {\stackrel{^}{\delta }}_{n}=& \frac{1}{{\stackrel{^}{{\gamma }_{2}}}_{n}-{\stackrel{^}{{\gamma }_{1}}}_{n}^{2}-3}\hfill \\ \hfill {\stackrel{^}{\beta }}_{n}=& sign\left({\stackrel{^}{{\gamma }_{1}}}_{n}\right)arcos\left(2-{\stackrel{^}{\delta }}_{n}\left({\stackrel{^}{{\gamma }_{2}}}_{n}-3\right)\right)\hfill \\ \hfill {\stackrel{^}{\alpha }}_{n}=& {\left({\stackrel{^}{\sigma }}_{n}^{2}\left(cos{\stackrel{^}{\beta }}_{n}+1\right)\right)}^{1/3}\hfill \end{array}$ (3.3)

where (3.3) is defined if ${\stackrel{^}{{\gamma }_{2}}}_{n}\ge 2{\stackrel{^}{{\gamma }_{1}}}_{n}+3$.

Continuous multivariate distributions:

 Dirichlet Maximum likelihood estimators Normal $\begin{array}{c}{\stackrel{^}{\underline{\mu }}}_{n}^{\phantom{\left(}}={\overline{\underline{x}}}_{n}\hfill \\ {\stackrel{^}{\mathrm{Cov}}}_{n}=\frac{1}{n-1}\sum _{i=1}^{n}\left({\underline{X}}_{i}-{\stackrel{^}{\underline{\mu }}}_{n}\right){\left({\underline{X}}_{i}-{\stackrel{^}{\underline{\mu }}}_{n}\right)}^{t}\hfill \end{array}$ Student not yet implmented

Discrete univariate distributions :

 Bernoulli ${\stackrel{^}{p}}_{n}={\overline{x}}_{n}$ Binomial See details below. Dirac ${\stackrel{^}{point}}_{n}={x}_{1}$ Geometric ${\stackrel{^}{p}}_{n}=\frac{1}{{\overline{x}}_{n}}$ Multinomial $\begin{array}{c}data:\left({\underline{x}}^{1},\cdots ,{\underline{x}}^{n}\right)\hfill \\ N=ma{x}_{i,k}\phantom{\rule{0.166667em}{0ex}}{x}_{i}^{k}\hfill \\ {p}_{i}=\frac{1}{nN}\sum _{k=1}^{n}{x}_{i}^{k}\hfill \end{array}$ Negative Binomial $\begin{array}{c}data:\left({\underline{x}}^{1},\cdots ,{\underline{x}}^{n}\right)\hfill \\ {\stackrel{^}{p}}_{n}=\frac{{\overline{x}}_{n}}{{\stackrel{^}{r}}_{n}+{\overline{x}}_{n}}\hfill \\ {\stackrel{^}{r}}_{n}\phantom{\rule{4.pt}{0ex}}\text{solution}\phantom{\rule{4.pt}{0ex}}\text{of}\phantom{\rule{4.pt}{0ex}}n\left(log\left(\frac{{\stackrel{^}{r}}_{n}}{{\stackrel{^}{r}}_{n}+{\overline{x}}_{n}}\right)-\psi \left({\stackrel{^}{r}}_{n}\right)\right)+\sum _{i=1}^{n}\psi \left({x}^{i}+{\stackrel{^}{r}}_{n}\right)=0\hfill \\ \text{The}\phantom{\rule{4.pt}{0ex}}\text{resolution}\phantom{\rule{4.pt}{0ex}}\text{is}\phantom{\rule{4.pt}{0ex}}\text{done}\phantom{\rule{4.pt}{0ex}}\text{using}\phantom{\rule{4.pt}{0ex}}\text{Brent's}\phantom{\rule{4.pt}{0ex}}\text{method.}\hfill \end{array}$ Poisson ${\stackrel{^}{\lambda }}_{n}={\overline{x}}_{n}$ Skellam Moments estimators: see details below. UserDefined Uniform distribution over the sample.

Details for the Binomial distribution :

We initialize the value of $\left(n,{p}_{n}\right)$ to $\left(⌈\frac{{\stackrel{^}{x}}_{n}^{2}}{{\stackrel{^}{x}}_{n}-{\stackrel{^}{\sigma }}_{n}^{2}}⌉,\frac{{\stackrel{^}{x}}_{n}}{n}\right)$ where ${\stackrel{^}{x}}_{n}$ is the empirical mean of the sample $\left({x}_{1},\cdots ,{x}_{n}\right)$, and ${\stackrel{^}{\sigma }}_{n}^{2}$ its unbiaised empirical variance.

Then, we evaluate the likelihood of the sample with respect to the Binomial distribution parameterized with $\left(⌈\frac{{\stackrel{^}{x}}_{n}^{2}}{{\stackrel{^}{x}}_{n}-{\stackrel{^}{\sigma }}_{n}^{2}}⌉,\frac{{\stackrel{^}{x}}_{n}}{n}\right)$. By testing successively $n+1$ and $n-1$ instead of $n$, we determine the variation of the likelihood of the sample with respect to the Binomial distribution parameterized with $\left(n+1,{p}_{n+1}\right)$ and $\left(n-1,{p}_{n-1}\right)$. We then iterate in the direction that makes the likelihood decrease, until the likelihood stops decreasing. The last couple is the one selected.

Details for the Skellam distribution :

The estimators of $\left({\lambda }_{1},{\lambda }_{2}\right)$ write:

 $\begin{array}{c}\hfill \begin{array}{ccc}{\stackrel{^}{{\lambda }_{1}}}_{n}\hfill & =& \frac{1}{2}\left({\stackrel{^}{\sigma }}_{n}+{\stackrel{^}{x}}_{n}\right)\hfill \\ {\stackrel{^}{{\lambda }_{2}}}_{n}\hfill & =& \frac{1}{2}\left({\stackrel{^}{\sigma }}_{n}-{\stackrel{^}{x}}_{n}\right)\hfill \end{array}\end{array}$

Discrete multivariate distributions:

 Dirac ${\stackrel{^}{point}}_{n}={\underline{x}}_{1}$ Multinomial Maximum likelihood estimators UserDefined Uniform distribution over the sample.

Copula distributions :

We note ${\stackrel{^}{\tau }}_{n}$ the Kendall-$\tau$ of the sample and ${\stackrel{^}{\rho }}_{n}$ its Spearman correlation coefficient. AMH is the Ali-Mikhail-Haq copula and FGM the Farlie-Gumbel-Morgenstern one.

 AMH ${\stackrel{^}{\theta }}_{n}$ solution of ${\stackrel{^}{\tau }}_{n}=\frac{3\theta -2}{3\theta }-\frac{2{\left(1-\theta \right)}^{2}ln\left(1-\theta \right)}{3{\theta }^{2}}$. Clayton ${\stackrel{^}{\theta }}_{n}=\frac{2{\stackrel{^}{\tau }}_{n}^{\phantom{\left(}}}{{1}_{\phantom{\left(}}-{\stackrel{^}{\tau }}_{n}}$. FGM ${\stackrel{^}{\theta }}_{n}=\frac{9}{2}{\stackrel{^}{\tau }}_{n}^{\phantom{\left(}}$ if $|{\stackrel{^}{\theta }}_{n}|<1$. Otherwise, ${\stackrel{^}{\theta }}_{n}=3{\stackrel{^}{\rho }}_{n}^{\phantom{\left(}}$ if $|{\stackrel{^}{\theta }}_{n}|<1$. Otherwise, the estimation is not possible. Frank ${\stackrel{^}{\theta }}_{n}$ solution of ${\stackrel{^}{\tau }}_{n}=1-4\left(\frac{1-D{\left({\stackrel{^}{\theta }}_{n},1\right)}^{\phantom{\left(}}}{\theta }\right)$ where $D$ is the Debye function defined as $D\left(x,n\right)=\frac{n}{{x}^{n}}{\int }_{0}^{x}\frac{{t}^{n}}{{e}^{t}-{1}_{\phantom{\left(}}}dt$. Gumbel ${\stackrel{^}{\theta }}_{n}=\frac{{1}^{\phantom{\left(}}}{1-{\stackrel{^}{\tau }}_{{n}_{\phantom{\left(}}}}$. Normal The correlation matrix $\underline{\underline{R}}$ is such that ${R}_{ij}=sin{\left(\frac{\pi }{2}{\stackrel{^}{\tau }}_{n,ij}\right)}_{\phantom{\left(}}$.

Other notations

Link with OpenTURNS methodology

When the amount of data is sufficient, parametric estimation may be used within Step B ; Quantification of Uncertainties to model the uncertainty of some input random vectors or the output random vector.
References and theoretical basics
The following bibliographical references provide main starting points for further study of this method:
• Huang J., "Maximum Likelihood Estimation of Dirichlet Distribution Parameters".

• Koay C.G., Basser P.J., "Analytically exact correction scheme for signal extraction from noisy magnitude MR signals", Journal of magnetics Resonance 179, 317-322, 2006.

• G. Matthys & J. Beirlant "Estimating the extreme value index abd high quantiles with exponential regression models", Statistica Sinica, 13, 850-880, 2003.

• Saporta G. (1990). "Probabilités, Analyse de données et Statistique", Technip.

• Dixon W.J. & Massey F.J. (1983) "Introduction to statistical analysis (4th ed.)", McGraw-Hill.

Examples

-

### 3.3.12 Step B  – Graphical goodness-of-fit tests : QQ-plot, Kendall Plot and Henry line

Mathematical description

Goal

This method deals with the modelling of a probability distribution of a random vector $\underline{X}=\left({X}^{1},...,{X}^{{n}_{X}}\right)$. It seeks to verify the compatibility between a sample of data $\left\{{\underline{x}}_{1},{\underline{x}}_{2},...,{\underline{x}}_{N}\right\}$ and a candidate probability distribution previous chosen. OpenTURNS enables the use of graphical tools to answer this question in the one dimensional case ${n}_{X}=1$, and with a continuous distribution.

Principle

The QQ-plot, and henry line tests are defined in the case to ${n}_{X}=1$. Thus we denote $\underline{X}={X}^{1}=X$. The first graphical tool provided by Open TURNS is a QQ-plot (where "QQ" stands for "quantile-quantile"). In the specific case of a Normal distribution (see [standard parametric models] ), Henry's line may also be used.

QQ-plot

A QQ-Plot is based on the notion of quantile. The $\alpha$-quantile ${q}_{X}\left(\alpha \right)$ of $X$, where $\alpha \in \left(0,1\right)$, is defined as follows:

 $\begin{array}{c}\hfill ℙ\left(X\le {q}_{X}\left(\alpha \right)\right)=\alpha \end{array}$

If a sample $\left\{{x}_{1},...,{x}_{N}\right\}$ of $X$ is available, the quantile can be estimated empirically:

1. the sample $\left\{{x}_{1},...,{x}_{N}\right\}$ is first placed in ascending order, which gives the sample $\left\{{x}_{\left(1\right)},...,{x}_{\left(N\right)}\right\}$;

2. then, an estimate of the $\alpha$-quantile is:

 $\begin{array}{c}\hfill {\stackrel{^}{q}}_{X}\left(\alpha \right)={x}_{\left(\left[N\alpha \right]+1\right)}\end{array}$

where $\left[N\alpha \right]$ denotes the integral part of $N\alpha$.

Thus, the ${j}^{\mathrm{th}}$ smallest value of the sample ${x}_{\left(j\right)}$ is an estimate ${\stackrel{^}{q}}_{X}\left(\alpha \right)$ of the $\alpha$-quantile where $\alpha =\left(j-1\right)/N$ ($1).

Let us then consider the candidate probability distribution being tested, and let us denote by $F$ its cumulative distribution function. An estimate of the $\alpha$-quantile can be also computed from $F$:

 $\begin{array}{c}\hfill {\stackrel{^}{q}}_{X}^{\text{'}}\left(\alpha \right)={F}^{-1}\left(\left(j-1\right)/N\right)\end{array}$

If $F$ is really the cumulative distribution function of $F$, then ${\stackrel{^}{q}}_{X}\left(\alpha \right)$ and ${\stackrel{^}{q}}_{X}^{\text{'}}\left(\alpha \right)$ should be close. Thus, graphically, the points $\left\{\left({\stackrel{^}{q}}_{X}\left(\alpha \right),{\stackrel{^}{q}}_{X}^{\text{'}}\left(\alpha \right)\right),\phantom{\rule{4pt}{0ex}}\alpha =\left(j-1\right)/N,\phantom{\rule{4pt}{0ex}}1 should be close to the diagonal.

The following figure illustrates the principle of a QQ-plot with a sample of size $N=50$. Note that the unit of the two axis is that of the variable $X$ studied; the quantiles determined via $F$ are called here "value of $T$". In this example, the points remain close to the diagonal and the hypothesis "$F$ is the cumulative distribution function of $X$" does not seem irrelevant, even if a more quantitative analysis (see for instance [Kolmogorov-Smirnov goodness-of-fit test] ) should be carried out to confirm this.

In this second example, the candidate distribution function is clearly irrelevant.

Henry's line

This second graphical tool is only relevant if the candidate distribution function being tested is gaussian. It also uses the ordered sample $\left\{{x}_{\left(1\right)},...,{x}_{\left(N\right)}\right\}$ introduced for the QQ-plot, and the empirical cumulative distribution function ${\stackrel{^}{F}}_{N}$ presented in [empirical cumulative distribution function] .

By definition,

 $\begin{array}{c}\hfill {x}_{\left(j\right)}={\stackrel{^}{F}}_{N}^{-1}\left(\frac{j}{N}\right)\end{array}$

Then, let us denote by $\Phi$ the cumulative distribution function of a Normal distribution with mean 0 and standard deviation 1. The quantity ${t}_{\left(j\right)}$ is defined as follows:

 $\begin{array}{c}\hfill {t}_{\left(j\right)}={\Phi }^{-1}\left(\frac{j}{N}\right)\end{array}$

If $X$ is distributed according to a normal probability distribution with mean $\mu$ and standard-deviation $\sigma$, then the points $\left\{\left({x}_{\left(j\right)},{t}_{\left(j\right)}\right),\phantom{\rule{4pt}{0ex}}1\le j\le N\right\}$ should be close to the line defined by $t=\left(x-\mu \right)/\sigma$. This comes from a property of a normal distribution: it the distribution of $X$ is really $𝒩\left(\mu ,\sigma \right)$, then the distribution of $\left(X-\mu \right)/\sigma$ is $𝒩\left(0,1\right)$.

The following figure illustrates the principle of Henry's graphical test with a sample of size $N=50$. Note that only the unit of the horizontal axis is that of the variable $X$ studied. In this example, the points remain close to a line and the hypothesis "the distribution function of $X$ is a gaussian one" does not seem irrelevant, even if a more quantitative analysis (see for instance [Kolmogorov-Smirnov goodness-of-fit test] ) should be carried out to confirm this.

In this second example, the hypothesis of a gaussian distribution seems far less relevant because of the behaviour for small values of $X$.

Kendall plot

In the bivariate case, the Kendall Ploot test enables to validate the choice of a specific copula model or to verify that two samples share the same copula model.

Let $\underline{X}$ be a bivariate random vector which copula is noted $C$.

Let ${\left({\underline{X}}^{i}\right)}_{1\le i\le N}$ be a sample of $\underline{X}$.

We note :

 $\begin{array}{c}\hfill \forall i\ge 1,{H}_{i}=\frac{1}{n-1}Card\left\{j\in \left[1,N\right],j\ne i,\phantom{\rule{0.166667em}{0ex}}|\phantom{\rule{0.166667em}{0ex}}{x}_{1}^{j}\le {x}_{1}^{i}\phantom{\rule{4.pt}{0ex}}\text{and}\phantom{\rule{4.pt}{0ex}}{x}_{2}^{j}\le {x}_{2}^{i}\right\}\end{array}$

and $\left({H}_{\left(1\right)},\cdots ,{H}_{\left(N\right)}\right)$ the ordered statistics of $\left({H}_{1},\cdots ,{H}_{N}\right)$.

The statistic ${W}_{i}$ is defined by :

 ${W}_{i}=N{C}_{N-1}^{i-1}{\int }_{0}^{1}t{K}_{0}{\left(t\right)}^{i-1}{\left(1-{K}_{0}\left(t\right)\right)}^{n-i}\phantom{\rule{0.166667em}{0ex}}d{K}_{0}\left(t\right)$ (103)

where ${K}_{0}\left(t\right)$ is the cumulative density function of ${H}_{i}$. We can show that this is the cumulative density function of the random variate $C\left(U,V\right)$ when $U$ and $V$ are independent and follow $Uniform\left(0,1\right)$ distributions.

In OpenTURNS 0.15.0, Eq. (103) is evaluated with the Monte Carlo sampling method : OpenTURNS generates $n$ samples of size $N$ from the bivariate copula $C$, in order to have $n$ realisations of the statistics ${H}_{\left(i\right)},\forall 1\le i\le N$ and have an estimation of ${W}_{i}=E\left[{H}_{\left(i\right)}\right],\forall i\le N$.

When testing a specific copula with respect to a sample, the Kendall Plot test draws the points ${\left({W}_{i},{H}_{\left(i\right)}\right)}_{1\le i\le N}$. If the points are one the first diagonal, the copula model is validated.

When testing whether two samples have the same copula, the Kendall Plot test draws the points ${\left({H}_{\left(i\right)}^{1},{H}_{\left(i\right)}^{2}\right)}_{1\le i\le N}$ respectively associated to the first and second sample. Note that the two samples must have the same size.

In Figures 52 to 53, the data 1 and data 2 have been generated from a $Frank\left(1.5\right)$ copula, and data 3 from a $Gumbel\left(4.5\right)$ copula.

Figures 52 and 52 respectively validates and invalidates the Frank copula model to data 1 and data 2.

Figures 53 and 53 respectively validates that data 1 and data 2 share the same copula, and shows that data 1 and data 3 don't share the same copula.

 The Kendall Plot test validates the use of the Frank copula model for the data 1. The Kendall Plot test invalidates the use of the Frank copula model for the data 1.
Figure 52
 The Kendall Plot test validates that data 1 and data 2 have the same copula model. The Kendall Plot test invalidates that data 1 and data 3 have the same copula model.
Figure 53

Remark : In the case where you want to test a sample with respect to a specific copula, if the size of the sample is superior to 500, we recommend to use the second form of the Kendall plot test : generate a sample of the proper size from your copula and then test both samples. This way of doing is more efficient.

Other notations

-

Link with OpenTURNS methodology

This method is used in step B "Quantifying Sources of Uncertainty", to verify if the probability distribution is appropriate to describe the uncertainty of a component ${X}^{i}$ of the vector of unknown variables defined in step A "Specifying Criteria and the Case Study". The Kendall Plot is used to validate a copula model.
References and theoretical basics
Since QQ-plot and Henry's line are graphical analysis, their conclusion remain obviously subjective. The reader is referred to [Komogorov-Smirnov test] , [Cramer-Von-Mises test] , [Anderson-Darling test] for a more quantitative analysis.

The following bibliographical references provide main starting points for further study of this method:

• Saporta G. (1990). "Probabilités, Analyse de données et Statistique", Technip

• Dixon W.J. & Massey F.J. (1983) "Introduction to statistical analysis (4th ed.)", McGraw-Hill

### 3.3.13 Step B  – Chi-squared goodness of fit test

Mathematical description

Goal

This method deals with the modelling of a probability distribution of a random vector $\underline{X}=\left({X}^{1},...,{X}^{{n}_{X}}\right)$. It seeks to verify the compatibility between a sample of data $\left\{{\underline{x}}_{1},{\underline{x}}_{2},...,{\underline{x}}_{N}\right\}$ and a candidate probability distribution previous chosen. OpenTURNS enables the use of the ${\chi }^{2}$ Goodness-of-Fit test to answer this question in the one dimensional case ${n}_{X}=1$, and with a discrete distribution.

Principle

Let us limit the case to ${n}_{X}=1$. Thus we denote $\underline{X}={X}^{1}=X$. We also note that as we are considering discrete distributions i.e. those for which the possible values of $X$ belong to a discrete set $ℰ$, the candidate distribution is characterised by the probabilities ${\left\{p\left(x;\underline{\theta }\right)\right\}}_{x\in ℰ}$.

The chi squared test is based on the fact that if the candidate distribution is appropriate, the number of values in the sample x1, x2, ..., xN that are equal to $x$ should be on average equal to $Np\left(x;\underline{\theta }\right)$. The idea is therefore to compare the "theoretical values" with the actual observed values. This comparison is performed with the aid of the following "distance".

 $\begin{array}{c}\hfill {\stackrel{^}{D}}_{N}^{2}=\sum _{x\in {ℰ}_{N}}\frac{{\left(Np\left(x\right)-n\left(x\right)\right)}^{2}}{n\left(x\right)}\end{array}$

where ${ℰ}_{N}$ denotes the elements of $ℰ$ which have been observed at least once in the data sample and where $n\left(x\right)$ denotes the number of data values in the sample that are equal to $x$.

The probability distribution of the distance ${\stackrel{^}{D}}_{N}^{2}$ is asymptotically known (i.e. as the size of the sample tends to infinity), and this asymptotic distribution does not depend on the candidate distribution being tested. If $N$ is sufficiently large, this means that for a probability $\alpha$, one can calculate the threshold / critical value) ${d}_{\alpha }$ such that:

• if ${\stackrel{^}{D}}_{N}>{d}_{\alpha }$, we reject the candidate distribution with a risk of error $\alpha$,

• if ${\stackrel{^}{D}}_{N}\le {d}_{\alpha }$, the candidate distribution is considered acceptable.

An important notion is the so-called "$p$-value" of the test. This quantity is equal to the limit error probability ${\alpha }_{\mathrm{lim}}$ under which the candidate distribution is rejected. Thus, the candidate distribution will be accepted if and only if ${\alpha }_{\mathrm{lim}}$ is greater than the value $\alpha$ desired by the user. Note that the higher ${\alpha }_{\mathrm{lim}}-\alpha$, the more robust the decision.

Other notations

Link with OpenTURNS methodology

This method is used in step B "Quantifying Sources of Uncertainty", to verify if the probability distribution is appropriate to describe the uncertainty of a component ${X}^{i}$ of the vector of unknown variables defined in step A "Specifying Criteria and the Case Study".

Input data:

$\left\{{x}_{1},...,{x}_{N}\right\}$ : data sample

Distribution: probability distribution that we are testing for goodness-of-fit

Parameters:

$\alpha$ : Level of significance for the test

Outputs:

Result: Binary variable specifying whether the candidate distribution is rejected (0) or not (1)

${\alpha }_{\mathrm{lim}}$ : $p$-value of the test

References and theoretical basics

The test is suitable for discrete distributions. It cannot be used for continuous distributions except by means of an arbitrary discretisation of possible values of $X$, an important source of potential error. Readers interested in Goodness of Fit tests for continuous variables are referred to [Kolmogorov-Smirnov test] , [Cramer-Von Mises test] , [Anderson-Darling test] in the reference documentation.

Even for discrete distributions, certain precautions must be taken when using this test. Firstly, the critical value ${d}_{\alpha }$ is only valid for a sufficiently large sample size. No rule exists to determine the minimum number of data values necessary in order to use this test; it is often thought, however, that the approximation is reasonable when $N$ is of the order of a few dozen. But whatever the value of $N$, the distance – and similarly the $p$-value – remains a useful tool for comparing different probability distributions to a sample. The distribution which minimizes ${\stackrel{^}{D}}_{N}$ – or maximizes the $p$-value – will be of interest to the analyst.

On the other hand, the calculation of ${d}_{\alpha }$ and of the $p$-value should in theory be modified if we are testing the goodness of fit of a parametric model and if the parameters of the candidate distribution have been estimated from the same sample. The current version of OpenTURNS, however, does not permit such a modification, and so the results must be used with care when the $p$-value ${\alpha }_{\mathrm{lim}}$ and the desired error risk $\alpha$ are very close.

The following bibliographical references provide main starting points for further study of this method:

• Saporta, G. (1990). "Probabilités, Analyse de données et Statistique", Technip

• Dixon, W.J. & Massey, F.J. (1983) "Introduction to statistical analysis (4th ed.)", McGraw-Hill

• D'Agostino, R.B. and Stephens, M.A. (1986). "Goodness-of-Fit Techniques", Marcel Dekker, Inc., New York.

• Bhattacharyya, G.K., and R.A. Johnson, (1997). "Statistical Concepts and Methods", John Wiley and Sons, New York.

• Sprent, P., and Smeeton, N.C. (2001). "Applied Nonparametric Statistical Methods – Third edition", Chapman & Hall

### 3.3.14 Step B  – Kolmogorov-Smirnov goodness-of-fit test

Mathematical description

Goal

This method deals with the modelling of a probability distribution of a random vector $\underline{X}=\left({X}^{1},...,{X}^{{n}_{X}}\right)$. It seeks to verify the compatibility between a sample of data $\left\{{\underline{x}}_{1},{\underline{x}}_{2},...,{\underline{x}}_{N}\right\}$ and a candidate probability distribution previous chosen. OpenTURNS enables the use of the Kolmogorov-Smirnov Goodness-of-Fit test to answer this question in the one dimensional case ${n}_{X}=1$, and with a continuous distribution.

Principle

Let us limit the case to ${n}_{X}=1$. Thus we denote $\underline{X}={X}^{1}=X$. This goodness-of-fit test is based on the maximum distance between the cumulative distribution function ${\stackrel{^}{F}}_{N}$ of the sample $\left\{{x}_{1},{x}_{2},...,{x}_{N}\right\}$ (see [empirical cumulative distribution function] ) and that of the candidate distribution, denoted $F$. This distance may be expressed as follows:

 $\begin{array}{c}\hfill D=\underset{x}{sup}\left|{\stackrel{^}{F}}_{N}\left(x\right)-F\left(x\right)\right|\end{array}$

With a sample $\left\{{x}_{1},{x}_{2},...,{x}_{N}\right\}$, the distance is estimated by:

 $\begin{array}{c}\hfill {\stackrel{^}{D}}_{N}=\underset{i=1...N}{sup}\left|F\left({x}_{i}\right)-\frac{i-1}{N};\frac{i}{N}-F\left({x}_{i}\right)\right|\end{array}$

The probability distribution of the distance ${\stackrel{^}{D}}_{N}$ is asymptotically known (i.e. as the size of the sample tends to infinity). If $N$ is sufficiently large, this means that for a probability $\alpha$ and a candidate distribution type, one can calculate the threshold / critical value ${d}_{\alpha }$ such that:

• if ${\stackrel{^}{D}}_{N}>{d}_{\alpha }$, we reject the candidate distribution with a risk of error $\alpha$,

• if ${\stackrel{^}{D}}_{N}\le {d}_{\alpha }$, the candidate distribution is considered acceptable.

Note that ${d}_{\alpha }$ does not depend on the candidate distribution $F$ being tested, and the test is therefore relevant for any continuous distribution.

An important notion is the so-called "$p$-value" of the test. This quantity is equal to the limit error probability ${\alpha }_{\mathrm{lim}}$ under which the candidate distribution is rejected. Thus, the candidate distribution will be accepted if and only if ${\alpha }_{\mathrm{lim}}$ is greater than the value $\alpha$ desired by the user. Note that the higher ${\alpha }_{\mathrm{lim}}-\alpha$, the more robust the decision.

The diagram below illustrates the principle of comparison with the empirical cumulative distribution function for an ordered sample $\left\{5,6,10,22,27\right\}$; the candidate distribution considered here is the Exponential distribution with parameters $\lambda =0.07$, $\gamma =0$ (see [standard parametric models] ).

Other notations
This method is also referred to in the literature as Kolmogorov's Test.

Link with OpenTURNS methodology

This method is used in step B "Quantifying Sources of Uncertainty", to verify if the probability distribution is appropriate to describe the uncertainty of a component ${X}^{i}$ of the vector of unknown variables defined in step A "Specifying Criteria and the Case Study".

Input data:

$\left\{{x}_{1},...,{x}_{N}\right\}$ : data sample

Distribution: probability distribution that we are testing for goodness-of-fit

Parameters:

$\alpha$ : Level of significance for the test

Outputs:

Result: Binary variable specifying whether the candidate distribution is rejected (0) or not (1)

${\alpha }_{\mathrm{lim}}$ : $p$-value of the test

References and theoretical basics
The test deals with the maximum deviation between the empirical distribtuion and the candidate distribution, it is by nature highly sensitive to presence of local deviations (a candidate distribution may be rejected even if it correctly describes the sample for almost the whole domain of variation).

We remind the reader that the underlying theoretical results of the test are asymptotic. There is no rule to determine the minimum number of data values one needs to use this test; but it is often considered a reasonable approximation when $N$ is of an order of a few dozen. But whatever the value of $N$, the distance – and similarly the $p$-value – remains a useful tool for comparing different probability distributions to a sample. The distribution which minimizes ${\stackrel{^}{D}}_{N}$ – or maximizes the $p$-value – will be of interest to the analyst.

We also point out that the calculation of ${d}_{\alpha }$ should in theory be modified if on is testing the goodness-of-fit to a parametric model where the parameters have been estimated from the same sample. The current version of OpenTURNS does not allow this modification, and the results should be therefore used with caution when the $p$-value ${\alpha }_{\mathrm{lim}}$ and the desired error risk $\alpha$ are very close.

Readers interested in Goodness of Fit tests for continuous distributions are referred to [Cramer-Von Mises test] and [Anderson-Darling test] in the reference documentation.

The following bibliographical references provide main starting points for further study of this method:

• Saporta, G. (1990). "Probabilités, Analyse de données et Statistique", Technip

• Dixon, W.J. & Massey, F.J. (1983) "Introduction to statistical analysis (4th ed.)", McGraw-Hill

• NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/

• D'Agostino, R.B. and Stephens, M.A. (1986). "Goodness-of-Fit Techniques", Marcel Dekker, Inc., New York.

• Bhattacharyya, G.K., and R.A. Johnson, (1997). "Statistical Concepts and Methods", John Wiley and Sons, New York.

• Sprent, P., and Smeeton, N.C. (2001). "Applied Nonparametric Statistical Methods – Third edition", Chapman & Hall

### 3.3.15 Step B  – Cramer-Von Mises goodness-of-fit test

Mathematical description

Objective

This method deals with the modelling of a probability distribution of a random vector $\underline{X}=\left({X}^{1},...,{X}^{{n}_{X}}\right)$. It seeks to verify the compatibility between a sample of data $\left\{{\underline{x}}_{1},{\underline{x}}_{2},...,{\underline{x}}_{N}\right\}$ and a candidate probability distribution previous chosen. OpenTURNS enables the use of the Cramer-von-Mises Goodness-of-Fit test to answer this question in the one dimensional case ${n}_{X}=1$, and with a continuous distribution. The current version is limited to the case of the Normal distribution.

Principle

Let us limit the case to ${n}_{X}=1$. Thus we denote $\underline{X}={X}^{1}=X$. This goodness-of-fit test is based on the distance between the cumulative distribution function ${\stackrel{^}{F}}_{N}$ of the sample $\left\{{x}_{1},{x}_{2},...,{x}_{N}\right\}$ (see [empirical cumulative distribution function] ) and that of the candidate distribution, denoted $F$. This distance is no longer the maximum deviation as in the [Kolmogorov-Smirnov test] but the distance squared and integrated over the entire variation domain of the distribution:

 $\begin{array}{c}\hfill D={\int }_{-\infty }^{\infty }{\left[F\left(x\right)-{\stackrel{^}{F}}_{N}\left(x\right)\right]}^{2}\phantom{\rule{0.166667em}{0ex}}dF\end{array}$

With a sample $\left\{{x}_{1},{x}_{2},...,{x}_{N}\right\}$, the distance is estimated by:

 $\begin{array}{c}\hfill {\stackrel{^}{D}}_{N}=\frac{1}{12N}+\sum _{i=1}^{N}{\left[\frac{2i-1}{2N}-F\left({x}_{i}\right)\right]}^{2}\end{array}$

The probability distribution of the distance ${\stackrel{^}{D}}_{N}$ is asymptotically known (i.e. as the size of the sample tends to infinity). If $N$ is sufficiently large, this means that for a probability $\alpha$ and a candidate distribution type, one can calculate the threshold / critical value ${d}_{\alpha }$ such that:

• if ${\stackrel{^}{D}}_{N}>{d}_{\alpha }$, we reject the candidate distribution with a risk of error $\alpha$,

• if ${\stackrel{^}{D}}_{N}\le {d}_{\alpha }$, the candidate distribution is considered acceptable.

Note that ${d}_{\alpha }$ depends on the candidate distribution $F$ being tested; the current version of OpenTURNS is limited to the case of the Normal distribution.

An important notion is the so-called "$p$-value" of the test. This quantity is equal to the limit error probability ${\alpha }_{\mathrm{lim}}$ under which the candidate distribution is rejected. Thus, the candidate distribution will be accepted if and only if ${\alpha }_{\mathrm{lim}}$ is greater than the value $\alpha$ desired by the user. Note that the higher ${\alpha }_{\mathrm{lim}}-\alpha$, the more robust the decision.

Other notations

-

Link with OpenTURNS methodology

This method is used in step B "Quantifying Sources of Uncertainty", to verify if the probability distribution is appropriate to describe the uncertainty of a component ${X}^{i}$ of the vector of unknown variables defined in step A "Specifying Criteria and the Case Study".

Input data:

$\left\{{x}_{1},...,{x}_{N}\right\}$ : data sample

Distribution: normal probability distribution that we are testing for goodness-of-fit

Parameters:

$\alpha$ : Level of significance for the test

Outputs:

Result: Binary variable specifying whether the candidate distribution is rejected (0) or not (1)

${\alpha }_{\mathrm{lim}}$ : $p$-value of the test

References and theoretical basics
The test concerns the deviation squared and integrated over the entire variation domain, it often appears to be more robust than the Kolmogorov-Smirnov test.

We remind the reader that the underlying theoretical results of the test are asymptotic. There is no rule to determine the minimum number of data values one needs to use this test; but it is often considered a reasonable approximation when $N$ is of an order of a few dozen. But whatever the value of $N$, the distance – and similarly the $p$-value – remains a useful tool for comparing different probability distributions to a sample. The distribution which minimizes ${\stackrel{^}{D}}_{N}$ – or maximizes the $p$-value – will be of interest to the analyst.

We also point out that the calculation of ${d}_{\alpha }$ should in theory be modified if on is testing the goodness-of-fit to a parametric model where the parameters have been estimated from the same sample. The current version of OpenTURNS does not allow this modification, and the results should be therefore used with caution the $p$-value ${\alpha }_{\mathrm{lim}}$ and the desired error risk $\alpha$ are very close.

Readers interested in Goodness of Fit tests for continuous distributions are referred to [Kolmogorov-Smirnov test] and [Anderson-Darling test] in the reference documentation.

The following bibliographical references provide main starting points for further study of this method:

• Saporta, G. (1990). "Probabilités, Analyse de données et Statistique", Technip

• Dixon, W.J. & Massey, F.J. (1983) "Introduction to statistical analysis (4th ed.)", McGraw-Hill

• D'Agostino, R.B. and Stephens, M.A. (1986). "Goodness-of-Fit Techniques", Marcel Dekker, Inc., New York.

• Bhattacharyya, G.K., and R.A. Johnson, (1997). "Statistical Concepts and Methods", John Wiley and Sons, New York.

• Sprent, P., and Smeeton, N.C. (2001). "Applied Nonparametric Statistical Methods – Third edition", Chapman & Hall

### 3.3.16 Step B  – Anderson-Darling goodness-of-fit test

Mathematical description

Objective

This method deals with the modelling of a probability distribution of a random vector $\underline{X}=\left({X}^{1},...,{X}^{{n}_{X}}\right)$. It seeks to verify the compatibility between a sample of data $\left\{{\underline{x}}_{1},{\underline{x}}_{2},...,{\underline{x}}_{N}\right\}$ and a candidate probability distribution previous chosen. OpenTURNS enables the use of the Anderson-Darling Goodness-of-Fit test to answer this question in the one dimensional case ${n}_{X}=1$, and with a continuous distribution. The current version is limited to the case of the Normal distribution.

Principle

Let us limit the case to ${n}_{X}=1$. Thus we denote $\underline{X}={X}^{1}=X$. This goodness-of-fit test is based on the distance between the cumulative distribution function ${\stackrel{^}{F}}_{N}$ of the sample $\left\{{x}_{1},{x}_{2},...,{x}_{N}\right\}$ (see [empirical cumulative distribution function] ) and that of the candidate distribution, denoted $F$. This distance is a quadratic type, as in the [Cramer-Von Mises test] , but gives more weight to deviations of extreme values:

 $\begin{array}{c}\hfill D={\int }_{-\infty }^{\infty }\frac{{\left[F\left(x\right)-{\stackrel{^}{F}}_{N}\left(x\right)\right]}^{2}}{F\left(x\right)\left(1-F\left(x\right)\right)}\phantom{\rule{0.166667em}{0ex}}dF\left(x\right)\end{array}$

With a sample $\left\{{x}_{1},{x}_{2},...,{x}_{N}\right\}$, the distance is estimated by:

 $\begin{array}{c}\hfill {\stackrel{^}{D}}_{N}=-N-\sum _{i=1}^{N}\frac{2i-1}{N}\left[lnF\left({x}_{\left(i\right)}\right)+ln\left(1-F\left({x}_{\left(N+1-i\right)}\right)\right)\right]\end{array}$

where $\left\{{x}_{\left(1\right)},...,{x}_{\left(N\right)}\right\}$ describes the sample placed in increasing order.

The probability distribution of the distance ${\stackrel{^}{D}}_{N}$ is asymptotically known (i.e. as the size of the sample tends to infinity). If $N$ is sufficiently large, this means that for a probability $\alpha$ and a candidate distribution type, one can calculate the threshold / critical value ${d}_{\alpha }$ such that:

• if ${\stackrel{^}{D}}_{N}>{d}_{\alpha }$, we reject the candidate distribution with a risk of error $\alpha$,

• if ${\stackrel{^}{D}}_{N}\le {d}_{\alpha }$, the candidate distribution is considered acceptable.

Note that ${d}_{\alpha }$ depends on the candidate distribution $F$ being tested; the current version of OpenTURNS is limited to the case of the Normal distribution.

An important notion is the so-called "$p$-value" of the test. This quantity is equal to the limit error probability ${\alpha }_{\mathrm{lim}}$ under which the candidate distribution is rejected. Thus, the candidate distribution will be accepted if and only if ${\alpha }_{\mathrm{lim}}$ is greater than the value $\alpha$ desired by the user. Note that the higher ${\alpha }_{\mathrm{lim}}-\alpha$, the more robust the decision.

Other notations
-

Link with OpenTURNS methodology

This method is used in step B "Quantifying Sources of Uncertainty", to verify if the probability distribution is appropriate to describe the uncertainty of a component ${X}^{i}$ of the vector of unknown variables defined in step A "Specifying Criteria and the Case Study".

Input data:

$\left\{{x}_{1},...,{x}_{N}\right\}$ : data sample

Distribution: normal probability distribution that we are testing for goodness-of-fit

Parameters:

$\alpha$ : Level of significance for the test

Outputs:

${\stackrel{^}{D}}_{N}$ : Distance between theoretical and empirical values

${d}_{\alpha }$ : Threshold / Critical value which if exceeded the tested probability is rejected

Result: Binary variable specifying whether the candidate distribution is rejected or not

References and theoretical basics
The Anderson-Darling test is theoretically designed to be more sensitive to the quality of fit in the tails of the distribution. A user interested in the extreme values of the source of uncertainty being studied will find this particularly interesting but we stress that both tails of the distribution, upper and lower, will influence the test results.

We remind the reader that the underlying theoretical results of the test are asymptotic. There is no rule to determine the minimum number of data values one needs to use this test; but it is often considered a reasonable approximation when $N$ is of an order of a few dozen. But whatever the value of $N$, the distance – and similarly the $p$-value – remains a useful tool for comparing different probability distributions to a sample. The distribution which minimizes ${\stackrel{^}{D}}_{N}$ – or maximizes the $p$-value – will be of interest to the analyst.

We also point out that the calculation of ${d}_{\alpha }$ should in theory be modified if on is testing the goodness-of-fit to a parametric model where the parameters have been estimated from the same sample. The current version of OpenTURNS does not allow this modification, and the results should be therefore used with caution the $p$-value ${\alpha }_{\mathrm{lim}}$ and the desired error risk $\alpha$ are very close.

Readers interested in Goodness of Fit tests for continuous distributions are referred to [Kolmogorov-Smirnov test] and [Cramer-von-Mises test] in the reference documentation.

The following bibliographical references provide main starting points for further study of this method:

• NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/

• D'Agostino, R.B. and Stephens, M.A. (1986). "Goodness-of-Fit Techniques", Marcel Dekker, Inc., New York.

• Sprent, P., and Smeeton, N.C. (2001). "Applied Nonparametric Statistical Methods – Third edition", Chapman & Hall

### 3.3.17 Step B  – Bayesian Information Criterion (BIC)

Mathematical description

Goal

This method deals with the modelling of a probability distribution of a random vector $\underline{X}=\left({X}^{1},...,{X}^{{n}_{X}}\right)$. It seeks to rank variable candidate distributions by using a sample of data $\left\{{\underline{x}}_{1},{\underline{x}}_{2},...,{\underline{x}}_{N}\right\}$. OpenTURNS enables the use of the Bayesian Information Criterion (BIC) to answer this question in the one dimensional case ${n}_{X}=1$.

Principle

Let us limit the case to ${n}_{X}=1$. Thus we denote $\underline{X}={X}^{1}=X$. Moreover, let us denote by ${ℳ}_{1}$,..., ${ℳ}_{K}$ the parametric models envisaged by the user among the [standard parametric models] . We suppose here that the parameters of these models have been estimated previously by the [maximum likelihood method] on the basis of the sample $\left\{{\underline{x}}_{1},{\underline{x}}_{2},...,{\underline{x}}_{n}\right\}$. We denote by ${L}_{i}$ the maximized likelihood for the model ${ℳ}_{i}$.

By definition of the likelihood, the higher ${L}_{i}$, the better the model describes the sample. However, using the likelihood as a criterion to rank the candidate probability distributions would involve a risk: one would almost always favour complex models involving many parameters. If such models provide indeed a large numbers of degrees-of-freedom that can be used to fit the sample, one has to keep in mind that complex models may be less robust that simpler models with less parameters. Actually, the limited available information ($N$ data points) does not allow to estimate robustly too many parameters.

The BIC criterion can be used to avoid this problem. The principle is to rank ${ℳ}_{1}$,..., ${ℳ}_{K}$ according to the following quantity:

 $\begin{array}{c}\hfill {\mathrm{BIC}}_{i}=log\left({L}_{i}\right)-\frac{{p}_{i}}{2}log\left(n\right)\end{array}$

where ${p}_{i}$ denotes the number of parameters being adjusted for the model ${ℳ}_{i}$. The larger ${\mathrm{BIC}}_{i}$, the better the model. Note that the idea is to introduce a penalization term that increases with the numbers of parameters to be estimated. A complex model will then have a good score only if the gain in terms of likelihood is high enough to justify the number of parameters used.

The term "Bayesian Information Criterion" comes the interpretation of the quantity ${\mathrm{BIC}}_{i}$. In a bayesian context, the unknow "true" model may be seen as a random variable. Suppose now that the user does not have any informative prior information on which model is more relevant among ${ℳ}_{1}$,..., ${ℳ}_{K}$; all the models are thus equally likely from the point of view of the user. Then, one can show that ${\mathrm{BIC}}_{i}$ is an approximation of the posterior distribution's logarithm for the model ${ℳ}_{i}$.

Other notations
Link with OpenTURNS methodology
This method is used in step B "Quantifying Sources of Uncertainty", to verify if the probability distribution is appropriate to describe the uncertainty of a component ${X}^{i}$ of the vector of unknown variables defined in step A "Specifying Criteria and the Case Study".
References and theoretical basics
Compared to other criteria proposed in literature for model selection and based on the same idea of penalization (such as the AIC criterion), the BIC criterion tends to favour models with a small number of parameters. Moreover, note that the undelying hypothesis is that the user does not have any significant prior information on which model is more relevant; if such prior information is available (for instance via literature or expert judgement), the BIC criterion becomes less relevant.

Readers interested in other ways to rank candidate models referred to [Kolmogorov-Smirnov test] , [Cramer-Von Mises test] and [Anderson-Darling test] in the reference documentation.

The following bibliographical references provide main starting points for further study of this method:

• Saporta, G. (1990). "Probabilités, Analyse de données et Statistique", Technip

• Dixon, W.J. & Massey, F.J. (1983) "Introduction to statistical analysis (4th ed.)", McGraw-Hill

• D'Agostino, R.B. and Stephens, M.A. (1986). "Goodness-of-Fit Techniques", Marcel Dekker, Inc., New York.

• Bhattacharyya, G.K., and R.A. Johnson, (1997). "Statistical Concepts and Methods", John Wiley and Sons, New York.

• Burnham, K.P., and Anderson, D.R (2002). "Model Selection and Multimodel Inference: A Practical Information Theoretic Approach", Springer

### 3.3.18 Step B  – Pearson Correlation Coefficient

Mathematical description

Goal

This method deals with the parametric modelling of a probability distribution for a random vector $\underline{X}=\left({X}^{1},...,{X}^{{n}_{X}}\right)$. It aims to measure a type of dependence (here a linear correlation) which may exist between two components ${X}^{i}$ and ${X}^{j}$.

Principle

The Pearson's correlation coefficient ${\rho }_{U,V}$ aims to measure the strength of a linear relationship between two random variables $U$ and $V$. It is defined as follows:

 $\begin{array}{c}\hfill {\rho }_{U,V}=\frac{\mathrm{Cov}\left[U,V\right]}{{\sigma }_{U}{\sigma }_{V}}\end{array}$

where $\mathrm{Cov}\left[U,V\right]=𝔼\left[\left(U-{m}_{U}\right)\left(V-{m}_{V}\right)\right]$, ${m}_{U}=𝔼\left[U\right]$, ${m}_{V}=𝔼\left[V\right]$, ${\sigma }_{U}=\sqrt{\mathrm{Var}\left[U\right]}$ and ${\sigma }_{V}=\sqrt{\mathrm{Var}\left[V\right]}$. If we have a sample made up of a set of $N$ pairs $\left\{\left({u}_{1},{v}_{1}\right),\left({u}_{2},{v}_{2}\right),...,\left({u}_{N},{v}_{N}\right)\right\}$, Pearson's correlation coefficient can be estimated using the formula:

 $\begin{array}{c}\hfill {\stackrel{^}{\rho }}_{U,V}=\frac{\sum _{i=1}^{N}\left({u}_{i}-\overline{u}\right)\left({v}_{i}-\overline{v}\right)}{\sqrt{\sum _{i=1}^{N}{\left({u}_{i}-\overline{u}\right)}^{2}{\left({v}_{i}-\overline{v}\right)}^{2}}}\end{array}$

where $\overline{u}$ and $\overline{v}$ represent the empirical means of the samples $\left({u}_{1},...,{u}_{N}\right)$ and $\left({v}_{1},...,{v}_{N}\right)$.

Pearson's correlation coefficient takes values between -1 and 1. The closer its absolute value is to 1, the stronger the indication is that a linear relationship exists between variables $U$ and $V$. The sign of Pearson's coefficient indicates if the two variables increase or decrease in the same direction (positive coefficient) or in opposite directions (negative coefficient). We note that a correlation coefficient equal to 0 does not necessarily imply the independence of variables $U$ and $V$: this property is in fact theoretically guaranteed only if $U$ and $V$ both follow a Normal distribution. In all other cases, there are two possible situations in the event of a zero Pearson's correlation coefficient:

• the variables $U$ and $V$ are in fact independent,

• or a non-linear relationship exists between $U$ and $V$.

Other notations

The estimate $\stackrel{^}{\rho }$ of Pearson's correlation coefficient is sometimes denoted by $r$.

Link with OpenTURNS methodology

Pearson's correlation coefficient can be used in step B "Quantifying Sources of Uncertainty". Having defined the vector $\underline{X}$ of input variables in step A "Specifying Criteria and the Case Study", [Pearson's Independence Test] shows how to test for the existence of a linear type of dependency between two components ${X}^{i}$ and ${X}^{j}$. Such a relationship should in fact be taken in to account so as not to falsify the results of step C "Propagation of Uncertainty".

Pearson's correlation coefficient is also used in step C' "Sensitivity Analysis and Ranking of Sources of Uncertainty". If a propagation of uncertainty with Monte-Carlo simulation (step C, [Mean and Variance Estimation using Standard Monte Carlo] ) has been carried out, [Pearson's Ranking] shows the user how to class the components of the input vector $\underline{X}$ according to their impact on the uncertainty of a final variable / output variable defined in step A.

References and theoretical basics

Regardless of the method used in step B or step C', we recall that the Pearson's coefficient is only useful in measuring a linear relationship between two variables. Readers are referred to the following references:
• Saporta, G. (1990). "Probabilités, Analyse de données et Statistique", Technip

• Dixon, W.J. & Massey, F.J. (1983) "Introduction to statistical analysis (4th ed.)", McGraw-Hill

• Bhattacharyya, G.K., and R.A. Johnson, (1997). "Statistical Concepts and Methods", John Wiley and Sons, New York.

### 3.3.19 Step B  – Pearson's correlation test

Mathematical description

Goal

This method deals with the modelling of a probability distribution of a random vector $\underline{X}=\left({X}^{1},...,{X}^{{n}_{X}}\right)$. It seeks to find a type of dependency (here a linear correlation) which may exist between two components ${X}^{i}$ and ${X}^{j}$.

Principle

The Pearson's correlation coefficient ${\rho }_{U,V}$, defined in [Pearson's Coefficient] , measures the strength of a linear relationship between two random variables $U$ and $V$. If we have a sample made up of $N$ pairs $\left\{\left({u}_{1},{v}_{1}\right),\left({u}_{2},{v}_{2}\right),\left({u}_{N},{v}_{N}\right)\right\}$, we denote ${\stackrel{^}{\rho }}_{U,V}$ to be the estimated coefficient.

Even in the case where two variables $U$ and $V$ have a Pearson's coefficient ${\rho }_{U,V}$ equal to zero, the estimate ${\stackrel{^}{\rho }}_{U,V}$ obtained from the sample may be non-zero: the limited sample size does not provide the perfect image of the real correlation. Pearson's test nevertheless enables one to determine if the value obtained by ${\stackrel{^}{\rho }}_{U,V}$ is significantly different from zero. More precisely, the user first chooses a probability $\alpha$. From this value the critical value ${d}_{\alpha }$ is calculated such that:

• if $\left|{\stackrel{^}{\rho }}_{U,V}\right|>{d}_{\alpha }$, one can conclude that the real Pearson's correlation coefficient ${\rho }_{U,V}$ is not zero; the risk of error in making this assertion is controlled and equal to $\alpha$;

• if $\left|{\stackrel{^}{\rho }}_{U,V}\right|\le {d}_{\alpha }$, there is insufficient evidence to reject the null hypothesis ${\rho }_{U,V}=0$.

An important notion is the so-called "$p$-value" of the test. This quantity is equal to the limit error probability ${\alpha }_{\mathrm{lim}}$ under which the null correlation hypothesis is rejected. Thus, Pearson's coefficient is supposed non zero if and only if ${\alpha }_{\mathrm{lim}}$ is greater than the value $\alpha$ desired by the user. Note that the higher ${\alpha }_{\mathrm{lim}}-\alpha$, the more robust the decision.

Other notations

-

Link with OpenTURNS methodology

The Pearson's test is used in step B "Quantifying Sources of Uncertainty". It enables us to verify if a linear type of dependency exists between the two components ${X}^{i}$ and ${X}^{j}$ of the input variable vector $\underline{X}$ defined in step A "Specifying Criteria and the Case Study". Such a relationship should in fact be taken into account to avoid distortion of results in step C "Propagation of Uncertainty".

Input data:

Two samples $\left\{{x}_{1}^{i},...,{x}_{N}^{i}\right\}$ and $\left\{{x}_{1}^{j},...,{x}_{N}^{j}\right\}$ of variables ${X}^{i}$ and ${X}^{j}$, each pair $\left({x}_{k}^{i},{x}_{k}^{j}\right)$ corresponding to a simultaneous sampling of the two variables

Parameters:

a probability $\alpha$ taking values strictly between 0 and 1, defining the risk of permissible decision error (significance level)

Outputs:

Result: Binary variable specifying whether the hypothesis of a correlation coefficient equal to 0 is rejected (0) or not (1)

${\alpha }_{\mathrm{lim}}$ : $p$-value of the test

References and theoretical basics
Certain precautions should be taken when interpreting the Pearson's test results.
• The underlying theory of the Pearson test assumes in fact that the variables ${X}^{i}$ and ${X}^{j}$ are both normally distributed. In all other cases, the decision produced by the test is only valid if the sample size $N$ is sufficiently large (in practice $N\ge$ a few dozen, even if there is no theoretical result that enables us to prove that asymptotic behaviour has been attained).

• Still considering the case of distributions other than the Normal distribution, whatever the value of $N$, we recall that ${\rho }_{{X}^{i},{X}^{j}}=0$ does not enable us to conclude that ${X}^{i}$ and ${X}^{j}$ are independent (see [Pearson's Correlation Coefficient] ).

• More generally, the numerical value of Pearson's correlation coefficient can only be interpreted when the two variables studied ${X}^{i}$ and ${X}^{j}$ are related in a linear way; the scatter plot of points $\left\{\left({x}_{1}^{i},{x}_{1}^{j}\right),...,\left({x}_{N}^{i},{x}_{N}^{j}\right)\right\}$ provides some indication concerning the validity of this hypothesis.

The following pages describe methods which enable us to test the hypothesis of the Normal distribution using the available sample $\left\{{x}_{1}^{i},...,{x}_{N}^{i}\right\}$ and $\left\{{x}_{1}^{j},...,{x}_{N}^{j}\right\}$: [Kolmogorov-Smirnov Goodness of Fit Test] , [Cramer-von Mises Goodness of Fit Test] , [Anderson-Darling Goodness of Fit Test] .

Out of Pearson's test validity domain (i.e. linear relationship, Normal distributions), [Spearman's test] provides some answers.

The following bibliographical references provide main starting points for further study of this method:

• Saporta, G. (1990). "Probabilités, Analyse de données et Statistique", Technip

• Dixon, W.J. & Massey, F.J. (1983) "Introduction to statistical analysis (4th ed.)", McGraw-Hill

• Bhattacharyya, G.K., and R.A. Johnson, (1997). "Statistical Concepts and Methods", John Wiley and Sons, New York.

### 3.3.20 Step B  – Spearman correlation coefficient

Mathematical description

Goal

This method deals with the parametric modelling of a probability distribution for a random vector $\underline{X}=\left({X}^{1},...,{X}^{{n}_{X}}\right)$. It aims to measure a type of dependence (here a monotonous correlation) which may exist between two components ${X}^{i}$ and ${X}^{j}$.

Principle

The Spearman's correlation coefficient ${\rho }_{U,V}^{S}$ aims to measure the strength of a monotonic relationship between two random variables $U$ and $V$. It is in fact equivalent to the Pearson's correlation coefficient after having transformed $U$ and $V$ to linearize any monotonic relationship (remember that Pearson's correlation coefficient may only be used to measure the strength of linear relationships, see [Pearson's Correlation Coefficient] ):

 $\begin{array}{c}\hfill {\rho }_{U,V}^{S}={\rho }_{{F}_{U}\left(U\right),{F}_{V}\left(V\right)}\end{array}$

where ${F}_{U}$ and ${F}_{V}$ denote the cumulative distribution functions of $U$ and $V$.

If we arrange a sample made up of $N$ pairs $\left\{\left({u}_{1},{v}_{1}\right),\left({u}_{2},{v}_{2}\right),...,\left({u}_{N},{v}_{N}\right)\right\}$, the estimation of Spearman's correlation coefficient first of all requires a ranking to produce two samples $\left({u}_{1},...,{u}_{N}\right)$ and $\left({v}_{1},...,{v}_{N}\right)$. The ranking ${u}_{\left[i\right]}$ of the observation ${u}_{i}$ is defined as the position of ${u}_{i}$ in the sample reordered in ascending order: if ${u}_{i}$ is the smallest value in the sample $\left({u}_{1},...,{u}_{N}\right)$, its ranking would equal 1; if ${u}_{i}$ is the second smallest value in the sample, its ranking would equal 2, and so forth. The ranking transformation is a procedure that takes the sample $\left({u}_{1},...,{u}_{N}\right)$) as input data and produces the sample $\left({u}_{\left[1\right]},...,{u}_{\left[N\right]}\right)$ as an output result.

For example, let us consider the sample $\left({u}_{1},{u}_{2},{u}_{3},{u}_{4}\right)=\left(1.5,0.7,5.1,4.3\right)$. We therefore have $\left({u}_{\left[1\right]},{u}_{\left[2\right]}{u}_{\left[3\right]},{u}_{\left[4\right]}\right)=\left(2,1,4,3\right)$. ${u}_{1}=1.5$ is in fact the second smallest value in the original, ${u}_{2}=0.7$ the smallest, etc.

The estimation of Spearman's correlation coefficient is therefore equal to Pearson's coefficient estimated with the aid of the $N$ pairs $\left({u}_{\left[1\right]},{v}_{\left[1\right]}\right)$, $\left({u}_{\left[2\right]},{v}_{\left[2\right]}\right)$, ..., $\left({u}_{\left[N\right]},{v}_{\left[N\right]}\right)$:

 $\begin{array}{c}\hfill {\stackrel{^}{\rho }}_{U,V}^{S}=\frac{\sum _{i=1}^{N}\left({u}_{\left[i\right]}-{\overline{u}}_{\left[\right]}\right)\left({v}_{\left[i\right]}-{\overline{v}}_{\left[\right]}\right)}{\sqrt{\sum _{i=1}^{N}{\left({u}_{\left[i\right]}-{\overline{u}}_{\left[\right]}\right)}^{2}{\left({v}_{\left[i\right]}-{\overline{v}}_{\left[\right]}\right)}^{2}}}\end{array}$

where ${\overline{u}}_{\left[\right]}$ and ${\overline{v}}_{\left[\right]}$ represent the empirical means of the samples $\left({u}_{\left[1\right]},...,{u}_{\left[N\right]}\right)$ and $\left({v}_{\left[1\right]},...,{v}_{\left[N\right]}\right)$.

The Spearman's correlation coefficient takes values between -1 and 1. The closer its absolute value is to 1, the stronger the indication is that a monotonic relationship exists between variables $U$ and $V$. The sign of Spearman's coefficient indicates if the two variables increase or decrease in the same direction (positive coefficient) or in opposite directions (negative coefficient). We note that a correlation coefficient equal to 0 does not necessarily imply the independence of variables $U$ and $V$. There are two possible situations in the event of a zero Spearman's correlation coefficient:

• the variables $U$ and $V$ are in fact independent,

• or a non-monotonic relationship exists between $U$ and $V$.

Other notations

Spearman's coeeficient is often referred to as the rank correlation coefficient.

Link with OpenTURNS methodology

Spearman's correlation coefficient can be used in step B "Quantifying Sources of Uncertainty". Having defined the vector $\underline{X}$ of input variables in step A "Specifying Criteria and the Case Study", [Spearman's Independence Test] shows how to test for the existence of a monotonous type of dependency between two components ${X}^{i}$ and ${X}^{j}$. Such a relationship should in fact be taken in to account so as not to falsify the results of step C "Propagation of Uncertainty".

Spearman's correlation coefficient is also used in step C' "Sensitivity Analysis and Ranking of Sources of Uncertainty". If a propagation of uncertainty with Monte-Carlo simulation (step C, [Mean and Variance Estimation using Standard Monte Carlo] ) has been carried out, [Spearman's Ranking] shows the user how to class the components of the input vector $\underline{X}$ according to their impact on the uncertainty of a final variable / output variable defined in step A.

References and theoretical basics
Regardless of the method used in step B or step C', we recall that the Spearman's coefficient is only useful in measuring a monotonous relationship between two variables. Readers are referred to the following references:
• Saporta, G. (1990). "Probabilités, Analyse de données et Statistique", Technip

• Dixon, W.J. & Massey, F.J. (1983) "Introduction to statistical analysis (4th ed.)", McGraw-Hill

• Bhattacharyya, G.K., and R.A. Johnson, (1997). "Statistical Concepts and Methods", John Wiley and Sons, New York.

• Sprent, P., and Smeeton, N.C. (2001). "Applied Nonparametric Statistical Methods – Third edition", Chapman & Hall

### 3.3.21 Step B  – Spearman correlation test

Mathematical description

Goal

This method deals with the modelling of a probability distribution of a random vector $\underline{X}=\left({X}^{1},...,{X}^{{n}_{X}}\right)$. It seeks to find a type of dependency (here a monotonous correlation) which may exist between two components ${X}^{i}$ and ${X}^{j}$.

Principle

The Spearman's correlation coefficient ${\rho }_{U,V}^{S}$, defined in [Spearman's Coefficient] , measures the strength of a monotonous relationship between two random variables $U$ and $V$. If we have a sample made up of $N$ pairs $\left\{\left({u}_{1},{v}_{1}\right),\left({u}_{2},{v}_{2}\right),\left({u}_{N},{v}_{N}\right)\right\}$, we denote ${\stackrel{^}{\rho }}_{U,V}^{S}$ to be the estimated coefficient.

Even in the case where two variables $U$ and $V$ have a Spearman's coefficient ${\rho }_{U,V}^{S}$ equal to zero, the estimate ${\stackrel{^}{\rho }}_{U,V}^{S}$ obtained from the sample may be non-zero: the limited sample size does not provide the perfect image of the real correlation. Pearson's test nevertheless enables one to determine if the value obtained by ${\stackrel{^}{\rho }}_{U,V}^{S}$ is significantly different from zero. More precisely, the user first chooses a probability $\alpha$. From this value the critical value ${d}_{\alpha }$ is calculated automatically such that:

• if $\left|{\stackrel{^}{\rho }}_{U,V}^{S}\right|>{d}_{\alpha }$, one can conclude that the real Spearman's correlation coefficient ${\rho }_{U,V}^{S}$ is not zero; the risk of error in making this assertion is controlled and equal to $\alpha$;

• if $\left|{\stackrel{^}{\rho }}_{U,V}^{S}\right|\le {d}_{\alpha }$, there is insufficient evidence to reject the null hypothesis ${\rho }_{U,V}^{S}=0$.

An important notion is the so-called "$p$-value" of the test. This quantity is equal to the limit error probability ${\alpha }_{\mathrm{lim}}$ under which the null correlation hypothesis is rejected. Thus, Spearman's's coefficient is supposed non zero if and only if ${\alpha }_{\mathrm{lim}}$ is greater than the value $\alpha$ desired by the user. Note that the higher ${\alpha }_{\mathrm{lim}}-\alpha$, the more robust the decision.

Other notations

-

Link with OpenTURNS methodology

The Spearman's test is used in step B "Quantifying Sources of Uncertainty". It enables us to verify if a monotonous type of dependency exists between the two components ${X}^{i}$ and ${X}^{j}$ of the input variable vector $\underline{X}$ defined in step A "Specifying Criteria and the Case Study". Such a relationship should in fact be taken into account to avoid distortion of results in step C "Propagation of Uncertainty".

Input data:

Two samples $\left\{{x}_{1}^{i},...,{x}_{N}^{i}\right\}$ and $\left\{{x}_{1}^{j},...,{x}_{N}^{j}\right\}$ of variables ${X}^{i}$ and ${X}^{j}$, each pair $\left({x}_{k}^{i},{x}_{k}^{j}\right)$ corresponding to a simultaneous sampling of the two variables

Parameters:

a probability $\alpha$ taking values strictly between 0 and 1, defining the risk of permissible decision error (significance level)

Outputs:

Result: Binary variable specifying whether the hypothesis of a correlation coefficient equal to 0 is rejected (0) or not (1)

${\alpha }_{\mathrm{lim}}$ : $p$-value of the test

References and theoretical basics
Certain precautions should be taken when interpreting the Spearman's test results.
• Remember that ${\rho }_{{X}^{i},{X}^{j}}=0$ does not enable us to conclude that ${X}^{i}$ and ${X}^{j}$ are independent (see [Spearman's correlation coefficient] ).

• More generally, the numerical value of Spearman's correlation coefficient can only be interpreted when the two variables studied ${X}^{i}$ and ${X}^{j}$ are related in a monotonous way; the scatter plot of points $\left\{\left({x}_{1}^{i},{x}_{1}^{j}\right),...,\left({x}_{N}^{i},{x}_{N}^{j}\right)\right\}$ provides some indication concerning the validity of this hypothesis.

The following bibliographical references provide main starting points for further study of this method:

• Saporta, G. (1990). "Probabilités, Analyse de données et Statistique", Technip

• Dixon, W.J. & Massey, F.J. (1983) "Introduction to statistical analysis (4th ed.)", McGraw-Hill

• Bhattacharyya, G.K., and R.A. Johnson, (1997). "Statistical Concepts and Methods", John Wiley and Sons, New York.

• Sprent, P., and Smeeton, N.C. (2001). "Applied Nonparametric Statistical Methods – Third edition", Chapman & Hall

### 3.3.22 Step B  – Chi-squared test for independence

Mathematical description

Goal

This method deals with the parametric modelling of a probability distribution for a random vector $\underline{X}=\left({X}^{1},...,{X}^{{n}_{X}}\right)$. We seek here to detect possible dependencies that may exist between two components ${X}^{i}$ and ${X}^{j}$. In response to this, OpenTURNS offers the use of the ${\chi }^{2}$ test for Independence for discrete probability distributions.

Principle

As we are considering discrete distributions, the possible values for ${X}^{i}$ and ${X}^{j}$ respectively belong to the discrete sets ${ℰ}_{i}$ and ${ℰ}_{j}$. The ${\chi }^{2}$ test of independence can be applied when we have a sample consisting of $N$ pairs $\left\{\left({x}_{1}^{i},{x}_{1}^{j}\right),\left({x}_{2}^{i},{x}_{2}^{j}\right),\left({x}_{N}^{i},{x}_{N}^{j}\right)\right\}$. We denote:

• ${n}_{u,v}$ the number of pairs in the sample such that ${x}_{k}^{i}=u$ and ${x}_{k}^{j}=v$,

• ${n}_{u}^{i}$ the number of pairs in the sample such that ${x}_{k}^{i}=u$,

• ${n}_{v}^{j}$ the number of pairs in the sample such that ${x}_{k}^{j}=v$.

The test thus uses the quantity denoted ${\stackrel{^}{D}}_{N}^{2}$:

 $\begin{array}{c}\hfill {\stackrel{^}{D}}_{N}^{2}=\sum _{u\in {ℰ}_{i}}\sum _{v\in {ℰ}_{2}}\frac{{\left({p}_{u,v}-{p}_{v}^{j}{p}_{u}^{i}\right)}^{2}}{{p}_{u}^{i}{p}_{v}^{j}}\end{array}$

where:

 $\begin{array}{c}\hfill {p}_{u,v}=\frac{{n}_{u,v}}{N},\phantom{\rule{4pt}{0ex}}{p}_{u}^{i}=\frac{{n}_{u}^{i}}{N},\phantom{\rule{4pt}{0ex}}{p}_{v}^{j}=\frac{{n}_{v}^{j}}{N}\end{array}$

The probability distribution of the distance ${\stackrel{^}{D}}_{N}^{2}$ is asymptotically known (i.e. as the size of the sample tends to infinity). If $N$ is sufficiently large, this means that for a probability $\alpha$, one can calculate the threshold (critical value) ${d}_{\alpha }$ such that:

• if ${\stackrel{^}{D}}_{N}>{d}_{\alpha }$, we conclude, with the risk of error $\alpha$, that a dependency exists between ${X}^{i}$ and ${X}^{j}$,

• if ${\stackrel{^}{D}}_{N}\le {d}_{\alpha }$, the independence hypothesis is considered acceptable.

An important notion is the so-called "$p$-value" of the test. This quantity is equal to the limit error probability ${\alpha }_{\mathrm{lim}}$ under which the independence hypothesis is rejected. Thus, independence is assumed if and only if ${\alpha }_{\mathrm{lim}}$ is greater than the value $\alpha$ desired by the user. Note that the higher ${\alpha }_{\mathrm{lim}}-\alpha$, the more robust the decision.

Other notations

This method is also referred to in the literature as the ${\chi }^{2}$ test of contingency.

Link with OpenTURNS methodology

The ${\chi }^{2}$ independence test is used in step B "Quantifying Sources of Uncertainty". It enables the existence of a dependency between two components ${X}^{i}$ and ${X}^{j}$ of the input vector $\underline{X}$, defined in step A "Specifying Criteria and the Case Study", to be verified.

Input data:

Two samples $\left\{{x}_{1}^{i},...,{x}_{N}^{i}\right\}$ and $\left\{{x}_{1}^{j},...,{x}_{N}^{j}\right\}$ of variables ${X}^{i}$ and ${X}^{j}$, each pair $\left({x}_{k}^{i},{x}_{k}^{j}\right)$ corresponding to a simultaneous sampling of the two variables

Parameters:

a probability $\alpha$ taking values strictly between 0 and 1, defining the risk of permissible decision error (significance level)

Outputs:

Result: Binary variable specifying whether the hypothesis of independence is rejected (0) or not (1)

${\alpha }_{\mathrm{lim}}$ : $p$-value of the test

References and theoretical basics
The ${\chi }^{2}$ test of independence can be applied when the two variables of study are discrete. Its use for continuous distributions is only possible by means of an arbitrary discretisation of possible values of $X$, a high source of potential error.

On the other hand, no hypothesis is made in the form of the relationship between the two tested variables. Readers interested in the detection of dependencies between two continuous variables are referred to [Pearson's Test] and [Spearman's test] in the reference documentation.

The following bibliographical references provide main starting points to further study of this method:

• Saporta, G. (1990). "Probabilités, Analyse de données et Statistique", Technip

• Dixon, W.J. & Massey, F.J. (1983) "Introduction to statistical analysis (4th ed.)", McGraw-Hill

• Bhattacharyya, G.K., and R.A. Johnson, (1997). "Statistical Concepts and Methods", John Wiley and Sons, New York.

• Sprent, P., and Smeeton, N.C. (2001). "Applied Nonparametric Statistical Methods – Third edition", Chapman & Hall

### 3.3.23 Step B  – Linear regression

Mathematical description

Goal

This method deals with the parametric modelling of a probability distribution for a random vector $\underline{X}=\left({X}^{1},...,{X}^{{n}_{X}}\right)$. It aims to measure a type of dependence (here a linear relation) which may exist between a component ${X}^{i}$ and other uncertain variables ${X}^{j}$.

Principle of the method

The principle of the multiple linear regression model is to find the function that links the variable ${X}^{i}$ to other variables ${X}^{{j}_{1}}$,...,${X}^{{j}_{K}}$ by means of a linear model:

 $\begin{array}{c}\hfill {X}^{i}={a}_{0}+\sum _{j\in \left\{{j}_{1},...,{j}_{K}\right\}}{a}_{j}{X}^{j}+\epsilon \end{array}$

where $\epsilon$ describes a random variable with zero mean and standard deviation $\sigma$ independent of the input variables ${X}^{i}$. For given values of ${X}^{{j}_{1}}$,...,${X}^{{j}_{K}}$, the average forecast of ${X}^{i}$ is denoted by ${\stackrel{^}{X}}^{i}$ and is defined as:

 $\begin{array}{c}\hfill {\stackrel{^}{X}}^{i}={a}_{0}+\sum _{j\in \left\{{j}_{1},...,{j}_{K}\right\}}{a}_{j}{X}^{j}\end{array}$

The estimators for the regression coefficients ${\stackrel{^}{a}}_{0},{\stackrel{^}{a}}_{1},...,{\stackrel{^}{a}}_{K}$, and the standard deviation $\sigma$ are obtained from a sample of $\left({X}^{i},{X}^{{j}_{1}},...,{X}^{{j}_{K}}\right)$, that is a set of $N$ values $\left({x}_{1}^{i},{x}_{1}^{{j}_{1}},...,{x}_{1}^{{j}_{K}}\right)$,...,$\left({x}_{n}^{i},{x}_{n}^{{j}_{1}},...,{x}_{n}^{{j}_{K}}\right)$. They are determined via the least-squares method:

 $\begin{array}{c}\hfill \left\{{\stackrel{^}{a}}_{0},{\stackrel{^}{a}}_{1},...,{\stackrel{^}{a}}_{K}\right\}=\mathrm{argmin}\sum _{k=1}^{n}{\left[{x}_{k}^{i}-{a}_{0}-\sum _{j\in \left\{{j}_{1},...,{j}_{K}\right\}}{a}_{j}{x}_{k}^{j}\right]}^{2}\end{array}$

In other words, the principle is to minimize the total quadratic distance between the observations ${x}_{k}^{i}$ and the linear forecast ${\stackrel{^}{x}}_{k}^{i}$.

Some estimated coefficient ${\stackrel{^}{a}}_{\ell }$ may be close to zero, which may indicate that the variable ${X}^{{j}_{\ell }}$ does not bring valuable information to forecast ${X}^{i}$. OpenTURNS includes a classical statistical test to identify such situations: Fisher's test. For each estimated coefficient ${\stackrel{^}{a}}_{\ell }$, an important characteristic is the so-called "$p$-value" of Fisher's test. The coefficient is said to be "significant" if and only if ${\alpha }_{\ell \mathrm{lim}}$ is greater than a value $\alpha$ chosen by the user (typically 5% or 10%). The higher the $p$-value, the more significant the coefficient.

Another important characteristic of the adjusted linear model is the coefficient of determination ${R}^{2}$. This quantity indicates the part of the variance of ${X}^{i}$ that is explained by the linear model:

 $\begin{array}{c}\hfill {R}^{2}=\frac{\sum _{k=1}^{n}{\left({x}_{k}^{i}-{\overline{x}}^{i}\right)}^{2}-\sum _{k=1}^{n}{\left({x}_{k}^{i}-{\stackrel{^}{x}}_{k}^{i}\right)}^{2}}{{\sum }_{k=1}^{n}{\left({x}_{k}^{i}-{\overline{x}}^{i}\right)}^{2}}\end{array}$

where ${\overline{x}}^{i}$ denotes the empirical mean of the sample $\left\{{x}_{1}^{i},...,{x}_{n}^{i}\right\}$.

Thus, $0\le {R}^{2}\le 1$. A value close to 1 indicates a good fit of the linear model, whereas a value close to 0 indicates that the linear model does not provide a relevant forecast. A statistical test allows to detect significant values of ${R}^{2}$. Again, a $p$-value is provided: the higher the $p$-value, the more significant the coefficient of determination.

By definition, the multiple regression model is only relevant for linear relationships, as in the following simple example where ${X}^{2}={a}_{0}+{a}_{1}{X}^{1}$.

In this second example (still in dimension 1), the linear model is not relevant because of the exponential shape of the relation. But a linear approach would be useful on the transformed problem ${X}^{2}={a}_{0}+{a}_{1}exp{X}^{1}$. In other words, what is important is that the relationships between ${X}^{i}$ and the variables ${X}^{{j}_{1}}$,...,${X}^{{j}_{K}}$ is linear with respect to the regression coefficients ${a}_{j}$.

The value of ${R}^{2}$ is a good indication of the goodness-of fit of the linear model. However, several other verifications have to be carried out before concluding that the linear model is satisfactory. For instance, one has to pay attentions to the "residuals" $\left\{{u}_{1},...,{u}_{N}\right\}$ of the regression:

 $\begin{array}{c}\hfill {u}_{j}={x}^{i}-{\stackrel{^}{x}}^{i}\end{array}$

A residual is thus equal to the difference between the observed value of ${X}^{i}$ and the average forecast provided by the linear model. A key-assumption for the robustness of the model is that the characteristics of the residuals do not depend on the value of ${X}^{i},{X}^{{j}_{1}}$,...,${X}^{{j}_{K}}$: the mean value should be close to 0 and the standard deviation should be constant. Thus, plotting the residuals versus these variables can fruitful.

In the following example, the behaviour of the residuals is satisfactory: no particular trend can be detected neither in the mean nor in he standard deviation.

The next example illustrates a less favourable situation: the mean value of the residuals seems to be close to 0 but the standard deviation tends to increase with $X$. In such a situation, the linear model should be abandoned, or at least used very cautiously.

Other notations

Link with OpenTURNS methodology

Multiple linear regression can be used in step B "Quantifying Sources of Uncertainty". Having defined the vector $\underline{X}$ of input variables in step A "Specifying Criteria and the Case Study", linear regression allows to detect a linear type of dependency between uncertain variables. Such a relationship should in fact be taken in to account so as not to bias the results of step C "Propagation of Uncertainty".
References and theoretical basics
As we have seen in the mathematical description, there is a consequent list of verifications that have to be carried to validate the linear model. In particular, underlying assumptions on the residuals are important to ensure the robustness of the average forecast. Detecting a non-conform behaviour of the residuals can also provide leads on transformations that could be carried out before applying linear regression (such as considering the logarithm of a variable instead of the variable itself).

The following bibliographical references provide main starting points for further study of this method:

• Saporta, G. (1990). "Probabilités, Analyse de données et Statistique", Technip

• Dixon, W.J. & Massey, F.J. (1983) "Introduction to statistical analysis (4th ed.)", McGraw-Hill

• NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/

• Bhattacharyya, G.K., and R.A. Johnson, (1997). "Statistical Concepts and Methods", John Wiley and Sons, New York.

 Global methodology of an uncertainty study Table of contents OpenTURNS' methods for Step C: uncertainty propagation