2 Global methodology of an uncertainty study

2.1 Step A: specification of the case-study

The first step of an uncertainty study can be roughly described as "the definition of the problem". This may seem obvious, but starting an uncertainty study requires an analysis of some key issues – the foundations that will ensure that the industrial goals have been correctly translated in mathematical terms.

2.1.1 Variables of interest, model and input variables

In our framework, a variable of interest denotes a scalar variable on which the uncertainty is to be quantified. A model denotes a mathematical function that enables the computation of a set variable of interest, being given several input variables on which the User may have data and/or expert/engineering judgement. The basis of the uncertainty study is the following mathematical equation:

y ̲=hx ̲,d ̲

where: Illustration on the flood example

A/ Modelling with random vector

A key variable to be studied is the annual maximum water level; in addition, one may also want to consider the annual cost including damage caused by possible floods and maintenance of the dyke. Therefore, two variables of interest y ̲=y 1 ,y 2 can be studied: y 1 denotes the annual maximum water level, and y 2 denotes the overall annual cost. y 1 can be evaluated via more or less complex hydrological models, the main input factors being the river flow and some characteristics of the river bed (such as Strickler's coefficient to represent the friction i.e. the bed roughness). y 2 requires in addition an economical model to assess the costs (systematic maintenance and damages repair).

Some of the models input variables are uncertain: the river flow and bed's characteristics are naturally variable from year to year, and damage cost may not be well known. They are therefore part of x ̲, even if some of them may be put in d ̲ by using penalized value (e.g. a maximal damage cost or a "worst possible" Strickler's coefficient). This last approach could be chosen if too scarce information is available on these sources of uncertainty.

Note that every model is a simplified view of reality, which introduces another source of uncertainty in the analysis. Thus, one has to keep in mind the importance of a compromise between model uncertainty (complex models usually offer a more accurate evaluation of the variable of interest) and input variables uncertainty (complex models may involve much more uncertain factors on which information has to be available).

B/ Modelling with stochastic processes

One can also be interested in the variation of the water level over the time t; In that context the variable of y ̲ is then indexed according the time and noted y ̲(t)=y 1 (t),y 2 . The difference with the previous paragraph is that y ̲(t) will no more be modelled as a simple random vector, but it will be considered as a stochastic process.

Among the uncertain input variables, some can also depend over the time. For example the river flow is obviously not the same at winter and at summer. The bed's characteristics can also evolve according to some spatial parameters. The vector of inputs will then be indexed according the time t, and some spatial positions, p, x ̲(t,p). It will be also modelled as a stochastic process.

The modelling of the uncertain inputs by a stochastic process do not necessarily lead to model the output as a stochastic process. For example, even though x ̲(t,p) is a stochastic process, the output variable y 2 representing the annual cost due to damage of the dyke remains a simple random variable.

2.1.2 Criteria of the uncertainty study

Now that the general context has been staged, one major question is still to be addressed before moving to the core of the uncertainty study. The variable(s) of interest for the User are known to be uncertain, and this uncertainty is to be quantified; but what exactly could we or should we use to measure uncertainty? OpenTURNS' methodology proposes deterministic and probabilistic criteria that meet many industrial cases requirements. Deterministic criteria

In a deterministic context, one may want to assess the range of possible values of y ̲, that is to say a subset D y n y in which we are sure to find y ̲. In the following, we will refer to this type of uncertainty measurement as a deterministic criterion; OpenTURNS proposes methods that can be used to estimate the minimum and the maximum of a variable of interest.

This approach is the easiest to understand from a conceptual point of view, easier anyway than the probabilistic approach that we will now address. But we will see in step C that it is not always the less demanding approach in terms of CPU time. A/ Criteria for random vectors Probabilistic criteria: probability of exceeding a threshold / failure probability, and quantile

Most of the methods proposed in OpenTURNS use a probabilistic framework. In such a context, the vector y ̲ of variables of interest is seen as a mathematical object called random vector, usually noted in capital letters Y ̲. Roughly speaking, this means that one associates a probability to each interval (and more generally to each subset of values). Note that in such an approach, the range of possible values of Y ̲ may be infinite e.g. the water level in our flood problem may be somewhere between 0 and +, even if very large values will be associated to probabilities that are extremely close to zero.

The most complete measure of uncertainty when dealing with a random vector is the probability distribution. One way to characterize a probability distribution is the following function F Y , called cumulative distribution function:

F Y y 1 ,...,y n y =Y 1 y 1 ,...,Y n y y n y

In an uncertainty study, one may want to assess the value of the cumulative distribution function at least in certain points. More precisely, focus may be placed on the following quantities.

These criteria are very rich in terms of industrial meanings. But their assessment may be sometimes quite demanding in terms of CPU time (step C) and/or knowledge on the sources of uncertainty (step B). This is why in some applications, practitioners may be interested in more simple probabilistic criteria. Probabilistic criteria: central dispersion

The expectation/average value μ i and variance σ i 2 of a variable of interest Y i are defined as follows:

μ i =𝔼Y i ,σ i 2 =𝔼Y i -μ i 2

Exception made of very particular cases, these two quantities are not sufficient to compute the probability of exceeding a threshold, or a quantiles. But they provide an "order of magnitude" of uncertainty: the standard deviation σ i (square root of the variance) – normalized by the average value μ i in order to remove scale effects – is an indicator of the dispersion of the variable of interest Y i . Values distant from μ i are more likely if σ i is large. Illustration on the flood example

In our flood example, practitioners may be interested is the probability of a flood over a year. Since Y 1 denotes the annual maximum water level:

Y 1 > dyke height =1-F Y 1 dyke height

Another probabilistic quantity of interest would be the 99%-quantile of the variable of interest Y 1 , that is to say the level of water that is exceeded only 1 time per century on average (probability of exceeding the threshold equal to 1%). Note that here, one has in mind very low probabilities. But if the description of the methods proposed in OpenTURNS often place the focus on low probabilities assessment – which yields specific difficulties – it is obviously possible to use these methods in order to adress "non-rare" events.

The value of these indicators (probability of flood and quantiles) is relevant only if one is able to provide an accurate probabilistic model of the uncertainty sources (e.g. the river flow and the bed's characteristics), problem that will be addressed in step B. If information on the uncertainty sources is scarce or difficult to collect, a first uncertainty study could focus on the expectation and standard deviation of the variable Y 1 , which will bring some first useful – even though limited – informations on uncertainty. B/ Criteria for stochastic processes Same deterministic and probabilistic criteria as for random vectors

The same criteria than the ones used for random vector can be adapted to stochastic processes just by fixing an instant of interest t I or a duration of interest T I .

For example, the probability of exceeding a threshold and the quantile defined previously can be rewritten as follow. Specific criteria for stochastic processes

Other criteria related to the particular characteristics of the time dependence of stochastic processes can be defined. Illustration on the flood example

In our flood example, if many precipitation have been observed, practitioners may be interested in the prediction of the date of the flood. In some cases, this may be of great for helping the organisation of the displacement of people. This date is a stopping time since it will be defined as the first time where the level of the river is greater than the dyke height.

The duration of the flood may also be of interest. The consequences are obviously not the same if the flood lasts several hours or several days.

2.2 Step B: quantification of the uncertainty sources

Once step A has been carried out, the next step is to define a model to represent the uncertainties on the vector x ̲. The methods to be used depend mainly on the type of criteria chosen (deterministic or probabilistic) and on the information available (statistical datasets and/or expert/engineering judgement). Deterministic criteria

In a deterministic framework, the range of possible values has to be determined for each component of the uncertainty sources x ̲.

A/ Criteria for random vectors Probabilistic criteria

In a probabilistic framework, the vector x ̲ of uncertainty sources is seen as a random vector denoted by X ̲. The uncertainty study then requires to assess the probability distribution of X ̲.

The first question that has to be investigated concerns the possible dependencies between uncertain variables. Common physical phenomenon may link several components of vector X ̲; then obtaining an information on X i would change our knowledge of X j . If such dependencies are suspected, a multi-dimensional analysis is required in order not to bias the results of the uncertainty study. In case of independence, a uni-dimensional analysis for each X i is sufficient.

In this version, OpenTURNS proposes a way of building a multi-dimensional probability distribution of X ̲ in two sub-steps.

In the uni-dimensional case, the way to build a probability distribution depends on the available data. Illustration on the flood example

In a deterministic framework, note that the upper limit for the river flow is always relative: whatever "realistic" value is proposed, one has to be aware that there is still a residual risk of exceeding this limit.

If a probabilistic framework is considered, some uncertainty sources can be reasonably assumed independent: there is no physical reason that may justify a dependancy between the river flow and Strickler's friction coefficient (knowing the flow of arriving water does not give any information on the state of the river bed). But if several uncertain variables characterize the river bed (e.g. Strickler's coefficient and some indicators of topography), the question of dependency should be investigated in order not to false the results of the study, even if it is an additional source of complexity.

Finally, note that some relationships between the variable of interests and some uncertain variables are monotonic. For instance, the maximum value of the water level will be reached for the highest possible value considered for the river flow, since a non-decreasing relation intuitively exists between these variables. Therefore, studying a high quantile of the water level requires a good confidence in the probabilistic model of extreme river flow values. B/ Criteria for stochastic processes

In the following, the input vector is considered as a stochastic process X ̲(t). As previously, its probability distribution is required to perform an uncertainty study. A stochastic process is the mathematical generalisation of the notion of random vector. In our contexts of application, it enables to represent a spatial or temporal evolution of a random phenomenon. The mathematical models enable to represent at each time step or spatial position the associated uncertainty. This uncertainty at time t or position p takes into account the effects of the uncertainty of the other times steps or positions and its own characteristics. This requires two sub-steps.

For the determination of the stochastic process, analogy can be made with the paragraph A/. It can be identified through expertise, or by data analysis. Remarks Illustration on the flood example

In the flood example, the recordings of the flow of the river can be used for the identification a the stochastic process. First, to respect the hypothesis required for computations, a box cox transformation and an extraction of the trend due to seasonality need to be performed. A parametric assumption on the dependence over the time of the obtained process is assumed, and one will estimate the parameters thanks to the data. As mentioned in the remarks, one will have to pay attention to the time grid which is considered (hourly, daily, weekly ...). This basically depends on the objective of the study.

2.3 Step C: uncertainty propagation

Now that the analysis on the uncertainty sources has been carried out, the next goal is to translate the model chosen in step B in terms of uncertainty on the variables of interest via the relation:

y ̲(t)=hx ̲(t),d ̲(t)

The method to be used depends on the criteria of the study, and on some characteristics of the model h. Deterministic criteria

In this situation, range of values have been determined for x ̲(t). Finding the minimum and maximum values of y ̲(t) is quite easy if the model h is monotonous with respect to x ̲(t) (one only has to consider the boundary values of x ̲(t)). But in a more general context, this is a potentially complex optimization problem. OpenTURNS proposes a simplified approach based on design of experimentss to estimate extreme values of y ̲(t). Probabilistic criteria

Step B has provided the probability distribution of X ̲(t). The objective is then to assess some characteristics of interest of the distribution of Y ̲(t)=hX ̲(t),d ̲(t): probability of exceeding a threshold, quantile, or expectation and variance. OpenTURNS proposes a set of relevant methods for each of these quantities. Illustration on the flood example

In a deterministic framework, the computation of extremum values is facilitated by the fact that some relationships between the variable of interests and some uncertain variables are monotonic, as mentioned above: the maximum value of the water level will be reached for the highest possible value considered for the river flow.

In a probabilistic framework, the complexity of the hydrological model h plays an important role in the propagation method to be chosen. If a simple model with a low CPU cost is used, robust sampling methods are the most natural candidates. Otherwise, approximation methods and/or accelerated sampling methods may be attractive. Note that one does not have to choose a unique method: cross-validating the results by using several propagation methods may be fruitful!

2.4 Step C': Ranking uncertainty sources / Sensitivity analysis (only for random vectors)

In a probabilistic framework, a better understanding of uncertainties can be achieved by analysing the contribution of the different uncertainty sources to the uncertainty of the variables of interest. For each couple "criteria of the study / propagation method used in step C", post-treatment procedures are proposed by OpenTURNS in order to rank the uncertainty sources.

It is important to note that an uncertainty study rarely stops after a first processing of steps A, B, C and C', and the last step then plays a crucial role. Indeed, the ranking results highlight the variables that truly determine the relevancy of the final results of the study. If the uncertainty model of some of these variables has been chosen a bit roughly in step B e.g. because of time constraints or any practical difficulties, collecting further informations on these meaningful sources would be a relevant move to refine the analysis. Illustration on the flood example

It is important to note that the result of the uncertainty ranking is strongly linked to the type of criterion considered. For instance, suppose that the central dispersion of the annual maximum water level is studied. Suppose also that the river flow is pointed out by uncertainty ranking as the most important uncertain variable, the other ones having almost a negligible impact. However, it would be dangerous to say without further investigation that this would be the same if the focus is shifted towards extreme values of the variable of interest (high quantile or rare probability). it is quite possible that the role of bed's roughness uncertainty will be increased since extreme values of the water level may come only from the conjunction of a high flow and a high roughness.

Table of contents
OpenTURNS' methods for Step B: quantification of the uncertainty sources