Week 9: Predictability through the lens of Information Theory

Week 9: Predictability through the lens of Information Theory

Perfect model experiments

Information theory plays a central role in the field of predictability (and probably the most essential one in our entire class). Our goal today is to leverage information theory for finding the most predictable feature in a dynamical system. but first, let’s briefly walk through some game rules.

In week 1, we learn that the definition of predictability is that: given a forecast PDF (probability density function) and a climatological PDF, if we can no longer tell the difference between the two, we reach so-called predictability limit. However, there are many ways of testing the difference between two PDFs, the question is…can we reach any agreement on the definition of predictability? The answer is yes …as long as two a prior assumptions are satisfied.

  • The minimum predictability is invariant through linear transformation.

  • The first two moments are the most important metrics for estimating predictability while higher moments are not considered

When both criterion are satisfied, it’s called Markov process. i.e., the future state only depends on the current state and the past doesn’t influence the future. Mathematically, it can be written as

(120)\[\mathbf{x}_{t+\tau}=\alpha\mathbf{x}_{t}+\mathbf{\epsilon} \]

where \(\mathbf{x_{t+\tau}}\) is the future state, \(\mathbf{x_{t}}\) is the current state and \(\mathbf{\epsilon}\) is the random white noise. \(\alpha\) is the memory passing from current time step to the next time step. An alternative way of writing (120) is using a conditional PDF.

(121)\[p(\mathbf{x}_{t+\tau}|\mathbf{x}_{t}) \]

where (121) is read as “given the PDF of x at \(t\), the PDF of x at \(t+\tau\)”. (121) is also called transition probability. We know that the initial states change discontinuously when we observe the state. For example, the summer temperature around 2pm at National Taiwan University is around \(35\pm3\) Celsius from the climatological data. However, as long as we have observation at a given moment, the probability instantaneously change to 100\(\%\). The new distribution of \(\mathbf{x_{t}}\) is defined as \(p(\mathbf{x_{t}}|\mathbf{\theta_t}) \) and named as analysis probability.

To reduce the notation burden, variables at the initial time, verification time and observation will be represented as

(122)\[\begin{split}\begin{align} \mathbf{x}_{t+\tau} & \equiv \mathbf{v} \\ \mathbf{x}_{t} & \equiv \mathbf{i} \\ \mathbf{\theta}_{t} & \equiv \mathbf{\theta} \\ \end{align}\end{split}\]

With (122), the analysis probability is written as \(p(\mathbf{i}|\mathbf{\theta})\) and transition probability is written as \(p(\mathbf{v}|\mathbf{i})\). The assumption that transition probability is known exactly ((121)) is called perfect model experiment. For simplification, we adopt the perfect model assumption for the later analysis.

We then define forecast distribution as

(123)\[p(\mathbf{v}|\mathbf{\theta}) = \int p(\mathbf{v}|\mathbf{i}) p(\mathbf{i}|\mathbf{\theta}) di\]

Physically, (123) says that if the system is predictable (i.e., initial state and final state remain a certain relation), we can have the final state PDF by integrating over all possible initial conditions. (123) holds from the theorem of compound probability, given that \(p(\mathbf{v}|\mathbf{i},\mathbf{\theta}) = p(\mathbf{v}|i)\) i.e., given the observation won’t perturb the system significantly. In practice, the ensemble forecast is generated by repeatedly sampling data from the analysis distribution and computing the transition probability conditioned on drawn \(i\). The mean of forecast probability is signal and the spread of forecast probability is noise.

For the reference state, we define a climatological distribution, where the specific observation is missing

(124)\[p(\mathbf{v}) = \int p(\mathbf{v}|\mathbf{\theta}) p(\mathbf{\theta}) d\mathbf{\theta}\]

One should notice that the climatological distribution should depends on time. For example, climatologically, we know the mid-night temperature is always lower than the afternoon temperature. or we know the winter temperature is on-average lower than the summer temperature.

Some studies (and most cases) have chosen the saturated forecast PDF (i.e., \(p(\mathbf{v}|\mathbf{\theta})\)) as the climatological PDF. This is a good choice in most cases but it can be a terrible choice at certain circumstance. For example, if a given dynamical system is a damped linear function, the saturated PDF will be a delta function, which obvious is not ideal given that we want to better approximate the forecast uncertainty at the longest lead time.

Shannon Entropy and Predictability Components

Now we have all the necessary ingredients to define predictability (forecast probability and climatology probability). To quantify the predictability, we still need some metrics. Usually, we define the predictability limit as the moment when \(p(\mathbf{v})\) and \(p(\mathbf{v}|\mathbf{\theta})\) are not statistically different from each other. In other words, when \(\mathbf{\theta}\) and \(\mathbf{v}\) has no connection, we hit the predictability limit.

Traditionally, we can compare two PDFs and estimate if their mean are significantly different from each other by calculating the t-scores. The t-score/z-score contains the information of both mean and standard deviations. However, the use of t or z-score doesn’t tell us how different the two PDFs are and doesn’t include the conservation of probability density (unless the two PDFs are Gaussian as we assumed at the very beginning). Therefore, we will try something more general here.

Recall that the Liouville equation ((64)) predicts the time evolution of PDF. If we rewrite (64), we can have

(125)\[-\frac{d \mathrm{ln}(\rho)}{dt}=\frac{\partial \mathbf{\Phi}}{\partial \mathbf{x}}\]

integrating over the entire domain, the right hand side of (125) will vanish, suggesting the conservation of probability density or information. The question is how to physically interpret the left hand side of (125)?

Here are a few examples… If today, we cast a six-sided dice, as long as we know outcome of one side (let’s say…it’s 1), the probability for the rest are determined (i.e., 5 other sides have 0 probability). On the other hand, if we only have 2-sided coin, when we know the outcome of one side, we determine the outcome of the other side at the same time. From these two examples, we can find that the six-sided dice provides more information as long as the result of one certain side is determined. On the other hand, the probability of knowing the outcome of a single side in six-sided dice is also lower. Combining both quantities, Claude Shannon introduced Shannon entropy in 1948, which is defined as the expected value of information.

(126)\[H(\mathbf{x}) = -\int p(\mathbf{x}) \mathrm{log} [p(\mathbf{x})] d\mathbf{x} \]

The first term in the integral is the probability of a given scenario and the second term in the integral is how much information is carried by each scenario. We can find that the conservation of (125) is equivalent to the conservation of the second term inside the integral. In computational science, we mostly use binary information (i.e., 1 or 0) to transmit data. Thus for a unit size data/information, the corresponding Shannon entropy is defined as \(\int\frac{1}{2}\mathrm{log_b} 2\). When \(b=2\), the integral equals 1, which is the widely used unit in computational science, bit.

Note

When Claude Shannon proposed using (126) to quantify the dispersion of a PDF, he consulted John von Neumann about how to name this metric. von Neumann told him that using entropy is a good idea because similar concepts have been used in statistical mathematics and people don’t really know what entropy is.

Because the unlikely event carries more information, the corresponding entropy is lower. It’s like buying a lottery ticket. If someone tells certain combination of number can help win the lottery you might want to buy that given combination and rather not to invest your money on other combinations. Knowing this information is equivalent the probability changes when we have observation. The change in the prize (because you expected you might get the money) reflects the change in the term \(\mathrm{log} [p(\mathbf{x})]\). Similar to thermodynamics, the entropy will increase with the increase of forecast lead time and the process is irreversible. Here we use the following formula to describe the increase of entropy

(127)\[P_\theta = H(\mathbf{v})- H_\theta(\mathbf{v}|\mathbf{\theta})\]

where

(128)\[H_\theta(\mathbf{v}|\mathbf{\theta}) = -\int p(\mathbf{v}|\mathbf{\theta}) \mathrm{log} [p(\mathbf{v}|\mathbf{\theta})] d \mathbf{v}\]

is the entropy of forecast distribution. The quantity of \(P_{\theta}\) is predictive information, which measures how much room we still have before we hit the predictability limit. In general, \(P_\theta\) various with time through its dependence on \(\theta\).

An alternative metric for estimating the difference between two PDFs is Kullback-Leible distance, which is define as

(129)\[R_\theta = \int p(\mathbf{v}|\mathbf{\theta}) \mathrm{log}(\frac{p(\mathbf{v}|\mathbf{\theta})}{p(\mathbf{v})}) dv\]

Technically, Kullback-Leible distance is not really a distance but it still measures how different the two PDFs are. One key element of Kullback-Leible distance is the second term inside the integral, which is the ratio between two PDFs. Therefore, Kullback-Leible distance is also called relative entropy. For a Markov problem, the relative entropy decreases monotonically.

Remarkably, the predictive information and relative entropy are invariant with respective to linear and nonlinear transformation of the states. This allows us to combine variables with different unit into a single measure of predictability without introducing weighting factors since such factors cannot affect final values. Relative entropy has the additional property of being invariant to nonlinear,invertible transformations of the state. Another remarkable point is that if we integrate both metrics over \(\mathbf{\theta}\), we can have exactly the same values. i.e.,

(130)\[M = \int P_\theta d\theta = \int R_\theta d\theta\]

that means the estimated predictability limit will be the same regardless of which metric is used here. We can take a detailed look of the first part of (130).

(132)\[M = \int P_\theta d\theta = \int H(\mathbf{v})- H_\theta(\mathbf{v}|\mathbf{\theta}) d\theta = H(\mathbf{v})- H(\mathbf{v}|\mathbf{\theta})\]

where \(H(\mathbf{v}|\mathbf{\theta})\) is conditional entropy. (130) and (132) rigorously states that the predictability can be stated in two different ways: (1) the difference between the forecast and climatological distribution and (2) the statistical dependence between \(\mathbf{v}\) and \(\mathbf{\theta}\). Because when \(\mathbf{v}\) and \(\mathbf{\theta}\) loss their statistical connection, \(H(\mathbf{v}|\mathbf{\theta})\) will reduce to \(H(\mathbf{v})\) and the difference between \(H(\mathbf{v})\) and \(H_\theta(\mathbf{v}|\mathbf{\theta})\) vanish.

To simplify the story, here we only focus on the normal distribution. The climatological distribution, observation and forecast distribution are then defined as

(132)\[\begin{split}\begin{align} p(\mathbf{v}) & = \mathcal{N}(\mu_\mathbf{v},\Sigma_\mathbf{v}) \\ p(\mathbf{\theta}) & = \mathcal{N}(\mu_\mathbf{\theta},\Sigma_\mathbf{\theta}) \\ p(\mathbf{v|\theta}) & = \mathcal{N}(\mu_\mathbf{v|\theta},\Sigma_\mathbf{v|\theta}) \\ \end{align}\end{split}\]

where \(\mu\) is the mean and \(\Sigma\) is the variance. Since the metrics we used allow us to apply the linear transformation, we can easily standardize the distribution and thus (132) can be written as

(133)\[\begin{split}\begin{align} p(\mathbf{v}) & = \mathcal{N}(\mu_\mathbf{v},\sigma_\mathbf{v}^2) \\ p(\mathbf{\theta}) & = \mathcal{N}(\mu_\mathbf{\theta},\sigma_\mathbf{\theta}^2) \\ p(\mathbf{v|\theta}) & = \mathcal{N}(\mu_\mathbf{v|\theta},\sigma_\mathbf{v|\theta}^2) \\ \end{align}\end{split}\]

Now we will formulate a single equation, which reconciles the concepts over different metrics and can be used to calculated predictability.

But…first, let’s take a look of the old-school method. Traditionally, the measure of predictability is based on the mean square difference between the forecast and the truth. In a perfect model, truth is chosen from a randomly sampled ensemble (it’s like multi universes!). Let \(\epsilon\) to be difference between two randomly chosen scenarios, \(\mathbf{f_1}\) and \(\mathbf{f_2}\) (one forecast and one truth). Then the mean square error is defined as,

(134)\[\begin{split}\begin{align} \mathrm{MSE} & = \overline{||\epsilon||^2} = \overline{||\mathbf{f_1}-\mathbf{f_2}||^2} = \overline{||(\mathbf{f_1}-\mathbf{\mu})-(\mathbf{f_2}-\mathbf{\mu})||^2} \\ & = \overline{||(\mathbf{f_1}-\mathbf{\mu})||^2}+\overline{||(\mathbf{f_2}-\mathbf{\mu})||^2} = 2\mathrm{Tr}[\Sigma_{\mathbf{v}|\mathbf{\theta}}] \end{align}\end{split}\]

As discussed previously, the measure of predictability quantify the degree to which the forecast and the climatological distribution are different. i.e.,

(135)\[\Delta_\theta = \frac{\overline{||\epsilon||^2}}{||\mathbf{v}-\mathbf{\mu}||^2} = \frac{2\mathrm{Tr}[\Sigma_{\mathbf{v}|\mathbf{\theta}}]}{2\mathrm{Tr}[\Sigma_{\mathbf{v}}]}\]

The above metric has some limitations. First, the results depends on the chosen coordinate, which is used to represent the dynamical system. Second, if the chosen variable is not representative enough, it fails to tell the difference between forecast and the climatological PDFs. However, later we gonna show that with certain linear transformation, the traditional method and the information theory-based method can give us the same result.

For information theory based method, if we assume there is a basis (spatial structure), \(\mathbf{q}\), which can maximum the signals on forecast lead time. Then the mean and variance of that basis’s time series can be written as

(137)\[\begin{split}\begin{align} \mu_v &= \mathbf{q}^{T}\mathbf{\mu}_v \\ \sigma_v^{2} &= \mathbf{q}^{T}\mathbf{\sigma}_v^{2}\mathbf{q} \\ \mu_{v|\theta} &= \mathbf{q}^{T}\mathbf{\mu}_{v|\theta} \\ \sigma_{v|\theta}^{2} &= \mathbf{q}^{T}\mathbf{\sigma}_{v|\theta}^{2}\mathbf{q} \\ \end{align}\end{split}\]

Then the metrics we used previously can be written as

(137)\[\begin{split}\begin{align} P_\theta & = -\mathrm{log} (\frac{\sigma_{v|\theta}^{2}}{\sigma_{v}^{2}}) \\ R_\theta & = -\mathrm{log} (\frac{\sigma_{v|\theta}^{2}}{\sigma_{v}^{2}})+\frac{\sigma_{v|\theta}^{2}}{\sigma_{v}^{2}}-1+\frac{(\mu_{v|\theta}-\mu_v)^{2}}{\sigma_v^{2}} \\ M & = -<\mathrm{log} (\frac{\sigma_{v|\theta}^{2}}{\sigma_{v}^{2}})> \\ \Delta_\theta & = \frac{\sigma_{v|\theta}^{2}}{\sigma_{v}^{2}} \end{align}\end{split}\]

We can find that across four different metrics, they are all controlled by one single parameter: \(\frac{\sigma_{v|\theta}^{2}}{\sigma_{v}^{2}}\). Given that the variable is whitened (standardized, or long-term variance = 1), \(\sigma_{v}^{2}\) will be 1. Thus, the only term that determines predictability is the whitened covariance matrix of forecast distribution (i.e., \(\sigma_{v|\theta}^{2}\)). At this point, it’s not hard to find the importance of (i.e., \(\sigma_{v|\theta}^{2}\)).

It’s not hard to find, for all four equations, if we can find a pattern that minimizes \(\sigma_{v|\theta}^{2}\), it will be the most predictable pattern. The above question is equivalent to solving an eigen value problem. The eigen vector with the smallest eigen value is the pattern having the longest predictability. i.e.,

(138)\[\mathbf{\sigma}_{v|\theta}^{2}\mathbf{\Phi} = \lambda \mathbf{\Phi}\]

Average Predictability Time

From (137), we know that \(\mathbf{\sigma_{v|\theta}}^{2}\) will approach 1 as \(\tau\) approaches \(\infty\). Because when forecast lead time is long enough, the forecast error will approach the variance of the climatological distribution.

In addition, in a Gaussian process, the forecast signal decay exponentially with the increase of lead time i.e., \(\mu_{v|\theta}=\mathbf{q}^{T}\mathbf{\mu_{v|\theta}} = \mu_{v|\theta,t=0}\mathrm{e}^{A\tau}\), where \(\mu_{v|\theta,t=0}\) is ensemble mean at \(\tau=0\).

Combining \(\mu_{v|\theta}=\mu_{v|\theta,t=0}\mathrm{e}^{A\tau}\) and the definition of \(\mathbf{\sigma_{v|\theta}}^{2}\), we can have

(139)\[\begin{split}\begin{align} \mathbf{\sigma}_{v|\theta}^{2} & = <(v|\theta-\mu_{v|\theta})^2> \\ & = <(v|\theta-\mu_{v|\theta,\tau=0}\mathrm{e}^{A\tau})^2> \\ & = <\mathrm{e}^{2A\tau} (v|\theta_{\tau=0}-\mu_{v|\theta,t=0})^2> + \mathrm{e}^{2A\tau}\int_{s=0}^{s=\tau}\int_{s'=0}^{s'=\tau} \mathrm{e}^{-As} \epsilon(s)\epsilon(s') \mathrm{e}^{-As'} dsds' \\ & = \mathrm{e}^{2A\tau} \mathbf{\sigma}_{v|\theta,\tau=0}^{2} + \mathrm{e}^{2A\tau} \int_{s=0}^{\tau} \mathrm{e}^{-2As} ds \\ & = \mathrm{e}^{2A\tau} \mathbf{\sigma}_{v|\theta,\tau=0}^{2} -\frac{1-\mathrm{e}^{2a\tau}}{2a}\sigma_{\epsilon}^2 \\ & = \frac{1-\mathrm{e}^{-2||a||\tau}}{2||a||}\sigma_{\epsilon}^2 \end{align}\end{split}\]

if we specify \(\mathbf{\sigma_{v|\theta,\tau=0}}^{2}=0\) and \(a<0\) (holds for red noise problem), we find that \(\mathbf{\sigma_{v|\theta}}^{2} = \frac{1-\mathrm{e}^{-2||a||\tau}}{2||a||}\sigma_{\epsilon}^2\). Thus, \(||a||\) tells how far we still can leverage \(v|\theta\) to inform the future states.

Note

In the homework assignment, you will use the ECMWF reforecast data to find the predictable components and the corresponding average predictability time. Do you find what pattern shows up over the Pacific- North America regions?