Important asymptotic theorems

Machine learning algorithms are very populuar. However, machine learing algorithms are not stable/consistant on the performance because lots of them are not using statistical inference. Thus, statistical theory for estimating function which has established hundreds of years ago becomes a more and more interesting research direction.

In this blog, I will introduce a few important asymptotic theorems that are fundamental to prove some machine learning algorithms, such as SVM and Markov Chain.

Fatou-Lebesgue Lemma: if the random variable \(X_n \xrightarrow{a.s} X\) and if for all n \(X_n \geq Y\) with \(E|Y| < \infty\), then \[E(\liminf_{n \to \infty} X_n) \leq \liminf_{n \to \infty} E(X_n)\] It holds if \(X_n \geq 0\) for all n. 

By using Fatou-Lebesgue Lemma, we can prove the (a) Monotone convergence Theorem, and the (b) Lebesgue Dominated Convergence Theorem.

  • (a) Monotone convergence Theorem: If \(X_n\) is a sequence of nonnegative measurable functions denoted by \(0 \leq x_1 \leq x_2 \dots \leq x_n \leq x_{n+1}\) and \(X_n \xrightarrow{a.s} X\), then \(\lim_{0\to\infty}E(X_n)=E(\lim_{0\to\infty}X_n) = EX\)

  • (b) Lebesgue Dominated Convergence Theorem:If the random variables \(X_n \to X\), then we have \(|X_n| \leq Y\), almost surely for all n. Then \(X_n \in L^1\), \(X \in L^1\), and \(\lim_{0\to\infty}E(X_n) = E(X)\).

Partical converge relation\(X_n \xrightarrow{P}X\) if and only if every subsequence \(n_1, N_2, \dots \epsilon \{1,2,\dots\}\) has a sub-sequence \(m_1, m_2, \dots \epsilon \{n_1,n_2,\dots \}\) such that \(X_{m_j} \xrightarrow{a.s}X\) as \(j\to\infty\).

Borel-Cantelli Lemma: for {\(A_n: n \geq 1\)} a sequence of events in a probability space if \(\sum_{n=1}^{\infty}P(A_n) < \infty\) then \(P(A_n i.o.)=0\); only a finite number of the events occur, with probability 1. Conversely, if the \(A_n\) are independent and \(\sum_{n=1}^{\infty}P(A_n) = \infty\), then \(P(A_n i.o.)=1\); an infinite number of the events occur, with probability 1.

Borel-Cantelli Lemma is useful in problems related to the a.s. convergence. It could be written as \(P(|X_n - X|>\epsilon i.o.) = 0, \forall \epsilon > 0\).

Laws of Large Numbers: When the convergence is in probability or law, this is known as weak law of large numbers (WLLN). if \(E|X| < \infty\), then \(\bar{X_n} \xrightarrow{P} \mu = EX\). When the convergence is almost surely, it is the strong laws of large nubmers (SLLN). \(\bar{X_n} \xrightarrow{a.s.} \mu \Leftrightarrow EX < \infty\) and \(\mu = EX\)

Central Limit Theorems: Let \(X_1,X_2,\dots\) be i.i.d. random vectors with mean \(\mu\) and finite covariance matrix, \(\Sigma\). Then \(\sqrt{n}(\bar{X_n} - \mu) \xrightarrow{L} N(0,\Sigma)\).

Slutsky’s Theorem: Let \(\{X_n\}, \{Y_n\}\) be sequences of scalar/vector/matrix random elements. If \(X_n\) converges in distribution to a random element X; and \(Y_n\) converges in probability to a constant c, then \(X_n + Y_n \xrightarrow{d} X + c\), \(X_nY_n \xrightarrow{d} cX\), \(\frac{X_n}{Y_n} \xrightarrow{d} \frac{X_n}{c}\).

SVM1could be an application of Lebesgue Dominated Convergence Theorem and Central Limit Theorem. We can use the theorem and Partical converge relation to prove the hinge loss function, when the data is not linearly separable. By limiting on Hilbert space, a weakly convergent subsequence. We can apply asymptotic normality property on the regularization parameter \(\lambda_i \to 0\) and thus to solve the miminization problem on the hyperplane \(\omega_n\), where the solution \(\tilde{\omega}= \omega_*\) because \(\omega_*(\lambda_i) \xrightarrow{a.s.} \omega_*\).

The Markov chain2 can be proved by using Borel-Cantelli Lemma. The probability of having state from \(i\) and eventually return to \(i\) is 1. If this probability is strictly less than 1, \(i\) is called transient.

  1. SVM is supervised learning model with associated learning algorithm that analyzes data used for classification and regression analysis.

  2. A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. In probability theory and related fields, a Markov process, named after the Russian mathematician Andrey Markov, is a stochastic process that satisfies the Markov property. The defining characteristic of a Markov chain is that no matter how the process arrived at its present state, the possible future states are fixed. In other words, the probability of transitioning to any particular state is dependent solely on the current state and time elapsed. The state space, or set of all possible states, can be anything: letters, numbers, weather conditions, sales volume,etc.

Yuan Du
Yuan Du
Senior Data Scientist

My interests include applied Statistics, Machine Learning, Deep Learning and Healthcare.