Important asymptotic theorems

Machine learning algorithms are very populuar. However, machine learing algorithms are not stable/consistant on the performance because lots of them are not using statistical inference. Thus, statistical theory for estimating function which has established hundreds of years ago becomes a more and more interesting research direction.

In this blog, I will introduce a few important asymptotic theorems that are fundamental to prove some machine learning algorithms, such as SVM and Markov Chain.

Fatou-Lebesgue Lemma: if the random variable Xna.sX and if for all n XnY with E|Y|<, then E(lim infnXn)lim infnE(Xn) It holds if Xn0 for all n. 

By using Fatou-Lebesgue Lemma, we can prove the (a) Monotone convergence Theorem, and the (b) Lebesgue Dominated Convergence Theorem.

  • (a) Monotone convergence Theorem: If Xn is a sequence of nonnegative measurable functions denoted by 0x1x2xnxn+1 and Xna.sX, then lim0E(Xn)=E(lim0Xn)=EX

  • (b) Lebesgue Dominated Convergence Theorem:If the random variables XnX, then we have |Xn|Y, almost surely for all n. Then XnL1, XL1, and lim0E(Xn)=E(X).

Partical converge relationXnPX if and only if every subsequence n1,N2,ϵ{1,2,} has a sub-sequence m1,m2,ϵ{n1,n2,} such that Xmja.sX as j.

Borel-Cantelli Lemma: for {An:n1} a sequence of events in a probability space if n=1P(An)< then P(Ani.o.)=0; only a finite number of the events occur, with probability 1. Conversely, if the An are independent and n=1P(An)=, then P(Ani.o.)=1; an infinite number of the events occur, with probability 1.

Borel-Cantelli Lemma is useful in problems related to the a.s. convergence. It could be written as P(|XnX|>ϵi.o.)=0,ϵ>0.

Laws of Large Numbers: When the convergence is in probability or law, this is known as weak law of large numbers (WLLN). if E|X|<, then Xn¯Pμ=EX. When the convergence is almost surely, it is the strong laws of large nubmers (SLLN). Xn¯a.s.μEX< and μ=EX

Central Limit Theorems: Let X1,X2, be i.i.d. random vectors with mean μ and finite covariance matrix, Σ. Then n(Xn¯μ)LN(0,Σ).

Slutsky’s Theorem: Let {Xn},{Yn} be sequences of scalar/vector/matrix random elements. If Xn converges in distribution to a random element X; and Yn converges in probability to a constant c, then Xn+YndX+c, XnYndcX, XnYndXnc.

SVM1could be an application of Lebesgue Dominated Convergence Theorem and Central Limit Theorem. We can use the theorem and Partical converge relation to prove the hinge loss function, when the data is not linearly separable. By limiting on Hilbert space, a weakly convergent subsequence. We can apply asymptotic normality property on the regularization parameter λi0 and thus to solve the miminization problem on the hyperplane ωn, where the solution ω~=ω because ω(λi)a.s.ω.

The Markov chain2 can be proved by using Borel-Cantelli Lemma. The probability of having state from i and eventually return to i is 1. If this probability is strictly less than 1, i is called transient.


  1. SVM is supervised learning model with associated learning algorithm that analyzes data used for classification and regression analysis.

  2. A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. In probability theory and related fields, a Markov process, named after the Russian mathematician Andrey Markov, is a stochastic process that satisfies the Markov property. The defining characteristic of a Markov chain is that no matter how the process arrived at its present state, the possible future states are fixed. In other words, the probability of transitioning to any particular state is dependent solely on the current state and time elapsed. The state space, or set of all possible states, can be anything: letters, numbers, weather conditions, sales volume,etc.

Yuan Du
Yuan Du
Senior Data Scientist

My interests include applied Statistics, Machine Learning, Deep Learning and Healthcare.