Given n samples from a population of individuals belonging to different types with unknown proportions, how do we estimate the probability of discovering a new type at the (n + 1)th draw? This is a classical problem in statistics, commonly referred to as the missing mass estimation problem. Recent results have shown: (i) the impossibility of estimating the missing mass without imposing further assumptions on type’s proportions; (ii) the consistency of the Good–Turing estimator of the missing mass under the assumption that the tail of type’s proportions decays to zero as a regularly varying function with parameter α ∈ (0, 1); (iii) the rate of convergence n-α/2 for the Good–Turing estimator under the class of α ∈ (0, 1) regularly varying P. In this paper we introduce an alternative, and remarkably shorter, proof of the impossibility of a distribution-free estimation of the missing mass. Beside being of independent interest, our alternative proof suggests a natural approach to strengthen, and expand, the recent results on the rate of convergence of the Good–Turing estimator under α ∈ (0, 1) regularly varying type’s proportions. In particular, we show that the convergence rate n-α/2 is the best rate that any estimator can achieve, up to a slowly varying function. Furthermore, we prove that a lower bound to the minimax estimation risk must scale at least as n-α/2, which leads to conjecture that the Good–Turing estimator is a rate optimal minimax estimator under regularly varying type proportions.
Ayed, F., Battiston, M., Camerlenghi, F., Favaro, S. (2021). On consistent and rate optimal estimation of the missing mass. ANNALES DE L'INSTITUT HENRI POINCARE-PROBABILITES ET STATISTIQUES, 57(3), 1476-1494 [10.1214/20-AIHP1126].
On consistent and rate optimal estimation of the missing mass
Camerlenghi, Federico;
2021
Abstract
Given n samples from a population of individuals belonging to different types with unknown proportions, how do we estimate the probability of discovering a new type at the (n + 1)th draw? This is a classical problem in statistics, commonly referred to as the missing mass estimation problem. Recent results have shown: (i) the impossibility of estimating the missing mass without imposing further assumptions on type’s proportions; (ii) the consistency of the Good–Turing estimator of the missing mass under the assumption that the tail of type’s proportions decays to zero as a regularly varying function with parameter α ∈ (0, 1); (iii) the rate of convergence n-α/2 for the Good–Turing estimator under the class of α ∈ (0, 1) regularly varying P. In this paper we introduce an alternative, and remarkably shorter, proof of the impossibility of a distribution-free estimation of the missing mass. Beside being of independent interest, our alternative proof suggests a natural approach to strengthen, and expand, the recent results on the rate of convergence of the Good–Turing estimator under α ∈ (0, 1) regularly varying type’s proportions. In particular, we show that the convergence rate n-α/2 is the best rate that any estimator can achieve, up to a slowly varying function. Furthermore, we prove that a lower bound to the minimax estimation risk must scale at least as n-α/2, which leads to conjecture that the Good–Turing estimator is a rate optimal minimax estimator under regularly varying type proportions.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.