Computes the data-driven regularization parameter \(\hat\lambda^{\rm PDB}_\alpha\) using the Pivotal Detection Boundary (PDB) principle. The selected value is defined as the empirical \((1 - \alpha)\) quantile of a null-distribution gradient statistic $$\hat\lambda^{\rm PDB}_\alpha =q_{1-\alpha}\left(\left\|\nabla \ell_0\right\|_\infty \right),$$ where \(\ell_0\) denotes the loss evaluated under the null model.
Usage
lambda_pdb(
X,
family,
n_simu = 5000L,
alpha = 0.05,
method = c("mc_exact", "mc_gaussian", "analytical")
)Arguments
- X
Numeric design matrix. Columns should typically be standardized to zero mean and unit variance.
- family
A PIC family specification. Can be either a family object or a character string accepted by
get_family().- n_simu
Number of Monte Carlo simulations used by the stochastic methods. Ignored when
method = "analytical".- alpha
Nominal tail probability used to define the quantile level.
- method
One of "mc_exact", "mc_gaussian", or "analytical".
Value
An object of class "pic.lambda_pdb".
- value
Selected regularization parameter \(\hat\lambda^{\rm PDB}_\alpha\).
- statistics
Simulated null statistics used to estimate the quantile.
NULLfor the analytical method.- method
Estimation method used.
- alpha
Quantile level parameter.
- n_simu
Number of Monte Carlo simulations.
Details
Under the null \(\beta = 0\), the gradient of the loss carries only sampling noise.
The smallest \(\lambda\) large enough to dominate this noise is the natural threshold
separating signal from noise, i.e. the value above which a coefficient should be
kept rather than shrunk to zero. Calibrating \(\hat\lambda\) this way has two
consequences. First, the quantile depends only on the design matrix \(X\),
the family, and the level \(\alpha\) (not on the response \(y\)), so
cross-validation is no longer needed. Second, and more importantly, it leads
to sharper support recovery than prediction-error-based selectors such as
cross-validated lasso: by targeting the noise level of the gradient directly,
PDB controls the inclusion of noise variables and more reliably identifies
the true non-zero coefficients. Computing the quantile requires only the
distribution of \(\|\nabla \ell_0\|_\infty\), which lambda_pdb() estimates
through one of the three methods below.
Details on method option
The empirical quantile is obtained using one of the three following methods:
"mc_exact"Family-aware Monte Carlo. For each of the
n_simudraws, a response vector is sampled under the null model (\(\beta = 0\)) of the chosen family; the family-specific gradient of the loss at \(\beta = 0\) is evaluated on the fixed design \(X\), and its supremum norm is recorded. The empirical \((1 - \alpha)\) quantile of then_simurecorded norms is returned. Most accurate but slowest of the three."mc_gaussian"Monte Carlo under the Gaussian approximation of the null gradient. A central-limit argument gives \(\nabla \ell_0 \approx \mathcal{N}(0,\, c(n)\, \Sigma_X / n)\) with \(\Sigma_X = X^\top X / n\). Each of the
n_simudraws samples directly from this Gaussian and records its supremum norm — no family-specific evaluation needed. Family-agnostic and noticeably faster than"mc_exact"; valid in the regime where the CLT kicks in (moderate to large \(n\))."analytical"Closed-form Bonferroni bound on the Gaussian tail. Combining a union bound over the \(p\) coordinates of the gradient with the standard Gaussian tail bound gives $$\hat\lambda_\alpha^{\rm analytical} = \Phi^{-1}\!\left(1 - \alpha / (2p)\right)\, \sqrt{c(n) / n}.$$ Deterministic and \(O(1)\) — no simulation. Conservative (slightly over-estimates the true quantile) and gets looser as \(p\) grows, but useful when speed matters or when Monte Carlo is overkill.
The "mc_gaussian" and "analytical" methods use a variance scaling
factor \(c(n)\) depending on the family:
Gaussian / Binomial / Poisson / Exponential: \(c(n) = 1\)
Gumbel: \(c(n) = \exp(2(\gamma + 1))\), with \(\gamma\) the Euler-Mascheroni constant.
Cox: \(c(n) = 1 / (4 \log n)\)
Computational cost
Both Monte Carlo methods are dominated by a single \(p \times n_{\rm simu}\)
matrix product \(X^\top R\), where \(R\) stacks the simulated
residuals. This product is dispatched to BLAS as a single \(\tt gemm\),
which is essentially as fast as a Monte Carlo selector can be on dense
designs. For very large \(n\) or \(p\), however, the constant
becomes large and "mc_exact" may be unnecessarily expensive:
As \(n\) grows, the central-limit approximation tightens and
"mc_gaussian"gives essentially the same \(\hat\lambda\) as"mc_exact"at a fraction of the cost (no family-specific residual generation, fewer dependencies on \(y\)-draws)."analytical"is \(O(1)\) and useful as a quick upper bound — slightly conservative, but accurate enough for triage and for very high \(p\) where the Bonferroni tail is tight.
Practical rule of thumb: prefer "mc_exact" for small to moderate
problems (default), "mc_gaussian" once \(n\) grows and the
design is dense, and "analytical" when even the Monte Carlo cost is
a concern.
Details on alpha option
The level \(\alpha\) is the nominal Type-I error of the test of \(H_0\!\colon \beta = 0\). By construction of the quantile, $$\Pr\!\left(\left\|\nabla \ell_0\right\|_\infty > \hat\lambda^{\rm PDB}_\alpha \,\big|\, H_0\right) = \alpha,$$ so under the null model no variable enters the active set with probability at most \(\alpha\). With the default \(\alpha = 0.05\), this caps the false-discovery rate at \(5\%\) under the null: when the data carry no signal, picreg returns the empty support \(95\%\) of the time.