Skip to contents

Computes the data-driven regularization parameter \(\hat\lambda^{\rm PDB}_\alpha\) using the Pivotal Detection Boundary (PDB) principle. The selected value is defined as the empirical \((1 - \alpha)\) quantile of a null-distribution gradient statistic $$\hat\lambda^{\rm PDB}_\alpha =q_{1-\alpha}\left(\left\|\nabla \ell_0\right\|_\infty \right),$$ where \(\ell_0\) denotes the loss evaluated under the null model.

Usage

lambda_pdb(
  X,
  family,
  n_simu = 5000L,
  alpha = 0.05,
  method = c("mc_exact", "mc_gaussian", "analytical")
)

Arguments

X

Numeric design matrix. Columns should typically be standardized to zero mean and unit variance.

family

A PIC family specification. Can be either a family object or a character string accepted by get_family().

n_simu

Number of Monte Carlo simulations used by the stochastic methods. Ignored when method = "analytical".

alpha

Nominal tail probability used to define the quantile level.

method

One of "mc_exact", "mc_gaussian", or "analytical".

Value

An object of class "pic.lambda_pdb".

value

Selected regularization parameter \(\hat\lambda^{\rm PDB}_\alpha\).

statistics

Simulated null statistics used to estimate the quantile. NULL for the analytical method.

method

Estimation method used.

alpha

Quantile level parameter.

n_simu

Number of Monte Carlo simulations.

Details

Under the null \(\beta = 0\), the gradient of the loss carries only sampling noise. The smallest \(\lambda\) large enough to dominate this noise is the natural threshold separating signal from noise, i.e. the value above which a coefficient should be kept rather than shrunk to zero. Calibrating \(\hat\lambda\) this way has two consequences. First, the quantile depends only on the design matrix \(X\), the family, and the level \(\alpha\) (not on the response \(y\)), so cross-validation is no longer needed. Second, and more importantly, it leads to sharper support recovery than prediction-error-based selectors such as cross-validated lasso: by targeting the noise level of the gradient directly, PDB controls the inclusion of noise variables and more reliably identifies the true non-zero coefficients. Computing the quantile requires only the distribution of \(\|\nabla \ell_0\|_\infty\), which lambda_pdb() estimates through one of the three methods below.

Details on method option

The empirical quantile is obtained using one of the three following methods:

"mc_exact"

Family-aware Monte Carlo. For each of the n_simu draws, a response vector is sampled under the null model (\(\beta = 0\)) of the chosen family; the family-specific gradient of the loss at \(\beta = 0\) is evaluated on the fixed design \(X\), and its supremum norm is recorded. The empirical \((1 - \alpha)\) quantile of the n_simu recorded norms is returned. Most accurate but slowest of the three.

"mc_gaussian"

Monte Carlo under the Gaussian approximation of the null gradient. A central-limit argument gives \(\nabla \ell_0 \approx \mathcal{N}(0,\, c(n)\, \Sigma_X / n)\) with \(\Sigma_X = X^\top X / n\). Each of the n_simu draws samples directly from this Gaussian and records its supremum norm — no family-specific evaluation needed. Family-agnostic and noticeably faster than "mc_exact"; valid in the regime where the CLT kicks in (moderate to large \(n\)).

"analytical"

Closed-form Bonferroni bound on the Gaussian tail. Combining a union bound over the \(p\) coordinates of the gradient with the standard Gaussian tail bound gives $$\hat\lambda_\alpha^{\rm analytical} = \Phi^{-1}\!\left(1 - \alpha / (2p)\right)\, \sqrt{c(n) / n}.$$ Deterministic and \(O(1)\) — no simulation. Conservative (slightly over-estimates the true quantile) and gets looser as \(p\) grows, but useful when speed matters or when Monte Carlo is overkill.

The "mc_gaussian" and "analytical" methods use a variance scaling factor \(c(n)\) depending on the family:

  • Gaussian / Binomial / Poisson / Exponential: \(c(n) = 1\)

  • Gumbel: \(c(n) = \exp(2(\gamma + 1))\), with \(\gamma\) the Euler-Mascheroni constant.

  • Cox: \(c(n) = 1 / (4 \log n)\)

Computational cost

Both Monte Carlo methods are dominated by a single \(p \times n_{\rm simu}\) matrix product \(X^\top R\), where \(R\) stacks the simulated residuals. This product is dispatched to BLAS as a single \(\tt gemm\), which is essentially as fast as a Monte Carlo selector can be on dense designs. For very large \(n\) or \(p\), however, the constant becomes large and "mc_exact" may be unnecessarily expensive:

  • As \(n\) grows, the central-limit approximation tightens and "mc_gaussian" gives essentially the same \(\hat\lambda\) as "mc_exact" at a fraction of the cost (no family-specific residual generation, fewer dependencies on \(y\)-draws).

  • "analytical" is \(O(1)\) and useful as a quick upper bound — slightly conservative, but accurate enough for triage and for very high \(p\) where the Bonferroni tail is tight.

Practical rule of thumb: prefer "mc_exact" for small to moderate problems (default), "mc_gaussian" once \(n\) grows and the design is dense, and "analytical" when even the Monte Carlo cost is a concern.

Details on alpha option

The level \(\alpha\) is the nominal Type-I error of the test of \(H_0\!\colon \beta = 0\). By construction of the quantile, $$\Pr\!\left(\left\|\nabla \ell_0\right\|_\infty > \hat\lambda^{\rm PDB}_\alpha \,\big|\, H_0\right) = \alpha,$$ so under the null model no variable enters the active set with probability at most \(\alpha\). With the default \(\alpha = 0.05\), this caps the false-discovery rate at \(5\%\) under the null: when the data carry no signal, picreg returns the empty support \(95\%\) of the time.

Examples

X <- scale(matrix(rnorm(100 * 20), 100, 20))
lam <- lambda_pdb(
  X,
  family = "gaussian",
  method = "mc_exact"
)
#> Warning: lambda_pdb: design matrix X does not appear standardized; results may be unreliable.
print(lam)
#> pic lambda_pdb (method = mc_exact, alpha = 0.05)
#>   lambda = 0.295355