CDE API Guide: Conditional Density Estimation ============================================= This is the comprehensive API documentation for the Conditional Density Estimation (CDE) module in VBI, which provides implementations of Mixture Density Networks (MDN) and Masked Autoregressive Flows (MAF) for parameter inference in brain models. .. note:: **Looking for a quick start?** See :doc:`inference_cde_mdn_basic` for a minimal working example with MDN. Introduction to Conditional Density Estimation ----------------------------------------------- **What is Conditional Density Estimation?** Conditional Density Estimation (CDE) addresses the problem of learning the full conditional probability density .. math:: p(y \mid x), where :math:`x \in \mathcal{X}` denotes observed input variables and :math:`y \in \mathcal{Y}` denotes a target variable. In contrast to standard regression methods, which estimate a single point prediction (e.g. the conditional mean :math:`\mathbb{E}[Y \mid X=x]`), CDE aims to recover the entire conditional distribution of :math:`Y` given :math:`X=x`. This enables explicit modeling of uncertainty, heteroskedasticity, skewness, and multimodality in the output space. **Problem Setting** Let .. math:: \mathcal{D} = \{(x_i, y_i)\}_{i=1}^N be a dataset of input–output pairs drawn from an unknown joint distribution :math:`p(x, y)`. The goal of CDE is to learn a parametric model :math:`\hat{p}_\phi(y \mid x)` that approximates the true conditional density :math:`p(y \mid x)` and generalizes to unseen inputs :math:`x^*`. **Model Formulation** In CDE, the conditional density is parameterized by a function approximator, typically a neural network, which maps each input :math:`x` to the parameters of a conditional distribution over :math:`y`: .. math:: x \;\mapsto\; \hat{p}_\phi(y \mid x). Depending on the chosen model class, this mapping may yield: - Parameters of a parametric distribution (e.g. mean and variance of a Gaussian) - Parameters of a mixture model (e.g. mixture weights, component means, and covariances) - Parameters of an invertible transformation defining a normalized density (e.g. conditional normalizing flows) **Training Objective** CDE models are commonly trained using maximum conditional likelihood estimation. The parameters :math:`\phi` are optimized by minimizing the negative log-likelihood (NLL): .. math:: \mathcal{L}(\phi) = -\frac{1}{N} \sum_{i=1}^N \log \hat{p}_\phi(y_i \mid x_i). Minimizing this objective encourages the model to assign high probability mass to observed targets :math:`y_i` conditioned on their corresponding inputs :math:`x_i`. **Expressive CDE Models** Several model families are commonly used for conditional density estimation: - **Mixture Density Networks (MDNs):** Represent :math:`p(y \mid x)` as a mixture of parametric distributions whose parameters depend on :math:`x`. - **Normalizing Flows:** Construct highly expressive conditional densities via invertible transformations of a simple base distribution. - **Conditional Kernel Density Estimators:** Estimate densities using kernel smoothing conditioned on the input variables. - **Implicit or likelihood-free models:** Approximate conditional densities using simulation-based or amortized inference techniques. **Inference and Usage** After training, a CDE model can be used to: - Evaluate conditional likelihoods :math:`\hat{p}(y \mid x)` - Compute summary statistics such as conditional means, variances, or quantiles - Draw samples from the conditional distribution :math:`p(y \mid x)` - Quantify predictive uncertainty for downstream decision-making **Relation to Bayesian Inference** Conditional Density Estimation naturally encompasses Bayesian posterior inference. In simulation-based inference, the posterior :math:`p(\theta \mid x)` is itself a conditional density, where model parameters :math:`\theta` are conditioned on observed data :math:`x`. Consequently, many simulation-based inference methods can be interpreted as specialized instances of CDE. **Summary** Conditional Density Estimation provides a principled framework for learning full predictive distributions conditioned on observed inputs. By modeling :math:`p(y \mid x)` directly, CDE enables uncertainty-aware predictions and forms a core component of modern probabilistic machine learning and inference pipelines. VBI Implementation Overview ---------------------------- The CDE module provides two main approaches for conditional density estimation: **Mixture Density Networks (MDN):** - Uses Gaussian mixture models for density approximation - Fast training and inference - Good for simpler parameter relationships - Interpretable mixture components - Best for: Low-to-moderate dimensional problems (< 10 parameters) **Masked Autoregressive Flows (MAF):** - Uses normalizing flows for flexible density modeling - More expressive for complex distributions - Naturally captures parameter dependencies - Better for multimodal posteriors - Best for: Complex, high-dimensional problems (> 10 parameters) Both approaches inherit from a common base class that provides standardized training, sampling, and evaluation methods. Architecture ------------ Base Class: ConditionalDensityEstimator ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All density estimators inherit from ``ConditionalDensityEstimator``, which provides: - **Unified Training Interface**: Adam optimizer with early stopping - **Dimension Inference**: Automatic parameter/feature dimension detection - **Data Validation**: Comprehensive input checking and preprocessing - **Standardized API**: Consistent ``train()``, ``sample()``, and ``log_prob()`` methods .. code-block:: python from vbi.cde import ConditionalDensityEstimator # Base class provides common functionality estimator = ConditionalDensityEstimator(param_dim=2, feature_dim=3) Key Features: - **Automatic Dimension Inference**: Set ``param_dim=None`` and ``feature_dim=None`` for auto-detection - **Robust Training**: Handles non-finite values, provides convergence monitoring - **Early Stopping**: Optional plateau detection with customizable patience - **Progress Monitoring**: tqdm integration for training progress visualization Mixture Density Network (MDN) ------------------------------ The ``MDNEstimator`` class implements conditional density estimation using Gaussian mixture models. Class Parameters ~~~~~~~~~~~~~~~~ .. code-block:: python from vbi.cde import MDNEstimator mdn = MDNEstimator( param_dim=2, # Target parameter dimensionality feature_dim=3, # Conditional feature dimensionality n_components=5, # Number of mixture components (default: 5) hidden_sizes=(32, 32) # Hidden layer sizes (default: (32, 32)) ) **Parameter Details:** - ``param_dim``: Dimensionality of parameters to estimate (θ) - ``feature_dim``: Dimensionality of conditional features (x) - ``n_components``: Number of Gaussian mixture components (K) - ``hidden_sizes``: Tuple of hidden layer sizes for the MLP Training ~~~~~~~~ .. code-block:: python # Train the MDN loss_history = mdn.train( params=theta_train, # Shape: (N, param_dim) features=x_train, # Shape: (N, feature_dim) n_iter=2000, # Training iterations learning_rate=1e-3, # Adam learning rate seed=42, # Random seed for reproducibility use_tqdm=True, # Progress bar patience=100, # Early stopping patience min_delta=1e-4 # Minimum improvement threshold ) **Training Features:** - **Adam Optimization**: Adaptive learning rate with momentum - **Early Stopping**: Stops when loss improvement < min_delta for patience iterations - **Loss Monitoring**: Tracks negative log-likelihood throughout training - **Data Preprocessing**: Automatic handling of non-finite values Inference Methods ~~~~~~~~~~~~~~~~~ **Log Probability Evaluation:** .. code-block:: python # Compute log p(θ|x) for each sample log_probs = mdn.log_prob( features=x_test, # Shape: (N, feature_dim) params=theta_test # Shape: (N, param_dim) ) # Returns: array of shape (N,) with log probabilities **Sampling:** .. code-block:: python # Generate samples from posterior p(θ|x) samples = mdn.sample( features=x_obs, # Shape: (n_conditions, feature_dim) n_samples=1000, # Samples per condition rng=np.random.RandomState(42), log_prob_threshold=None, # Optional rejection sampling oversample_factor=5 # Oversampling for rejection ) # Returns: shape (n_conditions, n_samples, param_dim) **Advanced Sampling Features:** - **Rejection Sampling**: Filter low-probability samples using ``log_prob_threshold`` - **Oversampling**: Generate extra candidates to account for rejections - **Fallback Handling**: Graceful degradation when sampling fails Masked Autoregressive Flow (MAF) --------------------------------- The ``MAFEstimator`` class provides a more flexible approach using normalizing flows. Class Parameters ~~~~~~~~~~~~~~~~ .. code-block:: python from vbi.cde import MAFEstimator maf = MAFEstimator( param_dim=2, # Target parameter dimensionality feature_dim=3, # Conditional feature dimensionality n_flows=4, # Number of flow layers hidden_units=64, # Hidden units per MADE block activation='tanh', # Activation function z_score_theta=True, # Standardize parameters z_score_x=True, # Standardize features use_actnorm=True, # Use ActNorm layers embedding_dim=None, # Optional PCA embedding ) **Parameter Details:** - ``n_flows``: Number of autoregressive transformation layers - ``hidden_units``: Number of hidden units in each MADE block - ``activation``: Activation function ('tanh', 'relu', 'elu') - ``z_score_theta``/``z_score_x``: Internal standardization of parameters/features - ``use_actnorm``: Data-dependent initialization of normalization layers - ``embedding_dim``: Optional PCA dimensionality reduction for features Preprocessing ~~~~~~~~~~~~~ MAF requires preprocessing for optimal performance: .. code-block:: python # Compute normalization statistics (call before training) maf.prepare_normalizers( features=x_train, params=theta_train, rng=np.random.RandomState(42) ) # Reinitialize weights and masks maf.reinitialize(rng=np.random.RandomState(42)) **Preprocessing Steps:** 1. **Standardization**: Z-score normalization of parameters and features 2. **PCA Embedding**: Optional dimensionality reduction for high-dimensional features 3. **Weight Initialization**: Proper initialization of MADE masks and flow parameters Training ~~~~~~~~ .. code-block:: python # Train the MAF maf.train( params=theta_train, features=x_train, n_iter=2000, learning_rate=1e-3, seed=42, use_tqdm=True, validation_fraction=0.1, # Validation split stop_after_epochs=20, # Early stopping patience early_stopping_delta=0.0, # Minimum improvement clip_max_norm=5.0 # Gradient clipping ) **Advanced Training Features:** - **Train/Validation Split**: Automatic data splitting for monitoring - **Gradient Clipping**: Prevents exploding gradients - **ActNorm Warmup**: Data-dependent initialization of normalization layers - **Convergence Monitoring**: Validation loss tracking with early stopping Inference ~~~~~~~~~ **Log Probability:** .. code-block:: python # Compute log probability under the flow log_probs = maf.log_prob(features=x_test, params=theta_test) **Sampling:** .. code-block:: python # Sample from the learned distribution samples = maf.sample( features=x_obs, n_samples=1000, rng=np.random.RandomState(42) ) # Returns samples in original parameter space Comparison: MAF vs MDN ----------------------- .. list-table:: Backend Performance Comparison :header-rows: 1 :class: color-caption * - **Aspect** - **MDN** - **MAF** * - Expressiveness - Limited to mixture of Gaussians - Highly flexible via flows * - Speed - Fast training/inference - Slower but more accurate * - Interpretability - Clear mixture components - Less interpretable * - Dependencies - Assumes independence - Captures dependencies * - Convergence - Usually stable - May require careful tuning * - Memory - Lower memory usage - Higher memory for flows Best Practices -------------- **Data Preparation:** 1. **Scale your data**: Both methods benefit from properly scaled inputs 2. **Handle outliers**: Remove or robustly handle extreme values 3. **Check dimensions**: Ensure consistent feature/parameter dimensions 4. **Sufficient samples**: Use adequate training data for reliable estimation **Training Tips:** 1. **Monitor convergence**: Use validation splits and early stopping 2. **Tune learning rate**: Start with 1e-3, adjust based on convergence 3. **Gradient clipping**: Essential for MAF to prevent instability 4. **Batch considerations**: Larger batches may improve stability **Hyperparameter Selection:** - **MDN**: Focus on ``n_components`` (3-10) and ``hidden_sizes`` - **MAF**: Tune ``n_flows`` (3-6), ``hidden_units`` (32-128), and ``activation`` Example Usage ------------- Complete example for brain model parameter inference: .. code-block:: python import numpy as np from vbi.cde import MAFEstimator # Load your brain model simulation data theta = np.load('simulation_parameters.npy') # Shape: (N, param_dim) features = np.load('simulation_features.npy') # Shape: (N, feature_dim) # Initialize and configure MAF maf = MAFEstimator( param_dim=theta.shape[1], feature_dim=features.shape[1], n_flows=4, hidden_units=64 ) # Preprocessing maf.prepare_normalizers(features, theta) maf.reinitialize() # Training with monitoring maf.train( params=theta, features=features, n_iter=1000, validation_fraction=0.2, stop_after_epochs=10 ) # Inference on new observations observed_features = np.load('experimental_data.npy') posterior_samples = maf.sample( features=observed_features, n_samples=5000 ) # Analyze posterior print(f"Posterior shape: {posterior_samples.shape}") print(f"Mean parameters: {np.mean(posterior_samples, axis=1)}") Troubleshooting --------------- **Common Issues:** - **Non-finite losses**: Check for NaN/inf in your data - **Poor convergence**: Try lower learning rate or gradient clipping - **Memory errors**: Reduce batch size or model complexity - **Sampling failures**: Check for singular matrices in MDN **Performance Optimization:** - Use ``float32`` precision for faster computation - Enable GPU acceleration if available - Consider PCA embedding for high-dimensional features - Monitor validation loss for overfitting References ---------- The CDE implementations are based on: 1. **MDN**: Bishop, C. M. (1994). Mixture density networks. Technical Report NCRG/94/004 2. **MAF**: Papamakarios, G., et al. (2017). Masked autoregressive flow for density estimation. NeurIPS 3. **ActNorm**: Kingma, D. P., & Dhariwal, P. (2018). Glow: Generative flow with invertible 1x1 convolutions. NeurIPS For brain model applications, see the examples directory for complete notebooks demonstrating parameter inference in Jansen-Rit, Wilson-Cowan, and other neural mass models.