ComplyHat returns deterministic, reproducible numbers, the kind a regulator can re-derive from your inputs. It does not synthesize prose, run an internal LLM, or call your model’s prediction function. Your MCP-capable host (Claude Code, Claude Desktop, Codex Desktop, Codex CLI, OpenClaw, NemoClaw, or any client speaking streamable-HTTP MCP) brings the reasoning. This page is the canonical reference for what the platform measures, how it measures it, and why the defaults are what they are. Every report carries the metric values, thresholds, dataset row counts, subgroup sizes, data-quality warnings, engine semver, and random seeds, so any finding is replayable from the same inputs.Documentation Index
Fetch the complete documentation index at: https://docs.complyhat.ai/llms.txt
Use this file to discover all available pages before exploring further.
ComplyHat has zero internal model calls. The bias, drift, and explainability engines are pure statistical libraries (
@complyhat/bias-engine, @complyhat/drift-engine, @complyhat/explainability-engine). Host agents bring the reasoning; ComplyHat returns structured citations and audit-tagged prose ([EXTRACTED] / [INFERRED] / [AMBIGUOUS]).At a glance
Bias
Four fairness metrics: Four-Fifths (EEOC), statistical parity, equal opportunity, predictive parity. Each with a default threshold tracing to a legal or academic source.
Drift
Three distribution tests: PSI, KS, chi-squared. Picked per feature type and the kind of shift you need to catch.
Explainability
Two model-agnostic local explainers: LIME and a coalition-attribution proxy. Both ship completeness scores so a noisy single-sample run is caught before it enters an audit trail.
Adversarial robustness
Boundary perturbation (smallest flip-inducing input change) and data-quality robustness (graceful degradation under realistic production-data errors). The latter is what EU AI Act Article 15 asks for.
Bias evaluation
Four fairness metrics. Each operates on a tabular dataset with an outcome column, a predicted-outcome column (where applicable), and one or more protected-class columns. All four return apass / fail ruling against a configurable threshold; defaults trace to a legal or academic source.
- Disparate impact
- Statistical parity
- Equal opportunity
- Predictive parity
Metric. For each subgroup
g inside a protected class, compute the favorable outcome rate favorable_rate(g) = favorable_count(g) / total(g). The adverse impact ratio is favorable_rate(g) / favorable_rate(reference) where the reference is the highest-favorable-rate subgroup.Ruling. Fail if any subgroup’s adverse impact ratio is below 0.80 (the “Four-Fifths” threshold).Origin. Uniform Guidelines on Employee Selection Procedures, 29 C.F.R. §1607.4(D), 1978. Adopted into model-risk practice by NIST AI RMF (AI RMF 1.0, MAP-5) and reused by NYC Local Law 144 audit rules (§20-870, 2023).What it catches. Systematic underselection of a subgroup at the decision boundary. Commonly the first fairness screen because it requires no ground-truth labels.What it misses. Equal selection rates can still mask per-error-type harms; a model can clear Four-Fifths while rejecting qualified members of one group at higher rates. Equal opportunity and predictive parity exist to catch that.Data quality gate
Before any of the four metrics run, the engine checks the dataset:- Subgroup sample sizes. Warn if any subgroup has
n < 30(statistical parity unstable) orn < 100(all four tests unstable). - Class imbalance. Warn if the smallest subgroup is less than 5% of the total dataset.
- Missing values. Warn if more than 10% of rows are missing the protected-class column.
pass is statistically meaningful.
Distribution drift
Drift is evaluated by comparing a baseline snapshot (typically training) against a production snapshot. Three methods, picked per feature type: PSI (numerical or categorical), KS (numerical), chi-squared (categorical).- PSI
- KS test
- Chi-squared
Population Stability Index. Bin baseline and production into
Origin. Yurdakul and Naranjo, Statistical Properties of the Population Stability Index, 2019. Formalized a practice predating it by two decades in credit scoring (Siddiqi, Credit Risk Scorecards, 2006).Floor handling. Bins with zero counts are smoothed to
B bins (default 10, quantile-based on the baseline). For each bin: psi_i = (p_prod_i − p_base_i) × ln(p_prod_i / p_base_i). PSI = sum(psi_i).| PSI range | Interpretation |
|---|---|
< 0.10 | No material change |
0.10 - 0.25 | Moderate drift, monitor |
≥ 0.25 | Significant drift, investigate |
1/N to keep the log term finite. The smoothing constant is exposed so audits can reproduce.Explainability
Two model-agnostic local explainers. Both produce per-feature attribution scores for a single prediction. Hosts pass in the decision plus a set of neighbor / background decisions and their precomputed outcomes; ComplyHat does not call the host’s prediction functionf.
LIME (Local Interpretable Model-agnostic Explanations)
Method. Weight each neighbor decision by an exponential kernel of its Euclidean distance to the target decision in feature space, then fit a weighted least-squares linear surrogate with a leading intercept column. The intercept is returned alongside per-feature slopes; the slopes are the local attributions. Origin. Ribeiro, Singh, Guestrin, “Why Should I Trust You?” Explaining the Predictions of Any Classifier, KDD 2016. Defaults. Kernel width0.75 (override via kernel_width); up to 50,000 neighbors retained.
Coalition-attribution proxy
Method. Enumerate or sample feature coalitions, weight each by the Kernel-SHAP kernel(M − 1) / (C(M, |S|) × |S| × (M − |S|)), and solve weighted least squares against per-coalition outcome blends. Because ComplyHat cannot call f, the per-coalition outcome is approximated as (|S| / M) · y_decision + (1 − |S| / M) · y_background_avg. The kernel weights still produce a defensible per-feature ranking, but the resulting values are coalition-fraction-weighted attributions, not Shapley values.
Defaults. Up to 50,000 coalitions sampled; 10,000 background decisions retained.
Consistency check (completeness)
Both explainers report a completeness score in[0, 1]: how closely the sum of attributions matches actual_prediction − baseline_prediction. Borrowed from the Shapley completeness property; applies equally well to LIME slopes and the coalition-attribution proxy.
Why it matters. Both methods are Monte-Carlo at scale and a single low-sample run can produce noisy attributions. The completeness score is a cheap red flag a reviewer can act on before the explanation enters an audit trail.
Adversarial robustness
Two test families. Each probes whether a model’s prediction is stable under perturbations, but they target different failure modes.Boundary robustness
Method. For each test point, find the smallest perturbation (in L∞ or L2 norm) that flips the model’s prediction. Report the median and 10th-percentile perturbation magnitudes across the test set. Origin. Szegedy et al., Intriguing properties of neural networks, ICLR 2014 (the original adversarial-examples paper). The method used is a black-box variant of Carlini and Wagner, Towards Evaluating the Robustness of Neural Networks, IEEE S&P 2017. Ruling. Regulatory-use-case-dependent. The platform reports the median perturbation; audit teams set the pass threshold based on the plausible perturbation range for the use case (pixel noise tolerance for vision, rounding tolerance for tabular).Data-quality robustness
Method. Inject realistic corruptions (missing values, out-of-range numeric, mis-typed categorical) at controlled rates (1%, 5%, 10%). Report the delta in prediction distribution per corruption type. Purpose. Distinct from adversarial robustness. This measures graceful degradation under ordinary production-data errors, which is what operational teams actually face. EU AI Act Article 15 (§1, §3) explicitly requires this kind of robustness evidence for high-risk systems.Data governance (supporting)
Not statistical methods, but supporting artifacts produced alongside every report because regulators (EU AI Act Article 10; SR 26-2 §5) require data-lineage evidence to interpret any finding.- Data lineage. For each dataset used in bias / drift / explainability evaluations, the report includes provenance (source, capture timestamp, transformations applied), retention window, and consent basis where applicable.
- Training / test separation. All bias and explainability evaluations run on held-out test sets; the platform rejects runs that would evaluate a model on its own training data unless the caller explicitly opts in.
- Protected-class handling. Protected-class columns are never persisted in the same row as prediction outputs. They live in a separate table joined at evaluation time, so the production model pipeline never sees them.
Threshold rationale
Every threshold above is conservative by default and traces to a legal or industry source. The rationale exists in one place so an audit team can pick a more appropriate value for their use case without guessing.| Threshold | Default | Source |
|---|---|---|
| Four-Fifths (disparate impact) | 0.80 | EEOC legal floor (29 C.F.R. §1607.4(D), 1978). The academic consensus is that 0.80 is permissive. |
| Statistical parity gap | 0.10 | IBM AI Fairness 360 default (Bellamy et al., 2018). Tighter values (0.05) are available via configuration. |
| PSI investigate | 0.25 | Siddiqi (2006) credit-scoring industry default; validated against real regulatory practice for two decades. |
| KS / chi-squared significance | p < 0.05 plus effect size | Dual gate is intentional: large samples make any real feature trivially significant. We want effect size, not just significance. |
Reproducibility
Every report persists, by design:- Metric value, threshold, pass/fail ruling.
- Dataset row count, subgroup sizes, data-quality warnings.
- Method version (engine semver), so a report produced in 2026-04 can be replayed against the same engine version in 2028.
- Random seeds for any Monte-Carlo step.
References
Legal & regulatory
Legal & regulatory
- 29 C.F.R. §1607.4(D) , Uniform Guidelines on Employee Selection Procedures (1978).
- NIST AI RMF 1.0 , AI Risk Management Framework, NIST AI 100-1, 2023.
- NYC Local Law 144 (2023) , Automated Employment Decision Tools.
- Joint Fed/FDIC/OCC SR 26-2 , Supervisory Guidance on Model Risk Management, 2026.
- EU AI Act , Regulation (EU) 2024/1689, Articles 10, 15, 17.
Bias / fairness
Bias / fairness
- Dwork, C., Hardt, M., Pitassi, T., Reingold, O., Zemel, R. Fairness Through Awareness. ITCS 2012.
- Bellamy, R. K. E., et al. AI Fairness 360: An Extensible Toolkit for Detecting, Understanding, and Mitigating Unwanted Algorithmic Bias. IBM Research, 2018.
- Hardt, M., Price, E., Srebro, N. Equality of Opportunity in Supervised Learning. NeurIPS 2016.
- Chouldechova, A. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. FATML 2016.
- Kleinberg, J., Mullainathan, S., Raghavan, M. Inherent Trade-Offs in the Fair Determination of Risk Scores. ITCS 2017.
Drift / distribution testing
Drift / distribution testing
- Yurdakul, B., Naranjo, J. Statistical Properties of the Population Stability Index. 2019.
- Siddiqi, N. Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. Wiley, 2006.
- Massey, F. J. The Kolmogorov-Smirnov Test for Goodness of Fit. JASA 1951.
- Pearson, K. On the criterion that a given system of deviations from the probable. Philosophical Magazine, 1900.
Explainability
Explainability
- Ribeiro, M. T., Singh, S., Guestrin, C. “Why Should I Trust You?” Explaining the Predictions of Any Classifier. KDD 2016.
- Lundberg, S. M., Lee, S.-I. A Unified Approach to Interpreting Model Predictions. NeurIPS 2017.
Adversarial robustness
Adversarial robustness
- Szegedy, C., et al. Intriguing properties of neural networks. ICLR 2014.
- Carlini, N., Wagner, D. Towards Evaluating the Robustness of Neural Networks. IEEE S&P 2017.