ComplyHat: Docs

ComplyHat returns deterministic, reproducible numbers, the kind a regulator can re-derive from your inputs. It does not synthesize prose, run an internal LLM, or call your model’s prediction function. Your MCP-capable host (Claude Code, Claude Desktop, Codex Desktop, Codex CLI, OpenClaw, NemoClaw, or any client speaking streamable-HTTP MCP) brings the reasoning. This page is the canonical reference for what the platform measures, how it measures it, and why the defaults are what they are. Every report carries the metric values, thresholds, dataset row counts, subgroup sizes, data-quality warnings, engine semver, and random seeds, so any finding is replayable from the same inputs.

ComplyHat has zero internal model calls. The bias, drift, and explainability engines are pure statistical libraries (@complyhat/bias-engine, @complyhat/drift-engine, @complyhat/explainability-engine). Host agents bring the reasoning; ComplyHat returns structured citations and audit-tagged prose ([EXTRACTED] / [INFERRED] / [AMBIGUOUS]).

At a glance

Bias

Four fairness metrics: Four-Fifths (EEOC), statistical parity, equal opportunity, predictive parity. Each with a default threshold tracing to a legal or academic source.

Drift

Three distribution tests: PSI, KS, chi-squared. Picked per feature type and the kind of shift you need to catch.

Explainability

Two model-agnostic local explainers: LIME and a coalition-attribution proxy. Both ship completeness scores so a noisy single-sample run is caught before it enters an audit trail.

Adversarial robustness

Boundary perturbation (smallest flip-inducing input change) and data-quality robustness (graceful degradation under realistic production-data errors). The latter is what EU AI Act Article 15 asks for.

Every report carries the metric values, thresholds, dataset row counts, subgroup sizes, data-quality warnings, engine semver, and random seeds, so any finding is replayable from the same inputs. Skip ahead to Threshold rationale for the defaults at a glance, or jump straight to References for the cited primary sources.

Bias evaluation

Four fairness metrics. Each operates on a tabular dataset with an outcome column, a predicted-outcome column (where applicable), and one or more protected-class columns. All four return a pass / fail ruling against a configurable threshold; defaults trace to a legal or academic source.

Disparate impact
Statistical parity
Equal opportunity
Predictive parity

Metric. For each subgroup g inside a protected class, compute the favorable outcome rate favorable_rate(g) = favorable_count(g) / total(g). The adverse impact ratio is favorable_rate(g) / favorable_rate(reference) where the reference is the highest-favorable-rate subgroup.Ruling. Fail if any subgroup’s adverse impact ratio is below 0.80 (the “Four-Fifths” threshold).Origin. Uniform Guidelines on Employee Selection Procedures, 29 C.F.R. §1607.4(D), 1978. Adopted into model-risk practice by NIST AI RMF (AI RMF 1.0, MAP-5) and reused by NYC Local Law 144 audit rules (§20-870, 2023).What it catches. Systematic underselection of a subgroup at the decision boundary. Commonly the first fairness screen because it requires no ground-truth labels.What it misses. Equal selection rates can still mask per-error-type harms; a model can clear Four-Fifths while rejecting qualified members of one group at higher rates. Equal opportunity and predictive parity exist to catch that.

Metric. statistical_parity_difference = max_group_favorable_rate − min_group_favorable_rate.Ruling. Fail if the absolute difference exceeds 0.10 by default. Configurable; some regulators use 0.05.Origin. Dwork et al., Fairness Through Awareness, ITCS 2012 (“demographic parity” / “group fairness”). Formalized in the IBM AI Fairness 360 toolkit metric catalog (Bellamy et al., 2018).What it catches. Same signal as Four-Fifths but expressed additively. Useful when the reference group’s rate is small and the ratio gets numerically unstable.

Metric. True positive rate per subgroup: TPR(g) = true_positives(g) / (true_positives(g) + false_negatives(g)). The equal-opportunity ratio is min_TPR / max_TPR.Ruling. Fail if the ratio is below 0.80.Origin. Hardt, Price, Srebro, Equality of Opportunity in Supervised Learning, NeurIPS 2016.What it catches. Whether qualified applicants in each subgroup are accepted at equal rates. Directly addresses “disparate treatment of the qualified”, the harm regulators actually care about when a model is consequential (lending, hiring, healthcare).Data requirement. Ground-truth labels must be present. Skip if the model’s outcome is deployed but not yet validated against labels.

Metric. Positive predictive value per subgroup: PPV(g) = true_positives(g) / predicted_positives(g). Predictive parity holds when PPV is equal (within threshold) across subgroups.Ruling. Fail if max_PPV − min_PPV exceeds 0.10.Origin. Chouldechova, Fair prediction with disparate impact, FATML 2016. See also the impossibility result: Kleinberg, Mullainathan, Raghavan, Inherent Trade-Offs in the Fair Determination of Risk Scores, ITCS 2017.What it catches. Whether the model’s positive predictions are equally reliable across groups. If PPV is higher for one group, the model is “getting more credit” when it says yes for that group.Trade-off. Predictive parity and equal opportunity cannot both hold when base rates differ across groups (Chouldechova 2016). Report both, and let the audit context decide which matters more for the use case.

Data quality gate

Before any of the four metrics run, the engine checks the dataset:

Subgroup sample sizes. Warn if any subgroup has n < 30 (statistical parity unstable) or n < 100 (all four tests unstable).
Class imbalance. Warn if the smallest subgroup is less than 5% of the total dataset.
Missing values. Warn if more than 10% of rows are missing the protected-class column.

Gate warnings do not fail the test, but they are persisted alongside the result so a reviewer sees whether a pass is statistically meaningful.

Distribution drift

Drift is evaluated by comparing a baseline snapshot (typically training) against a production snapshot. Three methods, picked per feature type: PSI (numerical or categorical), KS (numerical), chi-squared (categorical).

PSI
KS test
Chi-squared

Population Stability Index. Bin baseline and production into B bins (default 10, quantile-based on the baseline). For each bin: psi_i = (p_prod_i − p_base_i) × ln(p_prod_i / p_base_i). PSI = sum(psi_i).

PSI range	Interpretation
`< 0.10`	No material change
`0.10 - 0.25`	Moderate drift, monitor
`≥ 0.25`	Significant drift, investigate

Origin. Yurdakul and Naranjo, Statistical Properties of the Population Stability Index, 2019. Formalized a practice predating it by two decades in credit scoring (Siddiqi, Credit Risk Scorecards, 2006).Floor handling. Bins with zero counts are smoothed to 1/N to keep the log term finite. The smoothing constant is exposed so audits can reproduce.

Kolmogorov-Smirnov. KS = max_x |F_prod(x) − F_base(x)| where F is the empirical CDF. The two-sample p-value is also returned.Ruling. Fail if both p < 0.05 and KS > 0.10. The dual threshold is intentional: with large production samples, any real-world numeric feature produces a statistically significant but trivially small KS. We want effect size, not just significance.Origin. Massey, The Kolmogorov-Smirnov Test for Goodness of Fit, JASA 1951. Standard test in model risk where joint Fed/FDIC/OCC SR 26-2 requires ongoing monitoring of input data.Applicability. Numeric features only. Categorical features use chi-squared.

Categorical drift. Two-way contingency table of category × (baseline vs production). χ² = sum((observed − expected)² / expected) with k − 1 degrees of freedom.Ruling. Fail if p < 0.05. Effect size is reported separately via Cramér’s V (V = sqrt(χ² / (n × (k − 1)))); V > 0.1 small, > 0.3 moderate, > 0.5 strong.Origin. Pearson, On the criterion that a given system of deviations from the probable, Philosophical Magazine 1900.

Explainability

Two model-agnostic local explainers. Both produce per-feature attribution scores for a single prediction. Hosts pass in the decision plus a set of neighbor / background decisions and their precomputed outcomes; ComplyHat does not call the host’s prediction function f.

LIME (Local Interpretable Model-agnostic Explanations)

Method. Weight each neighbor decision by an exponential kernel of its Euclidean distance to the target decision in feature space, then fit a weighted least-squares linear surrogate with a leading intercept column. The intercept is returned alongside per-feature slopes; the slopes are the local attributions. Origin. Ribeiro, Singh, Guestrin, “Why Should I Trust You?” Explaining the Predictions of Any Classifier, KDD 2016. Defaults. Kernel width 0.75 (override via kernel_width); up to 50,000 neighbors retained.

Coalition-attribution proxy

Method. Enumerate or sample feature coalitions, weight each by the Kernel-SHAP kernel (M − 1) / (C(M, |S|) × |S| × (M − |S|)), and solve weighted least squares against per-coalition outcome blends. Because ComplyHat cannot call f, the per-coalition outcome is approximated as (|S| / M) · y_decision + (1 − |S| / M) · y_background_avg. The kernel weights still produce a defensible per-feature ranking, but the resulting values are coalition-fraction-weighted attributions, not Shapley values.

Lundberg and Lee’s Kernel SHAP (NeurIPS 2017) substitutes absent features by sampling from the background and re-evaluating f. Without f, the substitution collapses to a coalition-size blend and the local-accuracy / consistency guarantees no longer hold. ComplyHat reports labelled coalition_attribution are explicit about this; they should not be presented to a regulator as Shapley values.

Defaults. Up to 50,000 coalitions sampled; 10,000 background decisions retained.

Consistency check (completeness)

Both explainers report a completeness score in [0, 1]: how closely the sum of attributions matches actual_prediction − baseline_prediction. Borrowed from the Shapley completeness property; applies equally well to LIME slopes and the coalition-attribution proxy. Why it matters. Both methods are Monte-Carlo at scale and a single low-sample run can produce noisy attributions. The completeness score is a cheap red flag a reviewer can act on before the explanation enters an audit trail.

Adversarial robustness

Two test families. Each probes whether a model’s prediction is stable under perturbations, but they target different failure modes.

Boundary robustness

Method. For each test point, find the smallest perturbation (in L∞ or L2 norm) that flips the model’s prediction. Report the median and 10th-percentile perturbation magnitudes across the test set. Origin. Szegedy et al., Intriguing properties of neural networks, ICLR 2014 (the original adversarial-examples paper). The method used is a black-box variant of Carlini and Wagner, Towards Evaluating the Robustness of Neural Networks, IEEE S&P 2017. Ruling. Regulatory-use-case-dependent. The platform reports the median perturbation; audit teams set the pass threshold based on the plausible perturbation range for the use case (pixel noise tolerance for vision, rounding tolerance for tabular).

Data-quality robustness

Method. Inject realistic corruptions (missing values, out-of-range numeric, mis-typed categorical) at controlled rates (1%, 5%, 10%). Report the delta in prediction distribution per corruption type. Purpose. Distinct from adversarial robustness. This measures graceful degradation under ordinary production-data errors, which is what operational teams actually face. EU AI Act Article 15 (§1, §3) explicitly requires this kind of robustness evidence for high-risk systems.

Data governance (supporting)

Not statistical methods, but supporting artifacts produced alongside every report because regulators (EU AI Act Article 10; SR 26-2 §5) require data-lineage evidence to interpret any finding.

Data lineage. For each dataset used in bias / drift / explainability evaluations, the report includes provenance (source, capture timestamp, transformations applied), retention window, and consent basis where applicable.
Training / test separation. All bias and explainability evaluations run on held-out test sets; the platform rejects runs that would evaluate a model on its own training data unless the caller explicitly opts in.
Protected-class handling. Protected-class columns are never persisted in the same row as prediction outputs. They live in a separate table joined at evaluation time, so the production model pipeline never sees them.

Threshold rationale

Every threshold above is conservative by default and traces to a legal or industry source. The rationale exists in one place so an audit team can pick a more appropriate value for their use case without guessing.

Threshold	Default	Source
Four-Fifths (disparate impact)	`0.80`	EEOC legal floor (29 C.F.R. §1607.4(D), 1978). The academic consensus is that 0.80 is permissive.
Statistical parity gap	`0.10`	IBM AI Fairness 360 default (Bellamy et al., 2018). Tighter values (0.05) are available via configuration.
PSI investigate	`0.25`	Siddiqi (2006) credit-scoring industry default; validated against real regulatory practice for two decades.
KS / chi-squared significance	`p < 0.05` plus effect size	Dual gate is intentional: large samples make any real feature trivially significant. We want effect size, not just significance.

Every report includes the threshold used. Auditors who disagree can re-run with their own.

Reproducibility

Every report persists, by design:

Metric value, threshold, pass/fail ruling.
Dataset row count, subgroup sizes, data-quality warnings.
Method version (engine semver), so a report produced in 2026-04 can be replayed against the same engine version in 2028.
Random seeds for any Monte-Carlo step.

The engines are deterministic under fixed seeds. Given the same inputs and engine version, a third party can re-derive every number in any report.

References

Legal & regulatory

29 C.F.R. §1607.4(D) , Uniform Guidelines on Employee Selection Procedures (1978).
NIST AI RMF 1.0 , AI Risk Management Framework, NIST AI 100-1, 2023.
NYC Local Law 144 (2023) , Automated Employment Decision Tools.
Joint Fed/FDIC/OCC SR 26-2 , Supervisory Guidance on Model Risk Management, 2026.
EU AI Act , Regulation (EU) 2024/1689, Articles 10, 15, 17.

Bias / fairness

Dwork, C., Hardt, M., Pitassi, T., Reingold, O., Zemel, R. Fairness Through Awareness. ITCS 2012.
Bellamy, R. K. E., et al. AI Fairness 360: An Extensible Toolkit for Detecting, Understanding, and Mitigating Unwanted Algorithmic Bias. IBM Research, 2018.
Hardt, M., Price, E., Srebro, N. Equality of Opportunity in Supervised Learning. NeurIPS 2016.
Chouldechova, A. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. FATML 2016.
Kleinberg, J., Mullainathan, S., Raghavan, M. Inherent Trade-Offs in the Fair Determination of Risk Scores. ITCS 2017.

Drift / distribution testing

Yurdakul, B., Naranjo, J. Statistical Properties of the Population Stability Index. 2019.
Siddiqi, N. Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. Wiley, 2006.
Massey, F. J. The Kolmogorov-Smirnov Test for Goodness of Fit. JASA 1951.
Pearson, K. On the criterion that a given system of deviations from the probable. Philosophical Magazine, 1900.

Explainability

Ribeiro, M. T., Singh, S., Guestrin, C. “Why Should I Trust You?” Explaining the Predictions of Any Classifier. KDD 2016.
Lundberg, S. M., Lee, S.-I. A Unified Approach to Interpreting Model Predictions. NeurIPS 2017.

Adversarial robustness

Szegedy, C., et al. Intriguing properties of neural networks. ICLR 2014.
Carlini, N., Wagner, D. Towards Evaluating the Robustness of Neural Networks. IEEE S&P 2017.

For which metrics each framework requires, see supported frameworks. For the MCP entry points, see the tool reference.

Documentation Index

​At a glance

Bias

Drift

Explainability

Adversarial robustness

​Bias evaluation

​Data quality gate

​Distribution drift

​Explainability

​LIME (Local Interpretable Model-agnostic Explanations)

​Coalition-attribution proxy

​Consistency check (completeness)

​Adversarial robustness

​Boundary robustness

​Data-quality robustness

​Data governance (supporting)

​Threshold rationale

​Reproducibility

​References

At a glance

Bias evaluation

Data quality gate

Distribution drift

Explainability

LIME (Local Interpretable Model-agnostic Explanations)

Coalition-attribution proxy

Consistency check (completeness)

Adversarial robustness

Boundary robustness

Data-quality robustness

Data governance (supporting)

Threshold rationale

Reproducibility

References