Analyzing data¶

The analyze module fits regression models for inference. Two unified entrypoints — association_study for EWAS / PheWAS / GWAS and interaction_study for pairwise GxE / GxG via likelihood-ratio test — accept lists of outcomes and regressors, auto-detect the regression family (linear / logistic), support five genotype encodings (additive / dominant / recessive / codominant / edge), and parallelise the scan via joblib. Survey- aware estimation (sample weights with cluster-robust standard errors) is one flag away. Results land in an immutable RegressionResults container with chainable .with_correction → .passing → .top → .annotate → .summary methods, ready to be filtered, annotated against the IGEM knowledge graph, and persisted.

Tip

Every function in this module returns a :class:RegressionResults — an immutable, chainable container with built-in .with_correction → .passing → .top → .annotate → .summary fluency. Inputs are never mutated; each transformation returns a fresh result.

Note

No CLI commands for this module. Inference produces in-memory results for downstream filtering, plotting, or report generation within the same Python session. Use Python or a notebook.

By the end of this page you will know how to:

Choose between association_study (unified EWAS / PheWAS / GWAS), interaction_study (pairwise GxE / GxG via LRT), and the legacy specialized entrypoints (ewas, gwas, lrt).
Run single-outcome and multi-outcome PheWAS with per-outcome multiple-testing correction.
Test genotype regressors under five different encoding conventions (additive / dominant / recessive / codominant / edge).
Apply survey-aware estimation (weights + cluster-robust SE) for NHANES-style designs.
Parallelise the regression scan with n_jobs.
Read the canonical output schema — Wald / LRT p-values, ΔAIC, convergence flags — and chain transformations on it.

All examples below use the IGEM facade, consistent with the rest of the user guide:

from igem import IGEM

with IGEM() as igem:
    phen = igem.data.read_phenotypes(
        "nhanes.csv", outcomes=["GLUCOSE"],
    )

    res = igem.analyze.interaction_study(
        phen,
        outcomes="GLUCOSE",                  # what we model
        covariates=["AGE", "SEX"],           # adjustment set
        interactions=[                       # candidate pairs to test
            ("BMI", "SMOKING"),
            ("BMI", "DIET_QUALITY"),
            ("PHYSICAL_ACTIVITY", "DIET_QUALITY"),
        ],
        n_jobs=-1,                           # parallelise across cores
    ).with_correction("fdr_bh")              # Benjamini-Hochberg FDR

Each interaction pair becomes one likelihood-ratio test; the result is a RegressionResults indexed by (outcome, term1, term2) ready for further filtering or annotation.

Free functions in igem.modules.analyze remain importable when an IGEM instance is not needed.

1. When to use what¶

Function	Best for
`association_study(...)`	EWAS / PheWAS / small-to-medium GWAS with full flexibility (multi-outcome, all encodings, parallel, custom backend)
`interaction_study(...)`	Pairwise GxE / GxG screening via LRT
`ewas(...)`	Single-outcome EWAS, ergonomic shortcut over `association_study`
`gwas(...)`	Biobank-scale additive linear GWAS via sgkit’s vectorised regression — order-of-magnitude faster than `association_study(geno=...)` for the additive-linear case
`lrt(...)`	Standalone likelihood-ratio test on two arbitrary nested models (utility)

The first two are the canonical entrypoints — everything else is either a thin wrapper (ewas) or an optimised special case (gwas).

2. `association_study` — unified entrypoint¶

The minimum-viable call¶

res = igem.analyze.association_study(
    phen, outcome="GLUCOSE", regression_variables=["BMI"],
)

For each (outcome, regressor) pair the function fits two nested models — full with the regressor of interest plus covariates, and null with covariates only — and reports both Wald and LRT statistics in a long-format DataFrame.

Linear regression (continuous outcome):

\[ y_i = \beta_0 + \beta\, x_i + \boldsymbol{\gamma}^\top \boldsymbol{c}_i + \epsilon_i, \qquad \epsilon_i \sim \mathcal{N}(0, \sigma^2) \]

Logistic regression (binary outcome):

\[ \mathrm{logit}\!\big(\mathbb{P}(y_i = 1)\big) = \beta_0 + \beta\, x_i + \boldsymbol{\gamma}^\top \boldsymbol{c}_i \]

The family is auto-detected from the outcome dtype unless overridden. With $n$ samples and $k$ parameters, statsmodels fits via maximum-likelihood and reports:

Field	Definition
`beta`	$\hat{\beta}$, coefficient for the regressor
`se`	$\sqrt{\hat{V}_{\hat{\beta}}}$ from the inverse Fisher information
`ci_low`/`ci_high`	$\hat{\beta} \pm 1.96 \cdot \text{SE}$
`beta_pvalue`	Wald p-value: $p = 2,\Phi\big(-
`lrt_pvalue`	Likelihood-ratio test p-value (see §3 formulas)
`diff_aic`	$\Delta\mathrm{AIC} = \mathrm{AIC}_F - \mathrm{AIC}_N$, (Akaike, 1974)
`n`	Sample count after listwise deletion
`converged`	statsmodels convergence flag
`error`	First line of the error message if the fit failed
`variable_type`	`continuous` / `binary` / `categorical` / `genotype`

Multi-outcome PheWAS¶

Pass a list of outcomes — each outcome × regressor pair is fitted independently. Combine with groupby correction so multiple-testing adjustment is applied within each outcome:

res = igem.analyze.association_study(
    phen,
    outcomes=["GLUCOSE", "LDL", "HDL", "TG", "BP_SYS"],
    regression_variables=phen.exposures,
    n_jobs=-1,                  # parallelise across cores
).with_correction("fdr_bh", groupby="outcome")

Genotype regressors¶

Pass a Genotypes wrapper via geno= — regression_variables becomes the list of variant_ids to test. Five encodings cover the common genetic-architecture assumptions:

res_add = igem.analyze.association_study(
    phen, outcome="GLUCOSE",
    regression_variables=["rs429358", "rs7412"],
    geno=geno,
    encoding="additive",       # default
)

res_dom = igem.analyze.association_study(
    phen, outcome="GLUCOSE",
    regression_variables=["rs429358"],
    geno=geno,
    encoding="dominant",
)

Encoding	Transformation	Notes
`additive`	dosage $x \in \{0, 1, 2\}$	Linear allele-count effect (default)
`dominant`	$x' = \mathbb{1}[x \geq 1]$	Any alt allele triggers
`recessive`	$x' = \mathbb{1}[x = 2]$	Two alt alleles required
`codominant`	One-hot het / hom-alt vs hom-ref	Multi-df LRT joins both contrasts
`edge`	per-variant lookup from `edge_encoding_info`	Biologically-informed scoring

Edge encoding lets you replace the dosage with a biologically meaningful score per genotype. Pair with the IGEM knowledge graph to score variants by functional impact, conservation, or pathway membership:

edge_info = pd.DataFrame(
    {
        "score_0": [0.0,  0.0],     # hom-ref
        "score_1": [0.5,  1.0],     # het
        "score_2": [1.0,  1.5],     # hom-alt
    },
    index=["rs429358", "rs7412"],
)
res = igem.analyze.association_study(
    phen, "GLUCOSE",
    regression_variables=["rs429358", "rs7412"],
    geno=geno,
    encoding="edge",
    edge_encoding_info=edge_info,
)

Tip

gwas() vs association_study(geno=...). Both fit the same model for the additive-linear case, but gwas() uses sgkit’s vectorised SVD-based regression — single matrix decomposition over all variants — while association_study loops per variant with statsmodels. At biobank scale (10⁶+ variants) the speed difference is hours vs days. Use gwas() for additive-linear GWAS at scale; use association_study(geno=...) when you need a non-additive encoding, logistic outcome, or a custom regression backend.

Survey-aware analysis¶

For NHANES-style complex sample designs, set weights_col / cluster_col / strata_col at load time and pass use_survey=True:

phen = igem.data.read_phenotypes(
    "nhanes.csv", outcomes=["GLUCOSE"],
    weights_col="WTMEC", cluster_col="SDMVPSU", strata_col="SDMVSTRA",
)

res = igem.analyze.association_study(
    phen, "GLUCOSE", phen.exposures,
    use_survey=True,
)

The default backend (regression_kind="auto" → glm_engine) applies:

Sample weights as freq_weights in the GLM, giving the weighted point estimate

\[ \hat{\boldsymbol{\beta}}_w = (X^\top W X)^{-1} X^\top W \boldsymbol{y}, \qquad W = \mathrm{diag}(w_i) \]
Cluster-robust (“sandwich”) variance when cluster_col is set (White, 1980; Liang & Zeger, 1986):

\[ \hat{V}_{\mathrm{CR}}\!\left(\hat{\boldsymbol{\beta}}\right) = (X^\top X)^{-1} \!\left(\sum_g X_g^\top \hat{\boldsymbol{u}}_g \hat{\boldsymbol{u}}_g^\top X_g\right)\! (X^\top X)^{-1} \]

where $g$ indexes clusters and $\hat{\boldsymbol{u}}_g$ are the cluster’s residuals.

Note

Stratified Taylor-series variance (the gold-standard NHANES estimator) is not implemented in glm_engine. For production NHANES analysis pass regression_kind="r_survey" to use R’s survey package via rpy2. Install with poetry install --with r-survey.

Parallelism¶

res = igem.analyze.association_study(
    phen, outcomes=phen.outcomes, regression_variables=phen.exposures,
    n_jobs=-1,                  # all cores
)

n_jobs follows joblib semantics — 1 is sequential, -1 uses every available core, integers in between are explicit pool sizes. Tasks are distributed at the (outcome, regressor) granularity, so parallelism scales with the number of pairs.

Other parameters¶

Parameter	Effect
`min_n=200`	Skip pairs with fewer complete cases (default matches CLARITE; failed pairs land in `result.errors`)
`standardize_data=True`	Z-score continuous regressors before fitting (effect-size comparable across features)
`report_categorical_betas=True`	Emit one row per category dummy in addition to the LRT summary
`regression_kind="weighted_glm"`	Force survey-weighted GLM (auto-selected when `use_survey=True`)
`regression_kind=callable`	Custom regression backend (must return :class:`igem.modules.analyze._engines.EngineResult`)

3. `interaction_study` — pairwise interactions¶

For every pair of variables $(v_1, v_2)$ and each outcome, fit two nested models and compare via LRT:

Restricted (additive only): $\mathrm{outcome} \sim v_1 + v_2 + \boldsymbol{c}$
Full (additive + interaction): $\mathrm{outcome} \sim v_1 + v_2 + v_1{:}v_2 + \boldsymbol{c}$

The likelihood-ratio test statistic (Wilks, 1938):

\[ \Lambda = -2\,\big(\ell_R - \ell_F\big) \;\sim_{H_0}\; \chi^2_{\,df} \]

with $df$ equal to the number of additional parameters in the full model:

$df = 1$ for continuous × continuous
$df = (k_1 - 1)(k_2 - 1)$ for categorical × categorical with arities $k_1$ and $k_2$
mixed cases reduce accordingly (e.g. continuous × 3-level categorical gives $df = 2$)

res = igem.analyze.interaction_study(
    phen, "GLUCOSE",
    interactions=[("SMOKING", "RACE"), ("BMI", "AGE")],
    n_jobs=-1,
)
res.df[["term1", "term2", "n", "lrt_chi2", "lrt_df", "lrt_pvalue"]]

Three input forms for `interactions=`¶

# All unordered pairs of phen.exposures (capped by max_pairs).
interactions=None

# Anchor: every pair (anchor, other) with other ∈ phen.exposures \ {anchor}.
interactions="BMI"

# Explicit list of tuples.
interactions=[("BMI", "AGE"), ("SMOKING", "RACE")]

When interactions=None and the universe of pairs exceeds max_pairs (default 1000), the function raises rather than running silently — pre-filter via the knowledge graph or pass an explicit list.

`report_betas=True`¶

By default the result has one summary row per pair (LRT-only). With report_betas=True each interaction term gets its own row with the coefficient, SE, and Wald p-value of that contrast — useful for inspecting the direction of interaction effects in categorical pairs.

Tip

This is IGEM’s core value proposition. The combinatorial explosion of pairwise GxE / GxG tests dominates power-loss to multiple-testing correction. By filtering candidate pairs upstream through biological knowledge (gene–gene relationships, pathway co-membership) the universe of tests shrinks to a tractable size, and interaction_study runs LRT on the shortlisted pairs.

4. Multiple-testing correction¶

Apply correction post-hoc on a RegressionResults:

out = (
    igem.analyze.association_study(phen, ["GLUCOSE", "LDL"], phen.exposures)
    .with_correction("fdr_bh", groupby="outcome")
    .passing(p_corrected=0.05)
    .top(20)
)

Six methods are exposed via statsmodels’ multipletests:

`method`	Adjustment	Reference
`bonferroni`	$p^{\mathrm{adj}}_i = \min(1, m\,p_i)$	Bonferroni, 1936
`holm`	Step-down Bonferroni	Holm, 1979
`sidak`	$p^{\mathrm{adj}}_i = 1 - (1 - p_i)^m$	Šidák, 1967
`hommel`	Step-up	Hommel, 1988
`fdr_bh`	$p^{\mathrm{adj}}_{(i)} = \min_{k \geq i}\!\frac{m\,p_{(k)}}{k}$	Benjamini & Hochberg, 1995
`fdr_by`	BH adjusted for arbitrary dependence	Benjamini & Yekutieli, 2001

bonferroni controls FWER strictly but is conservative; fdr_bh controls the false discovery rate, generally recommended for EWAS / PheWAS where many tests are expected to be correlated and a moderate false-positive rate is acceptable.

Per-outcome correction (`groupby=`)¶

Critical for multi-outcome PheWAS. With groupby="outcome" (or any column of the result), the correction is applied within each group rather than across the entire result, preventing the false inflation of the test count when multiple outcomes are scanned in one call:

# Wrong (inflates m by n_outcomes):
res.with_correction("fdr_bh")

# Right:
res.with_correction("fdr_bh", groupby="outcome")

5. Family auto-detection¶

The family is inferred from the outcome column unless overridden:

from igem.modules.analyze import infer_family
infer_family(phen.df["GLUCOSE"])      # → "linear"  (continuous numeric)
infer_family(phen.df["DIABETES"])     # → "logistic" (binary 0/1 or bool)

Rule:

Outcome dtype / values	Family
`bool`	`logistic`
numeric with non-NaN values $\subseteq \{0, 1\}$	`logistic`
numeric with $\geq 3$ distinct values	`linear`
object / category	`ValueError` (encode first with `igem.modify.recode`)

Override explicitly when needed:

# Force linear regression on a binary outcome (e.g. for linear-probability
# model in robustness checks).
res = igem.analyze.association_study(
    phen, "DIABETES", phen.exposures, family="linear",
)

6. Choosing a regression engine¶

regression_kind selects the inference backend:

Value	Engine
`"auto"` (default)	`glm_engine` — statsmodels GLM
`"glm"`	Same as auto without survey weights
`"weighted_glm"`	`glm_engine` with `freq_weights` populated
`"r_survey"`	`r_survey_engine` — R’s `survey` via rpy2 (Taylor-series stratified variance, gold standard NHANES). Requires `poetry install --with r-survey`
callable	Custom backend returning :class:`EngineResult`

A custom callable is the escape hatch for research backends:

def my_engine(y, X, family, *, weights=None, cluster=None):
    fit = my_special_fitter(y, X, family)
    from igem.modules.analyze._engines import EngineResult
    return EngineResult(
        params=fit.params, bse=fit.bse, pvalues=fit.pvalues,
        conf_int=fit.conf_int, log_likelihood=fit.llf,
        aic=fit.aic, n=fit.nobs, converged=fit.converged,
    )

res = igem.analyze.association_study(
    phen, "GLUCOSE", phen.exposures, regression_kind=my_engine,
)

7. Result chaining¶

RegressionResults is immutable — every transformation returns a new instance. The result composes naturally into a fluent pipeline:

publishable = (
    igem.analyze.association_study(phen, "GLUCOSE", phen.exposures, n_jobs=-1)
    .with_correction("fdr_bh")
    .passing(p_corrected=0.05)
    .annotate(igem)               # gene-symbol / HGNC merge from the IGEM KG
    .top(20, by="lrt_pvalue")
)
publishable.to_csv("ewas_top20_annotated.csv")

Method	Effect
`.with_correction(method, groupby=None)`	Adds `p_corrected` column
`.passing(p=…, p_corrected=…)`	Filter rows by raw or corrected p-value
`.top(n, by=None)`	Sort + head; `by=None` auto-picks `beta_pvalue`
`.annotate(client)`	Left-join gene-annotation columns from the KG
`.summary()`	One-line dict for log lines / report headers
`.to_csv(path)`	Persist

8. Specialised entrypoints¶

`ewas` — single-outcome EWAS shortcut¶

A thin wrapper of association_study that pins to a single outcome, defaults regression_variables to phen.exposures, and emits the legacy result schema (variable, n, beta, se, ci_low, ci_high, p_value) for retrocompatibility with notebooks written before the unified entrypoint:

res = igem.analyze.ewas(phen, "GLUCOSE")  # equivalent to:
# igem.analyze.association_study(phen, "GLUCOSE", phen.exposures)
# adapted to the legacy column names.

`gwas` — sgkit-vectorised additive-linear GWAS¶

Order-of-magnitude faster than association_study(geno=...) for the additive-linear case at biobank scale. Restricted to:

Continuous outcome (linear). Logistic raises NotImplementedError.
Additive encoding only.

res = igem.analyze.gwas(geno, phen, "GLUCOSE")

For other encodings or logistic outcome, route through association_study(geno=geno, ...).

`lrt` — standalone likelihood-ratio test¶

For comparing two arbitrary nested models on the same outcome and data — useful when you need the LRT machinery outside of association / interaction:

out = igem.analyze.lrt(
    phen, "GLUCOSE",
    full=["AGE", "SEX", "BMI", "SMOKING"],
    nested=["AGE", "SEX"],
)
out["chi2"], out["df"], out["p_value"]

9. Quick reference¶

Main entrypoints¶

Function	Purpose
`igem.analyze.association_study(phen, outcomes, …)`	Unified EWAS / PheWAS / GWAS
`igem.analyze.interaction_study(phen, outcomes, …)`	Pairwise GxE / GxG via LRT
`igem.analyze.ewas(phen, outcome)`	Single-outcome EWAS shortcut (legacy schema)
`igem.analyze.gwas(geno, phen, outcome)`	Biobank-scale additive linear GWAS via sgkit
`igem.analyze.lrt(phen, outcome, full=, nested=)`	Standalone likelihood-ratio test

Result chaining¶

Method	Returns
`.with_correction("fdr_bh", groupby="outcome")`	`RegressionResults`
`.passing(p=0.01)`	`RegressionResults`
`.top(20)`	`RegressionResults`
`.annotate(igem)`	`RegressionResults`
`.summary()`	`dict`
`.to_csv(path)`	`Path`

Module-level utilities¶

Function	Purpose
`igem.modules.analyze.infer_family(series)`	Returns `"linear"` / `"logistic"`
`igem.modules.analyze.apply_correction(pvalues, method)`	Direct numpy-array correction (no wrapper)
`igem.modules.analyze.list_methods()`	Available correction methods