Analyzing data¶
The analyze module fits regression models for inference. Two
unified entrypoints — association_study for EWAS / PheWAS /
GWAS and interaction_study for pairwise GxE / GxG via
likelihood-ratio test — accept lists of outcomes and regressors,
auto-detect the regression family (linear / logistic), support
five genotype encodings (additive / dominant / recessive /
codominant / edge), and parallelise the scan via joblib. Survey-
aware estimation (sample weights with cluster-robust standard
errors) is one flag away. Results land in an immutable
RegressionResults container with chainable
.with_correction → .passing → .top → .annotate → .summary
methods, ready to be filtered, annotated against the IGEM knowledge
graph, and persisted.
Tip
Every function in this module returns a :class:RegressionResults —
an immutable, chainable container with built-in
.with_correction → .passing → .top → .annotate → .summary
fluency. Inputs are never mutated; each transformation returns a
fresh result.
Note
No CLI commands for this module. Inference produces in-memory results for downstream filtering, plotting, or report generation within the same Python session. Use Python or a notebook.
By the end of this page you will know how to:
Choose between
association_study(unified EWAS / PheWAS / GWAS),interaction_study(pairwise GxE / GxG via LRT), and the legacy specialized entrypoints (ewas,gwas,lrt).Run single-outcome and multi-outcome PheWAS with per-outcome multiple-testing correction.
Test genotype regressors under five different encoding conventions (additive / dominant / recessive / codominant / edge).
Apply survey-aware estimation (weights + cluster-robust SE) for NHANES-style designs.
Parallelise the regression scan with
n_jobs.Read the canonical output schema — Wald / LRT p-values, ΔAIC, convergence flags — and chain transformations on it.
All examples below use the IGEM facade, consistent with the rest
of the user guide:
from igem import IGEM
with IGEM() as igem:
phen = igem.data.read_phenotypes(
"nhanes.csv", outcomes=["GLUCOSE"],
)
res = igem.analyze.interaction_study(
phen,
outcomes="GLUCOSE", # what we model
covariates=["AGE", "SEX"], # adjustment set
interactions=[ # candidate pairs to test
("BMI", "SMOKING"),
("BMI", "DIET_QUALITY"),
("PHYSICAL_ACTIVITY", "DIET_QUALITY"),
],
n_jobs=-1, # parallelise across cores
).with_correction("fdr_bh") # Benjamini-Hochberg FDR
Each interaction pair becomes one likelihood-ratio test; the result
is a RegressionResults indexed by (outcome, term1, term2) ready
for further filtering or annotation.
Free functions in igem.modules.analyze remain importable when an
IGEM instance is not needed.
1. When to use what¶
Function |
Best for |
|---|---|
|
EWAS / PheWAS / small-to-medium GWAS with full flexibility (multi-outcome, all encodings, parallel, custom backend) |
|
Pairwise GxE / GxG screening via LRT |
|
Single-outcome EWAS, ergonomic shortcut over |
|
Biobank-scale additive linear GWAS via sgkit’s vectorised regression — order-of-magnitude faster than |
|
Standalone likelihood-ratio test on two arbitrary nested models (utility) |
The first two are the canonical entrypoints — everything else
is either a thin wrapper (ewas) or an optimised special case
(gwas).
2. association_study — unified entrypoint¶
The minimum-viable call¶
res = igem.analyze.association_study(
phen, outcome="GLUCOSE", regression_variables=["BMI"],
)
For each (outcome, regressor) pair the function fits two nested
models — full with the regressor of interest plus covariates,
and null with covariates only — and reports both Wald and LRT
statistics in a long-format DataFrame.
Linear regression (continuous outcome):
Logistic regression (binary outcome):
The family is auto-detected from the outcome dtype unless overridden. With \(n\) samples and \(k\) parameters, statsmodels fits via maximum-likelihood and reports:
Field |
Definition |
|---|---|
|
\(\hat{\beta}\), coefficient for the regressor |
|
\(\sqrt{\hat{V}_{\hat{\beta}}}\) from the inverse Fisher information |
|
\(\hat{\beta} \pm 1.96 \cdot \text{SE}\) |
|
Wald p-value: $p = 2,\Phi\big(- |
|
Likelihood-ratio test p-value (see §3 formulas) |
|
\(\Delta\mathrm{AIC} = \mathrm{AIC}_F - \mathrm{AIC}_N\), (Akaike, 1974) |
|
Sample count after listwise deletion |
|
statsmodels convergence flag |
|
First line of the error message if the fit failed |
|
|
Multi-outcome PheWAS¶
Pass a list of outcomes — each outcome × regressor pair is fitted
independently. Combine with groupby correction so multiple-testing
adjustment is applied within each outcome:
res = igem.analyze.association_study(
phen,
outcomes=["GLUCOSE", "LDL", "HDL", "TG", "BP_SYS"],
regression_variables=phen.exposures,
n_jobs=-1, # parallelise across cores
).with_correction("fdr_bh", groupby="outcome")
Genotype regressors¶
Pass a Genotypes wrapper via geno= — regression_variables
becomes the list of variant_ids to test. Five encodings cover the
common genetic-architecture assumptions:
res_add = igem.analyze.association_study(
phen, outcome="GLUCOSE",
regression_variables=["rs429358", "rs7412"],
geno=geno,
encoding="additive", # default
)
res_dom = igem.analyze.association_study(
phen, outcome="GLUCOSE",
regression_variables=["rs429358"],
geno=geno,
encoding="dominant",
)
Encoding |
Transformation |
Notes |
|---|---|---|
|
dosage \(x \in \{0, 1, 2\}\) |
Linear allele-count effect (default) |
|
\(x' = \mathbb{1}[x \geq 1]\) |
Any alt allele triggers |
|
\(x' = \mathbb{1}[x = 2]\) |
Two alt alleles required |
|
One-hot het / hom-alt vs hom-ref |
Multi-df LRT joins both contrasts |
|
per-variant lookup from |
Biologically-informed scoring |
Edge encoding lets you replace the dosage with a biologically meaningful score per genotype. Pair with the IGEM knowledge graph to score variants by functional impact, conservation, or pathway membership:
edge_info = pd.DataFrame(
{
"score_0": [0.0, 0.0], # hom-ref
"score_1": [0.5, 1.0], # het
"score_2": [1.0, 1.5], # hom-alt
},
index=["rs429358", "rs7412"],
)
res = igem.analyze.association_study(
phen, "GLUCOSE",
regression_variables=["rs429358", "rs7412"],
geno=geno,
encoding="edge",
edge_encoding_info=edge_info,
)
Tip
gwas() vs association_study(geno=...). Both fit the same
model for the additive-linear case, but gwas() uses sgkit’s
vectorised SVD-based regression — single matrix decomposition over
all variants — while association_study loops per variant with
statsmodels. At biobank scale (10⁶+ variants) the speed difference
is hours vs days. Use gwas() for additive-linear GWAS at scale;
use association_study(geno=...) when you need a non-additive
encoding, logistic outcome, or a custom regression backend.
Survey-aware analysis¶
For NHANES-style complex sample designs, set weights_col /
cluster_col / strata_col at load time and pass use_survey=True:
phen = igem.data.read_phenotypes(
"nhanes.csv", outcomes=["GLUCOSE"],
weights_col="WTMEC", cluster_col="SDMVPSU", strata_col="SDMVSTRA",
)
res = igem.analyze.association_study(
phen, "GLUCOSE", phen.exposures,
use_survey=True,
)
The default backend (regression_kind="auto" → glm_engine) applies:
Sample weights as
freq_weightsin the GLM, giving the weighted point estimate\[ \hat{\boldsymbol{\beta}}_w = (X^\top W X)^{-1} X^\top W \boldsymbol{y}, \qquad W = \mathrm{diag}(w_i) \]Cluster-robust (“sandwich”) variance when
cluster_colis set (White, 1980; Liang & Zeger, 1986):\[ \hat{V}_{\mathrm{CR}}\!\left(\hat{\boldsymbol{\beta}}\right) = (X^\top X)^{-1} \!\left(\sum_g X_g^\top \hat{\boldsymbol{u}}_g \hat{\boldsymbol{u}}_g^\top X_g\right)\! (X^\top X)^{-1} \]where \(g\) indexes clusters and \(\hat{\boldsymbol{u}}_g\) are the cluster’s residuals.
Note
Stratified Taylor-series variance (the gold-standard NHANES
estimator) is not implemented in glm_engine. For production
NHANES analysis pass regression_kind="r_survey" to use R’s
survey package via rpy2. Install with
poetry install --with r-survey.
Parallelism¶
res = igem.analyze.association_study(
phen, outcomes=phen.outcomes, regression_variables=phen.exposures,
n_jobs=-1, # all cores
)
n_jobs follows joblib semantics — 1 is sequential, -1 uses
every available core, integers in between are explicit pool sizes.
Tasks are distributed at the (outcome, regressor) granularity, so
parallelism scales with the number of pairs.
Other parameters¶
Parameter |
Effect |
|---|---|
|
Skip pairs with fewer complete cases (default matches CLARITE; failed pairs land in |
|
Z-score continuous regressors before fitting (effect-size comparable across features) |
|
Emit one row per category dummy in addition to the LRT summary |
|
Force survey-weighted GLM (auto-selected when |
|
Custom regression backend (must return :class: |
3. interaction_study — pairwise interactions¶
For every pair of variables \((v_1, v_2)\) and each outcome, fit two nested models and compare via LRT:
Restricted (additive only): \(\mathrm{outcome} \sim v_1 + v_2 + \boldsymbol{c}\)
Full (additive + interaction): \(\mathrm{outcome} \sim v_1 + v_2 + v_1{:}v_2 + \boldsymbol{c}\)
The likelihood-ratio test statistic (Wilks, 1938):
with \(df\) equal to the number of additional parameters in the full model:
\(df = 1\) for continuous × continuous
\(df = (k_1 - 1)(k_2 - 1)\) for categorical × categorical with arities \(k_1\) and \(k_2\)
mixed cases reduce accordingly (e.g. continuous × 3-level categorical gives \(df = 2\))
res = igem.analyze.interaction_study(
phen, "GLUCOSE",
interactions=[("SMOKING", "RACE"), ("BMI", "AGE")],
n_jobs=-1,
)
res.df[["term1", "term2", "n", "lrt_chi2", "lrt_df", "lrt_pvalue"]]
Three input forms for interactions=¶
# All unordered pairs of phen.exposures (capped by max_pairs).
interactions=None
# Anchor: every pair (anchor, other) with other ∈ phen.exposures \ {anchor}.
interactions="BMI"
# Explicit list of tuples.
interactions=[("BMI", "AGE"), ("SMOKING", "RACE")]
When interactions=None and the universe of pairs exceeds
max_pairs (default 1000), the function raises rather than running
silently — pre-filter via the knowledge graph or pass an explicit
list.
report_betas=True¶
By default the result has one summary row per pair (LRT-only). With
report_betas=True each interaction term gets its own row with the
coefficient, SE, and Wald p-value of that contrast — useful for
inspecting the direction of interaction effects in categorical
pairs.
Tip
This is IGEM’s core value proposition. The combinatorial
explosion of pairwise GxE / GxG tests dominates power-loss to
multiple-testing correction. By filtering candidate pairs upstream
through biological knowledge (gene–gene relationships, pathway
co-membership) the universe of tests shrinks to a tractable size,
and interaction_study runs LRT on the shortlisted pairs.
4. Multiple-testing correction¶
Apply correction post-hoc on a RegressionResults:
out = (
igem.analyze.association_study(phen, ["GLUCOSE", "LDL"], phen.exposures)
.with_correction("fdr_bh", groupby="outcome")
.passing(p_corrected=0.05)
.top(20)
)
Six methods are exposed via statsmodels’ multipletests:
|
Adjustment |
Reference |
|---|---|---|
|
\(p^{\mathrm{adj}}_i = \min(1, m\,p_i)\) |
Bonferroni, 1936 |
|
Step-down Bonferroni |
Holm, 1979 |
|
\(p^{\mathrm{adj}}_i = 1 - (1 - p_i)^m\) |
Šidák, 1967 |
|
Step-up |
Hommel, 1988 |
|
\(p^{\mathrm{adj}}_{(i)} = \min_{k \geq i}\!\frac{m\,p_{(k)}}{k}\) |
|
|
BH adjusted for arbitrary dependence |
bonferroni controls FWER strictly but is conservative;
fdr_bh controls the false discovery rate, generally
recommended for EWAS / PheWAS where many tests are expected to be
correlated and a moderate false-positive rate is acceptable.
Per-outcome correction (groupby=)¶
Critical for multi-outcome PheWAS. With groupby="outcome" (or any
column of the result), the correction is applied within each
group rather than across the entire result, preventing the false
inflation of the test count when multiple outcomes are scanned in
one call:
# Wrong (inflates m by n_outcomes):
res.with_correction("fdr_bh")
# Right:
res.with_correction("fdr_bh", groupby="outcome")
5. Family auto-detection¶
The family is inferred from the outcome column unless overridden:
from igem.modules.analyze import infer_family
infer_family(phen.df["GLUCOSE"]) # → "linear" (continuous numeric)
infer_family(phen.df["DIABETES"]) # → "logistic" (binary 0/1 or bool)
Rule:
Outcome dtype / values |
Family |
|---|---|
|
|
numeric with non-NaN values \(\subseteq \{0, 1\}\) |
|
numeric with \(\geq 3\) distinct values |
|
object / category |
|
Override explicitly when needed:
# Force linear regression on a binary outcome (e.g. for linear-probability
# model in robustness checks).
res = igem.analyze.association_study(
phen, "DIABETES", phen.exposures, family="linear",
)
6. Choosing a regression engine¶
regression_kind selects the inference backend:
Value |
Engine |
|---|---|
|
|
|
Same as auto without survey weights |
|
|
|
|
callable |
Custom backend returning :class: |
A custom callable is the escape hatch for research backends:
def my_engine(y, X, family, *, weights=None, cluster=None):
fit = my_special_fitter(y, X, family)
from igem.modules.analyze._engines import EngineResult
return EngineResult(
params=fit.params, bse=fit.bse, pvalues=fit.pvalues,
conf_int=fit.conf_int, log_likelihood=fit.llf,
aic=fit.aic, n=fit.nobs, converged=fit.converged,
)
res = igem.analyze.association_study(
phen, "GLUCOSE", phen.exposures, regression_kind=my_engine,
)
7. Result chaining¶
RegressionResults is immutable — every transformation returns
a new instance. The result composes naturally into a fluent
pipeline:
publishable = (
igem.analyze.association_study(phen, "GLUCOSE", phen.exposures, n_jobs=-1)
.with_correction("fdr_bh")
.passing(p_corrected=0.05)
.annotate(igem) # gene-symbol / HGNC merge from the IGEM KG
.top(20, by="lrt_pvalue")
)
publishable.to_csv("ewas_top20_annotated.csv")
Method |
Effect |
|---|---|
|
Adds |
|
Filter rows by raw or corrected p-value |
|
Sort + head; |
|
Left-join gene-annotation columns from the KG |
|
One-line dict for log lines / report headers |
|
Persist |
8. Specialised entrypoints¶
ewas — single-outcome EWAS shortcut¶
A thin wrapper of association_study that pins to a single
outcome, defaults regression_variables to phen.exposures, and
emits the legacy result schema (variable, n, beta, se, ci_low, ci_high, p_value) for retrocompatibility with notebooks written
before the unified entrypoint:
res = igem.analyze.ewas(phen, "GLUCOSE") # equivalent to:
# igem.analyze.association_study(phen, "GLUCOSE", phen.exposures)
# adapted to the legacy column names.
gwas — sgkit-vectorised additive-linear GWAS¶
Order-of-magnitude faster than association_study(geno=...) for
the additive-linear case at biobank scale. Restricted to:
Continuous outcome (linear). Logistic raises
NotImplementedError.Additive encoding only.
res = igem.analyze.gwas(geno, phen, "GLUCOSE")
For other encodings or logistic outcome, route through
association_study(geno=geno, ...).
lrt — standalone likelihood-ratio test¶
For comparing two arbitrary nested models on the same outcome and data — useful when you need the LRT machinery outside of association / interaction:
out = igem.analyze.lrt(
phen, "GLUCOSE",
full=["AGE", "SEX", "BMI", "SMOKING"],
nested=["AGE", "SEX"],
)
out["chi2"], out["df"], out["p_value"]
9. Quick reference¶
Main entrypoints¶
Function |
Purpose |
|---|---|
|
Unified EWAS / PheWAS / GWAS |
|
Pairwise GxE / GxG via LRT |
|
Single-outcome EWAS shortcut (legacy schema) |
|
Biobank-scale additive linear GWAS via sgkit |
|
Standalone likelihood-ratio test |
Result chaining¶
Method |
Returns |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
Module-level utilities¶
Function |
Purpose |
|---|---|
|
Returns |
|
Direct numpy-array correction (no wrapper) |
|
Available correction methods |