Quickstart

A five-minute tour from zero to your first knowledge-graph query. By the end you will have installed IGEM, connected to the public server, retrieved real biological annotations for a set of genes, and seen how the same workflow is available both from the command line and from Python.

Note

Requirement: Python 3.11 or newer. Everything else is one pip install away.

1. Install and connect

Install the client from PyPI (a virtual environment is recommended):

python -m venv .venv
source .venv/bin/activate
pip install igem

Point it at the public IGEM server maintained by the Hall Lab:

igem config set server-url https://geneexposure.org/api
# set server-url = https://geneexposure.org/api
#   → /your/cwd/.igem.toml

Verify the connection:

igem health
# status: ok

If you see status: ok, you are ready to query the knowledge graph.

Tip

The command above writes a project-scoped ./.igem.toml. To set the URL once for any directory, use igem config set --global server-url which writes ~/.igem.toml instead. See Installation for the full configuration reference, troubleshooting, and the optional [embedded] extras for HPC use.

2. Discover the knowledge graph

Ask the server which reports it exposes:

igem report list
Name                   Version   Description
--------------------------------------------------------
gene_annotations       1.0       Gene metadata for HGNC symbols
go_annotations         1.0       Gene Ontology terms by gene
disease_annotations    1.0       Disease associations by gene
pathway_annotations    1.0       Reactome pathways by gene
protein_annotations    1.0       UniProt proteins by gene

Read the documentation for any of them:

igem report explain --name gene_annotations

3. Run your first report

Annotate three well-known cancer genes — BRCA1, TP53, MYC — and trim the output to four columns so it fits your terminal:

igem report run --name gene_annotations \
  --input BRCA1 --input TP53 --input MYC \
  --columns gene_symbol,entrez_id,chromosome,gene_locus_type
[INFO] ============================================
[INFO] IGEM — Client
[INFO]   Server URL : https://geneexposure.org/api
[INFO] ============================================
[INFO] [report] Running 'gene_annotations'...
[INFO] [report] 'gene_annotations' complete: 3 rows in 0.3s

gene_symbol  entrez_id  chromosome  gene_locus_type
BRCA1        672        17          gene with protein product
TP53         7157       17          gene with protein product
MYC          4609       8           gene with protein product

The server resolved each input symbol to its HGNC entity, joined the genomic-coordinate metadata, and returned a typed result in under a second.

Save to a file

For larger inputs, redirect to CSV:

igem report run --name gene_annotations \
  --input-file my_genes.txt \
  --output annotations.csv

my_genes.txt is a plain text file, one identifier per line.

4. The same flow in Python

The CLI is a thin wrapper around the IGEM Python facade. Anything the CLI does is one method call away:

from igem import IGEM

with IGEM() as igem:
    result = igem.report.gene_annotations(
        input_values=["BRCA1", "TP53", "MYC"],
        columns=["gene_symbol", "entrez_id", "chromosome",
                 "gene_locus_type"],
    )

print(result.df.head())

result.df is a pandas.DataFrame — chain it directly into downstream analysis:

result.df.to_parquet("annotations.parquet")
result.df.query("chromosome == '17'")

The IGEM() constructor reads the same ./.igem.toml (or ~/.igem.toml) that the CLI uses, so no further configuration is needed inside scripts or notebooks.

What’s next

You have just used IGEM as a pure knowledge-graph client. The full surface is split between the client (analyst-facing, the igem package) and the server (sysadmin-facing, the igem-server package). A quick map of capabilities and where each is documented:

IGEM client — analytical surface

  • Data loadingigem.data.read_plink, read_vcf, read_phenotypes. PLINK 1.x BED/BIM/FAM, VCF, and phenotype CSV / TSV. → Loading data

  • Quality controligem.describe.* for summaries, missingness reports, type inference, and skewness; igem.modify.* for recoding, outlier removal, categorisation, and column / row filters. → Describing data · Modifying data

  • Single-feature analysisigem.analyze.gwas and igem.analyze.ewas for genome- and environment-wide association scans, with built-in Bonferroni and FDR corrections.

  • Interaction analysisigem.analyze.lrt runs likelihood-ratio tests between nested and full models, the canonical mechanism for GxG, GxE, and ExE interaction terms.

  • Knowledge-graph reportsigem.report.*: gene, GO, disease, pathway, and protein annotations queried from the server (what this Quickstart used). → Reporting data · Report catalogue

  • HPC orchestration — LSF / SLURM submission helpers and offline embedded mode against a Parquet snapshot. → Container and HPC workflows

  • Visualisation (roadmap) — Manhattan plots (with Bonferroni and FDR thresholds), top-results plots, and exposome distribution plots, ported from the CLARITE lineage.

IGEM server — backend surface

  • Databasege.db: PostgreSQL schema, Alembic migrations, and connection management for the knowledge graph. → Server setup

  • ETLge.etl: ingestion pipelines for HGNC, UniProt, Reactome, GO, CTD, HMDB, PharmGKB, NHANES, and the other authoritative sources behind the graph. → ETL pipeline

  • NLPge.nlp: entity normalisation and synonym / alias resolution across the supported ontologies, used during ingestion and at query time.

  • Reportsge.report: server-side report registry that surfaces typed analytical endpoints (gene_annotations, pathway_annotations, …) to any client.

  • Snapshot generation — frozen Parquet exports of the graph, versioned by tag and used by clients in offline / embedded mode. → Snapshot generation

  • Container deployment — Docker / Apptainer image bundling the client and the server for single-host or HPC deployment. → Container deployment

If you would prefer to understand the moving parts before diving in, the next page — Concepts — explains the client / server / snapshot model and the filter-then-test loop at the centre of an IGEM study.