arwx API
A small, content-addressed experiment tracker. One project carries a goal; every run, experiment, decision, and artifact hangs off it. The API is plain JSON over HTTPS — use the Python SDK or call it directly.
Overview
Every object lives under a project, which holds the goal — an objective metric, a direction (minimize / maximize), and an optional target. Underneath the project:
- Run — a process that streams metrics (loss curves, lr, etc.).
- Session — one line of inquiry; a sequence of experiments.
- Experiment — one tried change: a note (the flaw→fix story), a diff, and an optional code snapshot.
- Decision — keep / drop verdict on an experiment, with the score it reached.
- Artifact — a versioned, content-addressed output (model, code, dataset) with movable refs like
winnerand lineage back to the experiment that produced it.
Two ways in: the Python SDK (arwandb) wraps all of this and works offline-first; the raw HTTP API is what the SDK speaks, and you can use it from any language.
Authentication
Every /v1/* endpoint requires an API key sent as a bearer token. Keys are created and revoked from the dashboard (Connection panel) or via /v1/keys. The service is fail-closed: no valid, enabled key → 401.
Authorization: Bearer arwx_live_xxxxxxxxxxxx
Organizations & sign-in
arwx is multi-tenant. People sign in to the dashboard with email + password; machines and CI use API keys. Every account belongs to one or more organizations, and each org owns its own projects, runs, and keys — fully isolated from other orgs.
Quickstart (Python)
The SDK isn't on PyPI — it's served straight from this site. It's pure Python with zero dependencies (only the standard library), so install is just a wheel fetch, and requires-python >= 3.10.
copy# install the wheel hosted here (no PyPI, no extra deps) pip install https://app.bullmask.com/sdk/arwandb-1.3.1-py3-none-any.whl # …or always grab the latest (find-links index, no pinned version) pip install --upgrade arwandb --find-links https://app.bullmask.com/sdk/ --no-deps
import arwandb with nothing to install:
copycurl -L https://app.bullmask.com/sdk/arwandb-1.3.1-py3-none-any.whl -o arwandb.whl unzip -o arwandb.whl 'arwandb/*' # leaves an ./arwandb package next to your code
copy# point the SDK at your server + key (once, in your shell) export ARWANDB_BASE_URL=https://api.bullmask.com export ARWANDB_API_KEY=arwx_live_xxxxxxxxxxxx export ARWANDB_PROJECT=char-lm
copyimport arwandb run = arwandb.init(project="char-lm", config={"lr": 3e-4}) for step in range(1000): arwandb.log({"loss": loss, "val_bpb": bpb}) # batched + flushed in the background arwandb.finish()
The SDK spools to disk first and uploads asynchronously on a background thread, so a network blip never stalls or crashes training. log() is a memory-only append — it never blocks on the network and never fsyncs on your training thread — so logging on every step costs essentially nothing. Set ARWANDB_MODE=offline to record without uploading, then arwandb.sync(path) later.
Logging metrics
arwandb.log(metrics, step=None) takes a flat dict of name → number. Steps auto-increment if you omit them. Under the hood the SDK batches points and posts them to /v1/ingest (or streams NDJSON to /v1/ingest/stream for large bursts).
log() just appends points to an in-memory buffer (microseconds, no network, no disk wait); a background thread coalesces them into a few large batches (one request per ~500 points, flushed every flush_interval) and uploads over a kept-alive connection. So logging on every training step is one cheap stream of batches, not a request per step — and throttling to "every N steps" buys you nothing but coarser graphs. No need to spare the server; that's the SDK's job. Transient errors back off and retry; delivered points are exactly-once.log() it, it shows up. (The optional autoresearch layer below adds the keep/drop frontier on top.)GET /v1/runs/:id/integrity cross-checks the durability ledger against the metric store and returns a receipt — points_committed, points_stored, duplicates_collapsed, lost (0), intact — which the dashboard shows as a one-line badge above the panels. A NaN/inf or non-numeric value is skipped, never raised, so a stray metric can't crash training.arwandb.init(..., writer_id="rank0") or set ARWANDB_WRITER_ID=rank$RANK per process. A random id is used if you don't set one, and single-process runs need nothing. (Requires SDK ≥ 1.3.1.)| Call | What it does |
|---|---|
| init(project, name?, config?) | Start a run; registers it via POST /v1/runs. |
| log({...}, step?) | Append metric points (background upload). |
| finish(status="finished") | Flush and close the run. |
| sync(path?) | Upload a spool recorded in offline mode. |
System metrics (automatic)
From the moment you init(), the SDK auto-collects host telemetry on a background thread — no code, just like W&B. NVIDIA GPU utilization, memory, power, and temperature (via nvidia-smi when present), plus host CPU, memory, disk usage + I/O, and network, logged under system/* keys. The dashboard groups them into their own collapsible System section below your training charts.
wandb-service daemon and pulls psutil/pynvml), arwx reads /proc + os.statvfs directly and shells out to nvidia-smi only if it exists — no extra package, no second process. Sampling runs on its own thread on a separate step axis, so it never perturbs your training step numbering and never touches the hot path. A missing source (no GPU, an iGPU, a sandboxed /proc) is silently skipped, never an error.copy# on by default; tune or disable per run… run = arwandb.init(project="char-lm", system_interval=30) # sample every 30s (default 15) run = arwandb.init(project="char-lm", system_metrics=False) # …or turn it off # …or from the environment export ARWANDB_SYSTEM_METRICS=off export ARWANDB_SYSTEM_INTERVAL=30
The autoresearch loop
The arwandb.autoresearch module models a research story: a goal, sessions of experiments, and keep/drop decisions that build the running-best frontier you see on the dashboard.
copyfrom arwandb import autoresearch as ar # 1. define the objective (minimize val_bpb toward 2.45) ar.goal(metric="val_bpb", lower_is_better=True, target=2.45) # 2. open a line of inquiry ar.session(tag="tune-warmup") # 3. record a tried change. Inside a git repo, commit + parent + the # UNCOMMITTED working-tree diff are captured automatically (the code # that actually ran) — no arguments needed. note = the flaw→fix story. exp = ar.experiment( note="lr too hot early; added 200-step warmup", snapshot=".", # optional: full code tree, content-addressed ) # 4. judge it — the score feeds the frontier ar.decision(experiment_id=exp["id"], status="keep", score=2.41, metric="val_bpb")
| Function | Purpose |
|---|---|
| goal(metric, lower_is_better, target?) | Set/update the project objective. |
| session(tag?, branch?) | Start a session; becomes the active one. |
| experiment(commit?, note?, diff?, snapshot?) | Record one change under the active session. In a git repo, commit / parent / uncommitted diff auto-capture (pass to override; diff="" suppresses). Stamps a repro_key = hash(commit, diff, config). |
| decision(experiment_id, status, score, metric?) | Keep/drop verdict (status="keep") + score. |
| progress(session_id?) / frontier(session_id?) | Read the running-best staircase. |
| context(project?, limit=5) | Compact summary for an agent's next step. |
Artifacts & lineage
An artifact is a named, versioned output addressed like a container image: name:ref. Four orthogonal primitives:
- Version — immutable, auto-numbered from
v1. Identical bytes dedupe to the same version (content-addressed). - Ref — a movable name (
winner,production,latest) pointing at a version. - Lineage —
produced_byties a version to the experiment that made it. latestis a built-in ref that always resolves to the highest version.
copyfrom arwandb import autoresearch as ar # log a new version from files; content-addressed, so re-logging same bytes is a no-op v = ar.artifact("char-lm", "out/model", type="model", produced_by=exp["id"]) ar.promote("char-lm", "winner", version=v["version"]) # move the ref ar.get_artifact("char-lm", "winner") # resolve ref → version ar.artifacts(type="model") # list
The dashboard shows these under the Artifacts lens of each project: every version, its score, and which one carries the winner badge.
Code snapshots
Pass snapshot="." (a dir or file list) to experiment(...) and the SDK hashes every file, asks the server which blobs are missing (POST /v1/blobs/missing), uploads only those, and attaches a tree manifest. That makes any experiment reproducible without storing redundant bytes. Caches, datasets, and model weights are ignored by default; a snapshot failure warns but never breaks the experiment.
HTTP API reference
All paths are under the base URL and require the Authorization header (except /health). Bodies and responses are JSON.
Runs & metrics
| Endpoint | Description |
|---|---|
| POST /v1/runs | Create a run. |
| POST /v1/runs/:id/finish | Mark a run finished. |
| GET /v1/runs/:id/metrics | Full run history (W&B scan_history): ?keys=, min_step, max_step, limit. |
| GET /v1/runs/:id/integrity | Integrity receipt: points_committed (durability ledger) vs points_stored (deduped), duplicates_collapsed, lost, intact, and the run's commit/diff_sha fingerprint. |
| POST /v1/ingest | Batch-append metric points (JSON). |
| POST /v1/ingest/stream | Stream points as NDJSON (no body limit). |
Projects & goals
| Endpoint | Description |
|---|---|
| POST /v1/projects | Create/update a project & its goal. |
| GET /v1/projects | List projects. |
| GET /v1/projects/:name | One project. |
| GET /v1/projects/:name/sessions | Sessions in a project. |
| GET /v1/projects/:name/progress | Running-best frontier (powers the Experiments chart). |
| GET /v1/projects/:name/runs | Runs in a project (id, name, status, config) — the runs table. |
| GET /v1/projects/:name/charts | Auto-chart data (W&B history): downsampled per-metric series for every run + each run's latest value. ?keys=, samples (def 500), since_step, max_step. |
Sessions & experiments
| Endpoint | Description |
|---|---|
| POST /v1/sessions | Open a session. |
| POST /v1/experiments | Record an experiment. |
| GET /v1/experiments/:id | Experiment detail. |
| GET /v1/experiments/:id/diff | Stored diff text. |
| POST /v1/experiments/:id/decision | Record keep/drop + score. |
| GET /v1/experiments/:id/tree PUT | Read / set the code-snapshot tree. |
| GET /v1/experiments/:a/diff/:b | Tree diff between two experiments. |
| GET /v1/sessions/:id/progress | Session progress (.svg variant too). |
| GET /v1/sessions/:id/frontier | Best-so-far frontier. |
Blobs & files
| Endpoint | Description |
|---|---|
| POST /v1/blobs/missing | Which of these SHAs are absent? (dedup check) |
| POST /v1/blobs | Upload one blob's bytes. |
| GET /v1/blobs/:sha | Fetch a blob by content hash. |
| POST /v1/files POST /v1/files/upload | Register / upload an external file artifact. |
Artifacts
| Endpoint | Description |
|---|---|
| POST /v1/artifacts/log | Log a version (dedupes on tree hash). |
| GET /v1/artifacts | List artifacts (filter ?type=). |
| GET /v1/artifacts/:name | Versions + refs for one artifact. |
| PUT /v1/artifacts/:name/refs/:ref | Promote: move a ref to a version. |
| GET /v1/artifacts/:name/:ref | Resolve name:ref → version (latest built in). |
Agent & keys
| Endpoint | Description |
|---|---|
| GET /v1/agent/context | Compact project state for an agent. |
| GET /v1/bootstrap | One call that hydrates the whole dashboard. |
| POST /v1/keys GET /v1/keys | Create / list API keys. |
| DELETE /v1/keys/:id | Revoke a key. |
| GET /health | Liveness probe (public, no auth). |
cURL example
copy# set the goal for a project curl -s https://api.bullmask.com/v1/projects \ -H "Authorization: Bearer $ARWANDB_API_KEY" \ -H "Content-Type: application/json" \ -d '{"name":"char-lm","objective_metric":"val_bpb","lower_is_better":true,"target":2.45}' # resolve the current winner curl -s "https://api.bullmask.com/v1/artifacts/char-lm/winner?project=char-lm" \ -H "Authorization: Bearer $ARWANDB_API_KEY"
Environment variables
| Variable | Meaning |
|---|---|
| ARWANDB_BASE_URL | API origin. Default http://127.0.0.1:8090. |
| ARWANDB_API_KEY | Bearer token for all writes. |
| ARWANDB_PROJECT | Default project when not passed explicitly. |
| ARWANDB_MODE | online (default) or offline (spool only). |
| ARWANDB_SYSTEM_METRICS | Auto-collect system/* telemetry — on by default; off/0/false disables it. |
| ARWANDB_SYSTEM_INTERVAL | System sampling cadence in seconds (default 15, floor 1). |
| ARWANDB_WRITER_ID | Per-process writer id for distributed/multi-process runs sharing one run id (default random). |
| ARWANDB_SNAPSHOT_MAX_FILE | Per-file snapshot size cap (bytes, default 5 MiB). |
| ARWANDB_MAX_DIFF_BYTES | Cap on the auto-captured experiment diff (bytes, default 1 MiB); over it, a marker is stored instead. |
The WANDB_* equivalents are also read, so existing W&B scripts work with only the base URL changed.
Errors & retries
- 401 — missing/invalid/disabled API key.
- 404 — unknown project, experiment, or artifact ref.
- 409 / dedup — re-logging identical content is an idempotent no-op, not an error.
- 5xx / network — the SDK spools locally and retries; transient failures don't lose data or stop training.
arwx — simple, ultra-efficient, reliable, reproducible, evaluatable. Open the dashboard →