arwx API

A small, content-addressed experiment tracker. One project carries a goal; every run, experiment, decision, and artifact hangs off it. The API is plain JSON over HTTPS — use the Python SDK or call it directly.

Base URL https://api.bullmask.com · Dashboard app.bullmask.com · Health GET /health (no auth)

Overview

Every object lives under a project, which holds the goal — an objective metric, a direction (minimize / maximize), and an optional target. Underneath the project:

Run — a process that streams metrics (loss curves, lr, etc.).
Session — one line of inquiry; a sequence of experiments.
Experiment — one tried change: a note (the flaw→fix story), a diff, and an optional code snapshot.
Decision — keep / drop verdict on an experiment, with the score it reached.
Artifact — a versioned, content-addressed output (model, code, dataset) with movable refs like winner and lineage back to the experiment that produced it.

Two ways in: the Python SDK (arwandb) wraps all of this and works offline-first; the raw HTTP API is what the SDK speaks, and you can use it from any language.

Authentication

Every /v1/* endpoint requires an API key sent as a bearer token. Keys are created and revoked from the dashboard (Connection panel) or via /v1/keys. The service is fail-closed: no valid, enabled key → 401.

Authorization: Bearer arwx_live_xxxxxxxxxxxx

Keep keys server-side. A key is a write credential for your project. Prefer one named, revocable key per machine/agent so you can cut off a single source without rotating everything.

Organizations & sign-in

arwx is multi-tenant. People sign in to the dashboard with email + password; machines and CI use API keys. Every account belongs to one or more organizations, and each org owns its own projects, runs, and keys — fully isolated from other orgs.

Signing up creates a new org. Registering makes a brand-new organization (named whatever you type) with you as its admin — it never auto-joins an existing org, since that would expose another tenant's data. So if your data lives in a different org than the one your signup created, have an admin of that org add you as a member (Admin → Members). Your active org — which data the dashboard shows — lives in the Connection panel (bottom-left), where you can switch orgs and sign out; the dashboard remembers your last-used org across logins. API keys are bound to one org at creation and ignore the switcher.

Quickstart (Python)

The SDK isn't on PyPI — it's served straight from this site. It's pure Python with zero dependencies (only the standard library), so install is just a wheel fetch, and requires-python >= 3.10.

copy# install the wheel hosted here (no PyPI, no extra deps)
pip install https://app.bullmask.com/sdk/arwandb-1.3.1-py3-none-any.whl

# …or always grab the latest (find-links index, no pinned version)
pip install --upgrade arwandb --find-links https://app.bullmask.com/sdk/ --no-deps

No pip? Just vendor it. Because the SDK has no dependencies, you can drop the package folder into your project and import arwandb with nothing to install:

copycurl -L https://app.bullmask.com/sdk/arwandb-1.3.1-py3-none-any.whl -o arwandb.whl
unzip -o arwandb.whl 'arwandb/*'   # leaves an ./arwandb package next to your code

copy# point the SDK at your server + key (once, in your shell)
export ARWANDB_BASE_URL=https://api.bullmask.com
export ARWANDB_API_KEY=arwx_live_xxxxxxxxxxxx
export ARWANDB_PROJECT=char-lm

copyimport arwandb

run = arwandb.init(project="char-lm", config={"lr": 3e-4})
for step in range(1000):
    arwandb.log({"loss": loss, "val_bpb": bpb})   # batched + flushed in the background
arwandb.finish()

The SDK spools to disk first and uploads asynchronously on a background thread, so a network blip never stalls or crashes training. log() is a memory-only append — it never blocks on the network and never fsyncs on your training thread — so logging on every step costs essentially nothing. Set ARWANDB_MODE=offline to record without uploading, then arwandb.sync(path) later.

Built for long runs (1.2.0). The uploader holds one keep-alive connection and drains a whole batch over it (no TCP+TLS handshake per request), fsyncs periodically rather than per event, and compacts the on-disk spool as data is delivered — so the spool stays ~the undelivered backlog, not the run's whole history. A persistently-failing event is dead-lettered (recoverable) instead of wedging the queue. Net: a days-long run logging every step is cheap and bounded.

Logging metrics

arwandb.log(metrics, step=None) takes a flat dict of name → number. Steps auto-increment if you omit them. Under the hood the SDK batches points and posts them to /v1/ingest (or streams NDJSON to /v1/ingest/stream for large bursts).

Log every step — don't throttle. log() just appends points to an in-memory buffer (microseconds, no network, no disk wait); a background thread coalesces them into a few large batches (one request per ~500 points, flushed every flush_interval) and uploads over a kept-alive connection. So logging on every training step is one cheap stream of batches, not a request per step — and throttling to "every N steps" buys you nothing but coarser graphs. No need to spare the server; that's the SDK's job. Transient errors back off and retry; delivered points are exactly-once.

It auto-charts — zero setup. Open the project on the dashboard → Metrics: every metric key you log gets its own live panel (runs overlaid), plus a runs table showing each run's config (hyperparameters) and latest values. No goal, experiment, or decision needed — if you log() it, it shows up. (The optional autoresearch layer below adds the keep/drop frontier on top.)

Verify it, don't just trust it. Logging is exactly-once by construction (durable disk spool → idempotent sink), and you can prove it per run: GET /v1/runs/:id/integrity cross-checks the durability ledger against the metric store and returns a receipt — points_committed, points_stored, duplicates_collapsed, lost (0), intact — which the dashboard shows as a one-line badge above the panels. A NaN/inf or non-numeric value is skipped, never raised, so a stray metric can't crash training.

Distributed / multi-process runs. If several processes log to the same run id (e.g. distributed data-parallel training), give each one a stable writer id so their points can't collide in the exactly-once ledger: arwandb.init(..., writer_id="rank0") or set ARWANDB_WRITER_ID=rank$RANK per process. A random id is used if you don't set one, and single-process runs need nothing. (Requires SDK ≥ 1.3.1.)

Call	What it does
init(project, name?, config?)	Start a run; registers it via `POST /v1/runs`.
log({...}, step?)	Append metric points (background upload).
finish(status="finished")	Flush and close the run.
sync(path?)	Upload a spool recorded in offline mode.

System metrics (automatic)

From the moment you init(), the SDK auto-collects host telemetry on a background thread — no code, just like W&B. NVIDIA GPU utilization, memory, power, and temperature (via nvidia-smi when present), plus host CPU, memory, disk usage + I/O, and network, logged under system/* keys. The dashboard groups them into their own collapsible System section below your training charts.

Zero-dependency, zero training cost. Unlike W&B (which forks a wandb-service daemon and pulls psutil/pynvml), arwx reads /proc + os.statvfs directly and shells out to nvidia-smi only if it exists — no extra package, no second process. Sampling runs on its own thread on a separate step axis, so it never perturbs your training step numbering and never touches the hot path. A missing source (no GPU, an iGPU, a sandboxed /proc) is silently skipped, never an error.

copy# on by default; tune or disable per run…
run = arwandb.init(project="char-lm", system_interval=30)   # sample every 30s (default 15)
run = arwandb.init(project="char-lm", system_metrics=False)  # …or turn it off

# …or from the environment
export ARWANDB_SYSTEM_METRICS=off
export ARWANDB_SYSTEM_INTERVAL=30

The autoresearch loop

The arwandb.autoresearch module models a research story: a goal, sessions of experiments, and keep/drop decisions that build the running-best frontier you see on the dashboard.

copyfrom arwandb import autoresearch as ar

# 1. define the objective (minimize val_bpb toward 2.45)
ar.goal(metric="val_bpb", lower_is_better=True, target=2.45)

# 2. open a line of inquiry
ar.session(tag="tune-warmup")

# 3. record a tried change. Inside a git repo, commit + parent + the
#    UNCOMMITTED working-tree diff are captured automatically (the code
#    that actually ran) — no arguments needed. note = the flaw→fix story.
exp = ar.experiment(
    note="lr too hot early; added 200-step warmup",
    snapshot=".",                 # optional: full code tree, content-addressed
)

# 4. judge it — the score feeds the frontier
ar.decision(experiment_id=exp["id"], status="keep", score=2.41, metric="val_bpb")

Function	Purpose
goal(metric, lower_is_better, target?)	Set/update the project objective.
session(tag?, branch?)	Start a session; becomes the active one.
experiment(commit?, note?, diff?, snapshot?)	Record one change under the active session. In a git repo, `commit` / parent / uncommitted `diff` auto-capture (pass to override; `diff=""` suppresses). Stamps a `repro_key` = hash(commit, diff, config).
decision(experiment_id, status, score, metric?)	Keep/drop verdict (`status="keep"`) + score.
progress(session_id?) / frontier(session_id?)	Read the running-best staircase.
context(project?, limit=5)	Compact summary for an agent's next step.

Artifacts & lineage

An artifact is a named, versioned output addressed like a container image: name:ref. Four orthogonal primitives:

Version — immutable, auto-numbered from v1. Identical bytes dedupe to the same version (content-addressed).
Ref — a movable name (winner, production, latest) pointing at a version.
Lineage — produced_by ties a version to the experiment that made it.
latest is a built-in ref that always resolves to the highest version.

copyfrom arwandb import autoresearch as ar

# log a new version from files; content-addressed, so re-logging same bytes is a no-op
v = ar.artifact("char-lm", "out/model", type="model", produced_by=exp["id"])

ar.promote("char-lm", "winner", version=v["version"])   # move the ref
ar.get_artifact("char-lm", "winner")                     # resolve ref → version
ar.artifacts(type="model")                              # list

The dashboard shows these under the Artifacts lens of each project: every version, its score, and which one carries the winner badge.

Code snapshots

Pass snapshot="." (a dir or file list) to experiment(...) and the SDK hashes every file, asks the server which blobs are missing (POST /v1/blobs/missing), uploads only those, and attaches a tree manifest. That makes any experiment reproducible without storing redundant bytes. Caches, datasets, and model weights are ignored by default; a snapshot failure warns but never breaks the experiment.

HTTP API reference

All paths are under the base URL and require the Authorization header (except /health). Bodies and responses are JSON.

Runs & metrics

Endpoint	Description
POST /v1/runs	Create a run.
POST /v1/runs/:id/finish	Mark a run finished.
GET /v1/runs/:id/metrics	Full run history (W&B `scan_history`): `?keys=`, `min_step`, `max_step`, `limit`.
GET /v1/runs/:id/integrity	Integrity receipt: `points_committed` (durability ledger) vs `points_stored` (deduped), `duplicates_collapsed`, `lost`, `intact`, and the run's `commit`/`diff_sha` fingerprint.
POST /v1/ingest	Batch-append metric points (JSON).
POST /v1/ingest/stream	Stream points as NDJSON (no body limit).

Projects & goals

Endpoint	Description
POST /v1/projects	Create/update a project & its goal.
GET /v1/projects	List projects.
GET /v1/projects/:name	One project.
GET /v1/projects/:name/sessions	Sessions in a project.
GET /v1/projects/:name/progress	Running-best frontier (powers the Experiments chart).
GET /v1/projects/:name/runs	Runs in a project (id, name, status, config) — the runs table.
GET /v1/projects/:name/charts	Auto-chart data (W&B `history`): downsampled per-metric series for every run + each run's latest value. `?keys=`, `samples` (def 500), `since_step`, `max_step`.

Sessions & experiments

Endpoint	Description
POST /v1/sessions	Open a session.
POST /v1/experiments	Record an experiment.
GET /v1/experiments/:id	Experiment detail.
GET /v1/experiments/:id/diff	Stored diff text.
POST /v1/experiments/:id/decision	Record keep/drop + score.
GET /v1/experiments/:id/tree PUT	Read / set the code-snapshot tree.
GET /v1/experiments/:a/diff/:b	Tree diff between two experiments.
GET /v1/sessions/:id/progress	Session progress (`.svg` variant too).
GET /v1/sessions/:id/frontier	Best-so-far frontier.

Blobs & files

Endpoint	Description
POST /v1/blobs/missing	Which of these SHAs are absent? (dedup check)
POST /v1/blobs	Upload one blob's bytes.
GET /v1/blobs/:sha	Fetch a blob by content hash.
POST /v1/files POST /v1/files/upload	Register / upload an external file artifact.

Artifacts

Endpoint	Description
POST /v1/artifacts/log	Log a version (dedupes on tree hash).
GET /v1/artifacts	List artifacts (filter `?type=`).
GET /v1/artifacts/:name	Versions + refs for one artifact.
PUT /v1/artifacts/:name/refs/:ref	Promote: move a ref to a version.
GET /v1/artifacts/:name/:ref	Resolve `name:ref` → version (`latest` built in).

Agent & keys

Endpoint	Description
GET /v1/agent/context	Compact project state for an agent.
GET /v1/bootstrap	One call that hydrates the whole dashboard.
POST /v1/keys GET /v1/keys	Create / list API keys.
DELETE /v1/keys/:id	Revoke a key.
GET /health	Liveness probe (public, no auth).

cURL example

copy# set the goal for a project
curl -s https://api.bullmask.com/v1/projects \
  -H "Authorization: Bearer $ARWANDB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"name":"char-lm","objective_metric":"val_bpb","lower_is_better":true,"target":2.45}'

# resolve the current winner
curl -s "https://api.bullmask.com/v1/artifacts/char-lm/winner?project=char-lm" \
  -H "Authorization: Bearer $ARWANDB_API_KEY"

Environment variables

Variable	Meaning
ARWANDB_BASE_URL	API origin. Default `http://127.0.0.1:8090`.
ARWANDB_API_KEY	Bearer token for all writes.
ARWANDB_PROJECT	Default project when not passed explicitly.
ARWANDB_MODE	`online` (default) or `offline` (spool only).
ARWANDB_SYSTEM_METRICS	Auto-collect `system/*` telemetry — on by default; `off`/`0`/`false` disables it.
ARWANDB_SYSTEM_INTERVAL	System sampling cadence in seconds (default 15, floor 1).
ARWANDB_WRITER_ID	Per-process writer id for distributed/multi-process runs sharing one run id (default random).
ARWANDB_SNAPSHOT_MAX_FILE	Per-file snapshot size cap (bytes, default 5 MiB).
ARWANDB_MAX_DIFF_BYTES	Cap on the auto-captured experiment diff (bytes, default 1 MiB); over it, a marker is stored instead.

The WANDB_* equivalents are also read, so existing W&B scripts work with only the base URL changed.

Errors & retries

401 — missing/invalid/disabled API key.
404 — unknown project, experiment, or artifact ref.
409 / dedup — re-logging identical content is an idempotent no-op, not an error.
5xx / network — the SDK spools locally and retries; transient failures don't lose data or stop training.

arwx — simple, ultra-efficient, reliable, reproducible, evaluatable. Open the dashboard →