arwx API

A small, content-addressed experiment tracker. One project carries a goal; every run, experiment, decision, and artifact hangs off it. The API is plain JSON over HTTPS — use the Python SDK or call it directly.

Base URL  https://api.bullmask.com  ·  Dashboard app.bullmask.com  ·  Health GET /health (no auth)

Overview

Every object lives under a project, which holds the goal — an objective metric, a direction (minimize / maximize), and an optional target. Underneath the project:

Two ways in: the Python SDK (arwandb) wraps all of this and works offline-first; the raw HTTP API is what the SDK speaks, and you can use it from any language.

Authentication

Every /v1/* endpoint requires an API key sent as a bearer token. Keys are created and revoked from the dashboard (Connection panel) or via /v1/keys. The service is fail-closed: no valid, enabled key → 401.

Authorization: Bearer arwx_live_xxxxxxxxxxxx
Keep keys server-side. A key is a write credential for your project. Prefer one named, revocable key per machine/agent so you can cut off a single source without rotating everything.

Organizations & sign-in

arwx is multi-tenant. People sign in to the dashboard with email + password; machines and CI use API keys. Every account belongs to one or more organizations, and each org owns its own projects, runs, and keys — fully isolated from other orgs.

Signing up creates a new org. Registering makes a brand-new organization (named whatever you type) with you as its admin — it never auto-joins an existing org, since that would expose another tenant's data. So if your data lives in a different org than the one your signup created, have an admin of that org add you as a member (Admin → Members). Your active org — which data the dashboard shows — lives in the Connection panel (bottom-left), where you can switch orgs and sign out; the dashboard remembers your last-used org across logins. API keys are bound to one org at creation and ignore the switcher.

Quickstart (Python)

The SDK isn't on PyPI — it's served straight from this site. It's pure Python with zero dependencies (only the standard library), so install is just a wheel fetch, and requires-python >= 3.10.

copy# install the wheel hosted here (no PyPI, no extra deps)
pip install https://app.bullmask.com/sdk/arwandb-1.3.1-py3-none-any.whl

# …or always grab the latest (find-links index, no pinned version)
pip install --upgrade arwandb --find-links https://app.bullmask.com/sdk/ --no-deps
No pip? Just vendor it. Because the SDK has no dependencies, you can drop the package folder into your project and import arwandb with nothing to install:
copycurl -L https://app.bullmask.com/sdk/arwandb-1.3.1-py3-none-any.whl -o arwandb.whl
unzip -o arwandb.whl 'arwandb/*'   # leaves an ./arwandb package next to your code
copy# point the SDK at your server + key (once, in your shell)
export ARWANDB_BASE_URL=https://api.bullmask.com
export ARWANDB_API_KEY=arwx_live_xxxxxxxxxxxx
export ARWANDB_PROJECT=char-lm
copyimport arwandb

run = arwandb.init(project="char-lm", config={"lr": 3e-4})
for step in range(1000):
    arwandb.log({"loss": loss, "val_bpb": bpb})   # batched + flushed in the background
arwandb.finish()

The SDK spools to disk first and uploads asynchronously on a background thread, so a network blip never stalls or crashes training. log() is a memory-only append — it never blocks on the network and never fsyncs on your training thread — so logging on every step costs essentially nothing. Set ARWANDB_MODE=offline to record without uploading, then arwandb.sync(path) later.

Built for long runs (1.2.0). The uploader holds one keep-alive connection and drains a whole batch over it (no TCP+TLS handshake per request), fsyncs periodically rather than per event, and compacts the on-disk spool as data is delivered — so the spool stays ~the undelivered backlog, not the run's whole history. A persistently-failing event is dead-lettered (recoverable) instead of wedging the queue. Net: a days-long run logging every step is cheap and bounded.

Logging metrics

arwandb.log(metrics, step=None) takes a flat dict of name → number. Steps auto-increment if you omit them. Under the hood the SDK batches points and posts them to /v1/ingest (or streams NDJSON to /v1/ingest/stream for large bursts).

Log every step — don't throttle. log() just appends points to an in-memory buffer (microseconds, no network, no disk wait); a background thread coalesces them into a few large batches (one request per ~500 points, flushed every flush_interval) and uploads over a kept-alive connection. So logging on every training step is one cheap stream of batches, not a request per step — and throttling to "every N steps" buys you nothing but coarser graphs. No need to spare the server; that's the SDK's job. Transient errors back off and retry; delivered points are exactly-once.
It auto-charts — zero setup. Open the project on the dashboardMetrics: every metric key you log gets its own live panel (runs overlaid), plus a runs table showing each run's config (hyperparameters) and latest values. No goal, experiment, or decision needed — if you log() it, it shows up. (The optional autoresearch layer below adds the keep/drop frontier on top.)
Verify it, don't just trust it. Logging is exactly-once by construction (durable disk spool → idempotent sink), and you can prove it per run: GET /v1/runs/:id/integrity cross-checks the durability ledger against the metric store and returns a receipt — points_committed, points_stored, duplicates_collapsed, lost (0), intact — which the dashboard shows as a one-line badge above the panels. A NaN/inf or non-numeric value is skipped, never raised, so a stray metric can't crash training.
Distributed / multi-process runs. If several processes log to the same run id (e.g. distributed data-parallel training), give each one a stable writer id so their points can't collide in the exactly-once ledger: arwandb.init(..., writer_id="rank0") or set ARWANDB_WRITER_ID=rank$RANK per process. A random id is used if you don't set one, and single-process runs need nothing. (Requires SDK ≥ 1.3.1.)
CallWhat it does
init(project, name?, config?)Start a run; registers it via POST /v1/runs.
log({...}, step?)Append metric points (background upload).
finish(status="finished")Flush and close the run.
sync(path?)Upload a spool recorded in offline mode.

System metrics (automatic)

From the moment you init(), the SDK auto-collects host telemetry on a background thread — no code, just like W&B. NVIDIA GPU utilization, memory, power, and temperature (via nvidia-smi when present), plus host CPU, memory, disk usage + I/O, and network, logged under system/* keys. The dashboard groups them into their own collapsible System section below your training charts.

Zero-dependency, zero training cost. Unlike W&B (which forks a wandb-service daemon and pulls psutil/pynvml), arwx reads /proc + os.statvfs directly and shells out to nvidia-smi only if it exists — no extra package, no second process. Sampling runs on its own thread on a separate step axis, so it never perturbs your training step numbering and never touches the hot path. A missing source (no GPU, an iGPU, a sandboxed /proc) is silently skipped, never an error.
copy# on by default; tune or disable per run…
run = arwandb.init(project="char-lm", system_interval=30)   # sample every 30s (default 15)
run = arwandb.init(project="char-lm", system_metrics=False)  # …or turn it off

# …or from the environment
export ARWANDB_SYSTEM_METRICS=off
export ARWANDB_SYSTEM_INTERVAL=30

The autoresearch loop

The arwandb.autoresearch module models a research story: a goal, sessions of experiments, and keep/drop decisions that build the running-best frontier you see on the dashboard.

copyfrom arwandb import autoresearch as ar

# 1. define the objective (minimize val_bpb toward 2.45)
ar.goal(metric="val_bpb", lower_is_better=True, target=2.45)

# 2. open a line of inquiry
ar.session(tag="tune-warmup")

# 3. record a tried change. Inside a git repo, commit + parent + the
#    UNCOMMITTED working-tree diff are captured automatically (the code
#    that actually ran) — no arguments needed. note = the flaw→fix story.
exp = ar.experiment(
    note="lr too hot early; added 200-step warmup",
    snapshot=".",                 # optional: full code tree, content-addressed
)

# 4. judge it — the score feeds the frontier
ar.decision(experiment_id=exp["id"], status="keep", score=2.41, metric="val_bpb")
FunctionPurpose
goal(metric, lower_is_better, target?)Set/update the project objective.
session(tag?, branch?)Start a session; becomes the active one.
experiment(commit?, note?, diff?, snapshot?)Record one change under the active session. In a git repo, commit / parent / uncommitted diff auto-capture (pass to override; diff="" suppresses). Stamps a repro_key = hash(commit, diff, config).
decision(experiment_id, status, score, metric?)Keep/drop verdict (status="keep") + score.
progress(session_id?) / frontier(session_id?)Read the running-best staircase.
context(project?, limit=5)Compact summary for an agent's next step.

Artifacts & lineage

An artifact is a named, versioned output addressed like a container image: name:ref. Four orthogonal primitives:

copyfrom arwandb import autoresearch as ar

# log a new version from files; content-addressed, so re-logging same bytes is a no-op
v = ar.artifact("char-lm", "out/model", type="model", produced_by=exp["id"])

ar.promote("char-lm", "winner", version=v["version"])   # move the ref
ar.get_artifact("char-lm", "winner")                     # resolve ref → version
ar.artifacts(type="model")                              # list

The dashboard shows these under the Artifacts lens of each project: every version, its score, and which one carries the winner badge.

Code snapshots

Pass snapshot="." (a dir or file list) to experiment(...) and the SDK hashes every file, asks the server which blobs are missing (POST /v1/blobs/missing), uploads only those, and attaches a tree manifest. That makes any experiment reproducible without storing redundant bytes. Caches, datasets, and model weights are ignored by default; a snapshot failure warns but never breaks the experiment.

HTTP API reference

All paths are under the base URL and require the Authorization header (except /health). Bodies and responses are JSON.

Runs & metrics

EndpointDescription
POST /v1/runsCreate a run.
POST /v1/runs/:id/finishMark a run finished.
GET /v1/runs/:id/metricsFull run history (W&B scan_history): ?keys=, min_step, max_step, limit.
GET /v1/runs/:id/integrityIntegrity receipt: points_committed (durability ledger) vs points_stored (deduped), duplicates_collapsed, lost, intact, and the run's commit/diff_sha fingerprint.
POST /v1/ingestBatch-append metric points (JSON).
POST /v1/ingest/streamStream points as NDJSON (no body limit).

Projects & goals

EndpointDescription
POST /v1/projectsCreate/update a project & its goal.
GET /v1/projectsList projects.
GET /v1/projects/:nameOne project.
GET /v1/projects/:name/sessionsSessions in a project.
GET /v1/projects/:name/progressRunning-best frontier (powers the Experiments chart).
GET /v1/projects/:name/runsRuns in a project (id, name, status, config) — the runs table.
GET /v1/projects/:name/chartsAuto-chart data (W&B history): downsampled per-metric series for every run + each run's latest value. ?keys=, samples (def 500), since_step, max_step.

Sessions & experiments

EndpointDescription
POST /v1/sessionsOpen a session.
POST /v1/experimentsRecord an experiment.
GET /v1/experiments/:idExperiment detail.
GET /v1/experiments/:id/diffStored diff text.
POST /v1/experiments/:id/decisionRecord keep/drop + score.
GET /v1/experiments/:id/tree PUTRead / set the code-snapshot tree.
GET /v1/experiments/:a/diff/:bTree diff between two experiments.
GET /v1/sessions/:id/progressSession progress (.svg variant too).
GET /v1/sessions/:id/frontierBest-so-far frontier.

Blobs & files

EndpointDescription
POST /v1/blobs/missingWhich of these SHAs are absent? (dedup check)
POST /v1/blobsUpload one blob's bytes.
GET /v1/blobs/:shaFetch a blob by content hash.
POST /v1/files POST /v1/files/uploadRegister / upload an external file artifact.

Artifacts

EndpointDescription
POST /v1/artifacts/logLog a version (dedupes on tree hash).
GET /v1/artifactsList artifacts (filter ?type=).
GET /v1/artifacts/:nameVersions + refs for one artifact.
PUT /v1/artifacts/:name/refs/:refPromote: move a ref to a version.
GET /v1/artifacts/:name/:refResolve name:ref → version (latest built in).

Agent & keys

EndpointDescription
GET /v1/agent/contextCompact project state for an agent.
GET /v1/bootstrapOne call that hydrates the whole dashboard.
POST /v1/keys GET /v1/keysCreate / list API keys.
DELETE /v1/keys/:idRevoke a key.
GET /healthLiveness probe (public, no auth).

cURL example

copy# set the goal for a project
curl -s https://api.bullmask.com/v1/projects \
  -H "Authorization: Bearer $ARWANDB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"name":"char-lm","objective_metric":"val_bpb","lower_is_better":true,"target":2.45}'

# resolve the current winner
curl -s "https://api.bullmask.com/v1/artifacts/char-lm/winner?project=char-lm" \
  -H "Authorization: Bearer $ARWANDB_API_KEY"

Environment variables

VariableMeaning
ARWANDB_BASE_URLAPI origin. Default http://127.0.0.1:8090.
ARWANDB_API_KEYBearer token for all writes.
ARWANDB_PROJECTDefault project when not passed explicitly.
ARWANDB_MODEonline (default) or offline (spool only).
ARWANDB_SYSTEM_METRICSAuto-collect system/* telemetry — on by default; off/0/false disables it.
ARWANDB_SYSTEM_INTERVALSystem sampling cadence in seconds (default 15, floor 1).
ARWANDB_WRITER_IDPer-process writer id for distributed/multi-process runs sharing one run id (default random).
ARWANDB_SNAPSHOT_MAX_FILEPer-file snapshot size cap (bytes, default 5 MiB).
ARWANDB_MAX_DIFF_BYTESCap on the auto-captured experiment diff (bytes, default 1 MiB); over it, a marker is stored instead.

The WANDB_* equivalents are also read, so existing W&B scripts work with only the base URL changed.

Errors & retries


arwx — simple, ultra-efficient, reliable, reproducible, evaluatable.  Open the dashboard →