Arbio System Design - OldHost Sync Layer

📊 Architecture Diagrams

Hand-Drawn Style (Excalidraw)

Simple, clean diagram for whiteboard presentation:

Detailed Architecture

Full system with all components:

Simplified Flow

Quick reference version:

1. Requirements (5 min)

Functional Requirements

Instant UI Feedback: User/Agent actions must reflect in UI within 200ms
Eventual Sync: All state changes must sync to OldHost (legal system of record)
Priority Handling: Emergency events (guest locked out) must sync within 30 seconds
Crash Recovery: System must recover from writer crashes without data loss or nonce issues

Non-Functional Requirements

Consistency SLA: OldHost must be consistent with Arbio within 4 hours
Availability: UI must remain responsive even if OldHost is slow or down
Durability: Zero data loss — every event must eventually reach OldHost

Scale & Constraints

Metric	Value
Units managed	50,000
Peak input rate	300+ ops/sec (10K events in 30 min)
OldHost throughput	2.5 ops/sec (400ms latency, single-threaded)
OldHost constraints	Single TCP socket, sequential nonce (N→N+1→N+2)
Nonce violation penalties	Gap = 400 error, Replay = 15-minute ban

The Core Problem: We need to compress 300+ ops/sec down to 2.5 ops/sec while maintaining correctness.

2. Core Entities & API (5 min)

Data Model

-- Shadow State (instant reads)
units (
  id, name, property_id,
  status,                    -- Clean/Dirty/Cleaning
  sync_status,               -- PENDING_SYNC / SYNCED / SYNC_FAILED
  synced_at
)

-- Transactional Outbox (durable writes)
outbox_events (
  id, idempotency_key UNIQUE,
  entity_type, entity_id,
  event_type, payload JSON,
  priority_class,            -- EMERGENCY / TXN / LWW
  created_at, processed_at
)

-- Nonce Ledger (crash-safe nonce state)
nonce_ledger (
  id DEFAULT 1,              -- Singleton row
  next_nonce BIGINT,
  last_applied_nonce BIGINT
)

nonce_requests (
  nonce BIGINT PRIMARY KEY,
  idempotency_key UNIQUE,
  status,                    -- RESERVED / SENT / APPLIED / FAILED
  created_at, sent_at, completed_at
)

-- Leader Election
writer_lease (
  id DEFAULT 1,              -- Singleton row
  owner_id, epoch INT,       -- Fencing token
  acquired_at, expires_at
)

API Design

Internal (Agent → Sync Layer)

POST /api/units/{id}/status
  Body: { status: "Clean" }
  Response: { sync_status: "PENDING_SYNC" }  // < 200ms

External (Sync Layer → OldHost)

POST /oldhost/sync
  Headers: X-Nonce: 501
  Body: { unit_id, status, ... }
  Response: 200 OK | 400 (gap/replay)
  
GET /oldhost/expected-nonce
  Response: { expected_nonce: 502 }

3. High Level Design (10 min)

┌─────────────────────────────────────────────────────────────────────────────┐
│                              HIGH LEVEL DESIGN                              │
└─────────────────────────────────────────────────────────────────────────────┘

  ┌──────────┐     ┌─────────────────────────────────┐
  │  Agents  │────▶│         Shadow State            │◀─── UI reads (< 200ms)
  │ (300/s)  │     │  (MariaDB: status, sync_status) │
  └──────────┘     └───────────────┬─────────────────┘
                                   │
                                   ▼
                         ┌─────────────────┐
                         │  Outbox Table   │
                         │ (same DB, txn)  │
                         └────────┬────────┘
                                  │
                                  ▼
                         ┌─────────────────┐
                         │   Classifier    │
                         └───┬─────┬─────┬─┘
                             │     │     │
              ┌──────────────┘     │     └──────────────┐
              ▼                    ▼                    ▼
    ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
    │ Emergency Q  │     │Transactional │     │LWW Coalescer │
    │    (~2%)     │     │    (~18%)    │     │   (~80%)     │
    │  SLO: 30s    │     │ No compress  │     │ Debounce 2m  │
    └──────┬───────┘     └──────┬───────┘     └──────┬───────┘
           │                    │                    │
           └────────────────────┼────────────────────┘
                                ▼
                    ┌───────────────────────┐
                    │  Weighted Scheduler   │
                    │  Emergency = ABSOLUTE │
                    │  Txn : LWW = 3 : 1    │
                    └───────────┬───────────┘
                                │
                                ▼
              ┌─────────────────────────────────────┐
              │         SINGLETON WRITER            │
              │  • Lease + Epoch (fencing)          │
              │  • Nonce Ledger (crash-safe)        │
              │  • Circuit Breaker (ban)            │
              │  • Single TCP Socket                │
              └─────────────────┬───────────────────┘
                                │
                                ▼
                      ┌─────────────────┐
                      │    OldHost      │
                      │  (2.5 ops/sec)  │
                      └─────────────────┘

Key Design Decisions

Decision	Rationale
Shadow State	Decouples UI speed from OldHost latency
Outbox Pattern	Atomic: business write + event publish in one transaction
LWW Coalescing	Compresses 8,000 status updates → 2,500 ops
3 Priority Queues	Prevents housekeeping from starving emergencies
Singleton Writer	OldHost requires single socket + sequential nonces
Lease + Epoch	Prevents split-brain after failover

4. Deep Dives (15-20 min)

Deep Dive 1: Traffic Shaping (The 300→2.5 Compression)

The Math

Peak: 10,000 events in 30 minutes

Event Classification:
├── LWW (80%):     8,000 events → ~2,500 unique entities → 2,500 ops ✓
├── Transactional: 1,800 events → 1,800 ops (no compression)
└── Emergency:       200 events →   200 ops (no compression)

Total after coalescing: 2,500 + 1,800 + 200 = 4,500 ops
Capacity: 2.5 ops/sec × 1,800 sec = 4,500 ops ✓ EXACTLY AT LIMIT

LWW Coalescer Implementation

class LWWCoalescer:
    def __init__(self):
        self.buffer = {}  # entity_key → (event, timestamp, timer)
        self.debounce_ms = 120_000  # 2 minutes
    
    def add(self, event):
        key = f"{event.entity_type}:{event.entity_id}"
        
        if key in self.buffer:
            old_event, _, timer = self.buffer[key]
            timer.cancel()  # Cancel pending flush
            
            if event.timestamp <= old_event.timestamp:
                return  # Discard older event (Last Write Wins)
        
        # Store latest, start new debounce timer
        timer = Timer(self.debounce_ms, lambda: self.flush(key))
        self.buffer[key] = (event, event.timestamp, timer)
        timer.start()

Deep Dive 2: Singleton Writer & Nonce Safety

Why Singleton?

OldHost kills existing connection when new one opens
Nonces must be strictly sequential (N, N+1, N+2)
Two writers = guaranteed nonce collision = guaranteed ban

Lease-Based Leader Election

-- Acquire lease (atomic compare-and-swap)
UPDATE writer_lease
SET owner_id = :new_owner,
    epoch = epoch + 1,          -- Fencing token increments!
    acquired_at = NOW(),
    expires_at = NOW() + INTERVAL 30 SECOND
WHERE expires_at < NOW()        -- Only if expired
   OR owner_id = :new_owner;    -- Or we already own it

Fencing Token Enforcement

class SingletonWriter:
    def send_to_oldhost(self, request):
        # Before EVERY request, verify our epoch is still valid
        current_epoch = db.query(
            "SELECT epoch FROM writer_lease WHERE owner_id = ?", 
            self.owner_id
        )
        
        if current_epoch != self.my_epoch:
            raise FencedOutError("Another writer took over!")
        
        return self.socket.send(request)

Deep Dive 3: Crash Recovery

Scenario: Writer sends nonce 500, crashes before recording response

1. Writer A crashes after sending nonce 500
   DB state: nonce_requests[500] = SENT
   OldHost: May or may not have processed it

2. Writer B acquires lease (epoch = old_epoch + 1)

3. Writer B runs recovery:
   a. Query: SELECT * FROM nonce_requests WHERE status IN ('RESERVED', 'SENT')
   b. Found: nonce 500, status SENT
   c. Call: GET /expected-nonce → returns 501
   d. 501 > 500, so nonce 500 was applied!
   e. Update: nonce_requests[500].status = APPLIED
   f. Continue from nonce 501

Recovery Code

def recover_from_crash():
    pending = db.query(
        "SELECT * FROM nonce_requests WHERE status IN ('RESERVED', 'SENT')"
    )
    
    if not pending:
        return  # Clean state
    
    # Call OldHost once to reconcile
    expected = oldhost.get_expected_nonce()
    
    for req in pending:
        if req.nonce < expected:
            # OldHost processed it
            db.update("nonce_requests", {status: "APPLIED"}, where={nonce: req.nonce})
        elif req.nonce == expected:
            # Lost in flight, resend
            resend(req)
        else:
            # req.nonce > expected = GAP (should never happen!)
            alert_ops("P0", f"Nonce gap: expected {expected}, have {req.nonce}")
            raise CriticalError()

Deep Dive 4: Ban Handling (Circuit Breaker)

class BanCircuitBreaker:
    def __init__(self):
        self.state = "CLOSED"
        self.ban_duration = 15 * 60  # 15 minutes
        self.opened_at = None
    
    def on_replay_detected(self):
        self.state = "OPEN"
        self.opened_at = time.now()
        alert_ops("P0", "Nonce replay! Circuit breaker OPEN for 15 min")
    
    def is_open(self):
        if self.state == "CLOSED":
            return False
        
        elapsed = time.now() - self.opened_at
        if elapsed >= self.ban_duration:
            self.state = "CLOSED"
            return False
        
        return True

Backlog During Ban

Ban: 15 min = 900 seconds
Incoming: 5.5 ops/sec × 900 = 4,950 raw events
After coalescing: ~2,000 events

Drain after ban: 2,000 / 2.5 = 800 sec = ~13 min
Total recovery: 15 + 13 = 28 minutes (within 4h SLA ✓)

5. Summary

Component	Purpose
Shadow State	Instant UI response (< 200ms)
Outbox Pattern	Atomic, durable event capture
3 Priority Queues	Emergency (30s SLO), Transactional, LWW
LWW Coalescer	Compresses 80% of traffic
Weighted Scheduler	Fair scheduling with emergency priority
Singleton Writer	Single socket + sequential nonces
Lease + Epoch	Fencing for crash recovery
Nonce Ledger	Crash-safe nonce state
Circuit Breaker	Ban recovery

Key Numbers to Remember

120x

Compression ratio

80/18/2

LWW/TXN/Emergency %

30s

Lease TTL

2min

Debounce window

🏗️ Arbio System Design

📑 Table of Contents