Running Reliable Telegram Bots on a Raspberry Pi

I run several Telegram bots on a Raspberry Pi. For a long time, the setup worked—orders arrived, messages were processed, Claude Code executed commands. But it was fragile. When a bot crashed, I wouldn't notice for minutes. Commands sent to the bots sometimes disappeared silently. There was no way to see what was happening without SSH'ing in.

This is the story of how I rebuilt the system to be production-grade, and what I learned along the way.

The Problem

My original setup was simple:

Systemd services — each bot runs as a service and restarts on crash
Tmux sessions — for Claude Code, which needs an interactive terminal
Direct keystroke injection — send commands by typing into tmux directly

The critical flaw: no queue, no acknowledgment, no retry logic. When the admin bot wanted to send a command to Claude, it would:

tmux send-keys -t session "your command" Enter

That's it. If Claude crashed between the keystroke and execution, the command was lost. If Claude was processing something, the input might get mangled. If I restarted Claude for a fresh session, any pending commands evaporated.

The real problem: I had no visibility. I couldn't tell if Claude was alive, processing, or stuck. I couldn't see if commands were queued or lost. There was no feedback loop.

Why Raspberry Pi?

You might ask: why run always-on services on a Pi instead of a cloud VM? A few reasons:

Always available — no cold starts, consistent state
Low power — runs 24/7 for ~$10/year in electricity
Local storage — easy file access, no complex cloud storage
Direct access — low latency, full control

The tradeoff: limited RAM (~4 GB), no built-in redundancy. This meant any solution had to be lean and fault-tolerant.

The Solution: Orchestration as a Service

Instead of patching the tmux-based system, I rebuilt it from first principles. The goal was simple: make the system observable, reliable, and debuggable without adding much complexity.

I added three components:

Redis — central state bus and command queue
Health monitor — watches for crashes and auto-restarts
Status dashboard — real-time visibility into system state

Component 1: Redis as a State Bus

Redis became the single source of truth. Instead of bots talking directly via keystroke injection, they communicate through Redis keys:

claude:status → "idle" | "processing" | "starting" | "error"
claude:heartbeat → timestamp (TTL 30s)
claude:uptime → startup epoch
claude:error_count → number of crashes
claude:inbox → FIFO queue of pending commands
claude:history → last 20 executed commands

The daemon process (which manages Claude Code) now:

On startup: writes status "starting", uptime timestamp, resets error count
Every 10 seconds: writes a heartbeat with 30-second TTL
When idle: watches the queue for new commands
On shutdown: increments error count, sets status "stopped"

This solved the visibility problem immediately. I could SSH in and ask: redis-cli GET claude:status. Done.

Component 2: The Command Queue

Instead of direct tmux injection, commands now go into a Redis FIFO queue:

# In the admin bot
def queue_command(command):
    r.lpush("claude:inbox", json.dumps({
        "id": uuid.uuid4(),
        "command": command,
        "timestamp": time.time()
    }))
    return True  # Command will be delivered, guaranteed

The daemon's queue worker loop constantly polls:

while true; do
    if [ "$(redis-cli LLEN claude:inbox)" -gt 0 ] && [ "$(redis-cli GET claude:status)" = "idle" ]; then
        item=$(redis-cli RPOP claude:inbox)
        command=$(echo "$item" | python3 -c "import sys,json; print(json.loads(sys.stdin.read())['command'])")

        # Inject to tmux
        tmux send-keys -t session "$command" Enter

        # Record in history for debugging
        redis-cli LPUSH claude:history "$command"
        redis-cli LTRIM claude:history 0 19
    fi
    sleep 2
done

The beauty of this approach: commands persist across crashes. If Claude goes down mid-restart, the queue survives in Redis. When it comes back up, the commands are still waiting.

And there's graceful degradation: if Redis isn't available, the admin bot falls back to direct keystroke injection. The system doesn't break, it just loses the queue benefit.

Key insight: The queue isn't a complex message broker. It's Redis's native list data structure. Simple, fast, and good enough.

Component 3: Health Monitoring & Auto-Recovery

Now comes the magic: a simple health monitor that watches the heartbeat:

# Health monitor loop (runs every 30 seconds)
def check_heartbeat():
    ttl = redis.ttl("claude:heartbeat")

    if ttl == -2:  # Key missing, not renewed
        heartbeat_misses += 1
        if heartbeat_misses >= 3:
            # Three consecutive misses = Claude is dead
            subprocess.run(["sudo", "systemctl", "restart", "claude-code"])
            send_telegram_alert("Claude auto-restarted after heartbeat miss")
            heartbeat_misses = 0

This single pattern solved the detection problem. Claude crashes → heartbeat stops → monitor detects within 30 seconds → auto-restart triggers → Telegram alert sent. No manual intervention.

The monitor also checks:

Queue depth — alert if commands are backing up
All critical services — confirm they're still running
System resources — memory % and CPU temperature

Rate-limited to avoid alert spam (5-minute cooldown per alert type).

Component 4: The Status Dashboard

Finally, a /status command in Telegram that shows the whole system state:

Claude Code — Status
──────────────────────
✅ idle · up 2h 14m · queue: 0 · errors: 0

Services
  ✅ claude-code    ✅ admin-bot
  ✅ ledger         ✅ reflect
  ✅ redis          ✅ health-monitor

System
  🟢 Memory 58%   🟢 CPU 47.9°C

Today €0.42

Last commands
  1. /clear
  2. write me a haiku
  3. summarise this file

This gives me complete visibility from Telegram itself. No need to SSH in. Just open Telegram and ask for status.

Handling Edge Cases: The Startup Prompt

About halfway through implementation, I ran into an annoying edge case. Claude Code shows a security prompt on first startup: "Is this a project you created or one you trust?"

The daemon would start Claude, then sit waiting for the "Listening for channels" message. But it never came—Claude was stuck on the prompt, waiting for user input.

The health monitor would detect the heartbeat missing and trigger a restart. Which would start Claude again. Which would get stuck on the prompt again. Infinite loop.

The fix was simple but crucial: detect the prompt and auto-confirm it:

output=$(tmux capture-pane -t session -p)
if echo "$output" | grep -q "Is this a project you created"; then
    tmux send-keys -t session "Enter"
fi

This taught me an important lesson: even with great monitoring, you need to handle startup edge cases gracefully. A system that crashes on startup will never reach a healthy state, no matter how good the recovery logic is.

Lessons Learned

1. Observability is Infrastructure

Before: I had logs. After: I had state. State is better.

A log tells you what happened. State tells you what's happening. Redis keys with TTLs let me answer "Is this alive right now?" instantly, without parsing logs or running commands.

2. Graceful Degradation > Perfect Reliability

Every integration with Redis is wrapped in try/except. If Redis goes down, the bots keep working. They lose the queue benefit, but they don't crash.

This is better than building a system that requires Redis to be up. Simpler, more testable, more resilient.

3. Simple Patterns Scale

I didn't use RabbitMQ, Kafka, or Celery. I used Redis's native data structures: strings (for state), lists (for queues), TTLs (for heartbeats).

The whole system is ~400 lines of bash and Python. Understandable. Debuggable. No ops overhead.

4. Visibility First, Complexity Second

The status dashboard took an hour to build. It saved me dozens of hours of debugging. Build observability early.

5. Test Your Edge Cases

The startup prompt issue would have been caught immediately if I'd tested what happens when Claude starts but can't reach the "ready" state. Always test your recovery paths.

The Results

After this rebuild:

✅ No silent losses — every command queued is eventually executed
✅ Fast detection — crashes detected within 90 seconds
✅ Auto-recovery — health monitor restarts the bot before I notice
✅ Full visibility — status dashboard shows everything, from Telegram
✅ Lean system — under 30 MB of additional memory, <0.3% CPU overhead

The bots are now production-grade, not just "works most of the time."

What's Next

A few ideas for the future:

Distributed bots — this pattern would work across multiple Pis with shared Redis
Command replay — store all executed commands to replay after crashes
Circuit breakers — detect if a bot is in a bad state and pause its queue
Metrics export — push state to Prometheus for long-term trends

But for now, the system does what it needs to. It's observable, resilient, and simple enough to maintain alone.

Running production services doesn't require enterprise infrastructure. It requires understanding your system deeply and building observability first.

The full orchestration system is documented in detail here. If you're running services on a resource-constrained device (Pi, VPS, VM), the patterns here—Redis state bus, heartbeat monitoring, health dashboards—work for anything you're running.