Running Reliable Telegram Bots on a Raspberry Pi
How I built a production-grade orchestration system for Telegram bots using Redis, systemd, and health monitoring—turning fragile keystroke injection into a resilient platform.

I run several Telegram bots on a Raspberry Pi. For a long time, the setup worked—orders arrived, messages were processed, Claude Code executed commands. But it was fragile. When a bot crashed, I wouldn't notice for minutes. Commands sent to the bots sometimes disappeared silently. There was no way to see what was happening without SSH'ing in.
This is the story of how I rebuilt the system to be production-grade, and what I learned along the way.
The Problem
My original setup was simple:
- Systemd services — each bot runs as a service and restarts on crash
- Tmux sessions — for Claude Code, which needs an interactive terminal
- Direct keystroke injection — send commands by typing into tmux directly
The critical flaw: no queue, no acknowledgment, no retry logic. When the admin bot wanted to send a command to Claude, it would:
tmux send-keys -t session "your command" Enter
That's it. If Claude crashed between the keystroke and execution, the command was lost. If Claude was processing something, the input might get mangled. If I restarted Claude for a fresh session, any pending commands evaporated.
The real problem: I had no visibility. I couldn't tell if Claude was alive, processing, or stuck. I couldn't see if commands were queued or lost. There was no feedback loop.
Why Raspberry Pi?
You might ask: why run always-on services on a Pi instead of a cloud VM? A few reasons:
- Always available — no cold starts, consistent state
- Low power — runs 24/7 for ~$10/year in electricity
- Local storage — easy file access, no complex cloud storage
- Direct access — low latency, full control
The tradeoff: limited RAM (~4 GB), no built-in redundancy. This meant any solution had to be lean and fault-tolerant.
The Solution: Orchestration as a Service
Instead of patching the tmux-based system, I rebuilt it from first principles. The goal was simple: make the system observable, reliable, and debuggable without adding much complexity.
I added three components:
- Redis — central state bus and command queue
- Health monitor — watches for crashes and auto-restarts
- Status dashboard — real-time visibility into system state
Component 1: Redis as a State Bus
Redis became the single source of truth. Instead of bots talking directly via keystroke injection, they communicate through Redis keys:
claude:status → "idle" | "processing" | "starting" | "error"
claude:heartbeat → timestamp (TTL 30s)
claude:uptime → startup epoch
claude:error_count → number of crashes
claude:inbox → FIFO queue of pending commands
claude:history → last 20 executed commands
The daemon process (which manages Claude Code) now:
- On startup: writes status "starting", uptime timestamp, resets error count
- Every 10 seconds: writes a heartbeat with 30-second TTL
- When idle: watches the queue for new commands
- On shutdown: increments error count, sets status "stopped"
This solved the visibility problem immediately. I could SSH in and ask: redis-cli GET claude:status. Done.
Component 2: The Command Queue
Instead of direct tmux injection, commands now go into a Redis FIFO queue:
# In the admin bot
def queue_command(command):
r.lpush("claude:inbox", json.dumps({
"id": uuid.uuid4(),
"command": command,
"timestamp": time.time()
}))
return True # Command will be delivered, guaranteed
The daemon's queue worker loop constantly polls:
while true; do
if [ "$(redis-cli LLEN claude:inbox)" -gt 0 ] && [ "$(redis-cli GET claude:status)" = "idle" ]; then
item=$(redis-cli RPOP claude:inbox)
command=$(echo "$item" | python3 -c "import sys,json; print(json.loads(sys.stdin.read())['command'])")
# Inject to tmux
tmux send-keys -t session "$command" Enter
# Record in history for debugging
redis-cli LPUSH claude:history "$command"
redis-cli LTRIM claude:history 0 19
fi
sleep 2
done
The beauty of this approach: commands persist across crashes. If Claude goes down mid-restart, the queue survives in Redis. When it comes back up, the commands are still waiting.
And there's graceful degradation: if Redis isn't available, the admin bot falls back to direct keystroke injection. The system doesn't break, it just loses the queue benefit.
Key insight: The queue isn't a complex message broker. It's Redis's native list data structure. Simple, fast, and good enough.
Component 3: Health Monitoring & Auto-Recovery
Now comes the magic: a simple health monitor that watches the heartbeat:
# Health monitor loop (runs every 30 seconds)
def check_heartbeat():
ttl = redis.ttl("claude:heartbeat")
if ttl == -2: # Key missing, not renewed
heartbeat_misses += 1
if heartbeat_misses >= 3:
# Three consecutive misses = Claude is dead
subprocess.run(["sudo", "systemctl", "restart", "claude-code"])
send_telegram_alert("Claude auto-restarted after heartbeat miss")
heartbeat_misses = 0
This single pattern solved the detection problem. Claude crashes → heartbeat stops → monitor detects within 30 seconds → auto-restart triggers → Telegram alert sent. No manual intervention.
The monitor also checks:
- Queue depth — alert if commands are backing up
- All critical services — confirm they're still running
- System resources — memory % and CPU temperature
Rate-limited to avoid alert spam (5-minute cooldown per alert type).
Component 4: The Status Dashboard
Finally, a /status command in Telegram that shows the whole system state:
Claude Code — Status
──────────────────────
✅ idle · up 2h 14m · queue: 0 · errors: 0
Services
✅ claude-code ✅ admin-bot
✅ ledger ✅ reflect
✅ redis ✅ health-monitor
System
🟢 Memory 58% 🟢 CPU 47.9°C
Today €0.42
Last commands
1. /clear
2. write me a haiku
3. summarise this file
This gives me complete visibility from Telegram itself. No need to SSH in. Just open Telegram and ask for status.
Handling Edge Cases: The Startup Prompt
About halfway through implementation, I ran into an annoying edge case. Claude Code shows a security prompt on first startup: "Is this a project you created or one you trust?"
The daemon would start Claude, then sit waiting for the "Listening for channels" message. But it never came—Claude was stuck on the prompt, waiting for user input.
The health monitor would detect the heartbeat missing and trigger a restart. Which would start Claude again. Which would get stuck on the prompt again. Infinite loop.
The fix was simple but crucial: detect the prompt and auto-confirm it:
output=$(tmux capture-pane -t session -p)
if echo "$output" | grep -q "Is this a project you created"; then
tmux send-keys -t session "Enter"
fi
This taught me an important lesson: even with great monitoring, you need to handle startup edge cases gracefully. A system that crashes on startup will never reach a healthy state, no matter how good the recovery logic is.
Lessons Learned
1. Observability is Infrastructure
Before: I had logs. After: I had state. State is better.
A log tells you what happened. State tells you what's happening. Redis keys with TTLs let me answer "Is this alive right now?" instantly, without parsing logs or running commands.
2. Graceful Degradation > Perfect Reliability
Every integration with Redis is wrapped in try/except. If Redis goes down, the bots keep working. They lose the queue benefit, but they don't crash.
This is better than building a system that requires Redis to be up. Simpler, more testable, more resilient.
3. Simple Patterns Scale
I didn't use RabbitMQ, Kafka, or Celery. I used Redis's native data structures: strings (for state), lists (for queues), TTLs (for heartbeats).
The whole system is ~400 lines of bash and Python. Understandable. Debuggable. No ops overhead.
4. Visibility First, Complexity Second
The status dashboard took an hour to build. It saved me dozens of hours of debugging. Build observability early.
5. Test Your Edge Cases
The startup prompt issue would have been caught immediately if I'd tested what happens when Claude starts but can't reach the "ready" state. Always test your recovery paths.
The Results
After this rebuild:
- ✅ No silent losses — every command queued is eventually executed
- ✅ Fast detection — crashes detected within 90 seconds
- ✅ Auto-recovery — health monitor restarts the bot before I notice
- ✅ Full visibility — status dashboard shows everything, from Telegram
- ✅ Lean system — under 30 MB of additional memory, <0.3% CPU overhead
The bots are now production-grade, not just "works most of the time."
What's Next
A few ideas for the future:
- Distributed bots — this pattern would work across multiple Pis with shared Redis
- Command replay — store all executed commands to replay after crashes
- Circuit breakers — detect if a bot is in a bad state and pause its queue
- Metrics export — push state to Prometheus for long-term trends
But for now, the system does what it needs to. It's observable, resilient, and simple enough to maintain alone.
Running production services doesn't require enterprise infrastructure. It requires understanding your system deeply and building observability first.
The full orchestration system is documented in detail here. If you're running services on a resource-constrained device (Pi, VPS, VM), the patterns here—Redis state bus, heartbeat monitoring, health dashboards—work for anything you're running.