Introduction
At Siden, we continuously refine our infrastructure to ensure optimal performance, reliability, and scalability. Our edge computing platform depends on efficient key-value storage to manage content distribution and operational data. Recently, a Redis cluster failure exposed a critical redundancy gap, prompting us to re-evaluate our storage solutions. This blog post outlines our evaluation process, key findings, and the rationale behind our final decision.
The Challenge: Redis Cluster Failure
Redis has been central to our architecture, serving multiple roles, including:
- Storing Target State Data: Maintaining lists of content an edge node must download, structured as sorted sets indexed by device IDs.
- Distributed Lock Management: Ensuring “run only once” execution of tasks in a multi-worker environment.
- Queueing Mechanism: Facilitating a distributed work queue for task processing.
However, our reliance on Redis came under scrutiny when a single machine failure disrupted our cluster. This failure was unexpected, revealing that our Redis deployment lacked redundancy.
Defining Our Storage Requirements
A new key-value storage system needed to meet several critical criteria:
- Scalability: Must support seamless horizontal scaling with increasing workloads.
- Redundancy: No data loss should occur during node failures, with automatic failover mechanisms.
- Performance: Must efficiently handle high update rates, especially for append-heavy workloads.
- Operational Simplicity: Should allow easy node additions and removals without data loss.
- Geo-Redundancy: Should support multi-data-center clustering without requiring costly commercial licenses.
Evaluating Alternative Key-Value Storage Solutions
With these requirements in mind, we evaluated several options:
- Redis: The incumbent solution, offering fast operations but requiring a commercial license for advanced clustering.
- ValKey: A relatively new Redis fork with promised performance gains, but encountered clustering issues.
- DragonFly: Another high-performance Redis alternative but lacked built-in clustering.
- NATS: A distributed messaging system we already use extensively, making it a viable candidate.
Performance Testing and Observations
We built a custom Go-based performance testing client that simulated workloads with varying message sizes, concurrency levels, and request rates. Our testing infrastructure included three Dell R730 servers running Talos Linux on Kubernetes.
Comparing Operations Per Second | Comparing Time Per Operation (milliseconds) |
---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Key takeaways:
- Redis vs. DragonFly: Performance was nearly identical, but DragonFly’s lack of built-in clustering ruled it out.
- ValKey: Failed early in testing due to unresolved clustering issues.
- NATS: Performed at roughly half the speed of Redis in raw throughput but provided operational advantages.
The Decision: Why We Chose NATS
Despite lower raw performance, we selected NATS due to:
- Infrastructure Consolidation: Standardizing on NATS reduces separate infrastructure services.
- Built-in Geo-Redundancy: Unlike Redis, NATS provides scalable multi-data-center clustering.
- Resilience & Simplicity: NATS offers a streamlined, self-healing design aligned with our operational goals.
Future Considerations and Next Steps
As we integrate NATS, we will continue monitoring its real-world performance and scalability. Future optimizations include:
- Fine-tuning NATS configurations to maximize throughput.
- Investigating caching strategies to offset lower raw performance.
- Revisiting emerging alternatives as the landscape evolves.
Conclusion
Our journey from Redis to NATS highlights the importance of proactive failure testing and infrastructure adaptability. While Redis remains a high-performance option, its limitations in clustering and geo-redundancy led us to consolidate around NATS.At Siden, we embrace continuous learning and evolution. We will share further insights on optimizing distributed systems for large-scale content delivery. Stay tuned!
About Siden
Founded in 2018, Siden is at the forefront of revolutionizing connectivity in the aviation, maritime, and home broadband industries. Siden optimizes connectivity platforms through intelligent caching, enabling higher-quality content delivery while reducing network costs.