Transient Bottlenecks in Distributed Systems

Thu, 27 Feb 2025 00:00:00 +0000

Figure 1. A representative long call chain in microservices application SocialNetwork. Suddenly increased workload or resource contention are comment factors that cause transient bottlenecks.

Maintaining consistently low response times is crucial for mission-critical, web-facing applications (e.g., e-commerce), which are typically implemented using distributed systems such as microservices architectures. Through extensive benchmarking of a microservices application in a cloud environment, we find that response time stability is fragile, exhibiting significant variations ranging from milliseconds to seconds.

Our detailed timeline analysis identifies that even a millibottleneck (a bottleneck lasting sub-seconds) can trigger a queuing effect from a downstream service that propagates to upstream services, resulting in dropped requests and TCP retransmissions lasting several seconds at the weakest link in the chain.

Figure 2. An illustration showing how a single traffic burst triggers multiple waves of dropped requests and TCP retransmissions over a ten-second span, leading to significant response time fluctuations..

External bursty workloads occur when a microservice receives a sudden increase in requests, causing the system to become temporarily overloaded and response times to spike.

Here we show a representative 10-second snapshot, capturing each metric with fine-grained monitors using a 50ms time window. This figure demonstrates how a bursty workload induces substantial response time fluctuations. Notably, even very short resource saturation in a deep downstream microservice can significantly degrade performance, highlighting the critical impact of transient bottlenecks on system stability.

Attacking Microservices by Exploiting Execution Dependencies

Sun, 25 Feb 2024 00:00:00 +0000

Figure 1. Test performance interference between a pair of different requests to profile their execution dependencies.

Building on our understanding of execution dependencies in microservices, we propose a black-box approach that leverages legitimate HTTP requests to precisely profile the internal pairwise dependencies across all supported execution paths in the target microservices. As a result, overloading just a few microservices can significantly degrade the performance of the entire system, revealing potential performance vulnerabilities within the microservices.

Figure 2. By exploiting execution dependencies in microservices, Grunt attack triggers millibottlenecks alternatively among different paths, causing persistent blocking effects, resulting in system-wide large response problem.

To better understand performance vulnerabilities in microservices, we present Grunt Attack – a novel low-volume DDoS attack that exploits execution dependencies in microservice applications. By systematically grouping and characterizing execution paths based on their pairwise dependencies, Grunt attack can target only a few well-selected execution paths to launch a low-volume DDoS attack that achieves substantial wide-spread performance degradation to the system. To enhance stealth, the attacker avoids creating a persistent bottleneck by dynamically alternating target execution paths within their dependency group.

Figure 3. Conceptual illustration of cross-service queue blocking.

As a result, Grunt attack consumes less than 20% additional CPU resource of the target system while increasing its average response time by over 10x.