Optimizing Networking Costs in AWS EKS: Managing Cross-AZ Database Traffic

Optimizing Networking Costs in AWS EKS: Managing Cross-AZ Database Traffic

·

10 min read

Managing AWS EKS clusters provides excellent scalability and reliability for running containerized applications. However, without careful cost monitoring, expenses can quickly add up. This article breaks down practical strategies to help you reduce AWS EKS costs while keeping your applications fast and dependable.

Recently, we faced a significant challenge with our AWS EKS setup. Due to cross-AZ (Availability Zone) traffic, networking costs escalated.

Our high-workload application, built using Node.js and PHP, processes a large number of concurrent requests. The dynamic nature of this workload often triggers auto-scaling events, further adding to the complexity of resource management.

The application interacts with Redis, MySQL, and MongoDB, all deployed in Stateful Sets with PVs and PVCs. These databases are distributed across three AZs to ensure high availability and fault tolerance.

However, the frequent queries from the application combined with database traffic crossing AZ boundaries significantly contributed to increased networking costs.

This blog outlines the root causes, the solutions we explored, and actionable steps to optimize networking costs while maintaining high availability and resilience.

Breaking Down EKS Costs

When using AWS EKS, several cost components come into play:

1. EKS Management Fee:

AWS charges a flat fee of $0.10 per hour per cluster, which translates to approximately $73 per month per cluster, regardless of the size of the cluster. This fee covers the managed control plane, which includes components like the Kubernetes API server and etcd for managing cluster state.

2. EC2 Node Costs:

EKS requires worker nodes (EC2 instances) to run your workloads. The cost of these nodes depends on:

  • Instance type (e.g., t2.medium, m5.large).

  • Number of nodes in your cluster.

  • Uptime of these nodes.
    Additionally, costs can increase if you enable auto-scaling, as the cluster adds nodes dynamically based on workload demand.

3. Networking Costs:

Networking is amongst the most significant contributors to EKS costs, especially in setups using multiple AZs. Key contributors include:

  • Cross-AZ Data Transfer: AWS charges $0.01 per GB for data transfer between AZs. This cost can quickly add up when running services like databases that frequently communicate across AZs.

  • Inter-VPC Traffic: If your EKS cluster interacts with other VPCs or external services, you’ll incur additional charges.

  • Elastic Load Balancer (ELB): Any ingress traffic into the cluster using an ELB incurs data transfer costs, which are billed separately.

4. Persistent Storage Costs:

Stateful workloads, such as databases, rely on storage like Amazon EBS (Elastic Block Store) for data persistence. Costs depend on:

  • Volume size and type (e.g., General Purpose SSD, Provisioned IOPS).

  • Snapshot storage for backups.

  • IOPS charges for higher-performance workloads.

5. Additional Services:

  • CloudWatch: For logging and monitoring, AWS charges for log ingestion, storage, and data retrieval.

  • Data Transfer Out: Traffic leaving AWS (e.g., to the internet or external systems) is billed at $0.09 per GB for the first 10TB each month.

Tip: To manage DNS configurations with your AWS EKS, you can refer to our article on Deploying External DNS with AWS EKS, which explores efficient integration strategies to manage wast amount of DNS.

Solutions Explored

To address these issues, we considered multiple approaches:

1. Topology-Aware Service Routing

  • Objective: Ensure traffic remains within the same AZ whenever possible.

  • Implementation: Enable Service Topology in Kubernetes by annotating services to route traffic to pods within the same AZ.

Benefits:

  • Reduced cross-AZ traffic for database queries.

  • Improved latency due to AZ-local traffic.

2. Caching Frequently Accessed Data

  • Objective: Reduce the frequency of repetitive queries to the database.

  • Implementation:

  • Integrate Redis as a cache layer for read-heavy operations.

  • Use application-level caching with TTLs for frequent queries.

Benefits:

  • Significantly reduced database queries.

  • Improved response times for cached data.

Our high-workload application, built using Node.js and PHP, processes a large volume of concurrent requests. The dynamic nature of this workload often triggers auto-scaling events, further adding to the complexity of resource management.

The application interacts with Redis, MySQL, and MongoDB, all deployed in Stateful Sets with PVs and PVCs. These databases are distributed across three AZs to ensure high availability and fault tolerance.

However, the frequent queries from the application combined with database traffic crossing AZ boundaries significantly contributed to increased networking costs.

This blog outlines the root causes, the solutions we explored, and actionable steps to optimize networking costs while maintaining high availability and resilience.

Breaking Down EKS Costs

When using AWS EKS, several cost components come into play:

1. EKS Management Fee:

AWS charges a flat fee of $0.10 per hour per cluster, which translates to approximately $73 per month per cluster, regardless of the size of the cluster. This fee covers the managed control plane, which includes components like the Kubernetes API server and etcd for managing cluster state.

2. EC2 Node Costs:

EKS requires worker nodes (EC2 instances) to run your workloads. The cost of these nodes depends on:

  • Instance type (e.g., t2.medium, m5.large).

  • Number of nodes in your cluster.

  • Uptime of these nodes.
    Additionally, costs can increase if you enable auto-scaling, as the cluster adds nodes dynamically based on workload demand.

3. Networking Costs:

Networking is amongst the most significant contributors to EKS costs, especially in setups using multiple AZs. Key contributors include:

  • Cross-AZ Data Transfer: AWS charges $0.01 per GB for data transfer between AZs. This cost can quickly add up when running services like databases that frequently communicate across AZs.

  • Inter-VPC Traffic: If your EKS cluster interacts with other VPCs or external services, you’ll incur additional charges.

  • Elastic Load Balancer (ELB): Any ingress traffic into the cluster using an ELB incurs data transfer costs, which are billed separately.

4. Persistent Storage Costs:

Stateful workloads, such as databases, rely on storage like Amazon EBS (Elastic Block Store) for data persistence. Costs depend on:

  • Volume size and type (e.g., General Purpose SSD, Provisioned IOPS).

  • Snapshot storage for backups.

  • IOPS charges for higher-performance workloads.

5. Additional Services:

  • CloudWatch: For logging and monitoring, AWS charges for log ingestion, storage, and data retrieval.

  • Data Transfer Out: Traffic leaving AWS (e.g., to the internet or external systems) is billed at $0.09 per GB for the first 10TB each month.

Tip: To manage DNS configurations with your AWS EKS, you can refer to our article on Deploying External DNS with AWS EKS, which explores efficient integration strategies to manage wast amount of DNS.

Solutions Explored

To address these issues, we considered multiple approaches:

1. Topology-Aware Service Routing

  • Objective: Ensure traffic remains within the same AZ whenever possible.

  • Implementation: Enable Service Topology in Kubernetes by annotating services to route traffic to pods within the same AZ.

Benefits:

  • Reduced cross-AZ traffic for database queries.

  • Improved latency due to AZ-local traffic.

2. Caching Frequently Accessed Data

  • Objective: Reduce the frequency of repetitive queries to the database.

  • Implementation:

  • Integrate Redis as a cache layer for read-heavy operations.

  • Use application-level caching with TTLs for frequent queries.

Benefits:

  • Significantly reduced database queries.

  • Improved response times for cached data.

3. Read Replica Configuration

  • Objective: Offload read traffic to replicas deployed in each AZ.

Implementation:

  • Configure read replicas for MySQL and MongoDB in all AZs.

  • Update the application to route read queries to the nearest replica.

Benefits:

  • Distributed read traffic.

  • Minimized cross-AZ reads for applications.

4. Monitor and Analyze Traffic

  • Objective: Identify high-cost traffic patterns and optimize them.

Implementation:

  • Use tools like eBPF (via Cilium or similar solutions) to monitor pod-level traffic.

  • Visualize traffic patterns with Prometheus and Grafana.

Benefits:

  • Insights into cross-AZ traffic contributors.

  • Better-informed optimization decisions.

5. Auto-Scaling with Reserved Instances

  • Used Reserved Instances for predictable workloads to reduce EC2 costs.

  • For dynamic scaling, relied on Spot Instances with fallback to On-Demand Instances to balance cost and availability.

6. Monitoring and Traffic Analysis

  • Deployed tools like Prometheus and Grafana to monitor cross-AZ traffic patterns.

  • Used eBPF-based tools to analyze network flows and identify expensive traffic routes.

  • Optimized Node.js and PHP query patterns based on insights.

7. Efficient EBS Volume Management

  • Migrated to GP3 volumes for better cost-performance balance.

  • Reduced snapshot frequency to save on storage costs while maintaining adequate backup intervals.

Implementation Example: Topology-Aware Routing

Here’s a quick implementation of topology-aware routing for a MySQL Stateful Set:

1. Annotate the Kubernetes Service:

apiVersion: v1
kind: Service
metadata:
  name: mysql
  annotations:
    topology.kubernetes.io/zone: "true"
spec:
  selector:
    app: mysql
  ports:
    - protocol: TCP
      port: 3306

2. Deploy Pods with Node Affinity:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: mysql
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: topology.kubernetes.io/zone
                    operator: In
                    values:
                      - us-east-1a

3. Migrate Architecture to 2 AZs:

As part of our optimization strategy, we decided to reduce the deployment replicas from three AZs to two AZs. This architectural shift was motivated by the need to:

  • Reduce cross-AZ networking costs: With three AZs, application pods, and database instances often communicated across zones, leading to significant cross-AZ data transfer charges.

  • Simplify operational overhead: Managing resources across three AZs introduced complexity in terms of monitoring, scaling, and maintaining consistency.

Steps Taken During Migration:

Align Backend Applications and Databases in the Same AZs:

  • We grouped our Node.js and PHP application pods and their associated Redis, MySQL, and MongoDB databases within the same AZ.

  • This ensured that most traffic between application pods and databases remained within a single AZ, minimizing cross-AZ traffic.

Reconfigure StatefulSets for Zone Awareness:

  • The StatefulSets for MySQL and MongoDB were reconfigured with node affinity rules to limit pod placement to the chosen two AZs.

  • For example, the affinity configuration ensured MySQL replicas in us-east-1a were primarily accessed by application pods running in us-east-1a.

Redistribute Traffic:

  • Traffic patterns were adjusted to route requests to backend pods residing in the same AZ as the database.

  • This was achieved using Service Topology and DNS resolution to prioritize local AZ instances.

Leverage Read Replicas for High Availability:

  • To maintain availability and fault tolerance after reducing AZs, read replicas were deployed in both AZs.

  • This setup ensured that each AZ could independently handle read-heavy traffic in case of a single AZ failure.

Benefits Achieved:

1. Significant Reduction in Networking Costs:

  • By ensuring backend pods and database instances resided in the same AZ, we reduced cross-AZ data transfer by over 60%. The remaining cross-AZ traffic was primarily for write replication between database nodes.

2. Improved Application Performance:

  • Latency decreased as the traffic between application pods and databases no longer crossed AZ boundaries, leading to faster query execution.

3. Optimized Resource Utilization:

  • Consolidating resources across two AZs allowed us to better utilize EC2 instances and scale horizontally within those zones without over-provisioning.

4. Resilient Architecture:

  • Despite reducing to two AZs, the use of read replicas and distributed services ensured that the application could still handle AZ failures effectively.

Key Considerations:

  • Trade-off Between Availability and Cost: While migrating to two AZs reduced costs, we carefully analyzed the risk of reduced fault tolerance compared to a three-AZ setup. For our workload, the benefits outweighed the risks, as our architecture still adhered to AWS’s high-availability guidelines.

  • Database Write Replication Traffic: Since database write operations still required cross-AZ replication to maintain consistency, we evaluated and optimized replication intervals and volumes.

Verify Traffic Patterns:

After migration, we used Prometheus and Grafana to monitor key metrics such as:

  • network_tx_bytes: Measured overall traffic between pods and identified any lingering cross-AZ traffic.

  • request_latency: Ensured that application response times improved post-migration.

  • Custom Dashboards: Created dashboards to visualize traffic patterns, helping us confirm that most communication was now AZ-local.

This migration not only helped reduce costs but also streamlined our operational overhead while ensuring performance and reliability. By taking a systematic approach to reduce the number of AZs and optimize resource placement, we were able to maintain the high availability of our workloads without incurring unnecessary expenses.

Let me know if you’d like further refinements!

Results

After implementing these solutions:

  • Cross-AZ traffic dropped by 40%, significantly lowering costs.

  • Application response times improved by 20% due to reduced latency.

  • The caching layer offloaded 60% of read-heavy operations from the databases.

Summary

In this article, we tackled the challenges of managing cross-AZ traffic costs in an AWS EKS cluster. By implementing topology-aware routing, caching frequently accessed data, and optimizing database configurations, we reduced networking costs and improved application performance.

Monitoring tools like eBPF, Prometheus, and Grafana played a crucial role in identifying and resolving traffic inefficiencies. These strategies can be a starting point for optimizing Kubernetes workloads, especially in multi-AZ deployments.

Developing scalable backend solutions requires a deep understanding of cloud-native architectures and DevOps best practices. Furthermore, it demands composite expertise, robust techniques, and a collaborative approach with a Leading DevOps service provider.

By partnering with experienced DevOps engineers, you can access a comprehensive suite of cloud and backend development services, leverage agile methodologies, and implement advanced optimization techniques tailored to your business needs.

If you’re facing similar challenges or have any questions, feel free to share your experience in the comments!

Did you find this article valuable?

Support Ravi Kyada - The DevOps Guy by becoming a sponsor. Any amount is appreciated!