Interview about k8s setup in Amazon
- origin This is verbatim copy of the post as Medium makes everything to prevent storing a local copy.
- I Thought I Knew Kubernetes — Until I Met an Amazon L7 Interviewer | by F8010 | Feb, 2026 | Medium
About Kubernetes.
How a “simple” phone screen made me realize I was still playing in the shallow end of the orchestration pool
I still remember the exact moment my confidence evaporated. It was Tuesday, 9:47 AM, and I was sitting in my home office, freshly brewed coffee steaming beside my keyboard, ready to dazzle an Amazon L7 engineering manager with my Kubernetes prowess.
After all, I’d spent three years running production clusters. I’d debugged more CrashLoopBackOff states than I could count. I'd even given a talk at a local meetup about "Advanced K8s Patterns." I was ready.
Spoiler: I was not ready.
The Calm Before the Storm
The interview started innocently enough. We exchanged pleasantries. The interviewer — let’s call him David — had that calm, assessing demeanor that Amazon L7s are famous for. He started with the basics: “Walk me through how you’d design a highly available microservices platform on Kubernetes.”
Easy. I rattled off Deployments, Services, Horizontal Pod Autoscalers, maybe throw in a StatefulSet for the database. I mentioned Prometheus for monitoring, fluent-bit for logging. I was hitting all the notes.
David nodded. Then he leaned in.
“That’s a solid foundation,” he said. “But let’s say you’re running this across three regions, with compliance requirements that mandate data residency in the EU. Your payment service in us-east-1 needs to talk to a fraud detection service in eu-west-1, but with sub-100ms latency and end-to-end encryption. Also, you can’t use overlapping CIDR blocks because of a legacy acquisition. How do you architect the service discovery and traffic routing?”
I blinked.
“Uh… Services with ExternalNames?” I ventured.
David smiled. It was not a reassuring smile.
Deep End: Multi-Cluster Service Mesh
What followed was a 45-minute masterclass in how much I didn’t know about production-grade Kubernetes at scale.
David wasn’t asking about basic Services. He was probing multi-cluster service mesh architectures — specifically how to handle service discovery, mTLS, and traffic management across geographically distributed clusters with non-overlapping networks .
“Most people stop at single-cluster Istio,” David explained. “But at our scale, you’re dealing with multi-network, multi-mesh federation. You need to decide: are you going flat network with Cilium Cluster Mesh, or are you using Istio’s multi-primary setup with east-west gateways? Do you replicate control planes or centralize them? Because if you centralize, you’ve created a massive blast radius — one control plane outage takes down global traffic management.”
I learned there are entire layers of complexity I’d never touched:
- Namespace sameness vs. multi-tenancy: Whether services in different clusters share namespace identities or remain isolated
- Service mesh federation: Connecting independent meshes across organizational boundaries while maintaining zero-trust policies
- The Kubernetes Multi-Cluster Services (MCS) API: The standardized way to export and import services across clusters using ServiceExport and ServiceImport resources
“Here’s a hint,” David said. “If you’re using Cilium, you annotate a service with service.cilium.io/global: "true". But if you're using Istio, services are implicitly multi-cluster within the same trust domain. The APIs are completely different, and your choice locks you into operational patterns for years."
I was taking notes furiously, ego bruised but fascinated.
Operator Pattern Rabbit Hole
Just when I thought we’d exhausted networking, David pivoted.
“Okay, let’s talk about the database. You’re running a distributed SQL database across these clusters. How do you automate failover when a region goes down?”
“Custom scripts?” I guessed weakly.
“Operators,” David said. “But not just using them — building them.”
He explained that at Amazon’s scale, off-the-shelf Helm charts don’t cut it. You need Custom Resource Definitions (CRDs) with complex reconciliation loops that understand your application’s specific failure modes .
“Your operator needs to watch custom resources across multiple clusters, coordinate distributed consensus for failover decisions, and handle split-brain scenarios. It needs to integrate with the service mesh for traffic shifting during migrations. And it needs to do all this while respecting the Kubernetes control plane’s eventual consistency model.”
I realized I’d been treating Kubernetes like a deployment tool. David was talking about it as a distributed systems platform.
Scheduling Problem That Broke My Brain
For his final act, David asked about a problem I’d never even considered:
“You have a machine learning training workload that needs 4 GPUs, 64GB of memory, and must be co-located with a specific dataset in eu-west-1. But you also have real-time inference pods that need guaranteed QoS with CPU pinning. Your cluster autoscaler keeps provisioning the wrong node types. How do you fix this?”
I mumbled something about node selectors and taints.
“You’re thinking too small,” David said. “You need custom schedulers. The default kube-scheduler uses a scoring algorithm that doesn’t understand your hardware topology or data locality. You need to extend the scheduler framework with plugins that implement Filter, Score, and Reserve phases. You need to write a scheduler extender or build a completely separate scheduling controller that understands GPU topology, NUMA nodes, and data gravity."
He continued: “And that’s just scheduling. You also need to think about topology-aware routing — ensuring that traffic stays within availability zones when possible to avoid cross-AZ data transfer costs, but failing over gracefully when capacity is constrained. Kubernetes 1.27+ has topologySpreadConstraints, but implementing them correctly across a multi-cluster fleet requires understanding the relationship between pod topology and service mesh locality load balancing."
I was silent for a moment. Then I asked the question that had been building up: “How does anyone actually learn all this?”
Aftermath
David laughed — not unkindly. “You don’t learn it from tutorials. You learn it from failing at scale. The difference between a senior engineer and a staff engineer isn’t knowing the answers — it’s knowing which questions to ask when the system behaves unexpectedly.”
I didn’t get the job. But I got something better: a roadmap for what I didn’t know.
What I Learned (And What You Should Know)
If you’re interviewing for senior platform roles — or just want to stop feeling like an imposter when someone mentions “east-west traffic” — here are the deep topics worth mastering:
1. Multi-Cluster Networking Patterns
Don’t just know that multi-cluster exists. Understand the trade-offs between:
- Single network (flat): Pods can reach each other directly, but requires careful IPAM and creates larger blast radiuses
- Multi-network with gateways: More secure, but introduces latency and complexity in service discovery
- Service mesh federation: Istio’s multi-primary vs. remote cluster patterns, trust domains, and certificate management
2. Service Mesh at Scale
If you mention Istio in an interview, be ready to discuss:
- Ambient mode vs. sidecar architecture
- mTLS and identity federation across clusters
- Canary deployments using traffic splitting and fault injection
- The operational overhead of running a control plane that can handle thousands of services
3. Custom Operators and CRDs
Understand the reconciliation loop deeply. Know how to:
- Design CRDs that represent complex application state
- Handle multi-cluster reconciliation (yes, operators can watch resources across clusters)
- Implement finalizers for graceful shutdown
- Manage CRD versioning and migration
4. Advanced Scheduling
Beyond nodeSelector and affinity:
- Writing scheduler plugins using the Kubernetes Scheduling Framework
- Topology-aware provisioning and routing
- GPU and specialized hardware scheduling
- Custom resource models for heterogeneous clusters
5. The Control Plane’s Dark Corners
Know how etcd stores state, how the API server handles admission webhooks, and how to debug a controller that’s stuck in a hot loop. Understand that everything in Kubernetes is eventually consistent, and design for that reality.
The Humility Check
Three years of running Kubernetes in production, and I’d barely scratched the surface. I’d optimized for the “happy path” — deploying apps, scaling them, keeping them running. I’d never had to design for the “sad path” of multi-region failover, or the “complex path” of federated identity across organizational boundaries.
The Amazon interview was brutal. It was also the best technical conversation I’ve had in years. It taught me that Kubernetes isn’t just a tool — it’s a platform for building platforms. And building platforms requires thinking in systems, trade-offs, and failure modes.
If you’re preparing for a senior Kubernetes interview, don’t just study the docs. Study the architectural decisions that went into them. Ask yourself: why does the Multi-Cluster Services API use ServiceExport instead of just extending the Service resource? Why does Istio support both single and multi-network modes?
The answers lie in the tension between simplicity and power, between local consistency and global availability, between what Kubernetes is and what we need it to become.
I’m still learning. But now I know what I don’t know — and that’s a start.