XX
Senior Kafka Platform EngineerBalyasny Asset Management L.P.Chicago, Illinois, United States
XX

Senior Kafka Platform Engineer

Balyasny Asset Management L.P.
  • US
    Chicago, Illinois, United States
  • US
    Chicago, Illinois, United States
Apply Now

About

We're seeking a seasoned Kafka engineer to design, operate, and scale our event streaming platform. You'll own the Kafka core (brokers, storage, security, observability) and the automation that powers it—building infrastructure-as-code, operators/Helm charts, and CI/CD to enable safe, self-service provisioning. You'll run Kafka on Kubernetes and/or cloud-managed offerings, ensure reliability and performance, and partner with application teams on best practices.

What you'll do

  • Architect, deploy, and operate production-grade Kafka clusters (self-managed and/or Confluent/MSK), including upgrades, capacity planning, multi-AZ/region DR, and performance tuning.
  • Run Kafka on Kubernetes using Operators (e.g., Strimzi or Confluent for Kubernetes), Helm, and GitOps; manage StatefulSets, storage, PDBs, affinities, and rolling strategies.
  • Build and maintain automation infrastructure: Terraform/Helm modules, CI/CD pipelines, policy-as-code, and guardrails for repeatable, compliant Kafka provisioning.
  • Implement and manage Kafka Connect, Schema Registry, and MirrorMaker 2/Cluster Linking; standardize connectors (e.g., Debezium) and build self-service patterns.
  • Drive reliability: define SLOs/error budgets, on-call rotations, incident response, postmortems, runbooks, and automated remediation.
  • Implement observability: metrics, logs, traces, lag monitoring, and capacity dashboards (e.g., Prometheus/Grafana, Burrow, Cruise Control, OpenTelemetry).
  • Secure the platform: TLS/mTLS, SASL (OAuth/SCRAM), RBAC/ACLs, secrets management, network policies, audit, and compliance automation.
  • Guide event-streaming best practices: topic design, partitioning, compaction/retention, idempotency, ordering, schema evolution/compatibility, DLQs, EOS semantics.
  • Partner with app, data, and SRE teams; provide enablement, documentation, and internal tooling for a great developer experience.
  • Lead/mentor engineers and contribute to roadmap, standards, and platform strategy (including ZooKeeper-to-KRaft migrations where applicable).

Required qualifications

  • Deep hands-on experience operating Kafka in production at scale (brokers, controllers, partitions, ISR, tiered storage/retention, rebalancing, replication, recovery).
  • Strong Kubernetes expertise running stateful systems: storage classes, StatefulSets, node/pod tuning, PodDisruptionBudgets, topology spread, network policies.
  • Automation first: Infrastructure as Code (Terraform), Helm, Operators, GitOps (Argo CD/Flux), and CI/CD (e.g., GitHub Actions/Jenkins) for platform lifecycle.
  • Proficiency with one or more languages for tooling/automation: Python, Go, or Java; plus Bash and solid Linux fundamentals (networking, filesystems, JVM tuning basics).
  • Observability and reliability engineering for Kafka: Prometheus/Grafana, logging, alerting, lag monitoring, capacity/throughput modeling, performance tuning.
  • Security for data in motion: TLS/mTLS, SASL/OAuth, ACL/RBAC, secrets management (e.g., Vault), and audit/compliance practices.
  • Experience with Kafka ecosystem components: Kafka Connect, Schema Registry, MirrorMaker 2/Cluster Linking; familiarity with Cruise Control.
  • Cloud experience (AWS/Azure/GCP) with networking, IAM, and one or more managed offerings (e.g., Confluent Cloud or AWS MSK).
  • Proven track record designing runbooks, leading incidents/postmortems, and driving platform roadmaps.
  • Excellent communication and partnership skills with platform and application teams.

Nice to have

  • Experience migrating ZK-based clusters to KRaft and/or cross-cluster replication designs.
  • Data processing frameworks (Kafka Streams, Flink, Spark Structured Streaming) and EOS semantics.
  • Policy-as-code (OPA/Gatekeeper), secrets rotation automation, and compliance-as-code.
  • Experience with Strimzi or Confluent for Kubernetes in production.
  • Knowledge of CDC patterns and tools (e.g., Debezium) and database connectors at scale.
  • Multi-region architectures, cluster linking strategies, and disaster recovery drills.
  • Service mesh familiarity (mTLS, ingress/egress controls) and advanced network tuning.
  • Chicago, Illinois, United States

Languages

  • English
Notice for Users

This job comes from a TieTalent partner platform. Click "Apply Now" to submit your application directly on their site.