Running Guidewire InsuranceSuite on Kubernetes

A Guidewire-to-AKS Migration Case Study

The vendor says don't. We did it anyway. Guidewire InsuranceSuite — four monolithic Java applications designed for VMs and not officially supported on Kubernetes — is now running all four centers on AKS alongside 90+ microservices. Zero policy transactions lost. This is how, and what nearly broke.

The System We Inherited

When I took ownership of Beesafe's infrastructure, Guidewire InsuranceSuite was running exactly the way it had been running for years: four monolithic Java applications deployed on dedicated Windows Server VMs, managed through SSH and RDP sessions, updated quarterly through a process that involved printed runbooks and prayer.

The four centers each served a distinct purpose in the insurance lifecycle. PolicyCenter handled core policy management — quotes, binds, renewals, endorsements. ClaimCenter processed claims from first notice of loss through settlement. BillingCenter managed payment schedules, invoicing, and commission calculations. ContactManager maintained customer and agent relationship data across all three operational centers.

Each center was its own universe of complexity: 8-16 GB JVM heap sizes, deep MSSQL dependencies with stored procedures and SQL Agent jobs, local file storage for document generation, sticky session requirements for the web tier, and long-running batch jobs — nightly policy renewal processing alone ran for four hours or more.

Meanwhile, the rest of Beesafe's engineering team was building modern microservices. Over 90 services were running on Azure Kubernetes Service. We had GitOps with ArgoCD, full observability with Prometheus and Grafana, automated deployments dozens of times per day. But Guidewire sat in its own world. Two platforms. Two deployment processes. Two monitoring systems. Two sets of operational knowledge. Double the overhead for a single infrastructure engineer. You can read more about the broader infrastructure experience here.

The business pressure was real. Poland's insurance regulator, KNF, conducts rigorous audits. Having two fundamentally different infrastructure paradigms doubled our compliance surface. The microservice count was growing. Something had to give.

Dedicated VMs
PolicyCenter
ClaimCenter
BillingCenter
ContactManager
MSSQL (VM)
Azure AKS
90+ Microservices
ArgoCD / GitOps
Prometheus / Grafana

The Decisions That Shaped Everything

Migrating Guidewire to Kubernetes is not something you find a guide for. The vendor's official position ranges from "not supported" to "talk to your account representative." There's no Helm chart, no reference architecture, no community blog posts. Every architectural decision was a bet, and getting any of them wrong meant rolling back to VMs with months of work lost.

Four decisions defined the trajectory of the entire project.

Decision 1: Azure SQL Managed Instance, Not Azure SQL Database

Context

Guidewire relies heavily on MSSQL-specific features for batch processing: Service Broker for async messaging between centers, SQL Agent jobs for scheduled maintenance tasks, CLR integrations for custom business logic. Azure SQL Database doesn't support these features.

Decision

Azure SQL Managed Instance — full SQL Server compatibility in a managed service. No need to rewrite Guidewire's database layer.

Tradeoff

Roughly 3x the cost of Azure SQL Database. VNet integration with AKS required private endpoints and custom CoreDNS configuration to resolve the managed instance FQDN from inside pods. Weeks of networking work before writing a single deployment manifest.

Decision 2: StatefulSets with Tiered Storage, Not Ephemeral Pods

Context

All four Guidewire centers write to local disk: temporary files during policy rating calculations, generated PDF documents, batch job output files, staging data for inter-center communication. Treating these as stateless Deployments was not an option.

Decision

StatefulSets backed by Azure Premium SSD Persistent Volume Claims for hot data (active processing), Azure Files NFS shares for warm data (generated documents, batch output that other systems consume).

Tradeoff

NFS on Azure Files has real latency issues under high IOPS — we hit this during nightly batch runs. Pod rescheduling is slow because disk attach/detach takes 30-60 seconds. If a node goes down, recovery is minutes, not seconds.

Decision 3: Envoy Sidecar for Session Affinity, Not K8s Native

Context

PolicyCenter and ClaimCenter require session affinity for their web tier. K8s native sessionAffinity: ClientIP is unreliable when traffic comes through Cloudflare WAF — the source IP is Cloudflare's, not the end user's.

Decision

Envoy sidecar proxy with cookie-based session affinity keyed on Guidewire's JSESSIONID. Also gave us circuit breaking, retry logic, and per-route observability for free.

Tradeoff

Debuggability suffers with five layers of routing: Cloudflare → Azure Front Door → NGINX Ingress → Envoy sidecar → Guidewire. When a request fails, tracing which layer caused the issue requires correlating logs across all five. Custom Envoy filter configuration for JSESSIONID extraction is poorly documented — took several iterations to get right.

Decision 4: Blue-Green Deployment with Manual Promotion Gates

Context

Canary deployments are too risky for insurance workloads. If the canary is calculating premiums differently from the stable version, some customers get incorrect quotes. That's a regulatory violation, not just a bad user experience.

Decision

Blue-green deployments via ArgoCD Rollouts. The green environment runs a full synthetic test suite — policy create, quote, bind, payment — before a human promotes it to production.

Tradeoff

Double infrastructure cost during deployment windows. A deploy that could take 5 minutes now takes 45 minutes with validation. But zero policy transactions lost during any deployment since we started.

Migration approach comparison
Criterion VM (status quo) Standard K8s Hybrid K8s (chosen)
Guidewire compatibility Full Impossible Custom StatefulSets
Deploy frequency Quarterly N/A 2x/month
Operational overhead High (2 platforms) N/A Medium (1 platform)
Cost Baseline N/A -25% after VM decommission

Making It Work

JVM Tuning in Containers

PolicyCenter is the largest of the four centers. In production, it needs a 12 GB JVM heap. Setting the container memory limit to 12 GB and calling it a day will get you OOMKilled pods — I learned this the fast way.

A JVM inside a container doesn't just consume heap. You need to account for metaspace (class metadata, typically 256-512 MB for Guidewire), native memory allocations, thread stacks (each at ~1 MB, and Guidewire spawns hundreds of threads), and the OS-level page cache. Guidewire's own tuning guides assume a dedicated VM where the JVM is the only significant process. In a container, you share the kernel with the node, and the memory limit is hard.

The flag -XX:MaxRAMPercentage=75 tells the JVM to use 75% of the container's memory limit for heap, leaving 25% for everything else. For PolicyCenter with a 16 GB container limit, that's 12 GB heap and 4 GB headroom. It sounds generous until you realize Guidewire's class loading alone consumes close to 500 MB of metaspace.

# PolicyCenter StatefulSet — resource configuration
resources:
  requests:
    memory: "14Gi"    # guaranteed allocation
    cpu: "4"
  limits:
    memory: "16Gi"    # hard ceiling — OOMKill above this
    cpu: "8"         # burst allowed during batch jobs
env:
  - name: JAVA_OPTS
    value: "-XX:MaxRAMPercentage=75 -XX:+UseG1GC -XX:MaxGCPauseMillis=200"

The Sidecar Integration Pattern

The hardest constraint with Guidewire is that you cannot meaningfully modify its codebase. It's vendor software with a proprietary build system. Every customization you make increases the cost and risk of future upgrades. The moment you start adding Kafka producers or gRPC clients into Guidewire's Java code, you've created an upgrade nightmare.

The sidecar pattern solved this cleanly. Each Guidewire pod runs the main Guidewire container plus three sidecar containers: an event publisher that tails Guidewire's application log and publishes structured events to Kafka, a Prometheus exporter that scrapes Guidewire's JMX metrics endpoint and exposes them in Prometheus format, and a Vault agent that injects database credentials and API keys into a shared volume that Guidewire reads as flat files.

From Guidewire's perspective, nothing changed. It writes logs to stdout. It exposes JMX on localhost. It reads config files from a directory. It doesn't know it's in a container, let alone that three other containers are orbiting it. This preserves full upgradeability — when Guidewire ships a new version, we swap the main container image and the sidecars keep working.

Database Migration Orchestration

Guidewire upgrades are database-first. The new application version expects a new schema, and the old application version cannot run against the new schema. There's no backward compatibility window. The migration must succeed completely before the application starts, or everything breaks.

ArgoCD sync waves made this manageable. Wave 0 runs a Kubernetes Job that executes Guidewire's database migration tooling against the managed instance. Wave 1 deploys the updated StatefulSet. If the migration Job fails, ArgoCD halts the sync — the old StatefulSet keeps running against the old schema, and nothing is broken. We get an alert, investigate, fix, and retry.

The other critical piece was startup probes. Guidewire applications take 3-5 minutes to start. They load thousands of classes, build in-memory caches, warm connection pools, and run self-diagnostic checks. A standard liveness probe with a 30-second timeout would kill the pod repeatedly, creating an infinite restart loop. We configured startup probes with a failureThreshold of 40 and a periodSeconds of 10 — giving each pod up to ~7 minutes to become ready before Kubernetes considers it failed.

The Incident We Didn't Expect

Production Incident — Connection Pool Exhaustion

During the first production rolling update of PolicyCenter, we triggered a cascading failure that took down database connectivity for all four Guidewire centers simultaneously.

The scenario was straightforward in hindsight. PolicyCenter runs three replicas. During a rolling update with default settings, Kubernetes was restarting all three pods in rapid succession. Each pod, on startup, attempts to establish its full connection pool — 50 database connections per instance. Three pods starting simultaneously means 150 new connection attempts hitting Azure SQL Managed Instance at once.

The managed instance has a maximum connection limit. We hit it. Hard. The three PolicyCenter pods were competing for connections, and in the process, they consumed the connection budget that BillingCenter and ClaimCenter needed for their existing workloads. Both centers started throwing connection timeout exceptions. Claims processing stopped. The billing batch died three-quarters through, had to re-run. Phone lit up with two Slack channels.

We caught it fast. A custom Grafana alert on database connection count — set to fire at 90% of the managed instance's maximum — triggered within two minutes. The fix was a combination of operational changes: staggering the rolling update with maxSurge=1 and maxUnavailable=0 so only one pod restarts at a time, adding a connection pool warm-up delay so pods gradually establish connections over 60 seconds instead of all at once, and implementing per-center connection monitoring with pre-deployment headroom checks.

Total impact: 8 minutes of degraded database connectivity. Zero lost transactions — Guidewire's internal retry mechanisms handled the transient failures. But it was a sharp reminder that Kubernetes primitives designed for stateless microservices need careful adaptation for stateful enterprise workloads.

The Numbers

After 18 months of running Guidewire on AKS, the results speak clearly — with one honest asterisk.

Deploy Frequency
+8x
Quarterly → 2x/month
Deploy Duration
-87%
4-6 hrs → 45 min
Infrastructure Cost
-25%
After VM decommission
MTTR
-88%
~2.5 hrs → ~18 min
Platforms to Manage
1
Was 2 (VMs + K8s)
Deploy Complexity
Higher
But version-controlled
Policy Transactions Lost During Deploy
Zero
Across all deployments since migration

That last metric is the one that matters most. The deployment process is objectively more complex. There are more moving parts, more configuration to maintain, more things that can go wrong in the pipeline. But the complexity is encoded in version-controlled manifests, not in someone's head. When I'm on vacation, ArgoCD doesn't forget the deployment procedure.

What I'd Do Differently

  • "I'd negotiate Guidewire Kubernetes support with the vendor BEFORE migrating, not during. We hit licensing gray areas that took months to resolve. The conversation is easier when it's hypothetical than when you're already running production workloads in an unsupported configuration."
  • "I'd implement connection pooling middleware — something like PgBouncer but for MSSQL — from day one. The connection storm incident during our first rolling update was entirely predictable and preventable. We added connection governance after the fact, but it should have been part of the initial architecture."
  • "We went straight for the crown jewels. It worked, but we earned scars we didn't need to earn. I'd start with ContactManager as the pilot migration — the least critical of the four centers. If it goes down for an hour, agents lose contact history but no policies are affected. Proving the pattern there would have given us confidence before touching PolicyCenter."

Guidewire on Kubernetes is not a solved problem. It's a maintained one. Every upgrade cycle tests the architecture. 18 months in, nobody wants to go back.

The question we're exploring now: can this hybrid architecture model — Kubernetes for everything, but with workload-specific operational profiles — extend to other enterprise monoliths? If you can run Guidewire on K8s, you can run anything. That's the next case study. Check my full CV for context on the broader infrastructure work.

Want to discuss enterprise-to-Kubernetes migrations or infrastructure architecture?