Date and Time: 17 December 2024

Topics discussed:

OpenAI Kubernetes failure Dec 11
https://status.openai.com/incidents/ctrsv3lwd797
[Quick takes on the recent OpenAI public incident write-up – Surfing Complexity](https://surfingcomplexity.blog/2024/12/14/quick-takes-on-the-recent-openai-public-incident-write-up/)
- It was fun to use a well-written incident report from a different company to ask questions of our local cluster team
- Control Plane can create brittleness. E.g. people claim about AZ robustness, but often AWS control planes run in us-east-1, so even if people "don’t deploy to us-east-1" they can still be vulnerable to failures in that AWS region because of AWS control plane
- Designers often overlook the control plane when designing robustness for their systems
- Question to stimulate wider design thinking: "if you lost access to the system, how would you restart with just a key to the datacenter?"

LLM-based automation to manage infrastructure and its impact on safety
- Known uses of LLMs have been a buddy or assistant
- Seen proposals to have LLM power decision making. Haven’t seen stories of anyone doing this
- We know LLMs have non-deterministic behaviors.
- When this happens, what will be the consequences for safety and reliability? For understanding the incidents?

Did seasonal holiday spending affect anyone?
- could automate scaling. but not databases. cannot reshard under load; that requires additional capacity. 
- Use historical, not current data to predict seasonal load + lead time.
incidents might be more damaging to reputation and revenue than the cost of over-provisioning
- live adjustment to prices is similar to control plane capabilities on top of the primary functionality