Natalia Chechina is passionate about distributed systems, scalability, fault tolerance, and as a result is in love with the Erlang Ecosystem. Before joining Erlang Solutions as a senior developer, she was a research fellow and lecturer at UK Universities, focusing on approaches and techniques to enable scaling and efficient performance on commodity hardware where components are loosely coupled, communication is significant, and any of the components may fail or disconnect at any time. She has a PhD from Heriot Watt University, UK.
Observability is about understanding your system – its performance, reasons for actions, inactions, and failures, and the ability to pre-emptively act on various system limitations before they become a problem. The main tools of observability are metrics and logs. To empower developers, rather than introduce overheads and endless useless data, both metrics and logs should be designed into the system, rather than be added on. This is particularly true for large-scale systems.
In this talk, I will share my experience and the rules of thumb for working with metrics and logs at scale. I will also cover the theory behind these concepts.