Baseline(s).. [Part 1]
Merriam-webster defines baseline "as a line serving as a basis", or you can think of it is a starting point used for comparisons.
In site reliability engineering (SRE), establishing a performance baseline for workloads is a common practice. This baseline serves as a reference point for monitoring and also improving system performance over time.
In my previous experience as a SRE, particularly in supporting critical e-commerce websites, I learned that it's beneficial to establish more than one baseline for a more holistic view and accurate analysis.
I have found two key types of baselines are quite useful:
[A] Performance Baseline Influenced by External Factors: This baseline takes into account real-world conditions such as network variability, user device performance, and other external influences. It reflects the performance metrics as experienced by the end-users, providing insights into how these factors impact the user experience.
[B] Performance Baseline Detached from External Factors: This is a controlled baseline that isolates the system's performance from external variables, offering a pure look at the system's capabilities. It helps in identifying the inherent performance characteristics of the raw system, independent of external conditions. Note that, external conditions are always variable.
Lets take a simple example - Latency. When taking latency as an example metric, these two baselines offer different insights. The first, influenced by external factors, shows how latency is experienced by users in various conditions, which is critical for understanding and optimising the user experience. The second, detached from external factors, reveals the system's baseline latency, helping to identify potential improvements in the infrastructure or application itself.
But when combined together, they provide a holistic view of where performance stands and where it can be tuned.
RUM tools (Real User Monitoring) are super useful to understand this baseline [A] however for the baseline [B] I tend to rely on other mechanisms, more on this later (Part 2 of the article).
But first I want to highlight the importance of the establishing multiple baselines. In my SRE days I often came across this question, why we need to do this and here is my thought.
Without contextual information baseline adds little value : Different baselines can capture different aspects of your system performance under a range of contexts. For example, having a baseline for server response times under stable network conditions and another reflecting real-world user conditions allows for a more nuanced understanding of performance across different scenarios.
Improved visibility and control: Multiple baselines enable a more accurate analysis or troubleshooting. You can compare performance across different dimensions, this helps in identifying specific scenarios where performance might degrade and requires tuning.
Targeted Tuning: With baselines for different conditions and components of your service, you can target tuning more effectively. For instance, if one baseline shows degraded performance due to server load while another shows issues related to client-side rendering, you can prioritise resources and efforts to address these specific areas.
Risk mitigation: Multiple baselines also help in risk management. For example, If a new deployment badly affects performance, having established baselines helps quickly quantify the impact and roll back changes if necessary. Workload and their usage patterns evolve over time, so what constitutes "normal" performance can change. Multiple baselines can help track this evolution more accurately, allowing for adjustments in performance expectations and optimisation strategies.
Establishing and maintaining multiple baselines for performance monitoring takes more effort but can lead to a deeper insights and more effective management of site reliability.
[ To be continued..]
Comments
Post a Comment