Beyond Logging: A new approach to fix errors

Logging is the most popular approach for troubleshooting errors. However, even with over a decade of experience in the software industry, I have always encountered challenges when relying solely on logs to fix errors in a production environment. In this blog post, I will explain my perspective on the limitations of using logs for troubleshooting production errors and introduce the principles that inspired the creation of Errsole.

Why is logging such a common practice?

Old habits die hard.

During development, the codebase resides in our local system. We add log statements in our code, restart the app, test the app, and view the logs in real-time. If we start the app in live reload mode, then we don't even need to restart the app after adding log statements. Without a doubt, logging is the fastest and most efficient approach for troubleshooting errors in the local environment.

Due to its efficiency in the local environment, we also use logging for troubleshooting errors in the production environment. However, the production environment is not the same as the local environment.

Logs alone are not enough to troubleshoot errors

The map is not the terrain. What works best in the local environment might prove useless in the production environment.

In the local environment, the developer is the judge, jury, and executioner. He has complete control because the codebase resides on his local system.

Errors typically occur in the code that the developer is actively working on. So, when an error occurs, the developer can quickly navigate to the specific section of the code and fix the error.
It's the developer who tests the app. So, if an error occurs during testing, he can easily reproduce the error by performing the same actions.
The logs are small and exclusively capture the developer's activities, making it easy to trace errors and logged variables.

However, in the production environment, the code of multiple developers is merged, adding complexity to the system.

Errors that occur in the production environment are triggered by the end users' actions. Since developers are not aware of the specific actions performed by the end users, they can not reproduce the errors.
The process of adding log statements and restarting the app is not as straightforward as in the local environment. It involves going through the entire deployment process.
Logs in the production environment are a mess, containing variables from every user session.

As a result, when troubleshooting errors in the production environment, we end up doing this:

Find the error stack in the logs.
Add log statements in the code.
Deploy the changes to the production environment.
Wait for the error to occur again and inspect the logged variables.
If the root cause is still unclear, repeat the process by adding more log statements.

How to fix errors in the production environment?

Irrespective of the environment, troubleshooting any error involves three steps: capturing the error, reproducing the error, and inspecting variables.

All web frameworks automatically log errors as they occur. However, the error stack itself is not enough to reproduce the error and inspect variables.

Capture the request along with the error: All popular web frameworks are stateless. They follow a simple process: receive an HTTP request, process it, and send back a response. So, in web applications, when an error occurs, it is typically triggered by an HTTP request. By capturing the specific request, we can reproduce the error at any time simply by replaying the captured request.

Maintain a sandbox server: Maintain a sandbox server within the production environment. This server should host a copy of the live app, but it should not receive any user traffic and must not be accessible from the Internet. The purpose of this sandbox server is to provide a safe environment to reproduce production errors.

Replay the request in the sandbox server: Start the app in debug mode on the sandbox server, set breakpoints in the code, and replay the captured HTTP request. By doing this, we can reproduce the error in the sandbox server. During this process, inspect variables to identify the root cause of the error. Once the problem is understood, make the necessary code edits directly within the sandbox server to fix the error. After implementing the fix, replay the request again in the sandbox to verify that the error is no longer occurring.

Conclusion

For a faster and more efficient debugging process in the production environment:

Capture the HTTP request along with the error. The request is essential for reproducing the error.
Maintain a secure sandbox server dedicated to debugging the production code.

If you can’t reproduce the error, you can’t resolve the error.

These guiding principles have inspired the development of Errsole, an error logger and remote debugger for server apps. To learn more about Errsole, read our blog post "Why Use Errsole" or try out a live demo by visiting our website at https://www.errsole.com.