by

Watson on the Web

Any user of Microsoft software has seen that now ubiquitous prompt when something goes wrong, “Software X encountered a problem, would you like to report this to Microsoft?” That prompt is part of a system called Watson, named after the detective of Sherlock Holmes fame. When a user clicks “Yes” on the report dialog information identifying the fault is sent back to Microsoft where the product teams can use the data to analysis what are the most common bugs and the most important to fix.

In the M3 release of the Kahuna Mail Beta, we launched Watson support for web applications. It’s already paid off huge dividends for us to help identify what bugs to fix first in our M4 release. When our servers or our clients have an error, we display an inline message to the user that says we had an issue with their request and if they’d like to report the error.

If the error happens on the server, we encrypt the error (to prevent any personal data from leaking) and send it back down to the user to allow them to click “Report it”. If the error was on the client, we user encrypt it on the client and then allow the user to click “Report it”. If the user decides to report the error that encrypted blob is sent back up to the server where we decrypt the data, remove any personally identifiable data from it, and then send it to the Watson data warehouse.

Once it’s in Watson, we’re able to mine the data for specific sections of code that cause issue. The dumps provide us with the source file that failed along with the line and stack trace. Watson builds buckets for each unique combination of those items and then tracks hits in to the bucket. Basically, when we look to fix bugs we can find which ones our users hit the most and attack them first.

To give you an example of the kind of bugs we fixed, when we first launched M3 we started to receive a number of hits in our MIME parser when it was being used to render a message. We were able to track down the lines of code that were causing the issue (we weren’t handling mis-encoded MIME message correctly) and release an update of Kahuna which in turn made that issue disappear from the hit list.