Back in the day I worked on among other things the globalization support for Kahuna/Windows Live Mail. The current Hotmail product had limited support for globalized email and so we took this opportunity to enhance how our product would work in terms of sending and receiving email in various languages and markets. I wanted to share some of the scenarios and rules that we use under the hood in order to make Windows Live Mail work well in the global email arena.
On the sending side, the easiest solution would have been to send everything as UTF8 and call it a day. But that’s cheating. For starters, there are a lot of mail clients that exist in many countries that don’t support UTF8 encoding. Further, many national standards bodies request that mails sent in certain languages be sent in certain national character (GB18030 is a good example) sets and transfer encodings. Here are some of the rules we use when we generate mail that get sent to the internet.
- Detect the user’s UI settings and country settings first
- Auto detect against the entire From, Subject, and Body of the message
- If one language is detected, then the native charset is the default charset for the outbound email. Thus we send email in native charset to be compatible with other email services.
- If multiple languages related to multiple charsets are detected (e.g. English, French and Japanese), then the outbound mail needs to be encoded with UTF8
- If the From, Subject or Body are too short, detection may fail guess incorrectly so it’s possible the wrong outbound encoding may be used
On the receiving side, mail clients have absolutely no idea what the kind of mail that it receives are: what the character encoding is, what the transfer encoding is, even if it’s in proper RFC format. Thus, to properly display globalized data, we have a lot of rules about to make our best effort at decoding a message.
- If the message comes with complete MIME charset info convert this inbound email from the charset to UTF8 for display
- While reading a message, if the header is RFC2047 encoded but with no body charset info, we apply the header charset to the body without running autodetection. The case the charsets of the head and body are different is not common.
- While reading a message, if header is not RFC2047 encoded but body is correctly tagged, we can apply the body charset to the header without running autodetection
- While reading a message, if the body contains 8 bit characters ignore any US-ASCII tagging
- If the header is not RFC2047 encoded correctly, we apply the user default charset based on the user’s language and country since in the detecting the header may be too expensive. Remember, we’re talking about millions of transactions per second here.
- If there is no MIME charset tag or unknown MIME charset or charset is US-ASCII and the message contains 8-bit characters, then the encoding selector should be shown up to allow user to select new encoding in read message page and preview pane page. After a user selects his encoding, then Windows Live Mail will re-load the message based on the user’s selection and convert it to UTF8 for display.
- In all cases of ambiguity, show the character selection drop down
There’s a lot of complexity here but we’re quite happy with the way it works. I can read and write emails in multiple languages, which is a huge improvment over Hotmail.