White spots in HTML5's encoding sniffing algorithm

HTML5’s ‘encoding sniffing algorithm’ says that the default encoding of an HTML document is: “an implementation-defined or user-specified default character encoding … typically dependent on the user’s locale“.

However, it turns out that Web browsers operate with a second default — namely the parent browsing context-defined default. This default, which overrides the locale default, steps in when the document of a ‘child browsing context’ fails to provide encoding information.

Another — and hot — spot of HTML encoding is the proposal to make the Byte Order Mark override any other encoding setting — HTTP/MIME and meta element. And they might even do this for XML files.

A third hot issue is the way some browsers (partly) applies its HTML encoding sniffing algorithm on XML documents.

The second and third issue can be seen together: An XML document in an iframe of a HTML document should be unaffected by the encoding of the HTML document — even if the XML document is lacking encoding info. Because, in that case, the XML document defaults to UTF-8. But, alas, the Webkittens currently ignore this. Which makes it tempting to solve the problem with a BOM in XML files too.

How HTML implementations prioritize encoding info

To clarify — or eventually blur — my mind, I decided to go through all the browsers I have access to and document their behaviors. Here you have it! It is at least the most complete table I myself have ever made. (But it is still not 100% complete: It lacks info on how encoding of documents kept within data URIs, for instance.)

Status for documents with an HTML mime type per 23rd of July 2012.
The number in each cell indicates the priority of the determination method.
Empty cells means that the implementation does not count on this method.
Layer
document is XML formatuser overrideinherit override from parentexplicit MIME charsetBOM signa­turenative markup labelmarkup label of the sister languageinfo on
“the likely encoding”
UTF-8 detection etc.parent browsing contextimplicit markup language defaultinfo from OSlocale default
HTML
im­ple­men­tation
HTML5opts out1.2.3. bug 153594.5. ask Hixie6.7.
Firefox 10opts out1.2.3.4.5.ask vendors6. some locales7.8.
IE 10 pre-releaseopts out1.2.3.4.5.6. check!7. check!8.
IE 9opts out2.3.4.1.5.check!7. check!8.
IE 62.3.1.4.5.
Chrome 222.3.4.1.5.6.8.7.9.
Safari 52.3.4.1.5.6.7.8.9.
Opera 101.2.3.4.5.6. some locales7.8.
Opera 121.2.4.3.5.6.7. all locales8.9.
Who wins? Your own tests.
PS: In as much as your UA allows it at all(!), the tests require that you yourself override the encoding
XML vs your encoding override
xml vs override take 2
BOM signature vs your encoding overrideiframe doc vs your encoding override of the parent docHTTP vs meta elementBOM vs meta vs HTTP vs chardetmeta vs xml encoding declaration vs chardetxml encoding declaration vs locale defaultdeclared encoding vs reload after an encoding override - does it stick?UTF-8 file vs locale defaultlocale default vs doc of parent browsing contextDoes not apply to HTMLOSX: Via file://, load a TextEdit made UTF-8 file in Safari 5check the locale default

How XML implementations prioritize encoding info

Because I made the HTML table (or, actually i twas the other way around …), I wanted to document how XML is handled too.

Status for documents with an XML mime type per 23rd of July 2012.
The number in each cell indicates the priority of the determination method.
Empty cells means that the implementation does not count on this method.
Layer
user overrideexplicit MIME charsetBOM signa­turenative markup labelmarkup label of the sister languageinfo on
“the likely encoding”
UTF-8 detection etc.parent browsing contextimplicit markup language defaultinfo from OSlocale default
XML imple­men­tationHTML5HTML5 defers encoding determination of XML to XML 1.0
XML1.01.2.3.4.5.
Firefox 101.2.3.4. check!5.
IE 9check!2.1.3.4. check!check!5.
IE 10 pre-releasecheck!1.2.3.4. check!check!5.
Chrome 222.3.1.4.6. check!5.7.
Safari 52.3.1.4.5. check!6.7.
Opera 101.2.3.4. some locales5.6.
Opera 122.3.1.4.5. all locales6.7.

Key points from the tables

HTML key points

  1. If bug 15359 is taken ad notam, then HTML5 will align wiith IE, Chrome, Safari by making Encoding signatures (the Byte Order Mark) override everything else.
  2. If a document is a frame — that is: if it is a nested brwosing context — then all browsers (except IE6 and IE7? IE8? IE9? IE10?) let the encoding of the parent browsing context (and not the locale default!) serve the role the last default. This is not documented in HTML5!
  3. Only Chrome offers (UTF-8) encoding sniffing by default, but Opera and Firefox has (UTF-8) encoding sniffing for some locales. Firefox also offers it as an option.
  4. Chrome thinks the parent encodign should outdo the encoding sniffing, whereas Firefox and Opera (for the locales where they offer it!) lets the sniffing win over the parent.

XML key points in the table

  1. HTML5 defers to XML 1.0 w.r.t. how browsers should decide the encoding of an XML document. However, in practice, many browsers partly align with their HTML behaviour. This is especially true with regard to a) the priority of the byte order mark, b) nested browsing contexts

NB regarding XML! Note that e.g. SVG images inside the <img> element, does not count as browsing contexts, whereas an SVG in an iframe does count as brwsing context. Hence, for SVG images, then the parent encoding does not impact the encoding of the SVG image. Whereas if the same SVG is served inside an iframe, then it may inherit the encoding of the parent document rather than fallling back to the encoding default of the format (UTF-8).

Consequences

The consequences I draw from this tiresome research are that HTML5’s encoding sniffing algorithm should be updated with

  • a step which simply says “opt out-if XML”
  • two new steps that take into account the encoding default from the parent browsing context
  • clarification of the step “information on the likely encoding”

Bugs actually filed

These are the bugs I filed based on the above findings:
#1 Encoding Sniffing Algorithm: parent browsing context defines encoding default

PROPOSAL: Add a new, 2nd last step, like so:

  • If the document lives in a 'nested browsing context', then return the encoding of the 'parent browsing context', as a parent browsing context dictated default encoding, and abort these steps.
    [nested browsing context = iframe etc]
#2 Encoding Sniffing Algorithm: Overrides apply to nested browsing contexts

PROPOSAL: Add a new step after the current first step (about user overriding), like so:

  • If the current document lives in the 'nested browsing context' of a document in a 'parent browsing context' whose encoding has been overridden at the request of the user, then return the encoding of the parent browsing context, and abort these steps.
#3 Encoding Sniffing Algorithm: Add an XML check as a step zero

PROPOSAL: Add this step as a step zero:

  • If the document is an XML document, abort these steps."

[Purpose: to avoid that the/an HTML encoding sniffing algorithm (sometimes) is applied to XML.]

#4 Encoding Sniffing Algorithm: Clarify what "information on the likely encoding" covers
blog comments powered by Disqus


|