Converting a Latin1 encoded HTML Document with Elixir (1)

Posted on February 09, 2017 by Carsten Zimmermann

Elixir/Erlang gets some bad rep when it comes to String handling and the encoding of character data. Also, detecting/guessing the content encoding just based on heuristics is a reasonably hard problem to solve.

A recent Elixir/Phoenix project involved getting a remote HTML page with HTTPoison, modifying parts of it and returning the changed HTML document. It works perfectly fine when the source document is in UTF-8 unicode, but not so much when the source is ISO-8859-X (Latin1 and its siblings). This two-parter illustrates two mechanisms to get hints about the source document’s encoding.

Let’s assume that the HTML document contains a <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" /> element in its document headers.

I’ll cover two cases: first, if the Content-Type header response from the remote webserver is missing (is that even allowed?) or not corresponding to the actual encoding of the response body. And secondly, when the header and the encoding match. Here goes the first part:

Part 1: The proper http-equiv meta element

Guess the content type with this one weird trick

Let’s employ Floki to get the header element from within the HTML document:

defmodule Latin1Convert do

@doc """
Retrieves the content type indication from html.

"text/html; charset=ISO-8859-1"

iex>Latin1Convert.meta_http_equiv_encoding("<html></html>")
""
"""
@spec meta_http_equiv_encoding(String.t) :: String.t
def meta_http_equiv_encoding(html) do
String.downcase(html)
|> List.first
|> to_string
end
end


It’s easier on my brain to map this to atoms:

defmodule Latin1Convert do

@doc """
Looks for a <meta http-equiv="Content-Type"> node in the input
string's HTML header and returns an atom representing the encoding.
"""
@spec content_type_from_header(String.t) :: atom | nil
encoding = meta_http_equiv_encoding(html)

cond do
Regex.match?(~r(iso-8859)i, encoding) ->
:latin1
Regex.match?(~r(utf-8)i, encoding) ->
:unicode
true ->
nil
end
end

def meta_http_equiv_encoding(html) do
# See above
end
end


Convert and purge (now) erroneous markup

The last step would be to convert the HTML input to UTF-8 using the underlying Erlang library. However, we don’t want the HTML to identify as Latin1 anymore, so we have to remove the meta http-equiv tag:

defmodule Latin1Convert do

@doc """
Convert an input HTML string to UTF-8 unicode.
"""
@spec call(String.t) :: String.t
def call(html) do
cond do
content_type == :latin1 ->
html
|> :unicode.characters_to_binary(:latin1)
|> remove_meta_http_equiv_encoding
true ->
html
end
end

# Caveat: not really case-sensitive check for the DOM node.
# Floki doesn't seem to understand \$=foo i queries. We can't
# String.downcase here as that will mess up the filter chain.
defp remove_meta_http_equiv_encoding(html) do
|> Floki.raw_html
end


The next part will cover how to take a matching Content-Type HTTP header to short-circuit the guesswork. Check out the file so far: