[ANN] `idna2008` 1.0 — pure-Haskell IDNA 2008 with UCD 17.0.0 tables

idna2008 is a ground-up, pure-Haskell implementation of
Internationalized Domain Names in Applications (IDNA 2008):

  • RFC 5891 (Protocol)
  • RFC 5892 (Tables)
  • RFC 5893 (Right-to-Left Scripts / Bidi)
  • RFC 5895 (Mappings)
  • RFC 3492 (Punycode)

Codepoint tables are derived from the Unicode Character Database
(17.0.0) by an in-tree generator script. Later Unicode editions can
be easily swapped in. There are no external C dependencies:
neither libicu nor libidn are required.

The motivating drivers for this implementation were:

  • To implement IDNA2008 faithfully, without some of the
    questionable additions from UTS #46.

  • To model the transformations as either parsing of textual
    inputs to a ShortByteString DNS wire-form, or decoding
    from a DNS wire-form to presentation-form text. This is
    more flexible than just the “toASCII” or “fromASCII”
    text-to-text APIs.

  • To handle application-configurable mixtures of label “forms”,
    allowing e.g. parsing of names like “*.αβγ.gr” if the application
    also wants to admit “wildcard” labels as well, and perhaps
    examine the classification of each parsed label.

  • To avoid external C library dependencies.

Two configuration knobs drive most of the behaviour:

  • LabelFormSet — a constraint on which kinds of labels are
    admissible when parsing or decoding (after any mappings
    are applied when parsing).

  • IDNAOpts — a flag set controlling validation strictness and
    use of mappings.

The extreme example below shows the JSON representation of the
decoding of an input that exhibits all the label forms, the DNS
presentation form of the wire domain, and the decoding of that
back to text, when non-default settings admit all those forms:

{
  "input": {
    "text": "*._tcp.abc$def.la--la--la.xn--ls8h.хn--нет.αβγ.example",
    "forms": [
      "WILDLABEL",
      "ATTRLEAF",
      "OCTET",
      "RLDH",
      "FAKEA",
      "LAXULABEL",
      "ULABEL",
      "LDH"
    ]
  },
  "presentation": "*._tcp.abc\\$def.la--la--la.xn--ls8h.xn--n---tdd3b5ap.xn--mxacd.example",
  "output": {
    "text": "*._tcp.abc\\$def.la--la--la.💩.хn--нет.αβγ.example",
    "forms": [
      "WILDLABEL",
      "ATTRLEAF",
      "OCTET",
      "RLDH",
      "LAXULABEL",
      "LAXULABEL",
      "ULABEL",
      "LDH"
    ]
  }
}

The “input” element’s “text” field shows what the parser read.
The “output” element shows the result of decoding the wire domain with
the shown DNS zone presentation form.

The two LAXULABEL forms in the output above are admitted only when explicitly
requested; in normal use they would be left as ASCII, designated FAKEA
(a label that looks like an A-label, based on its xn-- prefix, but doesn’t
decode to a valid U-label), if that form is allowed. Otherwise the decoder
would report an error.

The recently announced dnsbase library meshes with idna2008 as a
parser implementation provider for its dnLit compile-time literal
domain TH splice, and idna2008 can also be used to render wire-form
domains to text with valid A-labels converted to U-label form.

Links

Feedback, bug reports and PRs welcome.