Processing XML with Elixir and xmerl

Contents

1 Introduction
- 1.1 The code
2 The tree walk
3 Using xpath
4 Wrap-up

1 Introduction

Elixir provides excellent abilities that enable us to use Erlang modules from the Erlang standard library in Elixir. Often doing so is as simple as making a straight call. For example:

iex(19)> greeting = 'hello'
'hello'
iex(20)> name = "dave"
"dave"
iex(21)> :io.fwrite("~s, ~s~n", [greeting, name])
hello, dave
:ok

Notes:

We turn the name of the Erlang module ("io", in this case) into an atom by prefixing it with a colon. And, instead of using a colon to reference a function within a module (as we'd do in Erlang), we use the standard Elixir dot operator.
Notice that in this case, :io.fwrite accepts both binary strings (Elixir strings created, e.g., with double quotes) and character lists (created in Elixir, e.g., with single quotes). But, in some cases you will need to distinguish between the two and may need to convert from one to the other.

However, using xmerl and its associated modules is complex enough so that we need some special techniques beyond the most obvious ones. This post attempts to help you learn a few of those techniques.

So that we can talk about real code, this post discusses two versions of roughly the same application. That application does regular expression searches of various fields in a list of XML files. The user can specify (1) which fields are to be searched, (2) a regular expression for the search, and (3) a list of files (XML documents) to be searched. The possible fields to be searched are (1) tag (element names), (2) attribute (attribute values), and (3) text (element text content). The "--help" command line options and the comments at the top of our two sample code files give this usage information.

1.1 The code

In this post we look at two approaches to searching XML files with xmerl: (1) one example module uses a tree walk; (2) the other uses xpath. Here are the modules:

The tree walk -- search_xml04.ex
Using xpath -- search_xml05.ex

Here are Elixir scripts for running each of the above:

And, shell scripts for running them:

2 The tree walk

Once again, you can view the tree walk code here: search_xml04.ex

Some notes:

The parse -- To parse an XML document, we use:

{root, []} = :xmerl_scan.file(infilename)

The above returns a nested structure of Erlang tuples and lists.

The records -- The tuples in this nested structure are actually Erlang records. Those records are defined in xmerl/include/xmerl.hrl . So that we can use those record definitions in Elixir, we include the following code in our module:

require Record
Record.defrecord :xmlElement,
  Record.extract(:xmlElement, from_lib: "xmerl/include/xmerl.hrl")
Record.defrecord :xmlText,
  Record.extract(:xmlText, from_lib: "xmerl/include/xmerl.hrl")
Record.defrecord :xmlAttribute,
  Record.extract(:xmlAttribute, from_lib: "xmerl/include/xmerl.hrl")

See the following for more on using Erlang records in Elixir: https://hexdocs.pm/elixir/Record.html.

Note that there are xmerl record definitions in addition to the ones I've defined (xmlElement, xmlText, and xmlAttribute), but these are the only ones I'll need here.

In order to use each of these record definitions, you will need to know the names of the fields defined in each one. You can look in xmerl/include/xmerl.hrl to learn those names. For example, here is the Erlang code that defines xmlElement:

-record(xmlElement,{
      name,         % atom()
      expanded_name = [],   % string() | {URI,Local} | {"xmlns",Local}
      nsinfo = [],          % {Prefix, Local} | []
      namespace=#xmlNamespace{},
      parents = [],     % [{atom(),integer()}]
      pos,          % integer()
      attributes = [],  % [#xmlAttribute()]
      content = [],
      language = "",    % string()
      xmlbase="",           % string() XML Base path, for relative URI:s
      elementdef=undeclared % atom(), one of [undeclared | prolog | external | element]
     }).

Given the above definition, we can access the contents of an xmlElement record, with expressions such as the following:

tag = xmlElement(element, :name)
children = xmlElement(element, :content)

Since each Erlang record is a tuple, and the first element of that tuple is the record name, we can determine the type of one of these tuples or nodes in the XML tree with something like the following:

case elem(node, 0) do
  :xmlElement ->
    # handle an xmlElement node
    handle_element(node)
  :xmlText ->
    # handle an xmlElement node
    handle_text(node)
  :xmlAttribute ->
    # handle an xmlElement node
    handle_attribute(node)
  _ ->
    handle_the_others(node)
end

The recursive, nested walk -- The recursive tree traversal happens in the walk function. It looks like this:

fn_walk_children = fn (elm) ->
  walk(elm, re_pattern, fields, infilename, path)
end # fn
children = xmlElement(element, :content)
Enum.each(children, fn_walk_children)

We create an anonymous function to use with Enum.each/2. This function does the recursive call to walk. Then we get the immediate children of the current node (an xmlElement). And, finally, we use Enum.each/2 to iterate over the children and perform the recursive call.

Search each node -- What's left is to search each node. That means that we need to be able to retrieve (1) the value of each text node, the (2) value of each attribute node, and (3) the name/tag of each element node. Because we are using Regex.match?/2 to search the actual text values, and because, in some cases, xmerl records contain character lists rather than binary strings, we need to convert those character lists to binary strings with to_string/1. This is done in function walk/5.

3 Using `xpath`

This second Elixir module perform roughly the same task as the previous one. It uses xmerl_xpath to find the parts of the document of interest (elements and their tags, attributes, and text content).

User advisory: Because of the way this module implements the search, it seems to me that it is a very inefficient way to implement our search. For example, when you search the tags, this algorithm first retrieves a list of all elements (xmlElement) in the document, then search the tag of each of those elements. For a large XML document containing many elements, this would likely be a very large list. Therefore, although this module seems to me to be a reasonable example of how to use xmerl_xpath, it is perhaps a not so good example of what to use it for.

It would be an interesting improvement if we could get :xmerl_xpath.string/2 to return an Elixir stream instead of a list. Maybe someday ...

Once again, you can view the xpath search module here: search_xml05.ex

If you read that code, you will notice a line containing #require IEx and another containing #IEx.pry. That's left-over debugging code, and enabled me to break during execution of the code, drop into an Elixir iex shell, and inspect variables. If you try to use it yourself, you will need to run the code inside the Elixir iex shell. There is a test/0 function in the module that can serve as a test harness, if you do so. And, there are some lines containing #IO.inspect(...), which were also used for debugging and viewing the contents of variables.

The actual searches are performed by three functions: search_file_tag/3 which searches for tags containing the pattern, search_file_attrib/3, which searches attribute values; and search_file_text/3, which searches text content. They all follow a common pattern: (1) collect a list of nodes using :xmerl_xpath.string/2; (2) filter that list with Enum.filter/2 using an anonymous function that calls Regex.match?/2; and, finally, display information about the nodes that satisfied the match, including the file name, the matched text, and a list or path of the parents of the node. We use the Elixir pipe operator ("|>") to pipe the list created by each of these steps to the next step.

4 Wrap-up

So, that's about it. Perhaps you will agree with me that Elixir is quite convenient as a language for processing XML using Erlang's xmerl modules.

Now, I just wish that I could find an interactive debugger for Elixir. Maybe someday ... In the meantime, there are some suggestions for Elixir debugging techniques here: http://blog.plataformatec.com.br/2016/04/debugging-techniques-in-elixir-lang/.

Processing XML with Elixir and xmerl

1 Introduction

1.1 The code

2 The tree walk

3 Using `xpath`

4 Wrap-up

Published

Category

Tags

Contact

1 Introduction

1.1 The code

2 The tree walk

3 Using xpath

4 Wrap-up

Published

Category

Tags

Contact

3 Using `xpath`