1   Introduction

In this article, we'll focus on two capabilities:

  • The ability to call Python from Elixir.
  • The ability to keep a Python process alive, along with the data referenced by any global variables.

Python had a huge number of libraries and modules. As I write this, the home page for the Python Package Index says "315,762 projects" (https://pypi.org/). Many of them are excellent and powerful. The ability to use them from within your Elixir code can be a huge gain.

This article will discuss Lxml, a package for processing XML, but you could just as well make use of Numpy, Scipy, Pandas, Dask, scikit-learn, nltk, and thousands more.

We'll also show what it means to keep a Python process alive, and how that causes data referenced by a global variable to persist across calls into functions in the module that implements that process.

2   Some preliminaries

To gain access to a Python module from Elixir, we'll use Export, which is a wrapper around Erlpart. See:

Create a new app, if you need it, with the following:

$ mix new my_app

The change directories to my_app or whatever you've called it.

We include Export and its dependency Erlport in our mix application with the following in our mix.exs:

defp deps do
  [
    {:export, "~> 0.1.0"},
  ]
end

Then, don't forget to do:

$ mix deps.get
$ mix deps.compile

The documentation for Erlport and Export is quite good, so I won't repeat that, although you will find several examples of its use, below. I'll suggest that you read the Export documentation for an overview and examples of Elixir code, then read the Erlport documentation for the details. Since Export is a reasonably thin layer on top of Export, when you have questions, you will be able to see the connections between then pretty easily.

You can also read the code. If you have included the above dependency in your mix.exs and then done $ mix deps.get, you will be able to look at deps/export/lib/export/python.ex in your project directory. One hint in advance: the implementation of Python.call uses pattern matching to select among definitions, and one of the definitions is implemented with defmacro.

However, it is, for our purposes, important to understand what happens when you use Erlport, by way of Export, to create a "process". That process that is created is an operating system process. For those of you who are used to thinking of processes in the Erlang and Elixir way, keep in mind that an operating system process is different from an Erlang/Elixir process. For one, it shows up in your task manager, for example in top/htop on Linux and the Task Manager on MS Windows. For another, operating system processes are quite heavy weight in comparison with Erlang/Elixir processes. You will not want to create hundreds of them, will even shy away from creating dozens, as you might create Erlang/Elixir processes.

3   A few problems and their solutions

3.1   Preserving Python Data

In Elixir, we create that operating system process with Python.start and we kill it with Python.stop. You can check this, by the way, by using Process.alive?(pid). After starting a process, we can make as many calls to functions defined in Python modules in that directory as we want. And, while that process is alive, any global variables in any one of those modules are preserved, for as long as we do not kill the process by calling Python.stop.

This means, for example, that we can reload a complex data structure or perform some time-intensive computation, then make multiple calls that reuse the resulting data items without recomputing them.

3.2   Passing complex data between Elixir and Python

One solution to this need is to serialize the data to a string at one end, and de-serialize it at the other. You can use JSON as your serialization format. So:

  • If, in your Elixir code, you have a complex data object (for example, an Elixir map or keyword list) and you want to pass that data item to Python, use the Elixir Jason module to convert it to a JSON string and pass that to Python.
  • If, in your Python code, you have a complex data object (for example, an Python dictionary or list of tuples) and you want to pass that data item to Elixir, use the Python json module to convert it to a JSON string and pass that (back) to Elixir.

But, not all data items can be encoded as JSON strings. JSON encodings can represent things that are composed of dictionaries/maps, lists, and simple types (numbers, booleans, and null). See https://en.wikipedia.org/wiki/JSON#Data_types So, converting your data to something that can be encoded as JSON is "left as an exercise for the user". One suggestion is that you use json.dumps and json.loads from the python module and that you use Jason.encode and Jason.decode from the Elixir module in order to learn what form your data should take.

Another option is to serialize to and de-serialize from Yaml. In elixir, consider using fast_yaml or yaml_elixir or the Erlang yamerl module. In Python, consider using pyyaml or pylibyaml (either of which can be installed from https://pypi.org using pip.

4   Some sample code

Here is some code that (1) starts a Python process; (2) parses and loads an XML file using Lxml in Python; (3) does several xpath searches in the loaded Lxml element tree; and, finally, stops/kills the Python process.

4.1   Elixir

Here is the Elixir code:

# lib/lxml_tasks.ex

defmodule LxmlTasks do

  use Export.Python

  @moduledoc """
  Various XML tasks using Lxml.
  """

  @doc """
  Run a test that loads several XML files and does several xpath searches.

  ## Options

  * `:verbose` - print extra info if `true`.  Default is `false`.

  ## Examples

      iex> LxmlTasks.test()

  """
  @spec test(keyword()) :: :ok
  def test(opts \\ []) do
    verbose = Keyword.get(opts, :verbose, false)
    xml_path1 = "Data/weather-data-provider-1-0.xml"
    xml_path2 = "Data/verification-method-1-0.xml"
    xpath_patterns = [".//Column", ".//Identification", ]
    {:ok, pid} = LxmlTasks.start()
    xpath_load_tree(pid, xml_path1)
    xpath_load_tree(pid, xml_path2)
    results = Enum.map(xpath_patterns, fn pattern ->
      {:ok, items} = xpath_search_one_pattern(pid, xml_path1, pattern, verbose: verbose)
      items
    end)
    stop(pid)
    results
  end

  @doc """
  Open/start the python process with the path to our python files.
  """
  @spec start() :: {:ok, pid()}
  def start() do
    {:ok, py_pid} = Python.start(python_path: Path.expand("lib/python"))
    {:ok, py_pid}
  end

  @doc """
  Close/stop the Python process.
  """
  @spec stop(pid()) :: :ok
  def stop(py_pid) do
    py_pid |> Python.stop()
    :ok
  end

  @doc """
  Parse and load an XML Etree tree.  Save it in the Python process.

  ## Args

  * `pid` -- process ID of the Python process.
  * `xml_path` -- path to the XML file to be loaded.

  """
  @spec xpath_load_tree(pid(), iodata()) :: atom()
  def xpath_load_tree(pid, xml_path) do
    result = Python.call(pid, load_xml_tree(xml_path), from_file: "lxml_tasks")
    result
  end

  @doc """
  Do an Xpath search and return list of found items.

  ## Options

  * `:verbose` - print extra info if `true`.  Default is `false`.

  ## Examples

      {:ok, pid} = start()
      :ok = xpath_load_tree(pid, xml_path)
      results = Enum.map(xpath_patterns, fn pattern ->
        {:ok, items} = xpath_search_one_pattern(pid, xml_path, pattern)
        items
      end)

  """
  @spec xpath_search_one_pattern(pid(), iodata(), iodata(), keyword()) :: {:ok, [map()]}
  def xpath_search_one_pattern(pid, xml_path, xpath_pattern, opts \\ []) do
    verbose = Keyword.get(opts, :verbose, false)
    json_items = Python.call(pid, xpath_search_one_pattern(xml_path, xpath_pattern), from_file: "lxml_tasks")
    if verbose do
      IO.inspect(json_items, label: "xpath_search_one/json_items")
    end
    #IO.inspect(json_items, label: "json_items")
    {:ok, items} = Jason.decode(json_items)
    if verbose do
      IO.inspect(items, label: "xpath_search_one/items")
    end
    {:ok, items}
  end

end

Notes on the code (above):

  • Look at function test/0 for an overview of the how we perform these tasks.
  • Function start/0 creates and starts up the Python process.
  • Function xpath_load_tree/2 makes a call to Python requesting that it parse and XML document and save the resulting element tree in a global dictionary for later use.
  • Function xpath_search_one_pattern/4 is where we actually do the xpath search. This function calls python to perform the search of the element tree with xpath and to return a JSON string representation of dictionaries that describe the found elements. So, then it uses Jason module to decode that JSON string and turn it into Elixir data structures.
  • Finally, since we're finished using the Python process it, we call stop/1 to close and kill it. Note that, at this point, any element trees that we've loaded from XML documents will be lost. Something analogous would happen if you were calling Python using packages like Numpy, SciPy, Pandas, Dask, scikit-learn, etc to load and process large data sets.
  • Note that you could call start/0 multiple times to create multiple Python processes and then distribute work load across them, thus effectively eliminating worries that you might have about Python's global interpreter lock, the GIL, slowing down compute intensive computation. In a later blog article, I hope to show how Poolboy, an Elixir module, could be used to do that.

4.2   Python

And, here is the Python code that is used by the above Elixir code:

# lib/python/lxml_tasks.py

"""
Synopsis:
    Python functions for processing XML from Elixir.
See:
    Module LxmlTasks in `lib/lxml_tasks.ex`.
"""

from lxml import etree
from erlport.erlterms import Atom
import json


XmlTreeDict = {}


def element_to_dict(el):
    d1 = {
        'tag': el.tag,
        'attrib': dict(el.attrib),
        'children': [child.tag for child in el],
    }
    return d1


def load_xml_tree(xml_path):
    """Load an XML from a file.  Save in global variable.
    """
    try:
        doc = etree.parse(xml_path)
        root = doc.getroot()
        XmlTreeDict[xml_path] = (doc, root)
        return Atom(b"ok")
    except OSError as exp:
        return [Atom(b"error"), str(exp)]


def xpath_search_one_pattern(xml_path, xpath_pattern):
    #print(f'XmlTreeDict: {XmlTreeDict}')
    spec = XmlTreeDict.get(xml_path)
    if spec is None:
        return Atom(b"error")
    _, root = spec
    items = root.xpath(xpath_pattern, namespaces=root.nsmap)
    dict_items = [element_to_dict(el) for el in items]
    json_items = json.dumps(dict_items)
    return json_items

Notes on the code (above):

  • Function load_xml_tree is used to parse and XML document and to store the resulting tree in a global dictionary using the document's path in the file system as the dictionary key. That's a little careless, I suppose, because relative paths to the same file can be different. Perhaps an improvement would be to convert relative paths to absolute paths before passing them to our Python functions. Elixir function Path.absname/1 will do that..
  • Function xpath_search_one does the actual xpath search. It then calls function element_to_dict to create a dictionary and save some data from each element (data that might be relevant for some application's use). And, then we encode that list of dictionaries as a JSON string and return that representation to the caller, which in this case, is our Elixir code.
  • A simple enhancement to xpath_search_one_pattern would be to have it parse and save any document/tree that is not already in its global dictionary of element trees.

Published

Category

elixir

Tags

Contact