py-pagexml: Python wrapper for the PageXML C++ library
The py-pagexml python package is a library of functions that eases working with omni:us Pages Format XML files (referred to as OPF XML files). It allows you from python to read an OPF file, extract information contained within, modify or add content, create an OPF from scratch, crop parts for page images, etc.
By default OPF XML files are validated against the XSD schema when reading, saving, or on request by calling a function. The documentation the XSD schema and the schema itself included in py-pagexml can be found at:
The official online documentation of py-pagexml is available at https://omni-us.github.io/pagexml/py-pagexml.
The py-pagexml package can be built with two modes: normal and slim. As the name
implies, the slim build is smaller but more importantly it has less library
dependencies. This also means that there are some features which are not
available, namely: functions related to images, e.g.
PageXML.crop
; and functions that perform intersections of polygons, e.g.
PageXML.selectByOverlap
.
Software dependencies
The core of py-pagexml is a compiled C++ library that links with a few libraries, so it requires installation of dependencies that cannot be automatically obtained from pypi servers.
There are docker images available at docker hub which include both the runtime and the build dependencies already installed. In particular the runtime docker images are intended to be used as base images for applications that use pagexml. The specific list of dependencies both for runtime and building are listed below.
Runtime dependencies
- Slim:
python3
- Normal (in addition to the previous):
libopencv-imgcodecs (Ubuntu 18.04/20.04) | libopencv-highgui (Ubuntu 16.04)
libopencv-imgproc
libopencv-core
libgdal
Building dependencies
- Slim:
python3-setuptools
python3-pkgconfig
python3-wheel
python3-dev
swig
- Normal (in addition to the previous):
libopencv-dev
libgdal-dev
libboost-all-dev
Installation from wheel binary file
If you have configured a pypi server that includes pagexml, installation is as simple as:
pip3 install pagexml
The slim build has a different name, thus the install comand would be:
pip3 install pagexml_slim
Otherwise you can install it from a github release. Each release includes multiple wheel files. One for python 3.5 which is built for Ubuntu 16.04, another for python 3.6 built for Ubuntu 18.04 and another for python 3.8 built for Ubuntu 20.04. Once you have located the appropriate wheel file, copy the link and run as follows replacing the URL with the one you copied:
pip3 install https://github.com/omni-us/pagexml/releases/download/20*/pagexml-20*-linux_x86_64.whl
Building the wheel file from source
Clone the github repository https://github.com/omni-us/pagexml.git, go to the py-pagexml directory and then run:
pip3 install --editable .[dev]
./setup.py bdist_wheel
To build the slim package, give the --slim
command line option, e.g.:
./setup.py bdist_wheel --slim
Simple usage examples
Create a new Page XML adding regions, text and properties
import pagexml
pxml = pagexml.PageXML()
# Create a new page xml
file = 'example_image.jpg'
width = 400
height = 200
pxml.newXml('name-and-version-of-tool', file, width, height)
# Add a text region to the Page
page = pxml.selectNth('//_:Page', 0)
reg = pxml.addTextRegion(page)
# Set text region bounding box with a confidence
pxml.setCoordsBBox(reg, 10, 20, 80, 60, 0.8)
# Set the text for the text region with a confidence
pxml.setTextEquiv(reg, 'lorem ipsum', 0.9)
# Add property to text region
pxml.setProperty(reg, 'key', 'value')
# Add a second page with a text region and specific id
page = pxml.addPage('example_image_2.jpg', 300, 300)
reg = pxml.addTextRegion(page, 'regA')
pxml.setCoordsBBox(reg, 15, 12, 76, 128)
# Write XML to file
pxml.write('example_image.xml')
Modify an existing Page XML
# Load an existing XML
import pagexml
pxml = pagexml.PageXML('example_image.xml')
# Add content to loaded XML
pxml.setProperty(pxml.selectNth('//_:Page', 0), 'key', 'value')
# Write XML to file
pxml.write('example_image_2.xml')
Crop an element and save image to disk
# Load an existing XML
import pagexml
pxml = pagexml.PageXML('examples/lorem.xml')
# Crop element with specific ID
cropped = pxml.crop('//*[@id="r1_l1"]/_:Coords')[0]
# Save image to disk
pagexml.imwrite(cropped.name+'.png', cropped.image)