Architecture

High-Level Architecture

The following diagram provides a high-level view of the Readux system architecture. The direction of arrows indicates the flow of data.

digraph { rankdir=BT; // using clusters to separate out layers/access subgraph cluster_0 { style=invis; User[shape=box]; } subgraph cluster_1 { style=invis; Webserver -> User; } subgraph cluster_2 { style=invis; Loris -> Webserver; Fedora -> Loris; SQLDB -> Webserver; Webserver -> SQLDB; Fedora -> Webserver; Fedora -> eulindexer; eulindexer -> Fedora; Webserver -> eulindexer; eulindexer -> Solr; Solr -> Webserver; } }

Readux is a Django application running on a web server. It uses a SQL database for user accounts, collection banner images and annotations. Collection and digitized book content is stored in a Fedora Commons 3.x repository and accessed using REST APIs with eulfedora. In normal operations, Readux does not ingest or modify content in Fedora (although the codebase does currently include scripts for ingesting contents).

Readux uses Loris IIIF Image Server to serve out page image in various sizes and support deep zoom. Loris is configured to pull images from Fedora by URL. Currently, Loris is not directly exposed to users; Loris IIIF outputs are mediated through the Readux application (although this is something that may be revisited). The page image urls do matter: Readux image annotations reference the image url, so changing urls will require updating existing annotations. Using more semantic image urls (thumbnail, page, full) within Readux may be a more durable choice than exposing specific IIIF URLs with sizes hard-coded.

Readux uses Apache Solr for search and browse functionality. This includes:

  • searching across all volumes (see VolumeSearch())
  • searching within a single volume (see VolumeDetail())
  • browsing volumes by collection (see CollectionDetail() or CollectionCoverDetail())
  • browsing pages within a volume (see VolumePageList())

We use eulindexer to manage and update Fedora-based Solr indexes. eulindexer loads the Solr configuration from Readux and listens to the Fedora messaging queue for updates. When an update event occurs, eulinedexer queries Fedora to determine the content type (based on content model), and then queries the relevant application for the index data to be sent to Solr. Readux uses the eulfedora.indexdata views and extends the default eulfedora.models.DigitalObject.index_data() method to include additional fields needed for Readux-specific functionality; see the code for readux.books.models.Volume.index_data() and readux.books.models.Page.index_data() for specifics. The current Solr schema is included in the deploy/solr directory.

Book Content Models

The following diagram shows how Readux digitized book content is structured in Fedora.

digraph { rankdir=RL; subgraph cluster_0 { style=invis; Volume2 -> Book[label="constituent", style=dashed]; Volume -> Book[label="constituent"]; Volume3 -> Book[label="constituent", style=dashed]; } subgraph cluster_1 { color=lightgray; label="Pages"; node [shape=box]; Page1 -> Volume[label="constituent"]; Page2 -> Volume[label="constituent"]; Page3 -> Volume[label="constituent"]; } Volume -> Page1[label="primary image"]; }

For consistency between single and multi-volume works, every Volume is associated with a Book. The Book object contains the MARC XML metadata; each Volume includes a PDF, and may include OCR XML. For Volumes with pages loaded, each page is stored as an individual object with a relation to the parent volume. Any volumes with a cover loaded will have at least one page, with the special hasPrimaryImage relationship to indicate which page image should be used as the cover or for thumbnails (this may or may not be the first page). The book-level object is not currently directly exposed in Readux, but it is used to associate volumes with collections, and volumes from the same book are linked as “related volumes” from each individual volume landing page.

Pages are ordered within a volume using a pageOrder property set in the RELS-EXT of each page.

For implementation specifics, see code documentation for:

  • readux.books.models.Book
  • readux.books.models.Volume
  • readux.books.models.Page

Volume and Page variants

Readux currently includes two different variants of Volume and Page objects. The primary difference is that the ScannedVolume-1.0 objects contain a single ABBYY OCR XML file with the OCR for the entire volume, where the ScannedVolume-1.1 objects have no volume-level OCR, but each page has a METS/ALTO OCR XML file, instead of the plain text OCR content present in the ScannedPage-1.0 objects.

graph { node [shape=record]; rankdir=BT; // set rank the same so variants will be displayed side by side {rank=same v1:id v11:id} {rank=same p1:id p11:id} subgraph cluster_0 { color=lightgray; v1 [label="{<id>Volume 1.0|<pdf>PDF|<ocr>Abbyy OCR}"]; p1 [label="{<id>Page 1.0|<img>Image|<ocr>OCR text|<pos>Word positions|<tei>TEI facsimile}"]; p1 -- v1; } subgraph cluster_1 { color=lightgray; v11 [label="{<id>Volume 1.1|<pdf>PDF}"]; p11 [label="{<id>Page 1.1|<img>Image|<ocr>ALTO OCR|<tei>TEI facsimile}"]; p11 -- v11; } // datastream equivalencies: including all of these causes a segfault, // so leaving out the obvious ones // v1:pdf -- v11:pdf [dir=none, color="blue"]; // p1:img -- p11:img [dir=none, color="blue"]; // p1:tei -- p11:tei [dir=none, color="blue"]; // dashed because not exactly equivalent p11:ocr -- v1:ocr [dir=none, style="dashed", color="blue"]; }

Readux uses TEI facsimile to provide a consistent format for positioned OCR text data across these variations. Readux includes scripts and XSLT to generate TEI from the volume-level ABBYY OCR or the page-level ALTO, and adds the TEI to the page object in Fedora. In addition, Readux adds xml ids to the original OCR XML, which is carried through to the TEI and then the HTML displayed on the Readux site for annotation, in order to ensure durability and correlation of content with annotations.

Fedora pids

Readux is intended for display and access, and not as a management tool. However, for historical reasons it currently includes some scripts for importing covers and pages, and also a preliminary script for importing a new Volume-1.1 work with pages and metadata (see import_volume). Prior to Readux, existing Emory Libraries digitized book content in the repository only included Book and Volume records. There are manage commands to import_covers and import_pages, but the current implementation uses a legacy Digitization Workflow (readux.books.digwf).

Following our standard practice, any objects ingested via Readux have Archival Resource Keys (ARKs) generated via our PID manager application, which are then used as the basis for Fedora object pids. The ARK is stored in the object metadata and displayed on the website as a permalink.