Getting Started: Apache UIMA DUCC

This guide is intended to be a user oriented overview of Apache UIMA™ DUCC.
DUCC is short for Distributed UIMA Cluster Computing.

High Level Overview of the UIMA DUCC Platform

DUCC is a Linux cluster controller designed to scale out any UIMA pipeline for high throughput collection processing jobs as well as for low latency real-tme applications. DUCC is particularly well suited to run large memory Java analytics in multiple threads in order to fully utilize multicore machines. DUCC manages the life cycle of all processes deployed across the cluster, including non-UIMA processes such as tomcat servers or VNC sessions.

DUCC has an extensive web interface providing details on all user activity. Because DUCC is built for UIMA-based analytics from the ground up it automatically makes available such details as what annotators are currently initializing as well as the timing breakdown for each primitive annotator in a pipeline.

DUCC's resource manager uses the combination of specified memory requirement and process class to find space to run each new managed process instance. The resource manager does not overcommit RAM memory for machines, and DUCC uses Linux cgroups to constrain all managed processes from interfering with each other. Only processes that exceed their requested memory allocation will be subject to paging. The resource manager employs process preemption to dynamically rebalance compute resources between collection processing jobs.

DUCC is primarily intended to be used for research and development activities where multiple users need to efficiently share cluster resources for a wide variety of computational activities. All processes run with the credentials of the submitting user. Process logfiles and DUCC collected performance data are stored in user filesystem space. This allows/forces each user to decide how to manage the metadata associated with work submitted to DUCC.

Visit the UIMA-DUCC live demo description and the UIMA-DUCC live demo itself.

The following sections will describe each of the three types of DUCC managed processes (collection processing jobs, services and arbitary processes) and constrast some differences between DUCC and Hadoop for scaling out UIMA applications.

DUCC Collection Processing Jobs

A classic UIMA pipeline starts with a Collection Reader (CR) that defines how to segment the input collection into separate artifacts for analysis, reads the input data, initializes a new CAS with each artifact and returns the CAS to be sent to downstream analytic components. Because a single CR supplying artifacts to a large number of analysis pipelines would be a bottleneck, DUCC implements collection level scale out according to the design in Figure 1.

Figure 1 - DUCC Collection Processing Job Model

In a DUCC collection processing job the role of collection segmentation is implemented by the CR run in the Job Driver. Figure 1 shows the CR inspecting the input collection to determine how to segment the data into work items to be sent to the analysis pipeline. The CR may also inspect the target output location to see which work items have already been done. Then the CR outputs small CASes containing references to input work items and the associated output locations.

DUCC wraps the user's CR in a Job Driver, which sends the CASes to a queue feeding one or more instances of a Job Process containing the analysis pipeline. Input data reading, artifact extraction and CAS initialization are implemented by the Cas Multiplier (CM) running in the Job Process. Each artifact CAS is then passed thru the analysis engine (AE) and CAS Consumer (CC) components.

A DUCC job specification includes the number of pipeline instances to run in each Job Process. Each instance is run in a separate thread. During the job DUCC will automatically scale the number of Job Processes running based on the number of number of work items left to do, the number of threads per Job Process, and the amount of resources available to the job.

Jobs are tracked on the Jobs page of DUCC webserver.

DUCC Services

A DUCC service is a process whose life cycle DUCC will manage and/or whose health DUCC will monitor. The service can be configured to be automatically started as soon as resources are available, or it can be started on demand, when DUCC is asked to start a job or another service that has declared a dependency on it. Another configuration parameter specifies the default number of service processes to start. Users can also start or stop services manually, as well as manually change the number of running instances.

There are two type of DUCC services: UIMA-AS and CUSTOM. Any existing UIMA-AS services would be able to use DUCC for service life cycle management. Each DUCC service must have a "service pinger" class that DUCC will call periodically to get the status of the service. A dependent job or service will not be given resources to run unless the service pingers indicate all services it depends on are available. DUCC has a default pinger for UIMA-AS services which will be used if none are specified. CUSTOM services must register a pinger class.

Services are tracked on the Services page of DUCC webserver.

DUCC Arbitrary Processes

DUCC can be used to run an arbitrary process on a DUCC worker node. Resources are allocated according to the memory size and scheduling class requested. The allocated resource is freed when the process terminates.

A command line script, viaducc, can be used to launch processes on a DUCC worker node. With a symlink named "java-viaducc" pointing at $DUCC_HOME/bin/viaducc, java commands can be run remotely from the command line. If java-viaducc is put into $JAVA_HOME/bin, eclipse can be configured to launch processes onto remote machines.

Arbitrary processes are tracked on the Reservations page of DUCC webserver.

Scaling UIMA with DUCC vs Hadoop

DUCC offers a number of potential advantages over Hadoop for many UIMA applications.

Threading

Hadoop mapper processes are intended to have a single analytic thread. DUCC is designed to run multiple UIMA pipelines in a single job process, allowing sharing of static Java objects.

Guaranteed RAM

DUCC's allocation model and use of cgroups guarantee every managed process the amount of real RAM memory requested.

Application Interface

The application interfaces for a UIMA application continue to be UIMA-standard components: CollectionReader, CasConsumer, and CasMultiplier. Hadoop requires integrating a new set of interface components.

Collection Processing Errors

If a Hadoop mapper fails to handle any work item in a collection the entire collection must be reprocessed after fixing the mapper problem; there is no way to make incremental progress. DUCC Jobs are designed to preserve previous results, if appropriate.

Performance Analysis

For every job DUCC automatically provides the performance breakdown for every UIMA component.

Other Workloads

DUCC has support for managing a wide range of processes, including non-UIMA processes. For example, DUCC could dynamically start a Hadoop instance on a subset of DUCC worker machines.

Debug Support

DUCC offers tight integration with Eclipse debugging. All or part of a UIMA application can be run in the Eclipse debugger by adding a single parameter to the Job submission.

DUCC on Clouds

DUCC has been running in various cloud deployments. The documentation for this is still in-progress, but some information is available here.

DUCC - What next?

Go to https://uima.apache.org/d/uima-ducc-current/duccbook.html and see the full documentation.

Go to the UIMA Download Page and get the "UIMA DUCC" install package.

The DUCC install package includes sample applications that demonstrate two useful Collection Processing applications.