Memory Referencing Characteristics and Caching Performance of AND-parallel Prolog on Shared-Memory Multiprocessors

M. Hermenegildo
MCC
E. Tick
University of Oregon

Abstract This paper presents the performance analysis results for the RAP-WAM AND-Parallel Prolog architecture on shared-memory multiprocessor organizations. The goal of this parallel model is to provide inference speeds beyond those attainable in sequential systems, while supporting conventional logic programming semantics. Special emphasis is placed on sequential performance, storage efficiency, and low control overhead. First, the concepts and techniques used in the parallel execution model are described, along with the general methodology, benchmarks, and simulation tools used for its evaluation. Results are given both at the memory reference level and at the memory organization level. A two-level shared-memory architecture model is presented together with an analysis of various solutions to the cache coherency problem. Finally, RAP-WAM shared-memory simulation results are presented. It is argued that the RAP-WAM model can exploit coherent caches and attain speeds in excess of 2 MLIPS with current shared-memory multiprocessing technology for real applications that exhibit medium degrees of parallelism.

§1 Introduction

Logic programs offer many opportunities for parallelism. The RAP-WAM execution model is aimed at providing inference speeds to logic programs beyond those attainable in sequential systems through the use of parallelism, while supporting the conventional semantics of logic languages. In particular, the model supports parallel execution of Prolog programs by exploiting Restricted/Independent AND-parallelism. The approach taken in this model is that of combining parallelism and advanced compiler technology. Therefore, in RAP-WAM the techniques used for supporting parallel execution
are efficient extensions of those used in the Warren Abstract Machine (WAM), which have already brought high inferencing speeds to sequential implementations. Special attention is given to the preservation of WAM sequential performance and storage efficiency, and to the use of low overhead mechanisms for controlling parallel execution.

This paper offers an overview of the RAP-WAM model and the type of parallelism exploited in it, and presents simulation results supporting the claims of performance, low overhead, and storage efficiency. High-performance processing elements (PEs) are often limited in execution speed by available memory bandwidth. The bandwidth requirement is an even more critical factor in parallel systems. For this reason, the simulations were performed at the memory reference level. Because the RAP-WAM model (as the WAM) is specified at a level above that of memory organization, simulations were first performed under the ideal assumption of a uniform, single shared-memory with no contention, and some results are presented at this level. Practical memory organizations deviate from this ideal behavior, however, and it is thus important to assess the effect of this deviation if realistic speedup figures are to be obtained. The results of such a study represent the core of this paper. A special emphasis is put on the issue of cache coherency and results from the simulation of two such protocols are presented. These results are of special importance because of the unique referencing characteristics that Prolog exhibits on cache memories.

The memory organization studied is characteristic of many new-generation high-performance shared-memory multiprocessors. The results obtained are thus helpful in estimation of the performance of AND-parallel Prolog/RAP-WAM on such machines. The results can be used to determine the advantages and shortcomings of the parallel implementation of other don't-know non-deterministic logic programming languages. In addition, the memory model and simulation results can also be used as an aid in the design of special-purpose multiprocessors.

Organization of the paper is as follows: Section 2 introduces the types of parallelism currently exploited in the RAP-WAM model and Section 3 gives an overall description of the RAP-WAM architecture. Section 4 summarizes the results obtained from high-level simulations of the architecture. Section 5 then presents the two-level shared-memory organization model, introduces the problems associated with cache coherency maintenance in such a model, and proposes solutions for them. Simulation results for the different coherency protocols proposed are then presented and discussed.

§2 Goal Independence and Restricted AND-parallelism

Of the various sources of parallelism present in Logic Programs AND- and OR-parallelism (or their combination) offer special promise and are currently being considered in several parallel logic programming systems. Efficient techniques for implementing OR-parallelism
have been proposed and are currently under development by various groups, as reported by Warren. AND-parallelism, although offering advantages such as being able to exploit parallelism in determinate programs and inherent efficiency, has until recently been difficult to implement due to the overhead involved in handling shared variable bindings and because of its interaction with "don't know" non-determinism. Consequently, many parallel logic programming systems which exploit this type of parallelism do not implement the conventional "don't know" non-deterministic semantics of logic programs. Instead, these systems implement committed-choice (i.e., "don't care") non-determinism. Our interest, however, is in directly supporting the conventional ("don't know") semantics. Although the use of program transformations has been suggested for implementing a form of "don't know" non-determinism with committed-choice languages, it is argued that a more direct approach (such as the RAP-WAM model considered in this paper or the models described by Warren) has a higher performance potential while guaranteeing that full "don't know" non-determinism will be supported.

An alternative way of dealing with variable binding conflicts which makes support for both AND-parallelism and "don't know" non-determinism amenable to an efficient implementation is that of Independent AND-parallelism. In this form of AND-parallelism, only sets of goals which are determined to be independent (i.e., which do not share any non-ground variables) can be executed in parallel. Goal independence can be postulated through annotations. Goal independence can also be determined statically by the compiler (guided perhaps by some information provided by the user as in Chang), or completely dynamically at run-time as in Conery, or through a combination of the above mentioned techniques.

Restricted AND-parallelism, as presented in DeGroot's seminal work, represents such a combination offering more opportunity for parallelism than static approaches at a lower cost than dynamic ones. Also, this model restricts the types of execution graphs that can be exploited in exchange for implementation efficiency. The RAP-WAM model uses a generalization of these ideas which completes DeGroot's original description (by providing backward execution semantics). The RAP-WAM model also introduces Conditional Graph Expressions (CGEs) that overcome the difficulty (within DeGroot's framework) in expressing 'sufficient' conditions for independence. In this approach, the

*1 It might be noted that, in the opinion of the authors, such programs are also easier to debug than their committed choice translations.

*2 The model is also currently being extended to support OR-parallelism—using techniques similar to those proposed by other researchers, see for example—and their references—and a form of dependent AND-parallelism.

*3 CGEs offer Prolog syntax and permit conjunctive checks, thus lifting limitations in the expressions proposed by DeGroot: given "f(X, Y, Z); g(X, Y), h(Y, Z)." the most natural annotation for this clause, that g and h can run in parallel if the terms in X and Z don't share variables and Y is bound to a ground term, can be expressed easily with CGEs ("f(X, Y, Z); (indep(X, Z), ground(Y) || g(X, Y) & h(Y, Z)).") but is cumbersome with DeGroot's expressions.
parallelism itself can be generated by the user by including annotations (CGEs) in the program, or automatically by the compiler, which performs a global (abstract interpretation-based) analysis of the program that often makes run-time independence checks unnecessary. In addition, module entry mode annotations are permitted that are also used to reduce or eliminate run-time tests and guide the global analysis. As an example consider the following clause from the deriv (symbolic derivation) benchmark:

\[
\text{deriv}(U + V, X, DU + DV) :- \text{deriv}(U, X, DU), \text{deriv}(V, X, DV).
\]

Without any further information from the user or any knowledge about the rest of the program, and if forced to work only locally (at the clause level) the RAP-WAM compiler performs a worst case analysis by simply applying the "independence conditions". Such analysis determines that deriv(U, X, DU) and deriv(V, X, DV) will be independent (and therefore executable in parallel) provided that at run-time two conditions are met. First, the variable X must be instantiated to a ground term. Second, the variables in the two goals (U, V, and DU, DV) must not "share" variables, i.e., they contain only terms that are pairwise independent. The result of this compile-time analysis can be encoded in a CGE and the deriv clause rewritten as shown below.

\[
\text{deriv}(U + V, X, DU + DV) :- (\text{ground}(X), \text{indep}([U, V], [U, DV], [DU, V], [DU, DV])) | \text{deriv}(U, X, DU) & \text{deriv}(V, X, DV)).
\]

If at run-time the ground and indep checks fail, the goals in the body of the clause are executed sequentially. The backward execution semantics (actions taken in case of failure) supports "don't know" non-determinism and is given by Hermenegildo. ground and indep tests can be expensive operations and it is therefore desirable to keep them to a minimum. As mentioned above this is done through a global analysis of the program and/or user annotation. For example, a typical user annotation for deriv may postulate that it will always be called with the first two arguments ground. The checks are then immediately reduced to

\[
\text{deriv}(U + V, X, DU + DV) :- (\text{indep}(DU, DV) | \text{deriv}(U, X, DU) & \text{deriv}(V, X, DV)).
\]

which is often a simple address comparison. Such annotations are, however, seldom necessary: as shown in Reference 44 the compiler can determine input and output modes through abstract interpretation-based global analysis typically for 70-80% of the arguments. For example, given a typical usage of deriv within a larger program the compiler can determine not only the groundness information but also that the third argument will be a variable, so that all checks are eliminated:
deriv(U + V, X, DU + DV):- (true | |
deriv(U, X, DU) & deriv(V, X, DV)).

§3 The RAP-WAM Architecture

The RAP-WAM architecture is an extension of the Warren Abstract Machine (WAM) architecture\(^4\) capable of executing logic programs in parallel as determined by CGE's. A fundamental design objective of RAP-WAM is fast sequential execution for cases where there is no available (AND) parallelism. To this end, the CGE semantics has been integrated naturally into the WAM storage model in the form of specialized stack frames and storage areas that are used during parallel execution. Thus the default (sequential) model is that of a standard WAM exhibiting the same high sequential performance. Special emphasis has also been given to efficiency in the management of parallelism so that most of the WAM performance and storage optimizations are still supported during parallel execution. Figure 1 shows one of the abstract machines of the RAP-WAM architecture mapped onto a single processing element (PE). Each of these abstract machines is similar to a standard WAM (with a complete set of registers and data areas, called a Stack Set), with the addition of a Goal Stack, a Message Buffer, and two new types of stack frames: Parcall Frames and Markers. During an inference, goals that are ready to be executed in parallel are pushed by the abstract machine onto its own Goal Stack. From there the goals can be picked-up by this or other abstract machines for execution. Each entry in the Goal Stack is called a Goal Frame. Parcall Frames are used for coordinating and synchronizing the parallel execution of the goals inside a parallel call, both during forward execution and during backtracking. Markers are used to delimit Stack Sections (horizontal cuts through the Stack Set of a given abstract machine, corresponding to the execution of different goals) and they implement the storage recovery mechanisms during backtracking\(^23\) in a similar manner to choice-points in the WAM. In practice, the stack is divided into separate Control stacks (Choice Point and Markers) and Local stacks (Environments) for reasons of locality and locking. Table 1 summarizes the types of objects allocated in these areas and their locality.

The instruction set of the RAP-WAM architecture includes all WAM instructions and several additional instructions related to parallel execution. Table 2 lists the additional instructions that currently support AND-parallelism and the WAM instructions that are modified in RAP-WAM. Although the check instructions are somewhat particular to the implementation of RAP-WAM, the instruction set is also suitable for other AND-parallel systems. Control instructions are responsible for scheduling and synchronization of parallel goals. Control instructions of the form ...det... are optimized for determinate execution and they provide significant performance advantage when no backtracking is needed. Instructions like pop_foreign_goal, idle, redo, etc.
Fig. 1 Data Areas and Registers for One RAP-WAM Abstract Machine
Table 1 Characteristics of RAP-WAM Storage Objects

<table>
<thead>
<tr>
<th>Frame Type</th>
<th>Location</th>
<th>In WAM?</th>
<th>Need lock?</th>
<th>Locality</th>
</tr>
</thead>
<tbody>
<tr>
<td>Envts./Control</td>
<td>Stack</td>
<td>Yes</td>
<td>No</td>
<td>Local</td>
</tr>
<tr>
<td>Envts./Perm. Vars.</td>
<td>Stack</td>
<td>Yes</td>
<td>No</td>
<td>Global</td>
</tr>
<tr>
<td>Choice Points</td>
<td>Stack</td>
<td>Yes</td>
<td>No</td>
<td>Local</td>
</tr>
<tr>
<td>Heap</td>
<td>Heap</td>
<td>Yes</td>
<td>No</td>
<td>Global</td>
</tr>
<tr>
<td>Trail entries</td>
<td>Trail</td>
<td>Yes</td>
<td>No</td>
<td>Local</td>
</tr>
<tr>
<td>PDL entries</td>
<td>PDL</td>
<td>Yes</td>
<td>No</td>
<td>Local</td>
</tr>
<tr>
<td>Parcall F./Local</td>
<td>Stack</td>
<td>No</td>
<td>No</td>
<td>Local</td>
</tr>
<tr>
<td>Parcall F./Global</td>
<td>Stack</td>
<td>No</td>
<td>Yes</td>
<td>Global</td>
</tr>
<tr>
<td>Parcall F./Wait, Sched</td>
<td>Stack</td>
<td>No</td>
<td>No</td>
<td>Local</td>
</tr>
<tr>
<td>Markers</td>
<td>Stack</td>
<td>No</td>
<td>Yes</td>
<td>Global</td>
</tr>
<tr>
<td>Goal Frames</td>
<td>Goal Stack</td>
<td>No</td>
<td>Yes</td>
<td>Global</td>
</tr>
<tr>
<td>Messages</td>
<td>M. Buffer</td>
<td>No</td>
<td>Yes</td>
<td>Global</td>
</tr>
</tbody>
</table>

Table 2 Parallel Abstract Machine-Specific Instructions

<table>
<thead>
<tr>
<th>Control Instructions</th>
<th>Modified WAM Instructions</th>
<th>Control Instructions (Det. Exec.)</th>
</tr>
</thead>
<tbody>
<tr>
<td>push_call Procedure/Arity, Slot#</td>
<td>proceed</td>
<td>push_det_call Procedure/Arity, Slot#</td>
</tr>
<tr>
<td>allocate_pcall #_of_slots, M</td>
<td>fail</td>
<td>allocate_det_pcall #_of_slots, M</td>
</tr>
<tr>
<td>check_ready Slot#, Label</td>
<td>pop_pending_goal</td>
<td>pop_pending_goal</td>
</tr>
<tr>
<td>pop_pending_goal</td>
<td>cut_merge</td>
<td></td>
</tr>
<tr>
<td>wait_on_siblings</td>
<td>check_me_else Label</td>
<td></td>
</tr>
<tr>
<td></td>
<td>check_ground Vn</td>
<td></td>
</tr>
<tr>
<td></td>
<td>check_independent Vn, Vm</td>
<td></td>
</tr>
</tbody>
</table>

Pseudo-Instructions

- idle
- pop_foreign_goal
- kill
- redo
- unwind

are pseudo-instructions that represent the actions taken upon failure and during distributed scheduling and backtracking. Space limitations here make a complete description of the execution model impossible. The reader is referred to Hermenegildo[21] for a complete description of the RAP-WAM instruction set and storage model.

§4 High-Level Simulation Results

As mentioned in Section 1, high-performance processing elements (PEs) in parallel systems can be severely limited in execution speed by available memory bandwidth. This is often a result of the gap in both technology and cost between PEs and interconnection networks, in addition to physical packaging/ interface limitations. Because bandwidth can play a key role in accessing system performance, the simulations presented were performed at the memory reference level. Although the evaluation of the implementation of the model could be done on an existing shared-memory machine, this approach only provides a single data point corresponding to a particular organization. Also, some statistics are difficult to gather from an actual implementation. However, simulation, the approach taken in this evaluation, provides data over a wide range of
architectural and organizational parameters. The RAP-WAM architecture (as is its relative, the WAM) is a model described at a level above *that of memory organization design*. A series of measurement tools have been built in order to evaluate the potential performance of the execution model and the associated architectural tradeoffs at this level\(^\text{21}\). The present configuration of these tools is shown in Figure 2. Application programs are translated into parallel RAP-WAM code and then fed into the RAP-WAM emulator together with emulation parameters such as the number of PEs, scheduling strategy used, and storage area sizes. The emulator then executes the program generating instrumentation data such as instruction and pseudo-instruction frequencies, number of references to the different data areas, ratios of local vs. remote references for each type of object, maximum amount of storage used in each area, estimated timings, and speedups. At this level the measurements are made under the assumption of an idealized memory model with a uniform address space and no contention. The measurements are thereby made independent of the particular architectural organization on which the model is implemented. In this section results are presented that correspond to the execution of the following set of benchmarks:

- **Symbolic derivation (deriv)**: this is a version of the symbolic derivation program in Warren's thesis\(^\text{40}\). It finds the symbolic derivative of a given arithmetic expression.
- **Takeuchi (tak)**: this program computes Takeuchi's function as described in Gabriel\(^\text{16}\).
- **Quick-sort (qsort)**: the quick-sort algorithm written using difference lists
Table 3  Statistics for the Benchmarks (on 8 Processors)

<table>
<thead>
<tr>
<th>Parameter</th>
<th>deriv</th>
<th>tak</th>
<th>qsort</th>
<th>matrix</th>
</tr>
</thead>
<tbody>
<tr>
<td>Instructions executed</td>
<td>33520</td>
<td>75254</td>
<td>237884</td>
<td>95349</td>
</tr>
<tr>
<td>Failures</td>
<td>2010</td>
<td>1186</td>
<td>10368</td>
<td>0</td>
</tr>
<tr>
<td>Unifications</td>
<td>1232</td>
<td>3560</td>
<td>27804</td>
<td>258</td>
</tr>
<tr>
<td>Bindings</td>
<td>784</td>
<td>7117</td>
<td>26203</td>
<td>8321</td>
</tr>
<tr>
<td>References (-locks)</td>
<td>85477</td>
<td>178967</td>
<td>502717</td>
<td>96013</td>
</tr>
<tr>
<td>References (WAM)</td>
<td>82519</td>
<td>169599</td>
<td>499526</td>
<td>95357</td>
</tr>
<tr>
<td>Overhead (refs.)</td>
<td>2958</td>
<td>10268</td>
<td>3191</td>
<td>656</td>
</tr>
<tr>
<td>Overhead (%)</td>
<td>3.58</td>
<td>6.05</td>
<td>0.63</td>
<td>0.68</td>
</tr>
<tr>
<td>Goals in parallel</td>
<td>97</td>
<td>263</td>
<td>97</td>
<td>24</td>
</tr>
<tr>
<td>Idle-time (%)</td>
<td>13.58</td>
<td>7.79</td>
<td>29.60</td>
<td>9.72</td>
</tr>
</tbody>
</table>

as given in Clocksin and Mellish\(^8\).

- Matrix multiplication (**matrix**): a naive matrix multiplication program where matrices are represented as lists of lists.

Each benchmark was executed on relatively large input data. Table 3 shows some statistics regarding the benchmarks used, running on eight PEs. Some observations regarding this table are appropriate: the table lists the number of instructions, failures, unifications, variable bindings, and memory references for each benchmark. The number of references is seen to be sufficient for vigorously exercising the memory simulators that will be described in the next section. For comparison, the number of references generated by a sequential WAM running the benchmarks using the Van-Roy compiler\(^31\) is also listed. The difference between these two numbers represents the overhead incurred due to control of parallelism (for eight PEs). The overhead running sequentially on one RAP-WAM PE is typically 1-3%, i.e., only 1-3% more references are generated by the parallel versions of the benchmarks running on one RAP-WAM PE than by the sequential versions running on WAM. This low overhead is attained by supporting all the WAM performance optimizations within RAP-WAM sequential and parallel execution: customization of unification, clause indexing, hashing on arguments, efficient management of backtracking, storage recovery on backtracking and of local storage, etc.

The number of references listed in Table 3 does not include references generated by spin locks, since these references are expected to be absorbed by the caching mechanism, as confirmed in the results presented in the next sections. The number of goals actually executed in parallel and the percentage of time that processors are idle (because there is no work available in the system) are also listed. These benchmarks, and the input data used, were chosen for several reasons: first, because of their small granularity (except for matrix) their execution offers a worst-case type of analysis with respect to parallelism management overhead. They also offer reasonable degrees of parallelism so that the parallel portion of the abstract machine is exercised. Finally, as will be shown in Section 5.2, the benchmarks' sequential memory referencing behavior and characteristics resemble those of much larger Prolog programs, such as the ones studied by Tick\(^35\).
Table 4 offers a summary of the memory referencing behavior of RAP-WAM running the *deriv* benchmark on eight PEs. Rows classify references according to the area of the stack set and type of frame being accessed. These include choice points (in the control stack), environments (in the local stack), heap objects (in the heap), trail entries (in the trail), parcall frames (in the local stack), markers (in the control stack), and goal frames (in the goal stack). Areas not referenced in this benchmark (such as the message buffer, only used during deep, non-deterministic backtracking between parallel goals) are not listed. References to environments are further divided into references to permanent variables, and references to the control parts of the environment. References to parcall frames are also further divided into references to the control part of the frame and references to the locks used for synchronization. The Totals line adds up references from the different storage areas.

Columns in Table 4 classify references into reads and writes, and totals (reads + writes). Within each of these categories, four numbers are listed: the number of references in absolute and the percentage they represent of the total number of references, and the number of these references that are remote (i.e., access the address space—stack set—of another PE) and their percentage. The last row represents the number of references that could not be determined statically (at compile-time) to be local and had to be labeled as "potentially remote". This labeling is used in the hybrid cache coherency algorithms studied in the next sections.

It is interesting to note that 65% of the total number of references could be determined to be local at compile-time. However, of the remaining 35% potentially remote references, only 7.3% actually ended up accessing data within other PEs' spaces. Note also that most remote references correspond to accesses to the heap and are mostly reads, due to structure copying. The referencing characteristics at this level reflect those observed by Tick for large Prolog programs during sequential WAM execution. In fact, RAP-WAM inherits both advantageous and disadvantageous characteristics, such as the larger than normal proportion of writes to reads (around 1:1) and the large number of

<table>
<thead>
<tr>
<th>Frame</th>
<th>Reads</th>
<th>%</th>
<th>rem</th>
<th>%</th>
<th>Writes</th>
<th>%</th>
<th>rem</th>
<th>%</th>
<th>R+W</th>
<th>%</th>
<th>rem</th>
<th>%</th>
</tr>
</thead>
<tbody>
<tr>
<td>choice-p</td>
<td>19062</td>
<td>20.8</td>
<td>0</td>
<td>0</td>
<td>19566</td>
<td>21.3</td>
<td>0</td>
<td>0</td>
<td>38628</td>
<td>42.1</td>
<td>0</td>
<td>0.0</td>
</tr>
<tr>
<td>env-cont</td>
<td>5778</td>
<td>6.3</td>
<td>0</td>
<td>0</td>
<td>12475</td>
<td>13.6</td>
<td>0</td>
<td>0</td>
<td>18253</td>
<td>19.9</td>
<td>0</td>
<td>0.0</td>
</tr>
<tr>
<td>env-pvar</td>
<td>4748</td>
<td>5.2</td>
<td>0</td>
<td>0</td>
<td>3391</td>
<td>3.7</td>
<td>0</td>
<td>0</td>
<td>8139</td>
<td>8.9</td>
<td>0</td>
<td>0.0</td>
</tr>
<tr>
<td>heap</td>
<td>5724</td>
<td>6.2</td>
<td>5230</td>
<td>5.7</td>
<td>10182</td>
<td>11.1</td>
<td>97</td>
<td>0.1</td>
<td>13906</td>
<td>17.3</td>
<td>5327</td>
<td>5.8</td>
</tr>
<tr>
<td>trail</td>
<td>0</td>
<td>0.0</td>
<td>0</td>
<td>0</td>
<td>1554</td>
<td>1.7</td>
<td>0</td>
<td>0</td>
<td>1554</td>
<td>1.7</td>
<td>0</td>
<td>0.0</td>
</tr>
<tr>
<td>pf-contr</td>
<td>262</td>
<td>0.3</td>
<td>0</td>
<td>0</td>
<td>337</td>
<td>0.4</td>
<td>97</td>
<td>0.1</td>
<td>599</td>
<td>0.6</td>
<td>97</td>
<td>0.1</td>
</tr>
<tr>
<td>pf-locks</td>
<td>5697</td>
<td>6.2</td>
<td>291</td>
<td>0.3</td>
<td>558</td>
<td>0.6</td>
<td>291</td>
<td>0.3</td>
<td>6255</td>
<td>6.8</td>
<td>582</td>
<td>0.6</td>
</tr>
<tr>
<td>markers</td>
<td>194</td>
<td>0.2</td>
<td>0</td>
<td>0</td>
<td>776</td>
<td>0.8</td>
<td>0</td>
<td>0</td>
<td>970</td>
<td>1.1</td>
<td>0</td>
<td>0.0</td>
</tr>
<tr>
<td>goal stk</td>
<td>714</td>
<td>0.8</td>
<td>679</td>
<td>0.7</td>
<td>714</td>
<td>0.8</td>
<td>0</td>
<td>0</td>
<td>1428</td>
<td>1.6</td>
<td>679</td>
<td>0.7</td>
</tr>
<tr>
<td>Totals</td>
<td>42179</td>
<td>46.0</td>
<td>6200</td>
<td>6.8</td>
<td>49553</td>
<td>54.0</td>
<td>485</td>
<td>0.5</td>
<td>91732</td>
<td>100</td>
<td>6685</td>
<td>7.3</td>
</tr>
<tr>
<td>Pot. rem</td>
<td>16985</td>
<td>18.5</td>
<td>6200</td>
<td>6.8</td>
<td>14942</td>
<td>16.3</td>
<td>485</td>
<td>0.5</td>
<td>31927</td>
<td>34.8</td>
<td>6685</td>
<td>7.3</td>
</tr>
</tbody>
</table>

Table 4 ofers a summary of the memory refrencing behavior of RAP-WAM running the *deriv* benchmark on eight PEs. Rows classify references according to the area of the stack set and type of frame being accessed. These include choice points (in the control stack), environments (in the local stack), heap objects (in the heap), trail entries (in the trail), parcall frames (in the local stack), markers (in the control stack), and goal frames (in the goal stack). Areas not referenced in this benchmark (such as the message buffer, only used during deep, non-deterministic backtracking between parallel goals) are not listed. References to environments are further divided into references to permanent variables, and references to the control parts of the environment. References to parcall frames are also further divided into references to the control part of the frame and references to the locks used for synchronization. The Totals line adds up references from the different storage areas.

Columns in Table 4 classify references into reads and writes, and totals (reads + writes). Within each of these categories, four numbers are listed: the number of references in absolute and the percentage they represent of the total number of references, and the number of these references that are remote (i.e., access the address space—stack set—of another PE) and their percentage. The last row represents the number of references that could not be determined statically (at compile-time) to be local and had to be labeled as "potentially remote". This labeling is used in the hybrid cache coherency algorithms studied in the next sections.

It is interesting to note that 65% of the total number of references could be determined to be local at compile-time. However, of the remaining 35% potentially remote references, only 7.3% actually ended up accessing data within other PEs' spaces. Note also that most remote references correspond to accesses to the heap and are mostly reads, due to structure copying. The referencing characteristics at this level reflect those observed by Tick for large Prolog programs during sequential WAM execution. In fact, RAP-WAM inherits both advantageous and disadvantageous characteristics, such as the larger than normal proportion of writes to reads (around 1:1) and the large number of
Table 5  Referencing Characteristics (Summary), on 8 Precessors

<table>
<thead>
<tr>
<th>Frame</th>
<th>Reads</th>
<th>%</th>
<th>rem</th>
<th>%</th>
<th>Writes</th>
<th>%</th>
<th>rem</th>
<th>%</th>
<th>R+W</th>
<th>%</th>
<th>rem</th>
<th>%</th>
</tr>
</thead>
<tbody>
<tr>
<td>&quot;tak&quot;</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Totals</td>
<td>70790</td>
<td>39.1</td>
<td>3948</td>
<td>2.2</td>
<td>110422</td>
<td>60.9</td>
<td>1315</td>
<td>0.7</td>
<td>181212</td>
<td>100.0</td>
<td>5263</td>
<td>2.9</td>
</tr>
<tr>
<td>Pot. rem</td>
<td>27638</td>
<td>15.3</td>
<td>3948</td>
<td>2.2</td>
<td>23177</td>
<td>12.8</td>
<td>1315</td>
<td>0.7</td>
<td>50815</td>
<td>28.0</td>
<td>5263</td>
<td>2.9</td>
</tr>
<tr>
<td>&quot;sort&quot;</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Totals</td>
<td>210110</td>
<td>41.7</td>
<td>11862</td>
<td>2.4</td>
<td>293480</td>
<td>58.3</td>
<td>485</td>
<td>0.1</td>
<td>503590</td>
<td>100.0</td>
<td>12347</td>
<td>2.5</td>
</tr>
<tr>
<td>Pot. rem</td>
<td>66750</td>
<td>13.3</td>
<td>11862</td>
<td>2.4</td>
<td>91134</td>
<td>18.1</td>
<td>485</td>
<td>0.1</td>
<td>157884</td>
<td>31.4</td>
<td>12347</td>
<td>2.5</td>
</tr>
<tr>
<td>&quot;matrix&quot;</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Totals</td>
<td>50060</td>
<td>52.0</td>
<td>16608</td>
<td>17.3</td>
<td>46143</td>
<td>48.0</td>
<td>120</td>
<td>0.1</td>
<td>96203</td>
<td>100.0</td>
<td>16728</td>
<td>17.4</td>
</tr>
<tr>
<td>Pot. rem</td>
<td>37587</td>
<td>39.1</td>
<td>16608</td>
<td>17.3</td>
<td>29444</td>
<td>30.6</td>
<td>120</td>
<td>0.1</td>
<td>67031</td>
<td>69.7</td>
<td>16728</td>
<td>17.4</td>
</tr>
</tbody>
</table>

references to choice points. However, the same compilation techniques that hold promise for alleviating these defects in the WAM (e.g., \(28,36,2,32\)) can similarly be used with RAP-WAM.

Table 5 offers a summary of the data contained in Table 4 for the other benchmarks further supporting the observations made above. Note the unusually large number of potentially shared (70%) and actually shared (17.4%) references in matrix. This is due to the extensive use and sharing of the heap, because of the implementation of the arrays as lists of lists. Other conclusions regarding the RAP-WAM model based on these and other results from the memory reference level simulations performed are summarized as follows:

- Again, the overhead in the RAP-WAM model due to the management of parallelism is low. For example, overhead has been observed to be less than 15% for up to 40 processors even for fine granularity (i.e., high overhead) cases such as that of the deriv benchmark, as shown in Figure 3. In this figure, work time represents the number of cycles spent by all PEs doing actual processing (i.e., not waiting or idle). Overhead, the

---

![Fig. 3 RAP-WAM Overheads for deriv](image-url)
difference between the work done by RAP-WAM and the work done by WAM, is the distance between the work curve and a horizontal line corresponding to WAM work, represented by the "uniprocessor" line in Figure 3. All data in this figure is presented as percentages of WAM work (i.e., 100% = the number of cycles necessary to execute the sequential version of benchmark on the WAM). Note that the RAP-WAM work on one PE is within 2% of the work done by WAM. Therefore, it appears that the RAP-WAM model can achieve actual speedup (i.e., significantly faster execution than a high-performance sequential implementation—WAM—for similar performance PEs) even if the application exhibits only low levels of parallelism. This is because the RAP-WAM offers sequential speed essentially equivalent to the WAM and low overhead in the management of parallelism. Speedup measurements are presented in Figure 4 for the deriv benchmark. The main limitation to speedup in this example is the parallelism available in the benchmark, dependent on input data size (c.f., the "idle" curve in Figure 3). Speedups are attainable under the assumption, of course, that the memory organization can handle the associated traffic. This issue is treated in the following sections.

The stack-based memory management approach appears to be very efficient. Its implementation is, on the other hand, more complicated than that of other solutions, such as a simple heap. However, it is well worth the effort: it extends to a parallel execution environment the storage management optimizations of the WAM, i.e., recovery of storage on backtracking, recovery of local storage upon exit from a procedure, last call optimization, etc. This means that the storage requirements for the RAP-WAM system running in parallel are essentially the same as those in the sequential version.
of the WAM, and that, as in the WAM, the invocation of garbage collection is minimized.

Although these results are encouraging, further evaluation taking into consideration the characteristics of a particular memory organization (such as available bandwidths, cache coherence maintenance overhead, etc.) on the performance of the model is necessary in order to guarantee actual speedup. This subject is addressed in the next sections.

§5 The Two-Level Shared-Memory Architecture Model

In contrast with the idealized memory organization used in the high-level simulations presented previously, practical shared-memory systems generally present a two-level structure where a local cache memory is located between each PE and the system bus, as shown in Figure 5. Such a hierarchical organization serves two main purposes. First, it allows faster execution because of the generally lower effective memory access time seen by a PE, in much the same way as in a sequential system. Second, it absorbs a (hopefully) significant part of the traffic to main memory that needs to go through the system bus. In shared-memory multiprocessors the latter is particularly important because the system bus is often the most significant bottleneck in the system. The former is essential in obtaining performance that is competitive with that of sequential systems. The locality of Prolog/WAM was shown by Tick\(^{35}\). In the next sections it is shown that Prolog/RAP-WAM also offers sufficient locality to take advantage of cache memories.

![Diagram](image)

**Fig. 5** The Two-Level Shared-Memory Architecture Model
5.1 Cache Coherency

Many of the local memory designs used in conventional or special-purpose sequential machines for the implementation of logic programs (e.g., PSI-II) cannot be used directly in a shared-memory Multiprocessor because of coherency problems. Coherent caches ensure that all the PEs in the system have a consistent view of the storage model. Recall that the RAP-WAM architecture ensures that processes running in parallel will not attempt to make conflicting bindings. Thus cache coherency during process execution is not required; however, prior to a process spawn or upon process completion, cache coherency is required at least for the correct management of process control structures, such as the Parcall Frames. It appears, however, that ensuring coherency continually is easier than enforcing coherency only at specific points (and has the additional benefit of generality). Therefore, traditional coherent caches are considered in this study.

Historically, the first coherent caches, proposed by Gibson, used a write-through strategy, where all write references were issued to both the local cache and shared memory. In addition, remote copies of a write target, i.e., copies residing on a cache other than the cache issuing the write request, were invalidated. This coherency protocol is inexpensive in terms of hardware, but offers low performance because of excessive traffic on the system bus.

Recently, a family of fully distributed broadcast cache protocols have been proposed and built. Broadcast protocols are based on the ability of the cache organization to modify all copies of a cached item in all caches which share this item in a single bus cycle. Information is maintained for each cache block as to whether it is private or shared, making it possible to avoid coherency overheads for private blocks. In addition, it is also possible to issue write references to the local cache only (and the caches which also contain this item) and not to shared memory. Shared memory may be independently updated during replacement of a modified cache block (write-back), for instance, and need not be consistent at all times with the local caches. The broadcasting mechanism guarantees, however, that local caches will still be consistent at all times among themselves. Different designs differ essentially in the treatment of a write to a possibly shared block. A write-through broadcast strategy updates remote copies and possibly shared memory. A write-in broadcast strategy invalidates remote copies. Descriptions and measurements of the relative performance of various broadcast protocol attributes for conventional architectures are given in Archibald.

The broadcast protocol offers high performance at the expense of additional hardware. With the objective of reducing this expense by exploiting attributes of the RAP-WAM architecture, a (firmware) controlled hybrid cache

* Note that simple buffers which hold only local data, such as a choice point buffer, do not suffer from this coherency problem and can be included without modification in a parallel system.
protocol was developed. This scheme attempts to combine the efficiency of broadcast caches with the simplicity and low cost of a traditional write-through cache using information provided by the PE (in the form of tags, derived from the information in Table 1) as to the locality characteristics of each reference. The protocol is referred to as “hybrid” because based on these tags potentially shared (global) data is written-through and local data is copied back*. An underlying tenet of the hybrid protocol is to avoid some of the complexity of broadcast caches by keeping shared memory consistent with local memory. The cost associated with this simplification is the traffic required to write-through to memory the write requests marked as global that are not actually shared. The gain with respect to the traditional write-through approach is that data marked as local is not written-through.

5.2 Simulations

In order to compare the performance of the various types of caches presented above, the RAP-WAM emulator was modified to generate a trace file of memory references. These references are marked with a PE identifier, a tag describing the particular storage area and object being accessed, and a read/write flag. All of the coherent cache models are simulated with the same parameterized multiprocessor cache simulator which can be reconfigured to support the various consistency protocols. The simulator takes the memory trace file as input (see Figure 2) and processes trace records sequentially, using the cache corresponding to the record’s PE identifier. Each private cache contains a number of blocks of a given size (in words). Caches are modeled as fully associative memories with perfect LRU replacement. Cache consistency is maintained for each reference. The simulator models a system with no cache-to-cache transfer capability. Therefore in the broadcast models, if the most up-to-date version of a miss target is held by a remote cache, the line is first copied back to memory, and then transferred to the requesting cache. Note that this is a pessimistic assumption since many systems can perform these two actions simultaneously. On the other hand, the simulator makes the optimistic assumption that no unnecessary invalidations/write-throughs are generated for lines that are assumed to be shared but all remote copies have been replaced.

The results presented in this section were obtained with execution traces for the four benchmarks listed in Section 4. Each benchmark was executed on large input data in order to generate address traces of sufficient size to fully exercise all caches in the multiprocessor model (a “cold start” was measured). It is hypothesized that the conclusions drawn from this study can be extrapolated to RAP-WAM Prolog programs in general. At the very least, the experiments allow comparisons between alternative coherent memory designs. Evidence supporting the above hypothesis is offered below. First, because the benchmarks

*1 A similar capability, of both write-through and copy-back, is supported by the Fairchild Clipper microprocessor.'
### Table 6  Fit of Small Benchmarks to Large Benchmarks (Traffic Ratio)

<table>
<thead>
<tr>
<th>cache size (words)</th>
<th>$E_{tr}$</th>
<th>$\sigma_{tr}$</th>
<th>deriv</th>
<th>tak</th>
<th>qsort</th>
<th>(matrix)</th>
<th>mean</th>
</tr>
</thead>
<tbody>
<tr>
<td>512</td>
<td>0.164</td>
<td>0.0626</td>
<td>1.1</td>
<td>-1.9</td>
<td>0.83</td>
<td>(16.0)</td>
<td>1.3</td>
</tr>
<tr>
<td>1024</td>
<td>0.108</td>
<td>0.0569</td>
<td>2.0</td>
<td>-1.1</td>
<td>1.6</td>
<td>(5.2)</td>
<td>1.6</td>
</tr>
</tbody>
</table>

**Fig. 6**  Traffic of Coherency Schemes
measured have small granularity, the study represents a worst case analysis of parallelism management overhead. Second, the benchmarks' sequential memory referencing behavior resembles that of much larger Prolog programs. Third, the *locality* characteristics of the benchmarks also resemble those of the larger Prolog benchmarks. Table 6 gives the mean and standard deviation of the traffic ratios of large WAM benchmarks for four-word-block copyback data caches. Also shown are the errors, in number of standard deviations, of the small benchmarks studied here. The fit is quite good, except for *matrix*, which performs poorly on a uniprocessor (Prolog has no arrays), but which performs quite well on a multiprocessor because of data partitioning. The average does not include *matrix*. This fit ensures that the benchmarks exercise the sequential storage model (the foundation of the RAP-WAM storage model) in a reasonable, typical way.

Figure 6 shows the mean traffic ratios (averaged over the four benchmarks) of the write-in broadcast, hybrid, and conventional write-through cache protocols, using four word lines. Caches of sizes 64, 128, and 256 words were
simulated with no-write-allocate (a write miss does not fetch the corresponding block to cache). Caches of sizes 512 and 1024 words were simulated with write-allocate, except for hybrid caches which used no-write-allocate for 512 words. These selections were made on the basis of the policy which produced the lowest traffic. A clear result of the simulations is that no-write-allocate is best for small caches; however, miss ratio increases with no-write-allocate. Another result is that a more efficient replacement policy (e.g., copyback) produced lower traffic with write-allocate than a less efficient policy (e.g., hybrid) for the same cache size.

The write-through broadcast cache statistics (not shown in Figure 6) are almost identical to those of the write-in broadcast cache, an indication that communication traffic in RAP-WAM is low. For each type of cache a family of curves is represented, corresponding to different numbers of PEs. Each curve plots traffic ratio as a function of total local memory size (i.e., the sum of the individual PE cache sizes).

A result seen from the curves is that the hybrid cache does quite well in reducing traffic, almost to the level of the copyback cache. The copyback cache does exceedingly well for 1024 word caches, and this trend is expected to continue with larger sizes, because the hybrid caches have already bottomed-out.

The idiosyncrasies in the curves are due to the effects of averaging the benchmarks. In addition, the advantageous effect (that of reducing memory traffic) of partitioning an algorithm's working set across several caches is seen to sometimes outweigh the increase in communication overheads. Figure 7 shows separate curves for each of the benchmarks for hybrid and copy-back coherent caches (for eight PEs).

5.3 Discussion

The hierarchical memory organization serves the dual purpose of lowering the effective memory access time and reducing the memory bandwidth requirement of a PE. According to the results of the simulations presented in the previous section, the hybrid cache generates an amount of traffic between that generated by the broadcast and conventional write-through caches. The broadcast schemes retain a (sometimes slight) advantage throughout the range of caches simulated.

It should be noted that these results measure performance only in terms of traffic ratio. For example, the simulation data shows that eight PEs with write-in broadcast caches (of 128 words or greater) generate a traffic ratio of less than 0.3 (the hybrid cache is also close to this performance). This means that more than 70% of the traffic generated by the processors is captured in the local memories and will not appear on the bus. In order to accurately estimate the actual performance of a multiprocessor the time penalty to access shared memory due to contention must also be taken into account. This analysis can be done statistically with a queueing model. Although beyond the scope of this
paper such a model for RAP-WAM execution is presented in Tick\textsuperscript{35}.

It is of obvious interest, if only to stimulate further research, to speculate about the potential performance levels attainable given the results presented in the previous sections. Even current low- to medium-cost shared-memory systems offer high PE to memory bandwidths by implementing multiple or overlapped busses and interleaved memories. This makes it reasonable to predict that speeds in the order of two million application\textsuperscript{1} inferences per second are possible on shared-memory multiprocessors built using current technology\textsuperscript{2}. A "back of the envelope" calculation, in order to justify this claim and based on the results obtained from the present and previous studies can be made as follows. Studies of large Prolog benchmarks show that in the average 15 (WAM or RAP-WAM) instructions are executed per logical inference (LI) and that each instruction averages 3 (word) references. This represents 45 words/LI, or 180 bytes/LI for a 32 bit word size. Therefore, a system executing at a speed of 2 MLIPS would require a cumulative memory bandwidth of 360 Mbytes/sec. If the caches are able to capture 70\% of this traffic, about 110 Mbytes/sec have to be delivered by the bus/memory system, a performance which is achievable with current technology (and expected of the upcoming generation of shared-memory multiprocessors)\textsuperscript{3}.

§6 Conclusions

The paper presents memory referencing characteristics of a parallel logic programming architecture, RAP-WAM, based on Independent/Restricted AND-parallel execution of Prolog. We describe its behavior and potential performance first for an idealized memory model and then for shared-memory multiprocessor organizations. The goal of the RAP-WAM model is to provide inferencing speeds to logic programs beyond those attainable in sequential systems through the use of parallelism, while preserving the conventional semantics of logic languages. The main objectives in developing such a multiprocessor architecture are to ensure high performance single PE execution, storage efficiency, and minimal communication and synchronization overheads.

The measurements presented here indicate that RAP-WAM is well-suited for high performance execution on tightly-coupled shared-memory multiprocessors, from cost-effective small-scale systems to higher performance medium-sized systems. It is argued that speeds of two million application inferences per second are possible with currently available technology for applications which exhibit medium degrees of parallelism. It has been shown that the architecture offers

\textsuperscript{1} Application inferences refer to inference steps of the average size found in large Prolog programs, i.e., on the order of 15 WAM instructions\textsuperscript{35}. This results in much lower but more realistic figures than those obtained using the conventional "LIPS" measurement based on "naive reverse."

\textsuperscript{2} Note that the Japanese FGCS Project is also predicting similar inferencing speeds for the PIM\textsuperscript{19}.

\textsuperscript{3} These conclusions are in disagreement with a related study by Fagin\textsuperscript{15} and his contention that Prolog programs cannot effectively make use of multiprocessing. Our results do agree, however, with those of Lin\textsuperscript{26}. 
high memory referencing locality so that it can take advantage of two-level memory organizations. The memory referencing characteristics presented included comparison of cache coherency protocols, indicating how well alternative designs perform under the expected communication loads. The "broadcast" and "hybrid" protocols are shown to offer superior performance to write-through mechanisms, present in some multiprocessors.

Because the memory organizations studied are characteristic of many current and next-generation multiprocessors, it is argued that the results obtained are relevant to the estimation of the performance of AND-parallel Prolog/RAP-WAM on these multiprocessors. The results also help determine the advantages and shortcomings of the parallel implementation of other don't-know non-deterministic logic programming languages. In addition, the memory model and simulation results can also be used as an aid in the design of small to medium-sized special-purpose multiprocessors. Although the goal of small to medium systems may seem rather unambitious, it is important to have evidence of actual speedups at these levels before attempting the design of large-scale systems.

§7 Acknowledgements

The authors would like to thank Richard Warren for his comments on earlier drafts of this paper and the Microelectronics and Computer Technology Corporation for its support. Evan Tick was affiliated with Stanford University during this research and supported by an IBM Graduate Fellowship and by NASA-Ames under consortium NCA2-1R745-406.

References

7) Ciepielewski, A. and Haridi, S., Control of Activities in the OR-Parallel Token


