Exploring the interfaces for User Defined Aggregates

image courtesy of wikipedia

Whilst I was working on the functional specification for the LISTAGG extensions that we implemented in 12c Release 2, I came across Tom Kyte’s stragg function which uses the User Defined Aggregate API introduced in database 9i. Tom’s comprehensive answer covers the two important areas that need to considered when using the data cartridge API: 1) a UDA can run as serial process and 2) a UDA can run as a parallel process. Therefore, you need to code for both these eventualities. Dealing with both scenarios can be a little challenging - as I have discovered over the last few weeks. having looked at a number of posts there is a common theme for explaining how the various interfaces for user defined aggregate actually work. One the clearest examples is on Tim Hall’s blog: String Aggregation Techniques.

This got me thinking…would it be possible to take the new extensions we made to LISTAGG and incorporate them into custom string manipulation function built using the UDA interfaces? Essentially providing a pre-12.2 solution to prevent the text-string overflow error “ORA-01489: result of string concatenation is too long”? After some initial coding I managed to get a solution that worked perfectly as long as I did not try and run the query using parallel execution. Eventually I managed to get the parallel execution process coded but it was not deterministic and the results differed from the results from the serial query.

After revisiting Tom Kyte’s stragg function solution I think I have finally created a fully working solution and here is what I have learned along the way…

Overview

Before the going into the code, let’s explore the data cartridge interface for creating user defined aggregates. As far as I can tell this is not a very well known or well used feature in that I have not seen any presentations at user conferences that explain why and how to build these functions. It is a very interesting feature because User-defined aggregate functions can be used in SQL DML statements just like Oracle’s built-in aggregates. Most importantly they allow you to work with and manipulate complex data types such as multimedia data stored using object types, opaque types, and LOBs.

Each user-defined aggregate function is made up of three mandatory and three optional ODCIAggregate interfaces, or steps, to define internal operations that any aggregate function performs. The four mandatory interfaces are: initialization, iteration, merging, and termination.

Initialization is accomplished by the ODCIAggregateInitialize() routine, which is invoked by Oracle to initialize the computation of the user-defined aggregate. The initialized aggregation context is passed back to Oracle as an object type instance.
Iteration is performed through the ODCIAggregateIterate() routine, which is repeatedly invoked by Oracle. On each invocation, a new value or a set of new values and the current aggregation context are passed in. The routine processes the new values and returns the updated aggregation context. This routine is invoked for every non-NULL value in the underlying group. NULL values are ignored during aggregation and are not passed to the routine.
Merging is an optional step and is performed by ODCIAggregateMerge(), a routine invoked by Oracle to combine two aggregation contexts. This routine takes the two contexts as inputs, combines them, and returns a single aggregation context.
Termination takes place when the ODCIAggregateTerminate() routine is invoked by Oracle as the final step of aggregation. The routine takes the aggregation context as input and returns the resulting aggregate value.

This is how these four main functions fit together…

The most important observations I would make are:

You need to think very carefully about where and how you want to process your data.
There are essentially two options: 1) during the Iterate stage and/or 2) during the Terminate stage. Of course you need to remember that code used in the Iterate stage needs to replicated at the Merge stage.
It’s tempting but don’t ignore the Merge stagebut if you do then when the function is run in parallel you won’t see any results!

What’s missing….

In the code sample below you will notice that I am missing two interfaces: ODCIAggregateDelete() and ODCIAggregateWrapContext(). Both functions are optional and for the purposes of creating a replacement for the LISTAGG function, these two functions were not needed. But for the sake of completeness below is a brief description of each function:

ODCIAggregateDelete() removes an input value from the current group. The routine is invoked by Oracle by passing in the aggregation context and the value of the input to be removed. It processes the input value, updates the aggregation context, and returns the context. This is an optional routine and is implemented as a member method.

ODCIAggregateWrapContext() integrates all external pieces of the current aggregation context to make the context self-contained. Invoked by Oracle if the user-defined aggregate has been declared to have external context and is transmitting partial aggregates from slave processes. This is an optional routine and is implemented as a member method.

I have searched the internet for examples of when and how to use these two optional functions and there is not much out there. One example was posted by Gary Myers http://blog.sydoracle.com/2005/09/analytics-with-order-by-and-distinct.html which is another derivation of the STRAGG function. In Gary’s example the objective is to return rows until 5 distinct values of a specific column had been returned. I think it is possible to do this without resorting to implementing this requirement within the ODCIAggregateTerminate function but I will leave you to under that one!

Schema - Sales History

For this simple LISTAGG alternative I am using the sample sales history schema and I am going to create a concatenated list of the first name and last name of each customer within each sales region. To make the code examples a little easier to read I have created a view over the CUSTOMERS and COUNTRIES table:

CREATE OR REPLACE VIEW CG_LIST AS
SELECT
 g.country_region_id,
 c.cust_first_name,
 c.cust_last_name
FROM countries g, customers c
WHERE g.country_id = c.country_id
ORDER BY g.country_region_id;

If we try to run the following SQL query we will get the usual LISTAGG overflow error:

SELECT 
 country_region_id,
 LISTAGG(cust_first_name||' '||cust_last_name) WITHIN GROUP (ORDER BY country_region_id) AS "customers"
FROM MY_LIST
GROUP BY country_region_id 
ORDER BY country_region_id;

ORA-01489: result of string concatenation is too long
01489. 00000 - "result of string concatenation is too long"
*Cause: String concatenation result is more than the maximum size.
*Action: Make sure that the result is less than the maximum size.

Now let’s use the User Defined Aggregates framework to resolve this issue…

Stage 1 - Building the basic framework

First we need a storage object to hold the results generated during the Iterate stage. Following Tom Kyte’s example I am using an array/table as my storage object as this will ensure that I never hit the limits of the VARCHAR2 object when I am building my list of string values. Here I have checked the maximum size of the concatenated first name and last name combinations which is 21 characters. This means I can set the limit for the varchar2 column at 25 characters (…giving myself a little bit of headroom, just in case…).

CREATE OR REPLACE TYPE string_varray AS TABLE OF VARCHAR2(25);

Note that if we did not use an array we would be forced to use a single VARCHAR2 variable to hold the string values being pieced together. This would mean testing the length of the VARCHAR2 object before adding the next string value within both the Iterate and Merge functions, which causes all sorts of problems! As with most things in life, when working with the user defined aggregate interfaces it pays to keep things as simple as possible and put the processing in the most appropriate place. Therefore, in this case the best solution is to use an array because it makes the code simple and processing more efficient!

Here is the definition of for the interfaces:

CREATE OR REPLACE TYPE t_string_agg AS OBJECT
(
 a_string_data string_varray,
 
 STATIC FUNCTION ODCIAggregateInitialize(sctx IN OUT t_string_agg)
 RETURN NUMBER,

MEMBER FUNCTION ODCIAggregateIterate(self IN OUT t_string_agg,
 value IN VARCHAR2 )
 RETURN NUMBER,

MEMBER FUNCTION ODCIAggregateTerminate(self IN t_string_agg,
 returnValue OUT VARCHAR2,
 flags IN NUMBER)
 RETURN NUMBER,

MEMBER FUNCTION ODCIAggregateMerge(self IN OUT t_string_agg,
 ctx2 IN t_string_agg)
 RETURN NUMBER
);
/
SHOW ERRORS

Initially I included some additional variables at the start of the OBJECT definition for managing various counters and the maximum string length. However, it’s not possible to pass parameters into this function except for the string value to be processed. Therefore, I removed the counter definitions from the header and just instantiated my array. This actually makes perfect sense because the counters and maximum string length variables are only needed at one specific point within the process flow - more on this later - plus this keeps the definition stage nice and simple.

Now we have the definition for our interfaces which means we can move on to the actual coding associated with each interface.

Stage 2 - Coding the Initialization phase

The code for this interface is relatively simple because all we need to do is instantiate the array that will hold our list of string values being passed into the aggregation process. We do this using a call to the object named in the previous section: sctx := t_string_agg(string_varray() ); This is the start of our processing flow…

STATIC FUNCTION ODCIAggregateInitialize(sctx IN OUT t_string_agg)
 RETURN NUMBER IS
 BEGIN
     sctx := t_string_agg(string_varray() );


     RETURN ODCIConst.Success;
 END;

Stage 3 - Coding the Iterate phase

The iterate phase will get called multiple times as we process the string values that need to be aggregated. At this point all we need to do is to collect the string values being passed in and insert them into the array object that was created at the start of stage 1. We use the extend function to add a new row into the array and assign the string value. The important point to note here is that the Iterate process will also form part of the parallel execution framework, therefore, we need to keep the processing as simple as possible.

MEMBER FUNCTION ODCIAggregateIterate(self IN OUT t_string_agg,
 value IN VARCHAR2)
 RETURN NUMBER IS
 BEGIN
    a_string_data.extend;
    a_string_data(a_string_data.count) := value;
 
    RETURN ODCIConst.Success;
 END;

Given that we want to try and ensure that we don’t exceed the limits of the VARCHAR2 object and generate an “ORA-01489“ it’s tempting to code this logic within the Iterate function. The issue with placing the processing logic within this function is that all the logic will need to be replicated within the ODCIAggregateMerge function to cope parallel execution where multiple Iterate functions are executed and the results merged to generate a final, single resultset that can be passed to the ODCIAggregateTerminate function.

Stage 4 - Coding the Merge phase

This function gets called during parallel execution and is used to merge results from multiple Iterate processes to generate a final, single resultset that can be passed to the ODCIAggregateTerminate function. The basic logic needs to replicate the logic from the ODCIAggregateIterate function but needs to take account of multiple values coming into the function via the CTX2 instantiation of our string function rather than the single string value being passed into the Iterate function. In essence the CTX2 provides the data from the various parallel execution processes and the self object simply accumulates the results being passed in from the various processes.

MEMBER FUNCTION ODCIAggregateMerge(self IN OUT t_string_agg,
 ctx2 IN t_string_agg)
 RETURN NUMBER IS 
 BEGIN
   FOR i IN 1 .. ctx2.a_string_data.count
    LOOP
       a_string_data.extend;
       a_string_data(a_string_data.count) := ctx2.a_string_data(i);
    END LOOP;
 
    RETURN ODCIConst.Success;
 END;

If we tried to enforce the maximum string length functionality within Iterate function then the same processing would need to be enforced within this function as well and having tried to do it I can tell you that the code very quickly gets very complicated. This is not the right place to do complex processing that is better done in the final, terminate, phase. If you do find yourself putting a lot of code within either or both the Iterate and/or Merge functions then I would recommend taking a very hard look at whether the code would be better and simpler if it was placed within the Terminate stage.

Stage 5 - Coding the Terminate phase

Given that we want to avoid blowing the limits of the VARCHAR2 object this is the most obvious place to code our processing logic. By placing the code here, we can ensure that our string processing function will work when the query is run in both serial and parallel. Here is the code:

MEMBER FUNCTION ODCIAggregateTerminate(self IN t_string_agg,
 returnValue OUT VARCHAR2,
 flags IN NUMBER)
 RETURN NUMBER IS
 
 l_data varchar2(32000);
 ctx_len NUMBER;
 string_max NUMBER;

BEGIN
 ctx_len := 0;
 string_max := 100;
 
 FOR x IN (SELECT column_value FROM TABLE(a_string_data) order by 1)
 LOOP
    IF LENGTH(l_data || ',' || x.column_value) <= string_max THEN
         l_data := l_data || ',' || x.column_value;
    ELSE 
       ctx_len := ctx_len + 1;
    END IF;
 END LOOP;
 
 IF ctx_len > 1 THEN
     l_data := l_data || '...(' || ctx_len||')';
 END IF;
 
 returnValue := LTRIM(l_data, ',');
 
 RETURN ODCIConst.Success;
 END;

Note that the function returns the list of string values from the array sorted in alphabetical order: (SELECT column_value FROM TABLE(a_string_data) order by 1). If you don’t need a sorted list then you can remove the ORDER BY clause. Cycling through the ordered list of strings allows us to add a comma to separate each value from the next but first we need to check the length of the list of strings to see if we have reached the maximum string length set by the variable string_max. As soon as we reach this maximum value we start counting the number of values that are excluded from the final list:

    IF LENGTH(l_data || ',' || x.column_value) <= string_max THEN
         l_data := l_data || ',' || x.column_value;
    ELSE 
       ctx_len := ctx_len + 1;
    END IF;

For this code sample the maximum string length is set as 100 characters. Once we reach that value then we simply continue looping through the array but now we increment our counter (ctx_len). Given that we have sorted the list of string values, the next question is: can we remove duplicate values? The answer is yes! It is a relatively simple change:

 FOR x IN (SELECT DISTINCT column_value FROM TABLE(a_string_data) order by 1)

There is an obvious cost to pay here because the use of the DISTINCT keyword will require additional processing which will have a performance impact so think very carefully about whether this is actually needed.

Full code sample

Here is the complete definition of the code:

CREATE OR REPLACE TYPE BODY t_string_agg IS
 STATIC FUNCTION ODCIAggregateInitialize(sctx IN OUT t_string_agg)
 RETURN NUMBER IS
 BEGIN
    sctx := t_string_agg(string_varray() );


    RETURN ODCIConst.Success;
 END;

MEMBER FUNCTION ODCIAggregateIterate(self IN OUT t_string_agg,
 value IN VARCHAR2)
 RETURN NUMBER IS
 BEGIN
    a_string_data.extend;
    a_string_data(a_string_data.count) := value;
 
    RETURN ODCIConst.Success;
 END;

MEMBER FUNCTION ODCIAggregateTerminate(self IN t_string_agg,
 returnValue OUT VARCHAR2,
 flags IN NUMBER)
 RETURN NUMBER IS
 
   l_data varchar2(32000);
   ctx_len NUMBER;
   string_max NUMBER;

BEGIN
   ctx_len := 0;
   string_max := 100;
 
   FOR x IN (SELECT DISTINCT column_value FROM TABLE(a_string_data) order by 1)
    LOOP
       IF LENGTH(l_data || ',' || x.column_value) <= string_max THEN
          l_data := l_data || ',' || x.column_value;
       ELSE 
          ctx_len := ctx_len + 1;
       END IF;
   END LOOP;
 
   IF ctx_len > 1 THEN
        l_data := l_data || '...(' || ctx_len||')';
    END IF;
 
    returnValue := LTRIM(l_data, ',');
 
    RETURN ODCIConst.Success;
 END;

MEMBER FUNCTION ODCIAggregateMerge(self IN OUT t_string_agg,
 ctx2 IN t_string_agg)
 RETURN NUMBER IS 
 BEGIN
    FOR i IN 1 .. ctx2.a_string_data.count
     LOOP
        a_string_data.EXTEND;
        a_string_data(a_string_data.COUNT) := ctx2.a_string_data(i);
     END LOOP;
 
    RETURN ODCIConst.Success;
 END;
END;
/
SHOW ERRORS

Stage 6 - Coding the actual string function

The last step is to create a function that calls our string processing object and takes a string (VARCHAR2) object as it’s input:

CREATE OR REPLACE FUNCTION string_agg (p_input VARCHAR2)
RETURN VARCHAR2
PARALLEL_ENABLE AGGREGATE USING t_string_agg;

Bringing it all together…

Now let’s run a query using our replacement for LISTAGG()..

SELECT 
 country_region_id,
 string_agg(cust_first_name||' '||cust_last_name) AS s_str_list
FROM MY_LIST
GROUP BY country_region_id 
ORDER BY country_region_id;

and here is the output:

Results from serial execution of query

the length of the concatenated string for each region is limited to a maximum of 100 characters and we have a count of the number of values excluded from the list - shown in brackets at the end of the list. Just so you can check that the truncation is occurring correctly I have included an additional column in the resultset below which shows the length of the final string (including the three dots + the numbers showing the count of truncated values).

Count of final string length

Conclusion

First a huge vote of thanks to Tom Kyte for providing the original code and ideas behind this particular blog post.

If you don’t have access to Database 12c Release 2 and the enhancements that we made to LISTAGG to control “ORA-01489“ errors then here is a “usable” alternative that provides a lot more control over string concatenation. Obviously if you want string concatenation with a count of missing values and without a count then you will need to create two separate functions. If you need a distinct list of values for some queries and not others then you will need to create separate functions to handle the different processing.

This example also provides a great introduction to the area of user defined aggregates, which were introduced in Oracle 9i. A quick Google search returned a lot of similar examples but no “real-world” use cases so I assume that UDAs are a little used gem within the long list of Oracle Database features. UDAs are a very powerful and flexible feature that you should definitely add to your toolbox of skills.

Technorati Tags: Analytics, Database 12c, Oracle Database 12c, SQL, SQL Analytics

Search This Blog

SQL for Data Warehousing and Analytics