Managing overflows in LISTAGG
This is an interesting problem that has come up a few times in discussions (and I think it has been mentioned on the SQL forums as well). When using LISTAGG on very large data sets you can sometimes create a list that is too long and consequently get an ORA-01489: result of string concatenation is too long error. Obviously, it is possible to determine in advance if the error is going to occur using some bespoke PL/SQL code and then take appropriate action within your application to manage the rows that contain stings that exceed the VARCHAR2 limit. Many customers have implemented workarounds to overcome the ORA-01489 error, however, this has mostly involved the use of complex code which has impacted performance. |
Wouldn’t it be great if there was a simple yet elegant way to resolve this issue? Actually there is and we can use a few of the most recent analytical SQL functions. If you are using Database 12c you can make use of the MATCH_RECOGNIZE function to effectively create chunks of strings that do not exceed the VARCHAR2 limit.
If you are using an earlier version of the Oracle Database then we can make use of the powerful SQL Model clause that was introduced in Database 10g.
In fact the beauty of the workarounds outlined in this blog post is that you can define your own maximum size for the string returned by LISTAGG so you have complete control over the size/length of the result set that is returned.
For example, let’s assume that we have the following statement (to keep things relatively simple let’s use the EMP table in the schema SCOTT)
SELECT
deptno,
LISTAGG(ename, ';') WITHIN GROUP (ORDER BY empno) AS namelist
FROM emp
GROUP BY deptno;
this statement will generate the following output:
DEPTNO NAMELIST
---------- ----------------------------------------
10 CLARK;KING;MILLER
20 SMITH;JONES;SCOTT;ADAMS;FORD
30 ALLEN;WARD;MARTIN;BLAKE;TURNER;JAMES
Let’s assume that the above statement does not run and that we have a limit of 15 characters that can be returned by each row in our LISTAGG function. We can use the Database 12c SQL pattern matching function, MATCH_RECOGNIZE, to return a list of values that does not exceed 15 characters. First step is to wrap the processing in a view so that we can then get data from this view to feed our LISTAGG function. Here is the view that contains the MATCH_RECOGNIZE clause:
CREATE OR REPLACE VIEW emp_mr AS
SELECT * FROM emp MATCH_RECOGNIZE(
PARTITION BY deptno ORDER BY empno
MEASURES match_number() AS mno
ALL ROWS PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (S B*)
DEFINE B AS LENGTHB(S.ename) + SUM(LENGTHB(B.ename)) + LENGTHB(‘;’) < = 15
);
You might well ask: why don’t we put the LISTAGG function inside the measure clause? At the moment it is not possible to include analytical functions such as LISTAGG in the measure clause. Therefore, we have put the LISTAGG function in a separate SQL statement:
SELECT
deptno,
LISTAGG(ename, ';') WITHIN GROUP (ORDER BY empno) AS namelist
FROM emp_mr
GROUP BY deptno, mno;
The above code will produce the following output:
DEPTNO NAMELIST
---------- ----------------------------------------
10 CLARK;KING
10 MILLER
20 SMITH;JONES
20 SCOTT;ADAMSbr
20 FORD
30 ALLEN;WARD
30 MARTIN;BLAKE
30 TURNER;JAMES
Using 12c MATCH_RECOGNIZE, we are able to get all the values by sending just one query. Now not everyone who is hitting this ORA-01489 error is running 12c, so what do you do if you are running an earlier version of the database and need an efficient fix for this error? Prior to 12.1, the overflow can be handled using SQL Model clause. This approach is a little more complicated. First we create a view that returns the row number:
CREATE OR REPLACE VIEW emp_sqm_view AS
SELECT
emp.*,
row_number() OVER (PARTITION BY deptno ORDER BY empno) as rn
FROM emp;
the above statement generates the following output with the row number added for each employee within each department. Thinking about this in Excel terms, we will use this view to feed our SQL Model clause, therefore, the view will allow us to create a “worksheet” for each department and then list the employees for each department as a row on that “worksheet”:
EMPNO ENAME JOB MGR HIREDARE SAL COMM DEPTNO RN
===== ====== ========= ===== ========= ==== ==== ======= ==
7782 CLARK MANAGER 7839 09-JUN-81 2450 10 1
7839 KING PRESIDENT 17-NOV-81 5000 10 2
7934 MILLER CLERK 7782 23-JAN-82 1300 10 3
7369 SMITH CLERK 7902 17-DEC-80 800 20 1
7566 JONES MANAGER 7839 02-APR-81 2975 20 2
7788 SCOTT ANALYST 7566 19-APR-87 3000 20 3
7876 ADAMS CLERK 7788 23-MAY-87 1100 20 4
7902 FORD ANALYST 7566 03-DEC-81 3000 20 5
7499 ALLEN SALESMAN 7698 20-FEB-81 1600 300 30 1
7521 WARD SALESMAN 7698 22-FEB-81 1250 500 30 2
7654 MARTIN SALESMAN 7698 28-SEP-81 1250 1400 30 3
7698 BLAKE MANAGER 7839 01-MAY-81 2850 30 4
7844 TURNER SALESMAN 7698 08-SEP-81 1500 0 30 5
7900 JAMES CLERK 7698 03-DEC-81 950 30 6
Now the next step is to calculate the running total of the length of the text string of employee names within each department. The SQL Model clause allows us to calculate the running total and reset the total if it exceeds the number of characters that we have set as the limit - in this case 15 characters. This code is wrapped in the first of the views that we are going to build:
CREATE OR REPLACE VIEW emp_model_1 AS (
SELECT *
FROM emp_sqm_view
MODEL
PARTITION BY (deptno)
DIMENSION BY (rn)
MEASURES (empno, ename, LENGTHB(ename) len, 0 cum_len)
RULES (
cum_len[1] = LEN[1] + LENGTH(';'),
cum_len[rn > 1] = CASE WHEN cum_len[cv()-1] + LEN[cv()] + LENGTHB(‘;’) <= 15
THEN cum_len[cv()-1] + LEN[cv()] + LENGTHB(';')
ELSE LEN[cv()] + LENGTH(';')
END))
Next we need an id that we can use to group together the list of names that fit within our boundary of 15 characters. Again we will use the SQL Model clause to compute the break points and increment the counter within each department:
CREATE OR REPLACE VIEW emp_model_2 AS
SELECT *
FROM emp_model_1
MODEL PARTITION BY (dept no) DIMENSION BY (rn)
MEASURES (empno, ename, len, cum_len, 0 sub_id)
RULES (
sub_id[1] = 1,
sub_id[rn > 1] = CASE WHEN cum_len[cv()] = len[cv()] + LENGTHB(‘;’)
THEN sub_id[cv() - 1] + 1
ELSE sub_id[cv() - 1]
END));
now we have the following output:
DEPTNO RN EMPNO ENAME LEN CUM_LEN SUB_ID
====== == ===== ====== === ======= ======
10 2 7839 KING 4 11 1
10 1 7782 CLARK 5 6 1
10 3 7934 MILLER 6 7 2
20 1 7369 SMITH 5 6 1
20 5 7902 FORD 4 5 3
20 4 7876 ADAMS 5 12 2
20 3 7788 SCOTT 5 6 2
20 2 7566 JONES 5 12 1
30 1 7499 ALLEN 5 6 1
30 2 7521 WARD 4 11 1
30 3 7654 MARTIN 6 7 2
30 6 7900 JAMES 5 13 3
30 5 7844 TURNER 6 7 3
30 4 7698 BLAKE 5 13 2
we can now use the column headed sub_id along with the department number to create GROUP BY clause which will finally generate our list of employees within each department where each row in the column namelist will contain a maximum of 15 characters:
SELECT
deptno,
LISTAGG(ename, ';') WITHIN GROUP (ORDER BY empno) AS namelist
FROM emp_model_2
group by deptno, sub_id;
The above query creates the following output which matches the final output we achieved using the MATCH_RECOGNIZE clause:
DEPTNO NAMELIST
---------- ----------------------------------------
10 CLARK;KING
10 MILLE
20 SMITH;JONES
20 SCOTT;ADAMS
20 FORD
30 ALLEN;WARD
30 MARTIN;BLAKE
£0 TURNER;JAMES
If you are using the LISTAGG function and sometimes getting an ORA-01489: result of string concatenation is too long error then hopefully the above workarounds are useful.
The code samples shown above were created by Lei Sheng who is a Principal Member of Technical Staff, Database Development. Lei is worked on many of the really cool analytical SQL features including the SQL Model clause and the new MATCH_RECOGNIZE clause. So big thanks to Lei for sharing these workarounds.
NEW UPDATE - January 2016
I have been working on some 12.2 projects and had to revisit the SQL Model code. While working on a different problem I spotted that the results from the model clause, while giving the right outcome, were not quite correct. It seems that the computed number of characters was actually incorrect. The model was adding the ';' character too many times so the final count of the string length was actually too high.
Therefore, I have modified the code to correctly compute the full length of the string taking into account the inclusion of only one instance of the ';' character. Here is the amended code for the first view emp_model_1
CREATE OR REPLACE VIEW emp_model_1 AS (
SELECT *
FROM emp_sqm_view
MODEL
PARTITION BY (deptno)
DIMENSION BY (rn)
MEASURES (empno, ename, LENGTHB(ename) len, 0 cum_len)
RULES (
cum_len[1] = LEN[1],
cum_len[rn > 1] = CASE WHEN cum_len[cv()-1] + LEN[cv()] + LEN[cv()+1] + LENGTHB(';') > 15
AND cum_len[cv()-1] + LEN[cv()] + LENGTHB(';')
THEN cum_len[cv()-1] + len[cv()] + LENGTH(';')
WHEN cum_len[cv()-1] + LEN[cv()] + LENGTHB(';') THEN cum_len[cv()-1] + len[cv()]
ELSE len[cv()]
END));
I have used bold text to highlight the change that I have mode to the code. This now generates the following ouptut:
DEPTNO RN EMPNO ENAME LEN CUM_LEN
---------- ---------- ---------- ---------- ---------- ----------
30 1 7499 ALLEN 5 5
30 2 7521 WARD 4 10
30 3 7654 MARTIN 6 6
30 4 7698 BLAKE 5 12
30 5 7844 TURNER 6 6
30 6 7900 JAMES 5 11
20 1 7369 SMITH 5 5
20 2 7566 JONES 5 11
20 3 7788 SCOTT 5 5
20 4 7876 ADAMS 5 10
20 5 7902 FORD 4 14
10 1 7782 CLARK 5 5
10 2 7839 KING 4 10
10 3 7934 MILLER 6 6
There is a minor change to the second view emp_model_2 as follows, again changes highlighted in bold
CREATE OR REPLACE VIEW emp_model_2 AS (
SELECT *
FROM emp_model_1
MODEL
PARTITION BY (deptno)
DIMENSION BY (rn)
MEASURES (empno, ename, len, cum_len, 0 sub_id)
RULES (
sub_id[1] = 1,
sub_id[rn > 1] = CASE WHEN cum_len[cv()] = len[cv()]
THEN THEN sub_id[cv()-1] + 1
ELSE sub_id[cv()-1]
END));
The final output is the same but I feel a bit happier that the numbers used to split up the list are now accurate and there is no double counting of the ';' character.
Technorati Tags: Analytics, Data Warehousing, Database 12c, Oracle, Oracle Database 12c, Pattern Matching, SQL Analytics
Hi Keith,
ReplyDeleteGood stuff! Here is a slightly simpler, or at least more compact, MODEL variant:
select deptno, sub_id,
listagg(ename,';') within group(order by rn) name_list
from (
select * from scott.emp
MODEL
PARTITION BY (deptno)
DIMENSION BY (row_number() over(partition by deptno order by ename) rn)
MEASURES (1 sub_id, ename, LENGTHB(ename) cum_len, LENGTHB(ename) len)
RULES (
cum_len[rn>1] = len[cv()] +
case when len[cv()] + cum_len[cv()-1] + 1 <= 15
then cum_len[cv()-1] + 1
else 0
end,
sub_id[rn>1] = sub_id[cv()-1] +
case when cum_len[cv()] = len[cv()]
then 1
else 0
end
)
)
group by deptno, sub_id
order by deptno, sub_id;
Thanks Stew, that is definitely a much neater, more compact solution.
ReplyDelete