Posts

Showing posts from January, 2017

How to intelligently aggregate approximations

Image
The growth of low-cost storage platforms has allowed many companies to actively seeking out new external data sets and combine them with internal historical data that goes back over a very long time frame. Therefore, as both the type of data and the volume of data continue to grow the challenge for many businesses is how to process this every expanding pool of data and at the same time, make timely decisions based on all the available data.
(Image above courtesy of http://taneszkozok.hu/) In previous posts I have discussed whether an approximate answer is just plain wrong and whether approximate answers really are the best way to analyze big data. As with the vast majority of data analysis at some point there is going to be a need to aggregate a data set to get a higher level view across various dimensions. When working with results from approximate queries, dealing with aggregations can get a little complicated because it is not possible to “reuse” an aggregated approximate result as…

Dealing with very very long string lists using Database 12.2

Image
Oracle RDBMS 11gR2 introduced the LISTAGG function for working with string values. It can be used to aggregate values from groups of rows and return a concatenated string where the values are typically separated by a comma or semi-colon - you can determine this yourself within the code by supplying your own separator symbol.

Based on the number of posts across various forums and blogs, it is widely used by developers. However, there is one key issue that has been highlighted by many people: when using LISTAGG on data sets that contain very large strings it is possible to create a list that is too long. This causes the following overflow error to be generated:
ORA-01489: result of string concatenation is too long. Rather annoyingly for developers and DBAs, it is very difficult to determine ahead of time if the concatenation of the values within the specified LISTAGG measure_expr will cause an ORA-01489 error. Many people have posted workarounds to resolve this problem - including mysel…

MATCH_RECOGNIZE - What should I include in the MEASURES clause?

Image
Image courtesy of wikipediaThis post is the result of reviewing a post on stackoverflow.com: http://stackoverflow.com/questions/41649178/getting-error-ora-00918-when-using-match-recognize. Here is my version of the code which includes the same issues/errors as the original, however, I am using the TICKER schema table that I always use for the tutorials that I post on liveSQL :SELECT symbol, tstamp, price
FROM ticker
MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY symbol, tstamp
MEASURES
a.symbol AS a_symbol,
a.tstamp AS a_date,
a.price AS a_price
ALL ROWS PER MATCH
PATTERN(A B*)
DEFINE
B AS (price < PREV(price))
);The above example will not run because of the following error:ORA-00918: column ambiguously defined00918. 00000 - "column ambiguously defined"
*Cause:
*Action:
Error at Line: 1 Column: 8So what is wrong with our code? As MATHGUY pointed out in his reply on stackoverflow.com - quite a lot actually! Let’s start by differentiating between “won’t run” and “wrong”…

Are approximate answers the best way to analyze big data

Image
Image courtesy of pixabay.comIn my previous post I reviewed some reasons why people seem reluctant to accept approximate results as being correct and useful. The general consensus is that approximate results are wrong which is very strange when you consider how often we interact with approximations as part of our everyday life.Most of the use cases in my first post on this topic covered situations where distinct counts were the primary goal - how many click throughs did an advert generate, how many unique sessions were recorded for a web site etc. The use cases that I outlined provided some very good reasons for using approximations of distinct counts. As we move forward into the era of Analytics-of-Things the use of approximations in queries will expand and this approach to processing data will become an accepted part of our analytical workflows.To support Analytics-of-Things, Database 12c Release 12.2 includes even more approximate functions. In this release we have added approximat…

SQL Pattern Matching Deep Dive - Part 6, state machines

Image
The obvious way to start this particular post is to pose a couple of simple questions: what is a state machine and why should you care? In general I would say that you don't need to know about or care about state machines. That's the beauty of using SQL for pattern matching. The MATCH_RECOGNIZE clause encapsulates all the deep technical modelling and processing that has to be performed to run pattern matching on a data set. However, there are times when it is useful, probably vital, that you understand what is going on behind the scenes and one of the most obvious situations is when backtracking happens.Therefore, the content covered in this post is a going to be a gently lead-in into my next post where I am going to discuss the concept of “backtracking”  and the dreaded ORA-30009 error.Let’s start our voyage of discovery…when you attempt to run a SQL statement containing a MATCH_RECOGNIZE clause during the compilation phase we generate a finite state machine based on the PATT…

Exploring the interfaces for User Defined Aggregates

Image
image courtesy of wikipediaWhilst I was working on the functional specification for the LISTAGG extensions that we implemented in 12c Release 2, I came across Tom Kyte’s stragg function which uses the User Defined Aggregate API introduced in database 9i. Tom’s comprehensive answer covers the two important areas that need to considered when using the data cartridge API: 1) a UDA can run as serial process and 2) a UDA can run as a parallel process. Therefore, you need to code for both these eventualities. Dealing with both scenarios can be a little challenging - as I have discovered over the last few weeks. having looked at a number of posts there is a common theme for explaining how the various interfaces for user defined aggregate actually work. One the clearest examples is on Tim Hall’s blog: String Aggregation Techniques.This got me thinking…would it be possible to take the new  extensions we made to LISTAGG and incorporate them into custom string manipulation function built using t…

Simplifying your data validation code with Database 12.2

Image
Image courtesy of pixabay.com Doesn’t matter who much testing you do (well, it actually does but that’s a whole different issue) you can almost guarantee that at some point your beautiful data validation code, that parses data input from a web form or loads data from some external file, will pop up with the error:SQL Error: ORA-01722: invalid number01722. 00000 - "invalid number"
*Cause: The specified number was invalid.
*Action: Specify a valid number.Of course, what’s is really annoying at this point is that you don’t know which column value of the record failed (assuming that you have more than one numeric column)Managing conversion errors during data loadsWhat’s to do? Of course the sensible thing is to add lots of data validation checks into your code to try and catch the situations where the wrong type of data  arrives from your data source. It’s likely that all the additional validation checks will slow down the process of inserting data, which is not a great result.If y…