Are approximate answers the best way to analyze big data
Image courtesy of pixabay.com
In my previous post I reviewed some reasons why people seem reluctant to accept approximate results as being correct and useful. The general consensus is that approximate results are wrong which is very strange when you consider how often we interact with approximations as part of our everyday life.
Most of the use cases in my first post on this topic covered situations where distinct counts were the primary goal - how many click throughs did an advert generate, how many unique sessions were recorded for a web site etc. The use cases that I outlined provided some very good reasons for using approximations of distinct counts. As we move forward into the era of Analytics-of-Things the use of approximations in queries will expand and this approach to processing data will become an accepted part of our analytical workflows.
To support Analytics-of-Things, Database 12c Release 12.2 includes even more approximate functions. In this release we have added approximations for median and percentile computations and support for aggregating approximate results (counts, median and percentiles).
What is a median and percentile?
A quick refresher course….according to wikipedia a percentile is:
a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. For example, the 20th percentile is the value (or score) below which 20 percent of the observations may be found.
Percentiles are prefect for locating outliers in your data set. In the vast majority of cases you can start with the assumption that a data set exhibits a normal distribution. Therefore if you take the data around the 0.13th and 99.87th percentiles (i.e. outside 3 standard deviations from the mean) then you get the anomalies. Percentiles are great for allowing you to quickly eyeball the distribution of a data set so that you can check for skew or bimodalities etc. Probably, the most common use case is around monitoring service levels where these anomalies are the values of most interest.
On the other hand, a median is:
the number separating the higher half of a data sample, a population, or a probability distribution, from the lower half.
Why would you use median rather than the mean? In other words, what are the use cases that require median? Median is great at removing the impact of outliers because the data is sorted and then the middle value is extracted. The average is susceptible to be skewed by outliers. A great use case for median is in resource planning. If you want to know how many staff you should assign to manage your web-store application you might create a metric based on number of sessions during the year. With a web-store the number of sessions will peak around key dates such as July 4th and Thanksgiving. Calculating the average number of sessions over the year will be skewed by these two dates and you will probably end-up with too many staff looking after your application. Using the median removes these two spikes and will return a more realistic figure for the number of sessions per day during the year.
But before you start to consider where, when, how or even if you want to consider using approximate calculations you need to step back for a moment and think about the accuracy of your existing calculations, which I am guessing you think are 100% accurate!
Is your data accurate anyway?
Most business users work on the assumption that the data set they are using is actually 100% accurate and for the vast majority of operational sources flowing into the data warehouse this is probably true although there will always parentless dimension values and in some cases “Other” bucket dimension members to create some semblance of accuracy.
As we start to explore big data related sources pulled from untrusted external sources and IoT sensor streams, which typically are inherently “noisy”, then the level of “accuracy” within the data warehouse starts to become a range rather than a single specific value.
Let’s quickly explore the three key ways that noise gets incorporated into data sets:
1) Human errors
Human input errors: probably the most obvious. It affects both and internal and external sources that rely on human input or interpretation of manually prepared data. Free format fields on forms create all sorts of problems because the answers need to be interpreted. Good examples are insurance claim forms, satisfaction surveys, crime reports, sales returns forms etc
2) Coding errors
ETL errors: Just about every data source feeding a data warehouse goes through some sort of ETL process. Whilst this is sort of linked to the first group of errors it does fall into this group simply because of the number of steps involved in most ETL jobs. There are some many places where errors can be introduced
Rounding and conversion errors: When an ETL job takes source data, converts it and then aggregates it before pushing it into the warehouse it will always be difficult to back trace the aggregated numbers down to the source data because of inherent rounding errors. When dealing with currency exchange rates it can be a little difficult to tie-back source data in one currency to the aggregated data in the common currency dues to tiny rounding errors.
3) Data Errors
Missing data points: Data always get lost in translation somewhere down the line or is simply just out of date. In many cases this is the biggest source of errors. For example, one bank recently put together a marketing campaign to stop customer churn. Before they launched they campaign one of their data scientists did some deeper analysis and discovered that the training data for the model included customers who were getting divorced and this was being flagged as a lost customer. Including this group ended up skewing the results. The data about changes to marital status was not being pushed through fast enough to the data warehouse.
Meaningless or distracting data points: with the growth in interest in the area of IoT it is likely that this type of “noise” will become more prevalent in data sets. Sensor data is rarely 100% accurate mainly because in many cases it does not need to deliver that level of accuracy. The volume of data being sent from the sensor will allow you to easily remove or flag meaningless or distracting data. With weblogs it is relatively easy to ignore click-events where a user clicks on an incorrect link and immediately clicks the back-button.
In other words, in many situations, getting precise answers is nothing but an illusion: even when you process your entire data, the answer is still an approximate one. So why not use approximation to your computational advantage and in a way where the trade off between accuracy and efficiency is controlled by you?
Use cases for these new features
There are a lot of really good use cases for these types of approximations but here are my two personal favorites:
Hypothesis testing — a good example of this is A/B testing which is most commonly used in conjunction with website design and ads design to select the page design or ad that generates the best response. With this type of analysis it is not vital that you have accurate, precise values .What is needed is the ability to reliably compare results and approximations are good normally enough.
Ranking — How does your ISP calculate your monthly usage so they can bill you fairly for your usage? They use a percentile calculation where they will remove the top 5% - 2%, of your bandwidth peaks. and then use that information to calculate your bill. By using data below the 95th-98th percentile they can ignore the infrequent peaks when say your are downloading the lasted update to your Android or iOS device. Again, having precise numbers for this percentile cut-off is not really necessary. A good enough approximation of the 95th percentile is usually going to be sufficient because it implies that approximately 95% of the time, your usage is below the data volume identified around that percentile. An conversely the remaining 5% of the time, your usage creeps above that amount.
Of course all the use cases that we considered for distinct counts in the first posts are also valid:
Discovery analytics: data analysts often slice and dice their dataset in their quest for interesting trends, correlations or outliers. If your application falls into this type of explorative analytics, getting an approximate answer within a second is much better compared to waiting twenty minutes for an exact answer. In fact, research on human-computer interaction has shown that, to keep business users engaged and productive, the response times for queries must be below 10 seconds. In particular, if the user has to wait for the answer to their query for more than a couple of seconds then their level of analytical thinking can be seriously impaired.
Market testing: most common use case for market testing is around serving ads on websites. This is where two variants of a specific ad (each with a group of slightly different attributes such as animations or colour schemes) are served up to visitors during a session. The objective is to measure which version generates a higher conversion rate (i.e. more click-throughs). The analytics requires counting the number of clicks per ad with respect to the number of times each ad was displayed. Using an approximation of the number of click-throughs is perfectly acceptable. This is similar to the crowd-counting problem where it is not really necessary to report exactly how many people joined a rally or turned up to an event.
Root cause analysis: contrary to perceived wisdom, this can in fact be accomplished using approximations. Typically RCA follows a workflow model where results from one query trigger another query, which in turn triggers another related query. Approximations are used to speed up that the decision as to whether or not to continue with a specific line of analysis. Of course you need to incorporate the likelihood of edge cases within your thinking process because there is the danger that the edge values will get lost within the general hashing process.
however, in these examples we usually end up merging or blending the first two use cases with the three above to gain a deeper level of insight so now let’s look at the new approximate statistical functions introduced in Database 12.2
Approximate median and percentile
With Database 12c Release 2 we have added two new approximate functions:
APPROX_PERCENTILE(%_number [DETERMINISTIC], [ERROR_RATE|CONFIDENCE]) WITHIN GROUP (ORDER BY expr [ DESC | ASC ])
This function takes three input arguments. The first argument is numeric type ranging from 0% to 100%. The second parameter is optional. If ‘DETERMINISTIC’ argument is provided, it means user requires deterministic results. If it is not provided, it means deterministic results are not mandatory. The input expression for the function is derived from the expr in the ORDER BY clause.
The approx_median function has the following syntax:
APPROX_MEDIAN(expr [DETERMINISTIC], [ERROR_RATE|CONFIDENCE]
We can use these functions separately or together as shown here using the SH schema:
APPROX_PERCENTILE(0.25) WITHIN GROUP (ORDER BY amount_sold ASC) as "p-0.25",
TRUNC(APPROX_PERCENTILE(0.25, 'ERROR_RATE') WITHIN GROUP (ORDER BY amount_sold ASC),2) as "p-0.25-er",
TRUNC(APPROX_PERCENTILE(0.25, 'CONFIDENCE') WITHIN GROUP (ORDER BY amount_sold ASC),2) as "p-0.25-ci",
APPROX_MEDIAN(amount_sold deterministic) as "p-0.50",
TRUNC(APPROX_MEDIAN(amount_sold deterministic, 'ERROR_RATE'),2) as "p-0.50-er",
TRUNC(APPROX_MEDIAN(amount_sold deterministic, 'CONFIDENCE'),2) as "p-0.50-ci",
APPROX_PERCENTILE(0.75 deterministic) WITHIN GROUP (ORDER BY amount_sold ASC) as "p-0.75",
TRUNC(APPROX_PERCENTILE(0.75, 'ERROR_RATE') WITHIN GROUP (ORDER BY amount_sold ASC),2) as "p-0.75-er",
TRUNC(APPROX_PERCENTILE(0.75, 'CONFIDENCE') WITHIN GROUP (ORDER BY amount_sold ASC),2) as "p-0.75-ci"
FROM sales s, times t
WHERE s.time_id = t.time_id
GROUP BY calendar_year
ORDER BY calendar_year
The results from the above query are shown below
Note that for the APPROX_MEDIAN function I have included the keyword “DETERMINISTIC”. What does this actually mean?
Due to the nature of computing approximate percentiles and medians it is not possible to provide a specific and constant value for the error rate or the confidence interval. However, when we have used a large scale real world customer data set (manufacturing use case) we saw an error range of around 0.1 - 1.0%. Therefore, in broad general terms, accuracy will not be a major concern.
Error rates and confidence intervals
How closely an approximate answers matches the precise answer is gauged by two important statistics:
- margin of error
- confidence level.
These two pieces of information tell us how well the approximation represents the precise value. For example, a result may have a margin of error of plus or minus 3 percent at a 95 percent level of confidence. These terms simply mean that if the analysis were conducted 100 times, the data would be within a certain number of percentage points above or below the percentage reported in 95 of the 100 runs.
In other words, Company X surveys customers and finds that 50 percent of the respondents say its customer service is “very good.” The confidence level is cited as 95 percent plus or minus 3 percent. This information means that if the survey were conducted 100 times, the percentage who say service is “very good” will range between 47% and 53% most (95%) of the time (for more information see here: https://www.isixsigma.com/tools-templates/sampling-data/margin-error-and-confidence-levels-made-simple/).
Please note that if you search for more information about error rates and confidence levels then a lot of results will talk about sample size and working back from typical or expected error rates and confidence levels to determine the sample size needed. With approximate query process we do not sample the source data. We always read all the source values, there is no sampling!
Performance - how much faster is an approximate result?
As a test against a real world schema we took a simple query from the customer that computed a number of different median calculations:
SELECT count(*) FROM (SELECT /*+ NO_GBY_PUSHDOWN */ b15, median(b4000), median(b776), median(e), median(f), median(n), median(z) FROM mdv group by b15);
As you can see from the real-time monitoring page, the query accessed 105 million rows and the calculations generated 11GB of temp. That’s a lot of data for one query to spill to disk!
Now if we convert the above query to use the approx_median function and rerun the query we can see below that we get a very different levels of resource usage:
Looking closely at the resource usage you can see that the query is 13x faster, uses considerably less memory (830Kb vs 1GB) but most importantly there is no usage of temp:
One of the most important take-aways from this post relates to the fact that we always read all the source data. The approximate functions in Database 12c Release 2 do not using sampling as a way to increase performance. These new features are significantly faster and use fewer resources which means more resources are available for other queries - allowing you to do more with the same level of resources.