Tuesday, 29 April 2014
Friday, 25 April 2014
OverviewThe need for a private, secure and safe area for data discovery within the data warehouse ecosystem is growing rapidly as many companies start investing in and investigating "big data". Business users need space and resources to evaluate new data sources to determine their value to the business and/or explore news way of analyzing existing datasets to extract even more value. These safe areas are most commonly referred to as "Sandboxes" or "Discovery Sandboxes" or "Discovery Zones". If you are not familiar with the term then Forrester Research defines a "sandbox" as:
“data exploration environment where a power user can analyse production […] with near complete freedom to modify data models, enrich data sets and run the analysis whenever necessary, without much dependency on IT and production environment restrictions.” *1These sandboxes are tremendously useful for business users because they allow them to quickly and informally explore new data sets or new ways of analyzing data without having to go through the formal rigour normally associated with data flowing into the EDW or deploying analytical scripts within the EDW. They provide business users with a high degree of freedom. The real business value is highlighted in a recent article by Ralph Kimball:
In several of the e-commerce enterprises interviewed for this white paper, analytic sandboxes were extremely important, and in some cases hundreds of the sandbox experiments were ongoing simultaneously.
As one interviewee commented “newly discovered patterns have the most disruptive potential, and insights from them lead to the highest returns on investment" *2
Key CharacteristicsSo what are they key characteristics of a sandbox? Essentially there are three:
- Used by skilled business analysts and data scientists
- Environment has fewer rules of engagement
- Time boxed
As I stated before, the normal EDW rules of engagement are significantly relaxed within the sandbox and new data flowing into the sandbox is typically disorganised and dirty. Hence the need for strong SQL skills to create simplified but functional data cleaning and transformation scripts with the emphasis being to make new data usable as quickly as possible. Part of the "transformation" process might be to generate new data points derived from existing attributes. A typical example of this is where a data set contains date-of-birth information, which in itself is quite a useful piece of information, that can be transformed to create a new data point of "age". Obviously the business analysts and data scientist need to be reasonably proficient in SQL to create the required transformation steps - it is not a complicated process but it highlights the point that the business community needs to have the necessary skills so that they are self-sufficient.
Most importantly the sandbox environment needs to have a time limit. In the past this is where most companies have gone wrong! Many companies fail to kill off their sandboxes. Instead these environments evolve and flourish into shadow marts and/or data warehouses which end up causing havoc as users can never be sure which system contains the correct data. Today, most enlightened companies enforce a 90-day timer on their sandboxes. Once the 90 day cycle is complete then ownership of the processes and data are either moved over to the EDW team, who can then start to apply the corporate standards to the various objects and scripts, or the environment and all its data is simply dropped.
The only way a business can support the hundreds of live sandbox experiments described in Kimball's recent report (*2) is by enforcing these three key characteristics.
Choosing your deployment model:Over the years that I have spent working on various data warehouse projects I have seen a wide variety of weird and wonderful deployment models designed to support sandboxing. In very general terms these various deployment models reduce down to one of the following types:
- Desktop sandbox
- Detached sandbox
- Attached sandbox
1. Desktop SandboxesMany business users prefer to use their desktop tools, such as spreadsheet packages, because the simple row-column data model gives them a simplified and easily managed view of their data set. However, this approach places a significant processing load on the desktop computer (laptop or PC) and while some vendors offer a way to off-load some of that processing to bespoke middleware servers this obviously means implementing an additional specialised middleware server on dedicated hardware. Otherwise, companies have to invest large amounts of money upgrading their desktop systems with additional memory and solid-state disks.
Creating a new sandbox is just a question of opening a new, fresh worksheet and loading the required data set. Obviously, the size and breadth of the dataset is limited by the resources on the desktop system and complicated calculations can take a considerable time to run with little or no scope for additional optimisation or tuning. Desktop sandbox are, by default, data-silos and completely disconnected from the enterprise data warehouse which makes it very difficult to do any sort of joined-up analysis.
The main advantage of this approach is that power users can easily run what-if models where they redefine their data model to test new "hierarchies", add new dimensions or new attributes. They can even change the data by simply over-typing existing values. Collaboration is a simple process of emailing the spreadsheet model to other users for comments. The overriding assumption here is that users who receive the spreadsheet are actually authorised to view the data! Of course there is nothing to prevent recipients forwarding the data to other users. Therefore, it is fair to say that data security is non-existent.
For DBAs, the biggest problem with this approach is that it offers no integration points into the existing cloud management infrastructure. Therefore, it is difficult for the IT team to monitor the resources being used and make appropriate x-charges. Of course the DBA has no control over the deletion of desktop based sandboxes so there is a tendency for these environments to take on a life of their own with business users using them to create "shadow" production systems that are never decommissioned.
Overall, the deployment of desktop sandboxes is not recommended.
2. Detached SandboxesUsing a detached, dedicated sandbox platform resolves many of the critical issues related to desktop sandbox platforms most notably the issues relating to: data security and processing scalability. Assuming a relatively robust platform is used to manage the sandboxes then the security profiles implemented in the EDW can be replicated across to the stand-alone platform. This approach still allows users to redefine their data model to test new "hierarchies", add new dimensions or new attributes within what-if models and even change data points but this ability is "granted" by the DBA rather than being automatically taken and enforced by the business user. In terms of sharing results there is no need to distribute data via email and this ensures everyone gets the same consistent view of the results (and by default the original source, should there be a need to work backwards from the results to the source).
Key concerns for business users is the level of latency that occurs from the need to unload and reload not only the required data but also all the supporting technical and business metadata. Unloading, moving and importing large historical data sets can be very time consuming and can require large amounts of resources on the production system - which may or may not be available depending on the timing of the request.
For the DBA issues arise around the need to monitor additional hardware and software services in the data center. For IT this means more costs because additional floor space, network bandwidth, power and cooling may be required. Of course, assuming that the sandbox platform fits into the existing monitoring and control infrastructure then x-charging can be implemented. In this environment the DBA has full control over the deletion of a sandbox so they can prevent the spread of "shadow" production data sets. For important business discoveries, the use of detached sandboxes does provide the IT team with the opportunity to grab the loading and analysis scripts and move them to the production EDW environment. This helps to reduce the amount of time and effort needed to "productionize" discoveries.
While detached sandboxes remove some of the disadvantages of desktop platforms it is still not an ideal way to deliver sandboxes to the business community.
3. Attached SandboxesAttached sandboxes resolve all the problems associated with the other two scenarios. Oracle provides a rich set of in-database features that allow business users to work with in-place data, which in effect, removes the issue of data latency. Oracle Database is able to guarantee complete isolation for any changes to dimensions, hierarchies, attributes and/or even individual data points so there is no need to unload, move and then reload data. All the existing data security policies remain in place which means there is no need to replicate security profiles to other systems where there is the inherent risk that something might be missed in the process.
For the DBA, x-charging can be implemented using existing infrastructure management tools. The DBA has full control over the sandbox in terms of resources (storage space, CPU, I/O) and duration. The only concern that is normally raised regarding the use of attached sandboxes is the impact on the existing operational workloads. Fortunately, Oracle Database, in conjunction with our engineered systems, has a very robust workload management framework (see earlier posts on this topic: https://blogs.oracle.com/datawarehousing/tags/Workload_Management). This means that the DBA can allocate sufficient resource to each sandbox while ensuring that the key operational workloads continue to meet their SLAs. Overall, attached sandboxes, within an Oracle Database environment, is a win-win solution: both the DBA and the business community get what they need.
High degree of local control over data
Quick and easy sharing of results
Reduced data scalability
Not easy to integrate new data
Very costly to implement
Undermines data consistency-governance
Data security is compromised
Reduces workload on EDW
Upload personal/external data to sandbox
Explore large volumes of data without limits
Requires additional hardware and software
Requires replication of corporate data
Replication + increased management of operational metadata
Upload additional data to virtual partitions Easy to mix new data with corporate data
No replication of corporate data
Efficient use of DW platform resources
Data access controlled by enterprise security features
Requires robust workload management tools
B-O-X-D: the lifecycle of a sandboxNow we know what type of sandbox we need to deploy (just in case you were not paying attention - attached sandboxes!) to keep our business users happy the next step is to consider the lifecycle of the sandbox along with the tools and features that support each of the key phases. To make things easier I have broken this down into four key DBA-centric phases as shown below:
Over the next four weeks I will cover these four key phases of the sandbox lifecycle and explain which Oracle tools and Oracle Database features are relevant and how they can be used.
*1 Solve the Data Management Conflict Between Business and IT, by Brad Peters - Information Management Newsletters, July 20, 2010
*2 The Evolving Role of the Enterprise Data Warehouse in the Era of Big Data Analytics by Ralph Kimball
Technorati Tags: Data Warehousing, Exadata, Oracle Database 12c, Sandbox
Thursday, 17 April 2014
Since last year's OpenWorld I have posted quite a few articles on our new 12c pattern matching feature:
- OLL Live webcast - Using SQL for Pattern Matching in Oracle Database
- OOW content for Pattern Matching….
- SQL analytical mash-ups deliver real-time WOW! for big data
- OTN Virtual Developer Day content now available on-demand
- Sessionization with 12c SQL pattern matching is super fast
- Oracle Technology Network's FREE Virtual Database Developer Day in EMEA
To date most of my online demos and tutorials have used a very simple data set consisting of a single table with 60 rows of fictitious trade data. Over the last few months I have been searching around for a slightly bigger and more challenging data set to use. Fortunately, our BIg Data PMs (led by Marty Gubar) have put together a very rich big data schema as part of their recently launched Big Data Virtual Machine that you can download from our OTN web page, see here: http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-bigdatalite-2104726.html.
The data set is based around an online customer movie application. Here is a little background…
Oracle MoviePlex Demo Application
Oracle MoviePlex is a fictitious on-line movie streaming company. Customers log into Oracle MoviePlex where they are presented with a targeted list of movies based on their past viewing behavior. Because of this personalized experience and reliable and fast performance, customers spend a lot of money with the company and it has become extremely profitable.
As the users watch movies, search for their favorite actors, rate their movies the system records all the clicks in a log file. The contents of that log file looks like this:
In its raw state this information is not very helpful. It needs a lot of processing to convert the various data points into usable information. In many cases companies have been turning to Hadoop and its related Java-based programming language MapReduce to process and convert these types of files into usable information. Most business users will want to summarise this information by customer and/or movie and then merge this information with other data sets. So how can we make access to and analysis of this type of data much easier? As part of this post I am going to compare the Hadoop-Java-MapReduce approach with an alternative approach using 12c SQL. My objective is not to determine which is solution is the best because each approach has its own camp of supporters. Once we have reviewed the two solutions I will put everything into context and make some recommendations…..so let's bring on the code!
Sessionization using Java
Accessing the Avro file
At the moment In the context of Big Data, everything seem to revolve around Hadoop, MapReduce and Java. It is quite natural for a big data developer to extend their basic map reduce processing to include more complicated requirements. In the case of our movie demo there is a lot of processing that needs to be done using the native features of Hadoop to collect and process the weblog file being generated by our application. There is an excellent video prepared by Marty Gubar (Director of Product Management for Big Data) which explains this process. This is part four of a six-part series that explains the movieplex demo: Part 4. Turn Clicks into Value - Flume & Hive. The movie demo lasts about 5 mins and you can watch here: https://www.youtube.com/watch?v=IwrjJUoUwXY.
The steps shown in the movie explain how to access the avro file and then how to clean the data to provide some interesting additional metrics.
Calculating the sessionization data
Creating the sessionization analysis is a little more complicated. In fact, it requires 370 lines of Java code. Here is the Java code we created for doing the sessionization analysis (the code window is scrollable so you can review the very long code sample):
The actual logic for the sessionization analysis is about 100 lines of code as shown here (at s before, the code window is scrollable so you can review the very long code sample):
As you can see from the code listing this requires a strong knowledge of Java and with 370 lines of code, if we decide to change the details of the pattern that we are searching for it is going to be a lengthy process to make the required changes. Making anything changes to the code to reflect changing business requirements is definitely going to be beyond the skills of your average business analyst and/or data scientists and this might negatively impact the level of project agility.
Making life easier and richer with SQL
My personal view is that sessionization analysis is quite a sophisticated requirement and is best achieved using SQL because the requirements and parameters for the analysis will naturally evolve over time as new questions are triggered by existing answers. Therefore, you need an agile approach to analytics that can deal with constant changing requirements during the data discovery phase.
Accessing the Avro file
First step is to create a DIRECTORY object to access the raw data file:
CREATE DIRECTORY session_file_dir AS '/home/oracle/applog';
GRANT READ, WRTIE ON DIRECTORY session_file_dir to pmuser;
Next I created an external table over the data file. Notice that in the avro file, each key column has an identifier and a value.
The objective for this external table was to keep the processing as simple as possible:
CREATE TABLE RAW_SESSION_DATA
, MOVIE_ID VARCHAR2(20)
, GENRE_ID VARCHAR2(20)
, SESSION_DATE VARCHAR2(50)
, RECOMMENDED_ID VARCHAR2(20)
, ACTIVITY_ID VARCHAR2(20)
, RATING_ID VARCHAR2(20)
DEFAULT DIRECTORY SESSION_FILE_DIR
(RECORDS DELIMITED BY NEWLINE
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY "'"
MISSING FIELD VALUES ARE NULL)
LOCATION (SESSION_FILE_DIR: 'movieapp_30months.log')
REJECT LIMIT 1;ALTER TABLE RAW_SESSION_DATA
PROJECT COLUMN REFERENCED;
Cleaning the data
Now the next stage is to clean the data and remove the column identifiers such as custId, movieId, genreId etc. To do this processing I simply used the SQL SUBSTR() function. At the same time I decided to breakout the activity ID column to create unique columns for each type of activity. This approach is useful if you are going to be using the data as part of a data mining project because in many cases it useful to uniquely identify specific attributes. In many cases you will want to create a fact table from this process but in this case I have created a view while I do some basic prototyping:CREATE OR REPLACE VIEW VW_SESSION_DATA AS SELECT
SUBSTR(CUST_ID,11) AS CUST_ID
, CASE SUBSTR(MOVIE_ID,11) WHEN 'null' THEN null ELSE SUBSTR(MOVIE_ID,11) END AS MOVIE_ID
, CASE SUBSTR(GENRE_ID,11) WHEN 'null' THEN null ELSE SUBSTR(GENRE_ID,11) END AS GENRE_ID
, TO_DATE(SUBSTR(SESSION_DATE, 9,19), 'YYYY-MM-DD HH24:MI:SS') AS SESS_DATE
, CASE SUBSTR(RECOMMENDED_ID,15) WHEN 'null' THEN null ELSE SUBSTR(RECOMMENDED_ID,16,1) END AS REC_ID , (CASE SUBSTR(ACTIVITY_ID,12,2) WHEN '1' THEN '1' WHEN '2}' THEN '2' WHEN '3}' THEN '3' WHEN '4}' THEN '4' WHEN '5}' THEN '5' WHEN '6}' THEN '6' WHEN '7}' THEN '5' WHEN '8}' THEN '8' WHEN '9}' THEN '9' WHEN '10' THEN '10' WHEN '11' THEN '11' ELSE null END) AS ACTIVITY_ID , CASE SUBSTR(ACTIVITY_ID,12) WHEN 'null' THEN null ELSE SUBSTR(ACTIVITY_ID,12,1) END as ACTIVITY_ID
, CASE SUBSTR(RATING_ID,9) WHEN 'null' THEN null ELSE SUBSTR(RATING_ID,10,1) END as RATING_ID
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '1' THEN 'Y' END as act_rate
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '2' THEN 'Y' END as act_complete
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '3' THEN 'Y' END as act_pause
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '4' THEN 'Y' END as act_start
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '5' THEN 'Y' END as act_browse
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '6' THEN 'Y' END as act_list
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '7' THEN 'Y' END as act_search
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '8' THEN 'Y' END as act_login
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '9' THEN 'Y' END as act_logout
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '10' THEN 'Y' END as act_incomplete
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '11' THEN 'Y' END as act_purchase
Running a query against this view generates our basic data set that we can then feed into our MATCH_RECOGNIZE clause to create the sessionization result set.
Is there a better, simpler way to deal with the avro file? Well yes there is. As part of the Database 12c release programme we will be adding in-database support for JSON data. This will allow us to take the raw avro file and access stand query it directly from within the database without the need to go through the complicated cleaning process! At the moment we have not made any collateral (PPTs etc) on this feature publicly available so I can't provide you with any supporting links that will give you more information. However, once this feature is released I will revisit this little demo to explore how the new JSON feature can be used along side SQL pattern matching.
Calculating the sessionization data
The code to calculate the sessionization information is as follows:
(PARTITION BY cust_id ORDER BY sess_date
MEASURES match_number() session_id,
TO_CHAR(FIRST(sess_date), 'hh24:mi:ss') start_time,
TO_CHAR(LAST(sess_date), 'hh24:mi:ss') end_time,
TO_CHAR(ROUND(TO_NUMBER(LAST(sess_date) - FIRST(sess_date)) * 1440), '999,999') duration,
ONE ROW PER MATCH
PATTERN (strt s+)
s as (round(to_number(sess_date - prev(sess_date)) * 1440) <= 5)
This statement uses many of the MATCH_RECOGNIZE features that I have discussed in previous posts. In this specific example I have decided to set the duration between events within a session as 5 minutes. That means if the user does nothing for about 5 minutes then I will assume that a new session has started even if the user has not logged out.
Beyond the basic sessionization model there is another interesting pattern that we can search for within our data set. If a user starts watching a movie then we might not get another event being logged for over two hours. Therefore, we could look for instances where the time between events is less than 150 minutes (assuming most movies last around 2 hours) and there is at least one (or more) "start" event is logged and at least one (or more) "complete" event is logged,
PATTERN (strt s* f+ c+)
f as act_start = 'Y',
c as act_complete = 'Y',
s as (round(to_number(sess_date - prev(sess_date))*1440) <= 150)
The output from this query is shown below:
Looking at the results what really jumps out is the level of interaction on the site before, during and after the user has watched a movie. Many of the records that are returned show users searching our movie database, going to their lists of favourite movies and rating the movies they have just watched. All this implies that if we can get customers to begin watching a movie there is a huge knock-on effect in terms of how long they remain on the site and the number of "events" they create while on the site. This is the sort of information that is useful to marketing and campaign management teams.
A different view: Calculating the sessionization data for each movie
While developing the above analysis I began to wonder about analysing the data not from a customer perspective but from the point of view of each movie. This would allow me to look at the session activity over time for each movie and understand the usage patterns across movies and genres. For example, are there certain movies or types of movies that are more or less likely to be rated. Is the process of watching a movie the same across all movies in terms of the events that are registered during a session? There are all sorts of new avenues of analysis that could be created from looking at the data in this way.
So how easy would it be to change the focus of the analysis to movies? Well the code took me approximately 30 seconds to write - actually cut & paste from the customer sessionization example, switch the customer id for the movie id and finally removed a few fields. Here is the new code:
FROM SESSION_DATA td
(PARTITION BY movie_id ORDER BY sess_date
MEASURES COUNT(*) no_of_events,
TO_CHAR(first(sess_date), 'hh24:mi:ss') start_time,
TO_CHAR(LAST(sess_date), 'hh24:mi:ss') end_time,
TO_CHAR(ROUND(TO_NUMBER(LAST(sess_date) - FIRST(sess_date)) * 1440), '999,999') duration,
ONE ROW PER MATCH
PATTERN (strt s*)
s as (ROUND(TO_NUMBER(sess_date - PREV(sess_date))*1440) <= 120)
The speed at which you can create these additional/alternative result sets is really great! It is now relatively easy to continue this discovery process by analysing the information by movie genre or any of the other related attributes.
The main take-away from this post is that (hopefully) I shown how easy it is to use SQL for sessionization analysis. Firstly in terms of creating the framework to support the normal "sessionization" transformation process: taking the source log file, extracting the data points and then grouping the information by user. The great news is that processing of the source file will be greatly simplified when the JSON feature is released because it will be possible to query the log file in a much simpler way to extract the required data points. Secondly, SQL provides a much simpler and easier way to support the evolutionary of the discovery process. Changing the focus from customer analysis to movie analysis is both quick and easy. I am not convinced that a Java-based approach to data discovery can offer the same level of simplicity and agility but I will let you be the judge of that last statement….
Tuesday, 15 April 2014
Just a gentle reminder - if you have not submitted a paper for this year's OpenWorld conference then there is still just enough time because the deadline is Today (Tuesday, April 15) at 11:59pm PDT. The call for papers website is here http://www.oracle.com/openworld/call-for-papers/index.html and this provides all the details of how and what to submit.
I have been working with a number of customers on some really exciting papers so I know this year's conference is going to be really interesting for data warehousing and analytics. I would encourage everyone to submit a paper, especially if you have never done this before. Right now both data warehousing and analytics are among the hottest topics in IT and I am sure all of you have some great stories that you could share with your industry peers who will be attending the conference. It is a great opportunity to present to your peers and also learn from them by attending their data warehouse/analytics sessions during this week long conference. And of course you get a week of glorious Californian sunshine and the chance to spend time in one of the World's most beautiful waterfront cities.
If you would like any help submitting a proposal then feel free to email during today and I will do my best to provide answers and/or guidance. My email address is firstname.lastname@example.org.
Have a great day and get those papers entered into our OpenWorld system right now!
Tuesday, 1 April 2014
Oracle has always been at the forefront of efforts to revolutionise your data center. To date, for obvious reasons, the focus has been on optimizing energy and space efficiency. As of today we are moving into an exciting new phase in terms of the look and feel of your data center. Oracle recently added a new fashion design team to its engineered system group to help us re-imagine the next generation data center and the first exciting fruits of this new partnership of both technology and fashion are now available for our customers to order…..
For a short period only, Oracle is offering its data warehouse customers the chance to buy a limited edition EXADATA X4-2C. This new Exadata configuration is going to brighten up your data center with its exciting range of color coordinated racks! Now you can enjoy running those really sophisticated business queries in glorious technicolor. Most importantly, the great news is that we are not charging you anything extra for this fabulous new technicolor data warehouse experience:
HARDWARE, SOFTWARE AND COLOR, ENGINEERED TO WORK TOGETHER
Each color-coded rack comes with its own color-linked version of Enterprise Manager to add more colour, brightness and joy to all those day-to-day tasks as you can see below on these specially designed monitoring screens:
Your Exadata DBA is really going to thank you!
So what happens if you buy a 1/2 rack then slowly add more Exadata nodes? Great question - well, while stocks last you can actually create your own multi-colored Exadata rack. As always we are ahead of the game because we know what our customers want. SO WHY NOT HAVE A TECHNICOLOR DATA WAREHOUSE in your data center! Go on, you know it makes sense….
BUT YOU GOTTA HURRY - This new Exadata X4-2C range is a limited edition, special order only model. Stocks are limited. To brighten up your data center make sure you contact your Oracle Sales Representative right now because you do not want to miss out on this exciting opportunity to put one of these gorgeous, colour-coded dudes in your data center. And don't forget, only Oracle gives you HARDWARE, SOFTWARE AND COLOR, ENGINEERED TO WORK TOGETHER