Thursday, 17 April 2014

Analyzing our Big Data Lite movie app clickstream data

Since last year's OpenWorld I have posted quite a few articles on our new 12c pattern matching feature: 

 To date most of my online demos and tutorials have used a very simple data set consisting of a single table with 60 rows of fictitious trade data. Over the last few months I have been searching around for a slightly bigger and more challenging data set to use. Fortunately, our BIg Data PMs (led by Marty Gubar) have put together a very rich big data schema as part of their recently launched Big Data Virtual Machine that you can download fro our OTN web page, see here: http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-bigdatalite-2104726.html.

The data set is based around an online customer movie application. Here is a little background…

Movieplex on otn

Oracle MoviePlex Demo Application

Oracle MoviePlex is a fictitious on-line movie streaming company. Customers log into Oracle MoviePlex where they are presented with a targeted list of movies based on their past viewing behavior. Because of this personalized experience and reliable and fast performance, customers spend a lot of money with the company and it has become extremely profitable.

As the users watch movies, search for their favorite actors, rate their movies the system records all the clicks in a log file. The contents of that log file looks like this:

Avro file

In its raw state this information is not very helpful. It needs a lot of processing to convert the various data points into usable information. In many cases companies have been turning to Hadoop and its related Java-based programming language MapReduce to process and convert these types of files into usable information.  Most business users will want to summarise this information by customer and/or movie and then merge this information with other data sets. So how can we make access to and analysis of this type of data much easier? As part of this post I am going to compare the Hadoop-Java-MapReduce approach with an alternative approach using 12c SQL. My objective is not to determine which is solution is the best because each approach has its own camp of supporters. Once we have reviewed the two solutions I will put everything into context and make some recommendations…..so let's bring on the code!

 

Sessionization using Java

Accessing the Avro file

At the moment In the context of Big Data, everything seem to revolve around Hadoop, MapReduce and Java. It is quite natural for a big data developer to extend their basic map reduce processing to include more complicated requirements. In the case of our movie demo there is a lot of processing that needs to be done using the native features of Hadoop to collect and process the weblog file being generated by our application. There is an excellent video prepared by Marty Gubar (Director of Product Management for Big Data) which explains this process. This is part four of a six-part series that explains the movieplex demo: Part 4. Turn Clicks into Value - Flume & Hive. The movie demo lasts about 5 mins and you can watch here: https://www.youtube.com/watch?v=IwrjJUoUwXY.

The steps shown in the movie explain how to access the avro file and then how to clean the data to provide some interesting additional metrics.

Calculating the sessionization data

 Creating the sessionization analysis is a little more complicated. In fact, it requires 370 lines of Java code.  Here is the Java code we created for doing the sessionization analysis (the code window is scrollable so you can review the very long code sample):

 

The actual logic for the sessionization analysis is about 100 lines of code as shown here (at s before, the code window is scrollable so you can review the very long code sample):

 

As you can see from the code listing this requires a strong knowledge of Java and with 370 lines of code, if we decide to change the details of the pattern that we are searching for it is going to be a lengthy process to make the required changes. Making anything changes  to the code to reflect changing business requirements is definitely going to be beyond the skills of your average business analyst and/or data scientists and this might negatively impact the level of project agility.

 

Making life easier and richer with SQL

My personal view is that sessionization analysis is quite a sophisticated requirement and is best achieved using SQL because the requirements and parameters for the analysis will naturally evolve over time as new questions are triggered by existing answers. Therefore, you need an agile approach to analytics that can deal with constant changing requirements during the data discovery phase.

Accessing the Avro file

First step is to create a DIRECTORY object to access the raw data file:

CREATE DIRECTORY session_file_dir AS '/home/oracle/applog';
GRANT READ, WRTIE ON DIRECTORY session_file_dir to pmuser;

 

Next I created an external table over the data file. Notice that in the avro file, each key column  has an identifier and a value.

{"custId":1077225"movieId":100"genreId":8"time":"2011-07-30:22:55:32""recommended":"N""activity":6}

The objective for this external table was to keep the processing as simple as possible:

CREATE TABLE RAW_SESSION_DATA 
(
CUST_ID VARCHAR2(20)
, MOVIE_ID VARCHAR2(20)
, GENRE_ID VARCHAR2(20)
, SESSION_DATE VARCHAR2(50)
, RECOMMENDED_ID VARCHAR2(20)
, ACTIVITY_ID VARCHAR2(20)
, RATING_ID VARCHAR2(20)
)
ORGANIZATION EXTERNAL
(
TYPE ORACLE_LOADER
DEFAULT DIRECTORY SESSION_FILE_DIR
ACCESS PARAMETERS
(RECORDS DELIMITED BY NEWLINE
NOBADFILE
NODISCARDFILE
NOLOGFILE
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY "'"
MISSING FIELD VALUES ARE NULL)
LOCATION (SESSION_FILE_DIR: 'movieapp_30months.log')
)
REJECT LIMIT 1;
ALTER TABLE RAW_SESSION_DATA 
PROJECT COLUMN REFERENCED;

 

Cleaning the data

Now the next stage is to clean the data and remove the column identifiers such as custId, movieId, genreId etc. To do this processing I simply used the SQL SUBSTR() function. At the same time I decided to breakout the activity ID column to create unique columns for each type of activity. This approach is useful if you are going to be using the data as part of a data mining project because in many cases it useful to uniquely identify specific attributes. In many cases you will want to create a fact table from this process but in this case I have created a view while I do some basic prototyping:

CREATE OR REPLACE VIEW VW_SESSION_DATA AS SELECT 
SUBSTR(CUST_ID,11) AS CUST_ID
, CASE SUBSTR(MOVIE_ID,11) WHEN 'null' THEN null ELSE SUBSTR(MOVIE_ID,11) END AS MOVIE_ID
, CASE SUBSTR(GENRE_ID,11) WHEN 'null' THEN null ELSE SUBSTR(GENRE_ID,11) END AS GENRE_ID
, TO_DATE(SUBSTR(SESSION_DATE, 9,19), 'YYYY-MM-DD HH24:MI:SS') AS SESS_DATE
, CASE SUBSTR(RECOMMENDED_ID,15) WHEN 'null' THEN null ELSE SUBSTR(RECOMMENDED_ID,16,1) END AS REC_ID
, CASE SUBSTR(ACTIVITY_ID,12) WHEN 'null' THEN null ELSE SUBSTR(ACTIVITY_ID,12,1) END as ACTIVITY_ID
, CASE SUBSTR(RATING_ID,9) WHEN 'null' THEN null ELSE SUBSTR(RATING_ID,10,1) END as RATING_ID
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '1' THEN 'Y' END as act_rate
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '2' THEN 'Y' END as act_complete
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '3' THEN 'Y' END as act_pause
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '4' THEN 'Y' END as act_start
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '5' THEN 'Y' END as act_browse
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '6' THEN 'Y' END as act_list
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '7' THEN 'Y' END as act_search
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '8' THEN 'Y' END as act_login
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '9' THEN 'Y' END as act_logout
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '10' THEN 'Y' END as act_incomplete
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '11' THEN 'Y' END as act_purchase
FROM RAW_SESSION_DATA;

 

Running a query against this view generates our basic data set that we can then feed into our MATCH_RECOGNIZE clause to create the sessionization result set.

Sessionization

 Is there a better, simpler way to deal with the avro file? Well yes there is. As part of the Database 12c release programme we will be adding in-database support for JSON data. This will allow us to take the raw avro file and access stand query it directly from within the database without the need to go through the complicated cleaning process! At the moment we have not made any collateral (PPTs etc) on this feature publicly available so I can't provide you with any supporting links that will give you more information. However, once this feature is released I will revisit this little demo to explore how the new JSON feature can be used along side SQL pattern matching.

 

Calculating the sessionization data

The code to calculate the sessionization information is as follows:

SELECT *
FROM vwsession_data 
MATCH_RECOGNIZE
(PARTITION BY cust_id ORDER BY sess_date
MEASURES match_number() session_id,
COUNT(*) no_of_events,
FIRST(sess_date) start_date,
TO_CHAR(FIRST(sess_date), 'hh24:mi:ss') start_time,
LAST(sess_date) end_date,
TO_CHAR(LAST(sess_date), 'hh24:mi:ss') end_time,
TO_CHAR(ROUND(TO_NUMBER(LAST(sess_date) - FIRST(sess_date)) * 1440), '999,999') duration,
LAST(activity_id) last_act_id,
COUNT(act_rate) act_rate,
COUNT(act_complete) act_complete,
COUNT(act_pause) act_pause,
COUNT(act_start) act_start,
COUNT(act_browse) act_browse,
COUNT(t(act_list) act_list,
COUNT(act_search) act_search,
COUNT(act_login) act_login,
COUNT(act_logout) act_logout,
COUNT(act_incomplete) act_incomplete,
COUNT(act_purchase) act_purchase
ONE ROW PER MATCH
PATTERN (strt s+)
DEFINE
s as (round(to_number(sess_date - prev(sess_date)) * 1440) <= 5)
);

 

This statement uses many of the MATCH_RECOGNIZE features that I have discussed in previous posts. In this specific example I have decided to set the duration between events within a session as 5 minutes. That means if the user does nothing for about  5 minutes then I will assume that a new session has started even if the user has not logged out. 

Beyond the basic sessionization model there is another interesting pattern that we can search for within our data set. If a user starts watching a movie then we might not get another event being logged for over two hours. Therefore, we could look for instances where the time between events is less than 150 minutes (assuming most movies last around 2 hours) and there is at least one (or more) "start" event is logged and at least  one (or more) "complete"  event is logged,

PATTERN (strt s* f+ c+) 
DEFINE
f as act_start = 'Y',
c as act_complete = 'Y',
s as (round(to_number(sess_date - prev(sess_date))*1440) <= 150)

 

The output from this query is shown below:

Start Complete Data Full

 

Looking at the results what really jumps out is the level of interaction on the site before, during and after the user has watched a movie. Many of the records that are returned show users searching our movie database, going to their lists of favourite movies and rating the movies they have just watched. All this implies that if we can get customers to begin watching a movie there is a huge knock-on effect in terms of how long they remain on the site and the number of "events" they create while on the site. This is the sort of information that is useful to marketing and campaign management teams. 

A different view: Calculating the sessionization data for each movie

While developing the above analysis I began to wonder about analysing the data not from a customer perspective but from the point of view of each movie. This would allow me to look at the session activity over time for each movie and understand the usage patterns across movies and genres. For example, are there certain movies or types of movies that are more or less likely to be rated. Is the process of watching a movie the same across all movies in terms of the events that are registered during a session? There are all sorts of new avenues of analysis that could be created from looking at the data in this way.

So how easy would it be to change the focus of the analysis to movies? Well the code took me approximately 30 seconds to write - actually cut & paste from the customer sessionization example, switch the customer id for the movie id and finally removed a few fields. Here is the new code:

SELECT *
FROM SESSION_DATA td
MATCH_RECOGNIZE
(PARTITION BY movie_id ORDER BY sess_date
MEASURES COUNT(*) no_of_events,
FIRST(sess_date) start_date,
TO_CHAR(first(sess_date), 'hh24:mi:ss') start_time,
LAST(sess_date) end_date,
TO_CHAR(LAST(sess_date), 'hh24:mi:ss') end_time,
TO_CHAR(ROUND(TO_NUMBER(LAST(sess_date) - FIRST(sess_date)) * 1440), '999,999') duration,
MIN(activity_id) act_id,
MAX(activity_id) last_act_id,
COUNT(act_rate) act_rate,
COUNT(act_complete) act_complete,
COUNT(act_pause) act_pause,
COUNT(act_start) act_start,
COUNT(act_browse) act_browse,
COUNT(act_list) act_list,
COUNT(act_search) act_search,
COUNT(act_login) act_login,
COUNT(act_logout) act_logout,
COUNT(act_incomplete) act_incomplete,
COUNT(act_purchase) act_purchase
ONE ROW PER MATCH
PATTERN (strt s*)
DEFINE
s as (ROUND(TO_NUMBER(sess_date - PREV(sess_date))*1440) <= 120)
) MR;

 

The speed at which you can create these additional/alternative result sets is really great!  It is now relatively easy to continue this discovery process by analysing the information by movie genre or any of the other related attributes.

Summary

The main take-away from this post is that (hopefully) I shown how easy it is to use SQL for sessionization analysis.  Firstly in terms of creating the framework to support the normal "sessionization" transformation process: taking the source log file, extracting the data points and then grouping the information by user. The great news is that processing of the source file will be greatly simplified when the JSON feature is released because it will be possible to query the log file in a much simpler way to extract the required data points. Secondly, SQL provides a much simpler and easier way to support the evolutionary of the discovery process. Changing the focus from customer analysis to movie analysis is both quick and easy. I am not convinced that a Java-based approach to data discovery can offer the same level of simplicity and agility but I will let you be the judge of that last statement….

 

Technorati Tags: , , , , , ,

Tuesday, 15 April 2014

OpenWorld call for Papers closes today!

 Just a gentle reminder - if you have not submitted a paper for this year's OpenWorld conference then there is still just enough time because the deadline is Today (Tuesday, April 15) at 11:59pm PDT. The call for papers website is here http://www.oracle.com/openworld/call-for-papers/index.html and this provides all the details of how and what to submit.

I have been working with a number of customers on some really exciting papers so I know this year's conference is going to be really interesting for data warehousing and analytics. I would encourage everyone to submit a paper, especially if you have never done this before. Right now both data warehousing and analytics are among the hottest topics in IT and I am sure all of you have some great stories that you could share with your industry peers who will be attending the conference. It is a great opportunity to present to your peers and also learn from them by attending their data warehouse/analytics sessions during this week long conference. And of course you get a week of glorious Californian sunshine and the chance to spend time in one of the World's most beautiful waterfront cities.

If you would like any help submitting a proposal then feel free to email during today and I will do my best to provide answers and/or guidance. My email address is keith.laker@oracle.com.

Have a great day and get those papers entered into our OpenWorld system right now!

 

 

Technorati Tags: , , , , ,

Tuesday, 1 April 2014

Limited Edition Exadata X4-2C - Brighten Up Your Data Center

Oracle has always been at the forefront of efforts to revolutionise your data center. To date, for obvious reasons, the focus has been on optimizing energy and space efficiency. As of today we are moving into an exciting new phase in terms of the look and feel of your data center. Oracle recently added a new fashion design team to its engineered system group to help us re-imagine the next generation data center and the first exciting fruits of this new partnership of both technology and fashion are now available for our customers to order…..

For a short period only, Oracle is offering its data warehouse customers the chance to buy a limited edition EXADATA X4-2C. This new Exadata configuration is going to brighten up your data center with its exciting range of color coordinated racks! Now you can enjoy running those really sophisticated business queries in glorious technicolor. Most importantly, the great news is that we are not charging you anything extra for this fabulous new technicolor data warehouse experience:

X4 2C

HARDWARE, SOFTWARE AND COLORENGINEERED TO WORK TOGETHER

Each color-coded rack comes with its own color-linked version of Enterprise Manager to add more colour, brightness and joy to all those day-to-day tasks as you can see below on these specially designed monitoring screens: 

EMC

 Your Exadata DBA is really going to thank you!

So what happens if you buy a 1/2 rack then slowly add more Exadata nodes? Great question - well, while stocks last you can actually create your own multi-colored Exadata rack. As always we are ahead of the game because we know what our customers want. SO WHY NOT HAVE A  TECHNICOLOR DATA WAREHOUSE in your data center! Go on, you know it makes sense….

X4 2C 2 

BUT YOU GOTTA HURRY - This new Exadata X4-2C range is a limited edition, special order only model. Stocks are limited. To brighten up your data center make sure you contact your Oracle Sales Representative right now because you do not want to miss out on this exciting opportunity to put one of these gorgeous, colour-coded dudes in your data center. And don't forget, only Oracle gives you  HARDWARE, SOFTWARE AND COLORENGINEERED TO WORK TOGETHER


Oracle 1-800-633-0738


Technorati Tags: , , ,

Friday, 21 March 2014

Open World 2014 - guidelines for call-for-papers…

 

OOW Banner 2013

 

Most of you will already have received an email from the OOW team announcing the call for papers for this year's conference: http://www.oracle.com/openworld/call-for-papers/index.html. Each year, customers ask me how they can increase their chances of getting their paper accepted? Well, I am going to start by stating that product managers have absolutely no influence over which papers are accepted - even mentioning that a product manager will be co-presenting with you will not increase your chances!

So how do you increase you make sure that your presentation title and abstract catches the eye of the selection committee? Here is my top 10 list of guidelines for submitting proposals:

 

1) Read the "call-for-papers" carefully and follow its instructions - even if you have submitted presentations for lots of Oracle conferences it is always a good idea to carefully read the call for papers and to make sure you follow the instructions. There is an excellent section towards the end of the call-for-papers web page, "Tips and Guidelines"

2) Address the theme of the conference - If this is available when the call the for papers is announced then try to address the theme of the conference within your abstract.

3) Address the key data warehouse focus areas - for this year's OOW 2014 the key focus areas for data warehousing will be partitioning, analytical SQL, parallel execution, workload management and logical data warehouse. If possible try to include one or more of these focus areas within your abstract.

4) Have a strong biography - You need to use your biography to differentiate and build credibility. This is an important topic because it allows you to differentiate yourself from all the other presenters who are trying to get speaking slots. Your biography must explain why you are an authority on the topic you have chosen for your presentation and why people will want to listen to what you have to say.

5) Have a strong business case - build your presentation around a strong business case, relevant to your industry and/or your target audience (DBAs, developers, architects etc). Try to explain in clear and simple terms the problem you needed to solve, how you solved it using Oracle technology and the direct technical/business benefits.

6) Make the title and abstract interesting - Your title and abstract must be easy to read and make sure you introduce your main idea as early as possible. Review the titles and abstracts from previous conferences as a guide. Ideally make the issue relevant to the delegates attending OWW, get to the point, and make sure it is easy to read.

7) Look at previous presentations - the content catalog for last year's conference is available online,see here:https://oracleus.activeevents.com/2013/connect/search.ww?eventRef=openworld. You can review all the titles and abstracts that were accepted and use them as guidelines for creating your own title and abstract for this year's conference.

8) Write clear outcomes - The majority of the best presentations have clearly stated outcomes. What do you expect that conference attendees will be able do or know at the end of your session? Consider including a sentence at the end of your abstract such as the following: “At the end of this presentation, participants will be able to . . . .”

9) Don’t submit your paper right away - Once you have a title and abstract show it to a few colleagues. Get some feedback. You probably know many people who’d be happy to give you ideas on making your paper better.

10) Keep number of submissions low - You do not increase your chances of getting a paper accepted by submitting lots of different papers.

I cannot guarantee you success if you follow these guideline but I hope they prove helpful. Good luck with your submission(s) and I look forward to seeing at you at this year's OpenWorld in San Francisco.

 

 

OOW2014

 

Technorati Tags: , ,

Tuesday, 18 March 2014

Built-in sorting optimizations to support analytical SQL

One of the proof points that I often make for using analytical SQL over more sophisticated SQL-based methods is that we have included specific optimizations within the database engine to support our analytical functions. In this blog post I am going to briefly talk about how the database optimizes the number of sorts that occur when using analytical SQL.

Sort Optimization 1: Ordering Groups

Many of analytical functions include PARTITION BY and/or an ORDER BY clause both of which by definition implies that an ordering process is going to be required. As each function can have its own PARTITION BY-ORDER BY clause this can create situations where lot of different sorts are needed. For example, if we have a SQL statement that included the following:

Rank() Over (Partition by (x) Order by (w))
Sum(a) Over (Partition by (w,x) Order by (z))
Ntile() Over (Partition by (x) Order by (y))
Sum(b) Over (Partition by (x,y) Order by (z))

this could involve four different sort processes to take into account the use of both PARTITION BY and ORDER BY clauses across the four functions. Performing four separate sort processes on a data set could add a tremendous overhead (depending on the size of the data set). Therefore, we have taken two specific steps to optimize the sorting process.

The first step is create the notion of "Ordering Groups". This optimizations looks for ways to group together sets of analytic functions which can be evaluated with a single sort. The objective is to construct a minimal set of ordering groups which in turn minimizes the number of sorts. In the example above we would create two ordering groups as follows:

 

Screen Shot 2014 03 13 at 13 39 37

This allows us to reduce the original list of sorts down from 4 to just 2.

Sort Optimization 2: Eliminating Sorts

We can further reduce the number sorts that need to be performed by carefully scheduling the execution so that:

  • Ordering groups with sorts corresponding to that in the GROUP BY execute first (immediately after the GROUP BY) 
  • Ordering groups with sorts corresponding to that in the ORDER BY execute last (immediately before the ORDER BY)

In addition, we can also eliminate sorts when an index or join method (sort-merge) makes sorting unnecessary. 

Optimization 3 : RANK Predicates

Where a SQL statement includes RANK() functions there are additional optimizations that kick-in. Instead of sorting all the data, adding the RANK and then applying the predicate, the RANK predicate is evaluated as part of the sort process. The net result is that fewer records are actually sorted, resulting in more efficient execution.

 

Summary 

Overall, these three optimizations ensure that as few sorts as possible are performed when you include SQL analytical functions as part of your SQL statements. 

 

Technorati Tags: , , , ,

Friday, 21 February 2014

Oracle Technology Network's FREE Virtual Database Developer Day in EMEA on March 4 @9:00am GMT

otn virtual dvlper day

 

Oracle Technology Network's FREE Virtual Developer Day is coming to EMEA on March 4 starting at 9:00am GMT. This fantastic online event for DBAs and Database Developers is focused on building and deploying Oracle Database apps using modern techniques. My SQL pattern matching workshop will be at 11:45am GMT and lasts for 45 minutes. This is the same event that we ran in the US earlier this month and the response was amazingly positive from all those people that joined us.

To make things amazingly easy for you and to make sure that you get the best experience from this event we are providing a pre-built virtual machine which contains all the software pre-installed and configured along with all the workshops for the two Database 12c tracks: DBA and Developer. You will find the links to download the VM on the event registration page.

The 12c DBA track will be running the following sessions:

09:00 AM - Oracle VM performance with Oracle Database 12c
10:15 AM - Automatic Data Optimization (ADO) features of Oracle 12c
11:00 AM - RAC vs. Multitenant
11:45 AM - Pattern matching in SQL
12:30 PM - Database as a Service in a DBA's World

The 12c Developer track will run the following sessions:

09:00 AM - SQL Data Modeler 4.0
10:15 AM - SQL Developer 4.0
11:00 AM - XML Database
11:45 AM - Building Mobile Web Applications with Oracle Application Express 4.2
12:30 PM - Best Practices for Performance, Scalability, & Availability with Oracle Database12c

If you are interested in big data there are two tracks for big data developers and DBAs that cover all the basic concepts of Hadoop, how to process data (social media streams) on Hadoop and then push data to Oracle Database. 

Of course you can mix and match topics across all four of these streams and don't worry if you end-up with wanting to attend two sessions running at the same time because you won't miss out on any of these sessions because all the workshops are included in the VM. To register for this event please follow this link: https://oracle.6connex.com/portal/database2014EMEA/login?langR=en_US&mcc=socspk

During my SQL pattern matching workshop (@11:45am, March 4) I will be on the live Q&A chat to answer any questions so mark March 4 in your diaries and I hope to see you online during the event.

 

Technorati Tags: , , , , ,

Thursday, 6 February 2014

Sessionization with 12c SQL pattern matching is super fast

To help us evaluate our new 12c pattern matching feature our development team ran a series of performance tests using the TPC-H schema. The data set contained approximately seven years of data which consumed about 1TB of space. As part of the tests we ran two different searches:
the first search was based on a sessionization pattern using the orders table and this was compared to using a window function
the second search was based on a W-pattern using the line-item table and this was also compared to using a window function.
Let's examine these two use cases in more detail.

Sessionization

To create the first part of the sessionization workflow we took the original source data and used the USER_ID as the PARTITION BY  key and the timestamp for the ORDER BY clause. In this example we defined a session as a sequence of one or more events with the same partition key (USER_ID) where the gap between the timestamps is less than 10 seconds - obviously the figure for the gap is completely arbitrary and could be set to any number as required. The objective for this first step is to detect the various sessions and assign a surrogate session id to each session within each partition (USER_ID).
This creates an output result set that delivers a simplified sessionization data set as shown here:

NewImage

The SQL to create the initial result set is as follows:
SELECT user_id, session_id start_time, no_of_events, duration
FROM Events MATCH_RECOGNIZE
 (PARTITION BY User_ID ORDER BY Time_Stamp 
  MEASURES match_number() session_id, 
           count(*) as no_of_events,
           first(time_stamp) start_time, 
           last(time_stamp) - first(time_stamp) duration 
  ONE ROW PER MATCH 
  PATTERN (b s*) 
  DEFINE 
       s as (s.Time_Stamp - prev(s.Time_Stamp) <= 10) 
 )
;
as a comparison for how to achieve the above using analytical window functions
CREATE VIEW Sessionized_Events as
SELECT Time_Stamp, User_ID,
 Sum(Session_Increment) 
 over (partition by User_ID order by Time_Stampasc) Session_ID
FROM ( SELECT Time_Stamp, User_ID,
 CASE WHEN (Time_Stamp - 
 Lag(Time_Stamp) over (partition by User_ID 
 order by Time_Stampasc)) < 10
 THEN 0 ELSE 1 END Session_Increment
 FROM Events)
;
SELECT User_ID, Min(Time_Stamp) Start_Time,
 Count(*) No_Of_Events, Max(Time_Stamp) -Min(Time_Stamp) Duration
FROM Sessionized_Events
GROUP BY User_ID, Session_ID;

As you can see the non-12c approach is a little more complex to understand but it produces the same output - i.e. our initial sessionized data set.
However, to get business value from this derived data set we need to do some additional processing.  Typically, with this kind of analysis the business value of within the data emerges only after aggregation which in this case needs to by session. We need to reduce the data set to a single tuple, or row, per session with some derived attributes
  • Within-partition Session_ID
  • Number of events in a session
  • Total duration 
To do this we can use the MATCH_RECOGNIZE clause to determine how many events are captured within each session. There are actually two ways to do this: 1) we can compare the current record to the previous record, i.e. peek backwards or 2)  we can compare the current record to the next record, i.e. peek forwards.
Here is code based on using the PREV() function to check the current record against the previous record:
select count(*) 
from ( select /* mr_sessionize_prev */ *
 from
 ( select o_pbykey, session_id, start_time, no_of_events, duration
 from orders_v MATCH_RECOGNIZE
 (
  PARTITION BY o_pbykey
  ORDER BY O_custkey, O_Orderdate
  MEASURES match_number() session_id, count(*) as no_of_events, 
           first(o_orderdate) start_time,
           last(o_orderdate) - first(o_orderdate) duration
  PATTERN (b s*)
  DEFINE s as (s.O_Orderdate - prev(O_Orderdate) <= 100)
 )
 )
 where No_Of_Events >= 20
); 
Here is code based on using the NEXT() function to check the current record against the next record:
select count(*) 
from ( select /* mr_sessionize_prev */ *
 from
 ( select o_pbykey, session_id, start_time, no_of_events, duration
 from orders_v MATCH_RECOGNIZE
 (
PARTITION BY o_pbykey
ORDER BY O_custkey, O_Orderdate
MEASURES match_number() session_id, count(*) as no_of_events, 
           first(o_orderdate) start_time,
           last(o_orderdate) - first(o_orderdate) duration
PATTERN (s* e)
DEFINE s as (next(s.O_Orderdate) - s.O_Orderdate <= 100)
 )
 )
 where No_Of_Events >= 20
Finally we can compare the 12c MATCH_RECOGNIZE code to the typical 11g code using window functions (which in my opinion is a lot more complex):
select count(*)
from (
 select /* wf */ *
 from (select O_pbykey, Session_ID, min(O_Orderdate) Start_Time, 
               count(*) No_Of_Events,
              (max(O_Orderdate) - Min(O_Orderdate)) Duration
        from ( select O_Orderdate, O_Custkey, o_pbykey, 
                      sum(Session_Increment) 
                      over(partition by o_pbykey order by O_custkey, O_Orderdate) Session_ID
               from ( select O_Custkey, O_Orderdate, o_pbykey,
                             case when (O_Orderdate –
                             Lag(O_Orderdate) 
                                 over(partition by o_pbykey 
                                      order by O_custkey, O_Orderdate)) <= 100 -- Threshold
                             then 0 else 1 end Session_Increment
                      from orders_v
                    )
            )
       group by o_pbykey, Session_ID
    )
 where No_Of_Events >= 20

The output would look something like this:
NewImage

The performance results for these three approaches are shown below: column 3 shows the SQL using window functions and columns 4 & 5 show the performance figures for MATCH_RECOGNIZE based on using either PREV() or NEXT(). As you can see from the results, it is worth investigating how you create your comparison test in terms of looking forward or backwards to do the comparison test because there can be differences in performance. In general the 12c MATCH_RECOGNIZE code is between 1.5x and 1.9x faster compared to using window functions.

NewImage