Monday, 25 August 2014

OpenWorld on your iPad and iPhone

Most of you probably know that each year I publish a data warehouse guide in iBook and PDF formats for OpenWorld which contains links to the latest data warehouse videos, a calendar for the most important sessions and labs and a section that provides profiles and relevant links for all the most important data warehouse presenters. For this year’s conference I have made all this information available in an HTML app that runs on most smartphones and tablets. The pictures below show the HTML app running on iPad and iPhone.

IPad OOW IPhone OOW

 

This exciting new web app contains information about why you should attend OpenWorld - just in case you have not yet booked your ticket! - as well as the following information:

  • Getting to know 12c - a series of video interviews with George Lumpkin, Vice President of Data Warehouse Product Management
  • Your presenters - full biographies and links to social media sites for all the key data warehouse presenters
  • Must sees sessions - list of all the most important data warehouse presentations at this year’s conference
  • Our customers - profiles our most important data warehouse customers
  • Must attend labs - list of all the most important data warehouse hands-on labs at this year’s conference
  • Links - a list of links to the most important data warehouse sites

If you want to run these web apps on your smartphone and/or tablet then follow these links:

Android users: I have tested the app on Android and there appears to be a bug in the way the Chrome browser displays iframes because scrolling does not work . The app does work correctly if you use either the Android version of the Opera browser or the standard Samsung browser.

If you have any comments about the app (content you would like to see) then please let me know. Enjoy OpenWorld.

  

Technorati Tags: , ,

Friday, 27 June 2014

My Data Warehousing and Big Data Must-See Guide for OpenWorld

Data Warehousing Must-See Session Guide for Open World 2014
IBook2014

There’s so much to learn at this year’s Oracle OpenWorld - it provides more educational and networking opportunities than any other conference dedicated to Oracle business and technology users. To get the most from this year’s event I have prepared an initial guide which lists all the must important data warehousing and big data sessions. The first part of the guide provides links to videos and content from last year’s event so you can either re-live the highlights from last year or see what you missed by not being there! If you have an iPad or use OSX 10.9 then you will want to download the iBook version because this contains video highlights from last year’s database keynote session.

The session guide is divided into the following chapters:

  • Data Warehousing
  • Performance and Scalability
  • Database Analytics
  • Industry Data Models and Data Integration
  • Unstructured Data
  • Big Data

The last part of the booklet lists all the key data warehouse and big data product managers who will be presenting at this year’s conference. Included alongside each speaker’s biography are links to their social media sites and blogs.

Please note that, as usual, there will be hands-on labs at this year’s OpenWorld but these have not been included at in the session catalog. I am expecting them to be added shortly. Details of all our labs will appear in the more detailed guide that I will publish before the conference begins.

The Must-See guide is available in two formats: 

For iPad and OSX 10.9 (Mavericks) users please download the iBook, which is available here: https://dl.dropboxusercontent.com/u/69436560/OOW/Must-See%20OpenWorld%202014_Sessions.ibooks

For all other platforms please download the PDF, which is available here: https://dl.dropboxusercontent.com/u/69436560/OOW/Must-See%20OpenWorld%202014_Sessions.pdf

 

 Hope to see you in September at OpenWorld. 

Friday, 20 June 2014

Part 3: DBAs guide to building and deploying sandboxes

Recently I started a series of blog posts covering the need for a private, secure and safe area for data discovery within the data warehouse ecosystem. The most common name for this type area is a “sandbox”. The demand for sandboxing is growing rapidly as many companies start exploring the new types of data streams linked to “big data”. In the first of my series of blog posts I started with an overview of the basic elements of sandboxes in the cloud. In the second post I looked at the differences between sandboxes vs. data marts.  As part of the first posts I put forward the idea of a sandbox lifecycle model, as shown below, and in this post I am going to explore the first stage of that lifecycle model: the BUILD stage:

 

Many companies are actively working on lots of different big data led projects: customer sentiment analysis, clickstream analysis, product/service recommendations, real time event monitoring etc. Consequently there are lots of different project teams wanting to evaluate a wide variety of new data sources inside their own private space. They are looking for new customers or segments to target, new product opportunities, new ways to retain customers and ways to predict system faults before they happen. Consequently, DBAs are being flooded with requests to create new sandbox environments. The process of building, or deploying, a sandbox can be based on one of two approaches:

  • Deliver an empty environment: Project team then finds the required data sources and loads them into the sandbox before starting their analysis
  • Delivered a pre-populated environment: Project team loads any additional new data sources into the sandbox before starting their analysis

At the moment we can easily support both of these scenarios. Below is a review of the features available in earlier, pre-12c, versions of the database followed by new features that are part of Database 12c. Hopefully, if you are already using sandboxes and looking for a reason to move to Database 12c then this post will help you make the case for moving your data warehouse to 12c.

 

Building Sandboxes pre-Database 12c

There are a number of ways to create sandboxes using pre-12c versions of the Oracle Database. The two most popular choices that I have seen used are Database Configuration Assistant and Workspace Manager. 

Database Configuration Assistant

Creating an empty sandbox is a relatively easy for DBAs and business users. The Database Configuration Assistant (DBCA) provides a wizard for creating and configuring an empty, but ready to go, database. This tool is very flexible and allows DBAs to control all aspects of the database creation process. For most business users the wizard provides far too many parameters to configure so Oracle provides three default configuration templates which provide default values for all the main configuration parameters. The pre-built templates include: general purpose, data warehouse and custom. The DBA can add additional templates for specific types of sandboxes by simply creating new template files and this is covered in the database documentation (see here: http://docs.oracle.com/cd/E16655_01/server.121/e17643/install.htm#BABEGFCG). The information in each template includes details of database options, initialisation parameters, and storage attributes (for data files, tablespaces, control files, and online redo logs). A common approach might be to create templates to configure large, medium and small sandboxes with appropriate settings for the specific hardware platform being used.

Once the new sandbox is ready the business users/data scientists can then start loading data. I am not going to over the data loading process, however, please note that this can be done using a data integration tool such as Oracle Data Integrator or flat files. In fact ODI now has a scripting language called Groovy which allows ETL developers to create easy-to-use dialog for selecting a table/view from a source system and copy its data to a target schema. For more information about ODI’s Groovy scripting language checkout the posts on the data integration blog: https://blogs.oracle.com/warehousebuilder/tags/groovy.

Overall, creating empty sandboxes using DBCA is a simple process. However, this approach does mean that data is replicated from the original database and each instance requires its own dedicated hardware resources (I/O, CPU and memory) which cannot be easily transferred to other sandboxes. Every time a new sandbox is required the DBA and systems administrators have to ensure that sufficient memory for managing the data along with all the typical database background processes, storage and CPU are available. As more and more sandboxes are requested and deployed these replicated overheads quickly consume the available server resources and this limits the number of sandboxes that can be run on a specific platform. Having dedicated, non-sharable, resources is extremely limiting and very costly.

Many teams find that trying to get approval to create a new database is a long and  painful process. Sometimes it can months to work through the bureaucracy and ensure all the pieces are in place ready to create the new database: storage, processing nodes, network connectivity etc.

Key Benefits: relatively simple process for creating new, empty, databases;

Key issues: requires dedicated system resources; cannot copy an existing database with its schemas and data; no simple mechanism for moving discoveries (objects, data etc) to production environment; may take a long time to provision due to internal bureaucracy

Workspace Manager

If projects require pre-populated sandbox environments then creating a sandbox with access to existing data is best done using Oracle Workspace Manager. This provides the ability to manage current, proposed and historical versions of data within the same database instance so there is no need to move data. Workspace Manager creates a self-contained virtual area within a database where users can update existing data without affecting the original data. This virtual area isolates all the changes until they are explicitly merged with production data or discarded. This means there is no need to duplicate existing data and any new data can be loaded in a controlled way without impacting the production data.

Users in a workspace always see a transactionally consistent view of the entire database; that is, they see changes made in their current workspace plus the original data in the database as it existed either when the workspace was created or when the workspace was last refreshed with changes from the parent workspace. Workspace Manager even supports a hierarchy of workspaces for very complex sandbox requirements involving multiple teams and/or project goals.

There is a great example of how to use Workspace Manager for what-if analysis on the Oracle Learning Library. The set-by-step tutorial shows you how to use workspaces to do location-based analysis on current and proposed data. The fictitious company MyCompany is planning to build a new warehouse to provide better service to its customers. Two potential sites are under consideration. The company wants to use SQL to analyse the prospective sites. To do this, data for both prospective sites must be entered into the production warehouse schema. However, this new data must be kept separate from the existing production data so that it does not impact the work of employees who are not part of the site selection team. Using Workspace Manager it is possible to isolate data for each of prospective sites and allow the two site selection teams to work concurrently. This use case creates the following hierarchy of workspaces, as shown below:

 version-enabled warehouse table and associated team workspaces: SITE1, SITE2, and LIVE. 

You can take this sandbox tutorial by following the steps on the Oracle Learning Library page, see here: https://apex.oracle.com/pls/apex/f?p=44785:24:15780601154836::NO:24:P24_CONTENT_ID%2CP24_PREV_PAGE:4553%2C29

One of the challenges of using workspaces relates to resource management and the ability to increase or decrease resources which is important for overall resource utilisation.

Obviously at some point users will want to add new data into the workspace and as with the DBCA overview, ETL developers can use ODI’s scripting language called Groovy to create easy-to-use dialog for selecting a table/view from a source system and copy its data to a target schema. For more information about ODI’s Groovy scripting language checkout the posts on the data integration blog: https://blogs.oracle.com/warehousebuilder/tags/groovy

Key Benefits: relatively simple process; no need to copy/clone data; easy to move discoveries to production

Key issues: requires careful workload management; monitoring and chargeback is difficult to achieve

For more information about Oracle Workspace Manager visit the product home page on OTN: http://www.oracle.com/technetwork/database/enterprise-edition/index-087067.html.

Summary 

In general, it is likely that most customers will use both approaches for delivering sandboxes to their project teams. Where there are hundreds of projects taking place at the same time then both approaches will probably create additional work for the DBA and systems administrators in terms of having to monitor and track lots of workspaces and/or plan-monitor-track resource usage across lots of database instances. Ideally what we need is an approach that can easily support the deployment of thousands of sandboxes, offers simple and fast provisioning, leaves data in place to prevent unnecessary movement of large volumes of data, provides a single centrally managed platform and, most importantly, allows for expansion/reduction of resources as needed across all sandboxes. Is this possible? 

 

Building Sandboxing with Database 12c

The good news is that the new features we introduced as part of Database 12c combine the all best parts of the sandboxing features from the pre-12c approach with none of the drawbacks. Specifically, Database 12c introduced the concept of a multitenant database. This was originally positioned as a way to consolidate databases and applications under the control of a single environment but the approach works amazingly well for sandboxes. Therefore, with 12c you can now deploy sandboxes into the cloud!

This new feature (Oracle Multitenant) delivers a new architecture that provides a multitenant container database to hold a series of pluggable databases, or in this case sandboxes. There is a great quick overview of the new multitenant feature available on YouTube, follow the link below to watch the video:

Video: Oracle Multitenant Architecture

With the new multitenant architecture featuring pluggable databases (sandboxes) it is now possible to create a single container database that allows multiple sandboxes to be simply plugged in. Below is a simple overview of the main multitenant terminology. When we start to create a new multitenant sandbox environment we plug-in our multiple pluggable sandboxes to the root database. The combination of root database and pluggable sandboxes (PDBs) is referred to as a “Container Database” or CDB as shown below:

Architecture

 

The great thing about the multitenant feature us that it offers a very fast and efficient way of creating new sandboxes. It is a bit like the template feature in Database Configuration Assistant with one important difference: deploying a new pluggable sandbox takes seconds as shown here:

Clone

the above is taken from the Mutitenant info graphic which is here: http://www.oracle.com/us/products/database/oracle-multitenant-infographic-1961308.pdf. The actual speed at which we can deploy pluggable sandboxes is outlined in the graph below:

P3

 

The first bar on the graph shows the time taken to create a “standard” database, which involves coping template data files and then configuring the instance which can take a considerable amount of time. The second bar shows the time taken to create a new pluggable sandbox, which is significantly faster. The third bar shows that we can deliver pluggable sandboxes in real time by using the built-in “clone” feature - this allows us to deploy a new pluggable sandbox which is able to “see” the source data in the PDB which acted as the source for the clone. The cloned-pluggable sandbox also gets access to new/refreshed data as it is changed in the source system. This means that our cloned pluggable sandbox is always up to date with the very latest data. 

What is clone?

Provisioning a sandbox that is a copy of an existing database so the project teams have access to existing data sets used to be a lengthy multistep process.  Multitenant provides a capability called cloning. It is now possible to create local clones (from a pluggable database in the same container database) or remote clones (across a database link to a remote container database). What are the benefits of using this feature? Firstly, it is very simple - a single SQL statement. Secondly, it keeps the overall security requirements nice and  simple  because all that is required is database access and not OS access. Lastly, business users and data scientists like it because the whole process is so fast. The process of creating full clones is fast. Even faster—ludicrously fast—is the process for creating snapshot clones. Snapshot clones are built on a capability of the underlying file system—where available—called copy-on-write. This makes it possible to request and deploy a pre-populated pluggable sandbox in a matter of seconds!

Now we understand the overall architecture and speed at which we can deploy new sandboxes, how do you actually request a new sandbox/PDB?

Self-Service provisioning of sandboxes

In prior releases of the Oracle Database a lot of the tools and features that were used for sandboxing were typically aimed at DBAs. Today, many business users and data scientists want to control and own the process of requesting and building a sandbox. As part of Database 12.1.0.2 we have provided a new self-service application that DBAs can make available to their business teams for self-service provisioning.  This new application provides is a simple 4-step process for creating a new pluggable sandbox :

Step 1: Start the APEX self-service provisioning app and click on the “New Database” link:

Build 1

Step 2: Select to create a new sandbox/pluggable database or plug-in an existing database that has been moved from another container. In the majority of cases business users will typically provision a new sandbox:

Build 2

Step 3: Determine if you are going to create a new empty sandbox or clone an existing database (mart or warehouse or existing sandbox). In the example below we have selected and existing sandbox MIKETEST and the option to clone this database is available at the bottom of the screen. 

 

Build 3

 

Step 4: Check the details and confirm the deployment. 

Build 4

 

As mentioned earlier the speed at which pluggable sandboxes can be created means that business users can request a new pluggable sandbox and have the environment ready for use in a matter of seconds (of your course your mileage might vary according to your hardware!).

Workspace Manager and Multitenant

In many cases the will still be a need to have an environment where the same data sets can be shared across multiple project teams with each team work independently on their own view of the data. This is exactly the sort of use case for Workspace Manager, as described earlier, and the good news is that Workspace Manager functions transparently in a multitenant architecture. Workspaces benefit from the efficient administration of one multitenant container database, and the separation and resource prioritisation allowed by multiple pluggable databases. 

SQL Developer and Multitenant

For data scientists and more sophisticated business users the latest version of SQL Developer is now multitenant aware. Therefore, rather than using the self-service application, all the required operations for creating and managing a sandbox are now available from within SQL Developer as shown here:

SQLDeveloper 1

 From within SQLDeveloper you can: deploy a sandbox (CREATE PLUGGABLE or CLONE PLUGGABLE), move a sandbox (UNPLUG and PLUG) and delete a sandbox (DROP PLUGGABLE)

SQLDeveloper 2

 

SQL Developer offers fine grained control and most business users and data scientists will probably need some guidance from their DBA on the various parameter settings for deploying a new sandbox. For more information about using SQL Developer to manage ( creating, modifying, plugging/unplugging) pluggable sandboxes see the online tutorial on the Oracle Learning Library: https://apex.oracle.com/pls/apex/f?p=44785:24:113456037406764:::24:P24_CONTENT_ID%2CP24_PROD_SECTION_GRP_ID%2CP24_PREV_PAGE:7649%2C%2C24

Summary

Database 12c offers significant advantages for delivering sandboxes - deploying pluggable sandboxes into the cloud is now simple and really fast. The fast and efficient deployment of pluggable sandboxes can be delivered as a self-service approach using a variety of tools and applications. These new pluggable sandboxes also support the existing data isolation features such as Workspaces. Overall, Database 12c makes sandboxing faster, simpler and more efficient.

 

Conclusion

In this latest post on sandboxing I have examined the “Build” phase of our BOX’D sandbox lifecycle. For customers who have not yet moved to Database 12c there are a number of tools and features that can be used to deploy sandboxes. Customers who have moved to Database 12c can use the new multitenant feature alongside some of the earlier features to deliver real-time deployment of pre-populated and empty sandboxes.

Wednesday, 28 May 2014

Oracle Data Warehouse and Big Data Magazine MAY Edition for Customers + Partners now available online….

Follow us on Facebook Twitter Blogger
 
The latest edition of our monthly data warehouse and big data magazine for Oracle customers and partners is now available. The content for this magazine is taken from the various data warehouse and big data Oracle product management blogs, Oracle press releases, videos posted on Oracle Media Network and Oracle Facebook pages. Click here to view the May Edition


Please share this link http://flip.it/fKOUS to our magazine with your customers and partners


This magazine is optimized for display on tablets and smartphones using the Flipboard App which is available from the Apple App store and Google Play store





Monday, 12 May 2014

DBAs Guide to Sandboxes vs. Data Marts

I had an interesting response to my first post on the topic of sandboxing (DBA's Guide to Deploying Sandboxes in the Cloud). The following question was asked: what is the difference between a data mart and sandbox?

This is actually a great question so I thought it would be useful to convert my answer into a short blog post. I am sure there will be lots of different opinions on this topic just as there are alternative names for "sandbox environment" (from analytical sandbox, to analytical appliance to discovery zone etc etc) but here is my attempt at an answer:

OLAP1 IndustryDataModels

 

In my experience data marts tend to be a single subject area data repository and/or linked to a specific corporate application (such as finance, HR, CRM, ERP, logistics, sales tracking etc). The source data is pushed to a specific line of business for analysis. The push and loading processes implements all the necessary data cleansing and transformation routines so the data arrives into its destination schema ready for use. Most importantly, the data push happens on a regular basis and is driven by the needs of the business.  Many customers implement a data lifecycle management workflow to ensure that sufficient historical data is available to support the required analysis. In many cases the life of the data mart is largely open-ended and the IT team will ensure that regular backups and all the usual patching and maintenance operations are performed on a regular basis. Where the mart is seen as a mission critical system then high-availability features can be implemented and in extreme cases a disaster recovery site is setup and data synchronised between the production and DR systems.

The schema itself is typically organized to support reporting requirements and will be based around the standard relational models such as star and/or snowflake schemas although this is not a mandatory requirements. Sometimes a 3NF approach is required to support the particular needs of the business. The majority of queries within the mart are "well-defined" and "well-known" and subject to tuning and monitoring by the DBA team. 

A "sandbox" is generally meant as a non-operational environment where business analysts and data scientists can test ideas, manipulate data and model "what if" scenarios without placing an excessive computational load on the core operational processes. It has a finite life expectancy so that when timer runs out the sandbox is deleted and the associated discoveries are either incorporated into the enterprise warehouse, or data mart, or simply abandoned. The primary driver from an organisational perspective is to use a 'fail-fast" approach. At any one point in time an organization might be running any number of analytical experiments spread across hundreds of sandboxes. However, at some point in time those experiments will be halted and evaluated and the "hardware" resources being consumed will be returned to a general pool for reuse by existing projects or used to create environments for new projects. 

In general terms a sandbox environment is never patched or upgraded, except in exceptional circumstances. There is never a great urgency to apply software or operating system patches so the ITM team will  just incorporated these tasks into the normal cycles.

A sandbox should never be considered mission critical so there is no need to implement high availability features or build and manage a DR environment. If a sandbox becomes unavailable due to a fault (hardware or software) there is no pressing urgency to resolve the issue - in Oracle parlance a "sandbox going down" is not a P1 issue for support.

Below is a summary of how I view the differences between marts and sandboxes:


Business Centric Attributes

Data Mart

Sandbox

Business scope

Single subject area

Potentially a mixing pot of data sourced from multiple systems

Core objective

Managing the business

Discovery of new products, markets and/or customer segments

Query scope

Batch reporting for dashboards and pre-configured reports along with limited ad-hoc analysis

Ad –hoc data discovery

Frequency of update

Regular, scheduled data loads

One-off and ad-hoc loads as required

Data Volumes

Driven by external factors such as GRC requirements

Driven by needs of project

Data Quality

Very important – data is fully cleansed and transformed during load process

Raw data is loaded, transformations, cleansing and enhancements are incorporated into discovery process

Typical Output

Historical reports, KPIs, scorecards, multi-dimensional analysis

Data mining models: forecasts, predictions, scoring

Sophisticated analytics (aggregations, spatial, graph etc)

Typical query patterns

Pre-defined patterns returning small data sets, easily tuned

Complex ad hoc queries over massive data volumes



IT Centric Attributes

Data Mart

Sandbox

Mission critical

Yes

No

Performance SLAs

Yes

No

Individual/LOB chargeback for resource usage

Unlikely

Yes

HA features/DR Site

Yes

No

Backups

Full + incremental

No

Patching+Upgrades

As required to resolve specific issues along with scheduled maintenance programs

Only when necessary

Use of beta software

Never

Possibly if project needs specific features

Life expectancy

Unlimited

Limited – typically 90 days or less

Number of instances

Most companies choose to
implent a very small number of subject-specific marts

Large companies likely to have many hundreds of sandoxes running at any given point in time

Table: Key differences between data mart and sandbox

As you can see from this table, in some ways sandboxes are similar to data marts and in other ways they are not. For me, the key difference is in the life expectancy - a sandbox should never outstay its welcome. The best sandbox environments that I have come across are those where strict time limits are enforced on their duration. If you let a sandbox live on too long then you run the danger of it morphing into a shadow data mart and that is a very dangerous situation if you look at the attributes and descriptions listed in the first table (Business Centric Attributes). 

So why is there so much confusion about the differences between marts and sandboxes? Much of this is down to niche vendors trying to jump on specific marketing bandwagons. At the moment the latest marketing bandwagon is the concept of "analytical databases" which in reality is nothing more than a data mart (and in many cases these vendors are simple peddling highly specialised data silos). These niche vendor platforms are simply not designed to run hundreds of environments with resources being continually returned to a centralised pool for redistribution to existing or new projects - which is a core requirement for effective sandboxing.

Over the next 2-3 months I am will share why and how the unique features of Oracle Database 12c provide the perfect platform for supporting environments running hundreds of thousands of sandbox-driven projects.

 

Technorati Tags: , , , , ,

Friday, 9 May 2014

New 12c sessionization analytics workshop now available on OLL

I have just uploaded a new workshop on sessionization analytics using the 12c pattern matching feature, MATCH_RECOGNIZE, to the Oracle Learning Library. The workshop is based on analysis of the log files generated by our the Big Data Lite Movieplex application, which is part of our Big Data Lite virtual machine. Oracle Movieplex is a fictitious on-line movie streaming company. Customers log into Oracle MoviePlex where they are presented with a targeted list of movies based on their past viewing behavior. Because of this personalised experience and reliable and fast performance, customers spend a lot of money with the company and it has become extremely profitable. 

All the activity from our application is captured in a log file and we are going to analyze the data captured in that file by using SQL pattern matching to create a sessionization result set for our business users and data scientists to explore and analyze. The sections in the workshop (I have recorded a video of this workshop, see links below) will step you through the process of creating our sessionization result set using the Database 12c pattern matching features.

BD VM sessionization

 

The workshop and video are available on the Oracle Learning Library using the following links:

For more information (whitepapers, multi-media Apple iBooks, tutorials etc) about SQL pattern matching and analytical SQL then checkout our home page on OTN: http://www.oracle.com/technetwork/database/bi-datawarehousing/sql-analytics-index-1984365.html

Enjoy!

Technorati Tags: , , , , , ,

Tuesday, 29 April 2014

Oracle Data Warehouse and Big Data Magazine April Edition for Customers + Partners

Follow us on Facebook Twitter Blogger
Oracle Data Warehouse and Big Data Magazine APRIL Edition for Customers + Partners

The latest edition of our monthly data warehouse and big data magazine for Oracle customers and partners is now available. The content for this magazine is taken from the various data warehouse and big data Oracle product management blogs, Oracle press releases, videos posted on Oracle Media Network and Oracle Facebook pages. Click here to view the April Edition


Please share this link http://flip.it/fKOUS to our magazine with your customers and partners


This magazine is optimized for display on tablets and smartphones using the Flipboard App which is available from the Apple App store and Google Play store