22##[图]OLAP Cubes and Logical Models

//
OLAP Cubes and Logical Models - Open Knowledge Labs
http://okfnlabs.org/blog/2014/01/20/olap-cubes-and-logical-model.html

Last time we talked about OLAP in general – what it is and why it is useful. Today we are going to look at the data – how they are structured and why? What are cubes? What does it mean “multi-dimensional”?
Data Cubes and Logical Model
Application data might be a mess from user’s perspective. Not only that, data might be scattered all around the place in multiple systems. Even when the data would be put into one place called “data warehouse”, they will still have their original form which is not ready to answer our questions quickly. Purpose of the logical model is to hide physical structure of the data (how applications use it) and provide user-oriented view of the data (how business sees it).
“Answering questions quickly” does not depend only on database performance and amount of data. We might have the fastest database and computation engine in the world, but we will not get the answer quickly because it will take weeks to properly translate the human (business) question into technical terms. The challenges are:
Where are the data stored? What table? Which column?
What are the categories and what can I summarize?
What are the relationships between columns?
Is this category_id
column the same as this pk_prod_cat
?
Does this column contain a key (which is unique) or is the a label (which might be not, due to data evolution)?
How can I group the data?

All this information is collected in metadata called logical model. Analysts or report writers do not have to know where name of an organisation or category is stored, nor he does not have to care whether customer data is stored in single table or spread across multiple tables (customer, customer types, …). They just ask for “customer name” or “category code”.
Cubes
The data structures used in the OLAP are multidimensional data cubes or OLAP cubes:


Cube is a data structure that can be imagined as multi-dimensional spreadsheet. How we can imagine it? Take a spreadsheet, put year on columns, department on rows – that’s two-dimensional cube. Now create multiple sheets with data of the same structure, say one sheet per country. Now you have three-dimensional cube.
Facts and Measures
Fact is most detailed information that can be measured.

Example of a fact might be a contract, a spending, a phone call, a visit. We can measure:
contract: financial amount, discount, planned amount
spending: financial amount, quantity
phone call: duration, cost
visit: duration

Those measurable properties, such as amount, discount or duration are called measures.
We are mostly interested in summarized view: “what was the overall spending?”, “what is the average call duration?” or “how many contracts are there?” Those computed values are called aggregates or aggregated measures.
Facts might have multiple measures or they might even have none. If there are no measures we still can at least answer questions of type “how many?”.
Note: The terminology might differ slightly in various literature and systems. For example, Microsoft calls measure a measure group and they label aggregates as measures.
Dimensions
OLAP is suitable mostly for data which can be categorized – grouped by categories. The categorical view of data should be also the main interest of the data analysis. Example of categories might be: color, department, location or even a date.
The categories are called dimensions.
22
Dimensions provide context for facts:
Where did that happen?
When was the contract signed?
What kind of goods or services was in the contract?

Dimensions are used to filter queries:
What was the spending last year?
How many contracts signed by the department of Health?

They are used to control scope of aggregation of facts:
What was the number of contracts by department?
What was the average visit duration per month?
What are the sales of each product?

Concept Hierarchies
We might be interested in amount per year, then per month for particular year; products can be grouped by categories and subcategories; location might be defined by country, country might have multiple cities… Those are concept hierarchies of dimensions.
Hierarchy has multiple levels and there might be various hierarchical views of any dimension. For example the date might be split by year, month and day. Or it might be split by year, quarter, month and no day (because we have no daily data) or by year and week (for weekly data).
From technical perspective you might associate an attribute with a dimension. Depending on the modelling method a dimension might be composed of just one attribute or multiple attributes grouped by hierarchies.
Note: there are multiple approaches to concept hierarchies. The one described here is: Dimension might be composed of multiple levels and the levels are grouped into hierarchies. Another approach might be “hierarchies are lists of dimensions” where a dimension represents just single attribute.
Slicing and Dicing
We have a data cube full of facts, how can we explore the data? We slice the cube! What does that mean?
Say we have a data cube of contracts with dimensions: time, country and type (of procured subject)
We might be interested in spending in 2010:


… or contracts in Estonia:

… or contracts in Estonia in 2010:

… or just IT contracts in general:
[图片上传中。。。(7)]
IT contractsin Estonia in 2010:
[图片上传中。。。(8)]
Some OLAP systems might have this information readily available in a pre-computed (pre-aggregated), therefore we might get the answer very quickly despite of huge amount of original data. Even if the system does not store the pre-aggregated data cells, it might use some other transparent tricks to achieve fast responses.
Slicing and dicing is an operation that filters the data cells of a cube and narrows our focus from broader view:

Drilling down
How many contracts per year? or Which type of products was most wanted in 2012? are kind of questions that are answered by “drilling down” through the data. Drilling down means changing our focus to more detailed data.
Drilling down can be done by concept hierarchies – for example going from year summary to month summary to daily sales or by going from country level to regional level.
The opposite operation is called “roll-up” – for example going from a monthly view to a yearly view.
Try It
You might try OLAP with light-weight Python framework Cubes. I’ll be talking about the framework in more details in the future, meanwhile here are the main features:
ROLAP – OLAP on top of relational database
quick prototyping on top of existing database schemas
metadata driven with user-oriented metadata
localizable
OLAP API with HTTP server
no need to know Python

The development version includes pluggable datawarehouse (cubes from external sources) and many new backends such as MongoDB.
For reporting and data exploration you might use CubesViewer. More visualisation software being developed.
Summary
Concept of OLAP cubes and multidimensional modeling brings more understandable and usable data to the end-users. It very easy and straightforward to translate business questions into multidimensional query.
The OLAP systems, thanks to the nature of multi-dimensional data cubes, can prepare data by aggregating them up-in-front to provide answers faster.
Moreover, explicit metadata (logical model) allows not only more flexible data navigation but also easy transformation of the data to be used in various reporting software. Some OLAP tools can work with certain database schemas immediately.
To sum it up in few words, the multidimensional modeling of OLAP cubes brings: understandability, better usability, speed and logical data reusability.
Next time we will look at the Cubes – Lightweight Python framework – how to have an OLAP server running “in 15 minutes”.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 151,511评论 1 330
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 64,495评论 1 273
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 101,595评论 0 225
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 42,558评论 0 190
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 50,715评论 3 270
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 39,672评论 1 192
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 31,112评论 2 291
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 29,837评论 0 181
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 33,417评论 0 228
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 29,928评论 2 232
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 31,316评论 1 242
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 27,773评论 2 234
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 32,253评论 3 220
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 25,827评论 0 8
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 26,440评论 0 180
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 34,523评论 2 249
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 34,583评论 2 249

推荐阅读更多精彩内容