Apache Kylin 1.5 Updates

The Evolution of Apache Kylin
Realtime & Plugin Architecture in Kylin 1.5
Li, Yang | 李扬

Agenda
 What’s Apache Kylin?
 New Features in Kylin 1.5
 Plugin Architecture
 Fast Cubing
 Parallel Scan
 Streaming Cubing
 User Defined Aggregation
 Summary

Extreme OLAP Engine for Big Data
Kylin is an open source Distributed Analytics Engine from eBay that
provides SQL interface and multi-dimensional analysis (OLAP) on
Hadoop supporting extremely large datasets
What’s Kylin
kylin / ˈkiːˈlɪn / 麒麟
--n. (in Chinese art) a mythical animal of composite form
• Open Sourced on Oct 1st, 2014
• Accepted as Apache Incubator Project on Nov 25th, 2014

Feature – SQL Interface
Hive Table
Build Cube
(Index)
SQL Query

 eBay
Feature – Big Data
Case Cube Size Raw Records
Session Analysis 20 TB 81+ billion rows
Traffic Analysis 30 TB 28+ billion rows
Transaction Analysis 560 GB 1.2+ billion rows

90% queries <5s
Dark-blue line: 90%tile queries
Light-blue line: 95%tile queries
90%ile query returns in 3 seconds
Feature – Low Latency

Feature – BI Integration via ODBC, JDBC

Linear scale out with more nodes
Feature – Scalable Throughput

Cube Builder (MapReduce…)
SQL
Low Latency -
SecondsRouting
3rd Party App
(Web App, Mobile…)
Metadata
SQL-Based Tool
(BI Tools: Tableau…)
Query Engine
Hadoop
Hive
REST API JDBC/ODBC
 Online Analysis Data Flow
 Offline Data Flow
 Clients/Users interactive with
Kylin via SQL
 OLAP Cube is transparent to
users
Star Schema Data Key Value Data
Data
Cube
OLAP
Cubes
(HBase)
SQL
REST ServerDataSource
Abstraction
Engine
Abstraction
Storage
Abstraction
Plugin Architecture Overview

MR Engine
IN OUT
Hive
Source
HBase
Storage
Cube Metadata
SourceFactory StorageFactoryEngineFactory
Plugin Architecture

MR Engine
Plugin Architecture
Hive Adapter HBase Adapter
load data save cubeHive
Source
HBase
Storage
adapt to IN adapt to OUT

 Engine
 MR V1
 MR V2
 Spark (early)
 Streaming (experimental)
 Source
 Hive
 Kafka
 Spark SQL & DataFrames
 Storage
 HBase
 ? Kudu
 ? Cassandra
Developing Modules

 Freedom
 Zoo break, not bound to Hadoop any more
 Free to go to a better engine or storage
 Extensibility
 Accept any input, e.g. Kafka
 Embrace next-gen distributed platform, e.g. Spark
 Flexibility
 Choose different engine for different data set
The Freedom, Extensibility, Flexibility

Full Data
0-D Cuboid
1-D Cuboid
2-D Cuboid
3-D Cuboid
4-D Cuboid
MR
MR
MR
MR
MR
A,B,C,D
A,B,C A,B,D A,C,D B,C,D
Layered Cubing (MR Engine V1)
 Pros
 Simple implementation, depends
on MR shuffle to merge sort and
then aggregate
 Little requirement on memory
 Cons
 Aggregation happens at reducer
side
 Mapper outputs raw data thus
shuffle is huge
 Multiple rounds of MR overhead
 Shuffle can be 100x of cube size,
big I/O pressure

mapper mapper mapper
reducer
Fast Cubing
 Pros
 In-mem cubing algorithm that can
be reused by Streaming, Spark etc.
 Mapper side aggregation
 Lesser shuffling given the right data
split
 One round MR
 Cons
 Code complexity
 High mapper CPU/Mem
consumption
Data Split Data Split Data Split
……
Final Cube
Merge Sort
(Shuffle)

 If data splits are unique
 Fast cubing wins
 If data splits are common
 Layer cubing wins
 New cube engine chooses
the right algorithm based on
data sampling.
 Overall build time is 1.5x
faster, sum results from 500
jobs.
Fast Cubing (MR Engine V2)

 Slow queries are 5-10x
faster.
 New Hbase storage
enables partition on
cuboids that are big
enough.
 Overall query time is 2x
faster than before, sum
results from 10,000+
queries.
Parallel Scan
Query
Cuboid A
Cuboid B
Query
A1 B1
A2 B2
A3 C
Cuboid C
Server 1
Server 2
Server 3
Server 1
Server 2
Server 3

Near Realtime Incremental Build
 Minutes micro cubes
 Kafka source
 In-mem cubing
 Auto merge

Cube StorageReal-time In-Mem Store
streaming Kafka
SQL Query
minute batch
Latest second
Inverted
Index
Hybrid Storage
Interface
Cube
Future Lambda Architecture for Realtime

Use Case: SEO Operational Dashboard
 eBay Site
 ebay.com, ebay.co.uk, ebay.de
 Buyer Country
 US, CN, RU
 Search Engine
 Google, Bing, Yahoo!
 Referrer
 google.com, google.co.uk
 Page
 Search, View Item, Product
 User Experience
 Desktop, Mobile APP, mWeb
• Visits, GMB $, GMB share,
conversion rate, bounce rate, # of
view items, # of bought items etc.
Dimensions
Measurements

 HyperLogLog Count Distinct
 TopN
 BitMap Precise Count Distinct
 from Sun, Yerui (netease.com)
 Raw Records
 from Wang, Xiaoyu (jd.com)
 Domain specific aggregations now become easy
 aggregate user events to detect time serials or access patterns
 draw a sketch of certain user groups
 pre-calculate clusters of data points
 histogram…
User Defined Aggregation Types

DT,LOC TopN
2015-10-1,CN Item A, $500
Item B, $300
…
TopN Support
select dt, loc, item, sum(gmv)
from test_kylin_fact
where dt=‘2015-10-1’ and loc=‘CN’
group by dt, loc, item
order by 4 desc
limit 100 cube pre-calculation
 TopN as a measure
 Approximate algorithm
 SpaceSaving TopN
 Ahmed Metwally, et al. “Efficient computation of frequent and top-k elements in data streams”.
Proceeding ICDT'05 Proceedings of the 10th international conference on Database Theory, 2005.
 A parallel version
 Massimo Cafaro, et al. “A parallel space saving algorithm for frequent items and the Hurwitz zeta
distribution”. Proceeding arXiv: 1401.0702v12 [cs.DS] 19 Setp 2015.
 Answer TopN queries directly from pre-calculation

 Works with Tableau 9.1
 Works with MS Excel
 Works with MS Power BI
ODBC Enhancement

 New in Apache Kylin 1.5
 Plugin-able architecture
 New MR Cube Engine with fast cubing (1.5x faster)
 New HBase Storage with parallel scan (2x faster)
 Near real-time analysis (experimental)
 User defined aggregations
 Excel / PowerBI / Zeppelin integration
Summary

Apache Kylin 1.5 Updates

Recommended

More Related Content

What's hot (20)

Similar to Apache Kylin 1.5 Updates (17)

Recently uploaded (20)

Apache Kylin 1.5 Updates

Editor's Notes