SlideShare a Scribd company logo
The Evolution of Apache Kylin
Realtime & Plugin Architecture in Kylin 1.5
Li, Yang | 李扬
Agenda
 What’s Apache Kylin?
 New Features in Kylin 1.5
 Plugin Architecture
 Fast Cubing
 Parallel Scan
 Streaming Cubing
 User Defined Aggregation
 Summary
Extreme OLAP Engine for Big Data
Kylin is an open source Distributed Analytics Engine from eBay that
provides SQL interface and multi-dimensional analysis (OLAP) on
Hadoop supporting extremely large datasets
What’s Kylin
kylin / ˈkiːˈlɪn / 麒麟
--n. (in Chinese art) a mythical animal of composite form
• Open Sourced on Oct 1st, 2014
• Accepted as Apache Incubator Project on Nov 25th, 2014
Feature – SQL Interface
Hive Table
Build Cube
(Index)
SQL Query
 eBay
Feature – Big Data
Case Cube Size Raw Records
Session Analysis 20 TB 81+ billion rows
Traffic Analysis 30 TB 28+ billion rows
Transaction Analysis 560 GB 1.2+ billion rows
90% queries <5s
Dark-blue line: 90%tile queries
Light-blue line: 95%tile queries
90%ile query returns in 3 seconds
Feature – Low Latency
Feature – BI Integration via ODBC, JDBC
Linear scale out with more nodes
Feature – Scalable Throughput
Agenda
 What’s Apache Kylin?
 New Features in Kylin 1.5
 Plugin Architecture
 Fast Cubing
 Parallel Scan
 Streaming Cubing
 User Defined Aggregation
 Summary
Cube Builder (MapReduce…)
SQL
Low Latency -
SecondsRouting
3rd Party App
(Web App, Mobile…)
Metadata
SQL-Based Tool
(BI Tools: Tableau…)
Query Engine
Hadoop
Hive
REST API JDBC/ODBC
 Online Analysis Data Flow
 Offline Data Flow
 Clients/Users interactive with
Kylin via SQL
 OLAP Cube is transparent to
users
Star Schema Data Key Value Data
Data
Cube
OLAP
Cubes
(HBase)
SQL
REST ServerDataSource
Abstraction
Engine
Abstraction
Storage
Abstraction
Plugin Architecture Overview
MR Engine
IN OUT
Hive
Source
HBase
Storage
Cube Metadata
SourceFactory StorageFactoryEngineFactory
Plugin Architecture
MR Engine
Plugin Architecture
Hive Adapter HBase Adapter
load data save cubeHive
Source
HBase
Storage
adapt to IN adapt to OUT
 Engine
 MR V1
 MR V2
 Spark (early)
 Streaming (experimental)
 Source
 Hive
 Kafka
 Spark SQL & DataFrames
 Storage
 HBase
 ? Kudu
 ? Cassandra
Developing Modules
 Freedom
 Zoo break, not bound to Hadoop any more
 Free to go to a better engine or storage
 Extensibility
 Accept any input, e.g. Kafka
 Embrace next-gen distributed platform, e.g. Spark
 Flexibility
 Choose different engine for different data set
The Freedom, Extensibility, Flexibility
Full Data
0-D Cuboid
1-D Cuboid
2-D Cuboid
3-D Cuboid
4-D Cuboid
MR
MR
MR
MR
MR
A,B,C,D
A,B,C A,B,D A,C,D B,C,D
Layered Cubing (MR Engine V1)
 Pros
 Simple implementation, depends
on MR shuffle to merge sort and
then aggregate
 Little requirement on memory
 Cons
 Aggregation happens at reducer
side
 Mapper outputs raw data thus
shuffle is huge
 Multiple rounds of MR overhead
 Shuffle can be 100x of cube size,
big I/O pressure
mapper mapper mapper
reducer
Fast Cubing
 Pros
 In-mem cubing algorithm that can
be reused by Streaming, Spark etc.
 Mapper side aggregation
 Lesser shuffling given the right data
split
 One round MR
 Cons
 Code complexity
 High mapper CPU/Mem
consumption
Data Split Data Split Data Split
……
Final Cube
Merge Sort
(Shuffle)
 If data splits are unique
 Fast cubing wins
 If data splits are common
 Layer cubing wins
 New cube engine chooses
the right algorithm based on
data sampling.
 Overall build time is 1.5x
faster, sum results from 500
jobs.
Fast Cubing (MR Engine V2)
 Slow queries are 5-10x
faster.
 New Hbase storage
enables partition on
cuboids that are big
enough.
 Overall query time is 2x
faster than before, sum
results from 10,000+
queries.
Parallel Scan
Query
Cuboid A
Cuboid B
Query
A1 B1
A2 B2
A3 C
Cuboid C
Server 1
Server 2
Server 3
Server 1
Server 2
Server 3
Near Realtime Incremental Build
 Minutes micro cubes
 Kafka source
 In-mem cubing
 Auto merge
Cube StorageReal-time In-Mem Store
streaming Kafka
SQL Query
minute batch
Latest second
Inverted
Index
Hybrid Storage
Interface
Cube
Future Lambda Architecture for Realtime
Use Case: SEO Operational Dashboard
 eBay Site
 ebay.com, ebay.co.uk, ebay.de
 Buyer Country
 US, CN, RU
 Search Engine
 Google, Bing, Yahoo!
 Referrer
 google.com, google.co.uk
 Page
 Search, View Item, Product
 User Experience
 Desktop, Mobile APP, mWeb
• Visits, GMB $, GMB share,
conversion rate, bounce rate, # of
view items, # of bought items etc.
Dimensions
Measurements
 HyperLogLog Count Distinct
 TopN
 BitMap Precise Count Distinct
 from Sun, Yerui (netease.com)
 Raw Records
 from Wang, Xiaoyu (jd.com)
 Domain specific aggregations now become easy
 aggregate user events to detect time serials or access patterns
 draw a sketch of certain user groups
 pre-calculate clusters of data points
 histogram…
User Defined Aggregation Types
DT,LOC TopN
2015-10-1,CN Item A, $500
Item B, $300
…
TopN Support
select dt, loc, item, sum(gmv)
from test_kylin_fact
where dt=‘2015-10-1’ and loc=‘CN’
group by dt, loc, item
order by 4 desc
limit 100 cube pre-calculation
 TopN as a measure
 Approximate algorithm
 SpaceSaving TopN
 Ahmed Metwally, et al. “Efficient computation of frequent and top-k elements in data streams”.
Proceeding ICDT'05 Proceedings of the 10th international conference on Database Theory, 2005.
 A parallel version
 Massimo Cafaro, et al. “A parallel space saving algorithm for frequent items and the Hurwitz zeta
distribution”. Proceeding arXiv: 1401.0702v12 [cs.DS] 19 Setp 2015.
 Answer TopN queries directly from pre-calculation
 Works with Tableau 9.1
 Works with MS Excel
 Works with MS Power BI
ODBC Enhancement
Zeppelin Integration
Agenda
 What’s Apache Kylin?
 New Features in Kylin 1.5
 Plugin Architecture
 Fast Cubing
 Parallel Scan
 Streaming Cubing
 User Defined Aggregation
 Summary
 New in Apache Kylin 1.5
 Plugin-able architecture
 New MR Cube Engine with fast cubing (1.5x faster)
 New HBase Storage with parallel scan (2x faster)
 Near real-time analysis (experimental)
 User defined aggregations
 Excel / PowerBI / Zeppelin integration
Summary
Thanks!
http://kylin.io

More Related Content

What's hot (20)

Apache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseApache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouse
Yang Li
 
Kylin OLAP Engine Tour
Kylin OLAP Engine TourKylin OLAP Engine Tour
Kylin OLAP Engine Tour
Luke Han
 
Adding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark MeetupAdding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark Meetup
Luke Han
 
Kylin olap part 1- getting started
Kylin olap   part 1- getting startedKylin olap   part 1- getting started
Kylin olap part 1- getting started
Shubham Shirude
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
DataWorks Summit/Hadoop Summit
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
Ted Dunning
 
Apache Kylin on HBase: Extreme OLAP engine for big data
Apache Kylin on HBase: Extreme OLAP engine for big dataApache Kylin on HBase: Extreme OLAP engine for big data
Apache Kylin on HBase: Extreme OLAP engine for big data
Shi Shao Feng
 
Apache Kylin Introduction
Apache Kylin IntroductionApache Kylin Introduction
Apache Kylin Introduction
Luke Han
 
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
Luke Han
 
Datacubes in Apache Hive at ApacheCon
Datacubes in Apache Hive at ApacheConDatacubes in Apache Hive at ApacheCon
Datacubes in Apache Hive at ApacheCon
amarsri
 
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
Luke Han
 
The Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke HanThe Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke Han
Luke Han
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015
Seshu Adunuthula
 
Apache Kylin Open Source Journey for QCon2015 Beijing
Apache Kylin Open Source Journey for QCon2015 BeijingApache Kylin Open Source Journey for QCon2015 Beijing
Apache Kylin Open Source Journey for QCon2015 Beijing
Luke Han
 
Big Data MDX with Mondrian and Apache Kylin
Big Data MDX with Mondrian and Apache KylinBig Data MDX with Mondrian and Apache Kylin
Big Data MDX with Mondrian and Apache Kylin
inovex GmbH
 
Apache Kylin Use Cases in China and Japan
Apache Kylin Use Cases in China and JapanApache Kylin Use Cases in China and Japan
Apache Kylin Use Cases in China and Japan
Luke Han
 
Kylin Engineering Principles
Kylin Engineering PrinciplesKylin Engineering Principles
Kylin Engineering Principles
Xu Jiang
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Xu Jiang
 
Apache kylin - Big Data Technology Conference 2014 Beijing
Apache kylin - Big Data Technology Conference 2014 BeijingApache kylin - Big Data Technology Conference 2014 Beijing
Apache kylin - Big Data Technology Conference 2014 Beijing
Luke Han
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentation
argonauts007
 
Apache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseApache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouse
Yang Li
 
Kylin OLAP Engine Tour
Kylin OLAP Engine TourKylin OLAP Engine Tour
Kylin OLAP Engine Tour
Luke Han
 
Adding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark MeetupAdding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark Meetup
Luke Han
 
Kylin olap part 1- getting started
Kylin olap   part 1- getting startedKylin olap   part 1- getting started
Kylin olap part 1- getting started
Shubham Shirude
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
Ted Dunning
 
Apache Kylin on HBase: Extreme OLAP engine for big data
Apache Kylin on HBase: Extreme OLAP engine for big dataApache Kylin on HBase: Extreme OLAP engine for big data
Apache Kylin on HBase: Extreme OLAP engine for big data
Shi Shao Feng
 
Apache Kylin Introduction
Apache Kylin IntroductionApache Kylin Introduction
Apache Kylin Introduction
Luke Han
 
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
Luke Han
 
Datacubes in Apache Hive at ApacheCon
Datacubes in Apache Hive at ApacheConDatacubes in Apache Hive at ApacheCon
Datacubes in Apache Hive at ApacheCon
amarsri
 
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
Luke Han
 
The Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke HanThe Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke Han
Luke Han
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015
Seshu Adunuthula
 
Apache Kylin Open Source Journey for QCon2015 Beijing
Apache Kylin Open Source Journey for QCon2015 BeijingApache Kylin Open Source Journey for QCon2015 Beijing
Apache Kylin Open Source Journey for QCon2015 Beijing
Luke Han
 
Big Data MDX with Mondrian and Apache Kylin
Big Data MDX with Mondrian and Apache KylinBig Data MDX with Mondrian and Apache Kylin
Big Data MDX with Mondrian and Apache Kylin
inovex GmbH
 
Apache Kylin Use Cases in China and Japan
Apache Kylin Use Cases in China and JapanApache Kylin Use Cases in China and Japan
Apache Kylin Use Cases in China and Japan
Luke Han
 
Kylin Engineering Principles
Kylin Engineering PrinciplesKylin Engineering Principles
Kylin Engineering Principles
Xu Jiang
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Xu Jiang
 
Apache kylin - Big Data Technology Conference 2014 Beijing
Apache kylin - Big Data Technology Conference 2014 BeijingApache kylin - Big Data Technology Conference 2014 Beijing
Apache kylin - Big Data Technology Conference 2014 Beijing
Luke Han
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentation
argonauts007
 

Similar to Apache Kylin 1.5 Updates (17)

M7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal HausenblasM7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal Hausenblas
Modern Data Stack France
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Tony Ng
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Chester Chen
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify Story
Neville Li
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Databricks
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
GoDataDriven
 
Time Series Analytics Azure ADX
Time Series Analytics Azure ADXTime Series Analytics Azure ADX
Time Series Analytics Azure ADX
Riccardo Zamana
 
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of AmazonBig Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Data Con LA
 
Tech
TechTech
Tech
ManabuYoneyama
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSS
Kevin Crocker
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
Sanket Shikhar
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Neville Li
 
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Adrian Cockcroft
 
System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
Ganesan Narayanasamy
 
Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
Torsten Steinbach
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Tony Ng
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Chester Chen
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify Story
Neville Li
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Databricks
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
GoDataDriven
 
Time Series Analytics Azure ADX
Time Series Analytics Azure ADXTime Series Analytics Azure ADX
Time Series Analytics Azure ADX
Riccardo Zamana
 
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of AmazonBig Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Data Con LA
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSS
Kevin Crocker
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
Sanket Shikhar
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Neville Li
 
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Adrian Cockcroft
 

Recently uploaded (20)

domains and paths, Nice & ugly domains, domain testing, domains and interface...
domains and paths, Nice & ugly domains, domain testing, domains and interface...domains and paths, Nice & ugly domains, domain testing, domains and interface...
domains and paths, Nice & ugly domains, domain testing, domains and interface...
Rajalingam Balakrishnan
 
Managing Changing Data with FME: Part 2 – Flexible Approaches to Tracking Cha...
Managing Changing Data with FME: Part 2 – Flexible Approaches to Tracking Cha...Managing Changing Data with FME: Part 2 – Flexible Approaches to Tracking Cha...
Managing Changing Data with FME: Part 2 – Flexible Approaches to Tracking Cha...
Safe Software
 
EIS-Manufacturing-AI–Product-Data-Optimization-Webinar-2025.pptx
EIS-Manufacturing-AI–Product-Data-Optimization-Webinar-2025.pptxEIS-Manufacturing-AI–Product-Data-Optimization-Webinar-2025.pptx
EIS-Manufacturing-AI–Product-Data-Optimization-Webinar-2025.pptx
Earley Information Science
 
Automation Dreamin': Capture User Feedback From Anywhere
Automation Dreamin': Capture User Feedback From AnywhereAutomation Dreamin': Capture User Feedback From Anywhere
Automation Dreamin': Capture User Feedback From Anywhere
Lynda Kane
 
Building Resilience with Energy Management for the Public Sector
Building Resilience with Energy Management for the Public SectorBuilding Resilience with Energy Management for the Public Sector
Building Resilience with Energy Management for the Public Sector
Splunk
 
#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018
#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018
#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018
Lynda Kane
 
Introduction to LLM Post-Training - MIT 6.S191 2025
Introduction to LLM Post-Training - MIT 6.S191 2025Introduction to LLM Post-Training - MIT 6.S191 2025
Introduction to LLM Post-Training - MIT 6.S191 2025
Maxime Labonne
 
Managing Multiple Logical Volumes - RHCSA+.pdf
Managing Multiple Logical Volumes - RHCSA+.pdfManaging Multiple Logical Volumes - RHCSA+.pdf
Managing Multiple Logical Volumes - RHCSA+.pdf
RHCSA Guru
 
What is Agnetic AI : An Introduction to AI Agents
What is Agnetic AI : An Introduction to AI AgentsWhat is Agnetic AI : An Introduction to AI Agents
What is Agnetic AI : An Introduction to AI Agents
Techtic Solutions
 
The Gold Jacket Journey - How I passed 12 AWS Certs without Burning Out (and ...
The Gold Jacket Journey - How I passed 12 AWS Certs without Burning Out (and ...The Gold Jacket Journey - How I passed 12 AWS Certs without Burning Out (and ...
The Gold Jacket Journey - How I passed 12 AWS Certs without Burning Out (and ...
VictorSzoltysek
 
UiPath Automation Developer Associate 2025 Series - Career Office Hours
UiPath Automation Developer Associate 2025 Series - Career Office HoursUiPath Automation Developer Associate 2025 Series - Career Office Hours
UiPath Automation Developer Associate 2025 Series - Career Office Hours
DianaGray10
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
BrainSell Technologies
 
Flow graphs and Path testing,path predicates and achievable paths
Flow graphs and Path testing,path predicates and achievable pathsFlow graphs and Path testing,path predicates and achievable paths
Flow graphs and Path testing,path predicates and achievable paths
Rajalingam Balakrishnan
 
Presentation Session 5 Transition roadmap.pdf
Presentation Session 5 Transition roadmap.pdfPresentation Session 5 Transition roadmap.pdf
Presentation Session 5 Transition roadmap.pdf
Mukesh Kala
 
A11y Webinar Series - Level Up Your Accessibility Game_ A11y Audit, WCAG, and...
A11y Webinar Series - Level Up Your Accessibility Game_ A11y Audit, WCAG, and...A11y Webinar Series - Level Up Your Accessibility Game_ A11y Audit, WCAG, and...
A11y Webinar Series - Level Up Your Accessibility Game_ A11y Audit, WCAG, and...
Julia Undeutsch
 
Beginners: Radio Frequency, Band and Spectrum (V3)
Beginners: Radio Frequency, Band and Spectrum (V3)Beginners: Radio Frequency, Band and Spectrum (V3)
Beginners: Radio Frequency, Band and Spectrum (V3)
3G4G
 
Outgrowing QuickBooks: Key Signs It's Time to Move On
Outgrowing QuickBooks: Key Signs It's Time to Move OnOutgrowing QuickBooks: Key Signs It's Time to Move On
Outgrowing QuickBooks: Key Signs It's Time to Move On
BrainSell Technologies
 
Automation Dreamin' 2022: Sharing Some Gratitude with Your Users
Automation Dreamin' 2022: Sharing Some Gratitude with Your UsersAutomation Dreamin' 2022: Sharing Some Gratitude with Your Users
Automation Dreamin' 2022: Sharing Some Gratitude with Your Users
Lynda Kane
 
The History of Artificial Intelligence: From Ancient Ideas to Modern Algorithms
The History of Artificial Intelligence: From Ancient Ideas to Modern AlgorithmsThe History of Artificial Intelligence: From Ancient Ideas to Modern Algorithms
The History of Artificial Intelligence: From Ancient Ideas to Modern Algorithms
isoftreview8
 
beginning_lambda_minimium_of_40_length.pptx
beginning_lambda_minimium_of_40_length.pptxbeginning_lambda_minimium_of_40_length.pptx
beginning_lambda_minimium_of_40_length.pptx
ShashankER1
 
domains and paths, Nice & ugly domains, domain testing, domains and interface...
domains and paths, Nice & ugly domains, domain testing, domains and interface...domains and paths, Nice & ugly domains, domain testing, domains and interface...
domains and paths, Nice & ugly domains, domain testing, domains and interface...
Rajalingam Balakrishnan
 
Managing Changing Data with FME: Part 2 – Flexible Approaches to Tracking Cha...
Managing Changing Data with FME: Part 2 – Flexible Approaches to Tracking Cha...Managing Changing Data with FME: Part 2 – Flexible Approaches to Tracking Cha...
Managing Changing Data with FME: Part 2 – Flexible Approaches to Tracking Cha...
Safe Software
 
EIS-Manufacturing-AI–Product-Data-Optimization-Webinar-2025.pptx
EIS-Manufacturing-AI–Product-Data-Optimization-Webinar-2025.pptxEIS-Manufacturing-AI–Product-Data-Optimization-Webinar-2025.pptx
EIS-Manufacturing-AI–Product-Data-Optimization-Webinar-2025.pptx
Earley Information Science
 
Automation Dreamin': Capture User Feedback From Anywhere
Automation Dreamin': Capture User Feedback From AnywhereAutomation Dreamin': Capture User Feedback From Anywhere
Automation Dreamin': Capture User Feedback From Anywhere
Lynda Kane
 
Building Resilience with Energy Management for the Public Sector
Building Resilience with Energy Management for the Public SectorBuilding Resilience with Energy Management for the Public Sector
Building Resilience with Energy Management for the Public Sector
Splunk
 
#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018
#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018
#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018
Lynda Kane
 
Introduction to LLM Post-Training - MIT 6.S191 2025
Introduction to LLM Post-Training - MIT 6.S191 2025Introduction to LLM Post-Training - MIT 6.S191 2025
Introduction to LLM Post-Training - MIT 6.S191 2025
Maxime Labonne
 
Managing Multiple Logical Volumes - RHCSA+.pdf
Managing Multiple Logical Volumes - RHCSA+.pdfManaging Multiple Logical Volumes - RHCSA+.pdf
Managing Multiple Logical Volumes - RHCSA+.pdf
RHCSA Guru
 
What is Agnetic AI : An Introduction to AI Agents
What is Agnetic AI : An Introduction to AI AgentsWhat is Agnetic AI : An Introduction to AI Agents
What is Agnetic AI : An Introduction to AI Agents
Techtic Solutions
 
The Gold Jacket Journey - How I passed 12 AWS Certs without Burning Out (and ...
The Gold Jacket Journey - How I passed 12 AWS Certs without Burning Out (and ...The Gold Jacket Journey - How I passed 12 AWS Certs without Burning Out (and ...
The Gold Jacket Journey - How I passed 12 AWS Certs without Burning Out (and ...
VictorSzoltysek
 
UiPath Automation Developer Associate 2025 Series - Career Office Hours
UiPath Automation Developer Associate 2025 Series - Career Office HoursUiPath Automation Developer Associate 2025 Series - Career Office Hours
UiPath Automation Developer Associate 2025 Series - Career Office Hours
DianaGray10
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
BrainSell Technologies
 
Flow graphs and Path testing,path predicates and achievable paths
Flow graphs and Path testing,path predicates and achievable pathsFlow graphs and Path testing,path predicates and achievable paths
Flow graphs and Path testing,path predicates and achievable paths
Rajalingam Balakrishnan
 
Presentation Session 5 Transition roadmap.pdf
Presentation Session 5 Transition roadmap.pdfPresentation Session 5 Transition roadmap.pdf
Presentation Session 5 Transition roadmap.pdf
Mukesh Kala
 
A11y Webinar Series - Level Up Your Accessibility Game_ A11y Audit, WCAG, and...
A11y Webinar Series - Level Up Your Accessibility Game_ A11y Audit, WCAG, and...A11y Webinar Series - Level Up Your Accessibility Game_ A11y Audit, WCAG, and...
A11y Webinar Series - Level Up Your Accessibility Game_ A11y Audit, WCAG, and...
Julia Undeutsch
 
Beginners: Radio Frequency, Band and Spectrum (V3)
Beginners: Radio Frequency, Band and Spectrum (V3)Beginners: Radio Frequency, Band and Spectrum (V3)
Beginners: Radio Frequency, Band and Spectrum (V3)
3G4G
 
Outgrowing QuickBooks: Key Signs It's Time to Move On
Outgrowing QuickBooks: Key Signs It's Time to Move OnOutgrowing QuickBooks: Key Signs It's Time to Move On
Outgrowing QuickBooks: Key Signs It's Time to Move On
BrainSell Technologies
 
Automation Dreamin' 2022: Sharing Some Gratitude with Your Users
Automation Dreamin' 2022: Sharing Some Gratitude with Your UsersAutomation Dreamin' 2022: Sharing Some Gratitude with Your Users
Automation Dreamin' 2022: Sharing Some Gratitude with Your Users
Lynda Kane
 
The History of Artificial Intelligence: From Ancient Ideas to Modern Algorithms
The History of Artificial Intelligence: From Ancient Ideas to Modern AlgorithmsThe History of Artificial Intelligence: From Ancient Ideas to Modern Algorithms
The History of Artificial Intelligence: From Ancient Ideas to Modern Algorithms
isoftreview8
 
beginning_lambda_minimium_of_40_length.pptx
beginning_lambda_minimium_of_40_length.pptxbeginning_lambda_minimium_of_40_length.pptx
beginning_lambda_minimium_of_40_length.pptx
ShashankER1
 

Apache Kylin 1.5 Updates

  • 1. The Evolution of Apache Kylin Realtime & Plugin Architecture in Kylin 1.5 Li, Yang | 李扬
  • 2. Agenda  What’s Apache Kylin?  New Features in Kylin 1.5  Plugin Architecture  Fast Cubing  Parallel Scan  Streaming Cubing  User Defined Aggregation  Summary
  • 3. Extreme OLAP Engine for Big Data Kylin is an open source Distributed Analytics Engine from eBay that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets What’s Kylin kylin / ˈkiːˈlɪn / 麒麟 --n. (in Chinese art) a mythical animal of composite form • Open Sourced on Oct 1st, 2014 • Accepted as Apache Incubator Project on Nov 25th, 2014
  • 4. Feature – SQL Interface Hive Table Build Cube (Index) SQL Query
  • 5.  eBay Feature – Big Data Case Cube Size Raw Records Session Analysis 20 TB 81+ billion rows Traffic Analysis 30 TB 28+ billion rows Transaction Analysis 560 GB 1.2+ billion rows
  • 6. 90% queries <5s Dark-blue line: 90%tile queries Light-blue line: 95%tile queries 90%ile query returns in 3 seconds Feature – Low Latency
  • 7. Feature – BI Integration via ODBC, JDBC
  • 8. Linear scale out with more nodes Feature – Scalable Throughput
  • 9. Agenda  What’s Apache Kylin?  New Features in Kylin 1.5  Plugin Architecture  Fast Cubing  Parallel Scan  Streaming Cubing  User Defined Aggregation  Summary
  • 10. Cube Builder (MapReduce…) SQL Low Latency - SecondsRouting 3rd Party App (Web App, Mobile…) Metadata SQL-Based Tool (BI Tools: Tableau…) Query Engine Hadoop Hive REST API JDBC/ODBC  Online Analysis Data Flow  Offline Data Flow  Clients/Users interactive with Kylin via SQL  OLAP Cube is transparent to users Star Schema Data Key Value Data Data Cube OLAP Cubes (HBase) SQL REST ServerDataSource Abstraction Engine Abstraction Storage Abstraction Plugin Architecture Overview
  • 11. MR Engine IN OUT Hive Source HBase Storage Cube Metadata SourceFactory StorageFactoryEngineFactory Plugin Architecture
  • 12. MR Engine Plugin Architecture Hive Adapter HBase Adapter load data save cubeHive Source HBase Storage adapt to IN adapt to OUT
  • 13.  Engine  MR V1  MR V2  Spark (early)  Streaming (experimental)  Source  Hive  Kafka  Spark SQL & DataFrames  Storage  HBase  ? Kudu  ? Cassandra Developing Modules
  • 14.  Freedom  Zoo break, not bound to Hadoop any more  Free to go to a better engine or storage  Extensibility  Accept any input, e.g. Kafka  Embrace next-gen distributed platform, e.g. Spark  Flexibility  Choose different engine for different data set The Freedom, Extensibility, Flexibility
  • 15. Full Data 0-D Cuboid 1-D Cuboid 2-D Cuboid 3-D Cuboid 4-D Cuboid MR MR MR MR MR A,B,C,D A,B,C A,B,D A,C,D B,C,D Layered Cubing (MR Engine V1)  Pros  Simple implementation, depends on MR shuffle to merge sort and then aggregate  Little requirement on memory  Cons  Aggregation happens at reducer side  Mapper outputs raw data thus shuffle is huge  Multiple rounds of MR overhead  Shuffle can be 100x of cube size, big I/O pressure
  • 16. mapper mapper mapper reducer Fast Cubing  Pros  In-mem cubing algorithm that can be reused by Streaming, Spark etc.  Mapper side aggregation  Lesser shuffling given the right data split  One round MR  Cons  Code complexity  High mapper CPU/Mem consumption Data Split Data Split Data Split …… Final Cube Merge Sort (Shuffle)
  • 17.  If data splits are unique  Fast cubing wins  If data splits are common  Layer cubing wins  New cube engine chooses the right algorithm based on data sampling.  Overall build time is 1.5x faster, sum results from 500 jobs. Fast Cubing (MR Engine V2)
  • 18.  Slow queries are 5-10x faster.  New Hbase storage enables partition on cuboids that are big enough.  Overall query time is 2x faster than before, sum results from 10,000+ queries. Parallel Scan Query Cuboid A Cuboid B Query A1 B1 A2 B2 A3 C Cuboid C Server 1 Server 2 Server 3 Server 1 Server 2 Server 3
  • 19. Near Realtime Incremental Build  Minutes micro cubes  Kafka source  In-mem cubing  Auto merge
  • 20. Cube StorageReal-time In-Mem Store streaming Kafka SQL Query minute batch Latest second Inverted Index Hybrid Storage Interface Cube Future Lambda Architecture for Realtime
  • 21. Use Case: SEO Operational Dashboard  eBay Site  ebay.com, ebay.co.uk, ebay.de  Buyer Country  US, CN, RU  Search Engine  Google, Bing, Yahoo!  Referrer  google.com, google.co.uk  Page  Search, View Item, Product  User Experience  Desktop, Mobile APP, mWeb • Visits, GMB $, GMB share, conversion rate, bounce rate, # of view items, # of bought items etc. Dimensions Measurements
  • 22.  HyperLogLog Count Distinct  TopN  BitMap Precise Count Distinct  from Sun, Yerui (netease.com)  Raw Records  from Wang, Xiaoyu (jd.com)  Domain specific aggregations now become easy  aggregate user events to detect time serials or access patterns  draw a sketch of certain user groups  pre-calculate clusters of data points  histogram… User Defined Aggregation Types
  • 23. DT,LOC TopN 2015-10-1,CN Item A, $500 Item B, $300 … TopN Support select dt, loc, item, sum(gmv) from test_kylin_fact where dt=‘2015-10-1’ and loc=‘CN’ group by dt, loc, item order by 4 desc limit 100 cube pre-calculation  TopN as a measure  Approximate algorithm  SpaceSaving TopN  Ahmed Metwally, et al. “Efficient computation of frequent and top-k elements in data streams”. Proceeding ICDT'05 Proceedings of the 10th international conference on Database Theory, 2005.  A parallel version  Massimo Cafaro, et al. “A parallel space saving algorithm for frequent items and the Hurwitz zeta distribution”. Proceeding arXiv: 1401.0702v12 [cs.DS] 19 Setp 2015.  Answer TopN queries directly from pre-calculation
  • 24.  Works with Tableau 9.1  Works with MS Excel  Works with MS Power BI ODBC Enhancement
  • 26. Agenda  What’s Apache Kylin?  New Features in Kylin 1.5  Plugin Architecture  Fast Cubing  Parallel Scan  Streaming Cubing  User Defined Aggregation  Summary
  • 27.  New in Apache Kylin 1.5  Plugin-able architecture  New MR Cube Engine with fast cubing (1.5x faster)  New HBase Storage with parallel scan (2x faster)  Near real-time analysis (experimental)  User defined aggregations  Excel / PowerBI / Zeppelin integration Summary

Editor's Notes

  • #4: Olap Big data Vs ubuntu kylin Ebay 第一个贡献到apache的开源项目,也是完整由中国团队贡献到Apache的第一个项目
  • #9: 介绍query 1台机器4个tomcat instanc可以达到300左右的QPS
  • #13: A High Level Architecture for Kylin which is a Standard MOLAP Architecture built on Hadoop. Data Sources to build your MOLAP Cubes primarily Hive, We have a fantastic project in the works for a Storage Abstraction Layer and support other NoSQL Stores such as Cassandra/CouchBase. An Engine Abstraction which maintains the Cube Metadata and a Cube Builder. Today a set of Map Reduce Jobs to build the cubes. A storage layer to store the Cubes in Hbase, primarily through a Bulk Load of the aggregrates into Hbase. We are looking for active community participation to build out additional Data Source, Engine and Storage plugins into Kylin. A Query Engine that directly index into the multi-dimensional arrays built into Hbase.