SQLstream s-Server is a fully compliant, distributed, scalable and optimized SQL query engine for unstructured machine data streams.

Why SQL? Even early Big Data adopters, impressed with the power of Hadoop, struggled as MapReduce proved a bottleneck in terms of the skills required, the time to develop applications and wider access to data across the Enterprise. SQL has proven to be the answer, as a:

High-level declarative language (applications can be written in just a few lines of powerful SQL queries that are optimized automatically for high performance over distributed hardware);
Data management platform (enhances portability and reusability of data across the Enterprise, essential as Big Data gains wider Enterprise adoption);
Language that can enable new data sources, enterprise integrations and application updates to be added dynamically. This enables applications to be changed on the fly in a 24×7 environment without downtime;
Language that can easily be machine-generated for enhancing productivity, enabling business level applications such as StreamLab.

SQL as a Big Data query language is also blazingly fast. Period.

SQL for stream processing

The benefits of SQL apply equally to the processing of unstructured machine data streams:

The latest SQL standard (SQL:2011) supports time-series data processing with operators such as the WINDOW clause.
For scale-up, SQL supports the optimization of continuous queries over a large set of cores on a multicore processor without requiring the developer to write multi-threaded code, understand lock concurrency or the use of semaphores.
For scale-out, SQL for stream processing supports the optimization of distributed SQL queries over any number of servers with optimization for low latency and high throughput.
Streaming SQL queries execute continuously, processing data as they arrive over row or time-based Windows. A streaming SQL query never ends, ensuring millisecond response times.

The power of SQL for stream processing

SQL and continuous queries

A streaming query is a continuous, standing query that executes over streaming data. Streaming queries are similar to database queries in how they analyze data; they differ by operating continuously on data as they arrive and by updating results incrementally in real-time.

The following simple example illustrates a SQL query executing over a stream of weather sensor data. Unlike a traditional query over stored data, the query executes continuously, processing each arriving record immediately, and outputting a result as the condition is met.

SELECT STREAM
    ROWTIME,
    City,
    Temperature
FROM WEATHERSTREAM
WHERE “City” = ‘San Francisco’ AND “Temperature” > ‘100’;

The result is a record containing the time, location and temperature reading for every input reading matching the selection criteria. The latency from the time the data record was emitted by the sensor to the alert being issued can be as little as a few milliseconds, even where data arrive at millions of records per second. Queries such as this are useful for generating real-time alerts from input data streams.

Tumbling windows and aggregation

There are two types of basic operations on streaming data – streaming aggregation (tumbling windows) and windowed analytics (sliding windows).

Streaming aggregation queries, for example with functions – including AVG, COUNT, MAX, MIN, SUM, STDDEV_POP, STDDEV_SAMP, VAR_POP, VAR_SAMP – output a record with aggregated values for each GROUP of input records.

If we now want to find the minimum and maximum temperature recorded anywhere each minute (globally, regardless of city), the minimum and maximum temperature can be calculated using the aggregate functions MIN and MAX. To indicate that we want this information on a per-minute basis, we use the FLOOR function to round each record’s (or row’s) rowtime down to the nearest minute.

SELECT STREAM
    FLOOR(WEATHERSTREAM.ROWTIME to MINUTE) AS FLOOR_MINUTE,
    MIN(TEMP) AS MIN_TEMP,
    MAX(TEMP) AS MAX_TEMP,
    AVG(TEMP) AS AVG_TEMP
FROM WEATHERSTREAM
GROUP BY FLOOR(WEATHERSTREAM.ROWTIME TO MINUTE);

The result is a stream of new record, one per minute, specifying the maximum, minimum and average readings recorded in that minute. This type of query is particularly useful for generating periodic reports with zero latency from the input data streams.

Streaming analytics and sliding windows

The second basic operation for stream processing is windowed analytics, where an output record (or row) is produced for each new input record. Each field (or column) in the output record may be calculated using a different window or partition. Windows can be time or row-based.

Let’s say we want to find the minimum and maximum temperature recorded in the 60 second period prior to any given reading, globally, regardless of city. To do this, we define a one minute time window using RANGE INTERVAL ’1′ MINUTE PRECEDING, and use it in the OVER clause for the analytic functions.

SELECT STREAM
    ROWTIME,
    MIN(TEMP) OVER W1 AS WMIN_TEMP,
    MAX(TEMP) OVER W1 AS WMAX_TEMP,
    AVG(TEMP) OVER W1 AS WAVG_TEMP
FROM WEATHERSTREAM
WINDOW W1 AS ( RANGE INTERVAL '1' MINUTE PRECEDING );

The query may look similar to the streaming aggregation query above, but the result is very different. In this case, an output record (or row) is generated for each and every newly arriving input record, specifying the updates to the minimum and maximum temperatures over the preceding 60 seconds, plus an incrementally updated average for the temperature over that period. Windowed analytics queries are the most common on stream processing solutions, with the ability to incrementally update answers key to stream processing performance.

Let’s assume we want the same information but broken down by CITY. To do this, we add a PARTITION BY clause on CITY to the window specification. The result will be an output record for each and every new input record for the specific PARTITION, that is, for each CITY.

SELECT STREAM
    ROWTIME,
    CITY,
    MIN(TEMP) over W1 AS WMIN_TEMP,
    MAX(TEMP) over W1 AS WMAX_TEMP,
    AVG(TEMP) over W1 AS WAVG_TEMP
FROM WEATHERSTREAM
WINDOW W1 AS (
    PARTITION BY CITY
    RANGE INTERVAL '1' MINUTE PRECEDING );

In effect, the PARTITION BY operation can be used to multiplex multiple logical output streams from a single input stream, in this case one logical output stream for each CITY in North America. Operations against partitioned data streams apply to all partitions, enabling the developer to write a single code statement and apply it to a large number of data streams. Window partitioning also allows queries to execute in parallel across multiple cores in a server, multiple servers in a data center, or multiple data centers in a cloud.

Streaming correlation and JOINs

The ability to correlate and report over data arriving from multiple sources simultaneously is an essential stream processing requirement. Inner, outer and cross JOINs are supported (as well as UNION and UNION ALL). The following example illustrates a JOIN of two streaming data sources, one for stock orders, and one for the resulting stock trades.

SELECT STREAM
    ROWTIME, o.orderId, o.ticker, 
    o.amount AS orderAmount,
    t.amount AS tradeAmount
FROM Orders AS o
JOIN Trades OVER (RANGE INTERVAL '1' MINUTE FOLLOWING) AS t
ON o.orderId = t.orderId;

This query outputs a stream of all Orders matched by a Trade within one minute of the Order being placed. The output stream is sorted by rowtime, in this case, the rowtime from the Orders stream. Let’s say we wanted the exceptions, that is, the unmatched Trades and Orders. This requires an OUTER JOIN as illustrated below.

SELECT STREAM
    ROWTIME, o.orderId, o.ticker, o.amount AS orderAmount,
    t.ticker, t.tradeId, t.amount AS tradeAmount,
FROM Orders AS o
OUTER JOIN Trades OVER (RANGE INTERVAL '1' MINUTE FOLLOWING) AS t
ON o.orderId = t.orderId;

The output is this case is a stream of Orders where the Trade has not occurred within a one minute time window, and including Trades with missing Orders.

JOINs are essential for real-world stream processing applications, enabling time and geo-spatial patterns to be identified over different types and formats of data.

Libraries and functions

SQL offers a wide range of libraries for manipulation of streaming data. For example, aggregation libraries including AVG, COUNT, MAX, MIN, SUM, MOD, LN and POWER, a statistical analysis library including STDDEV_POP, STDDEV_SAMP, VAR_POP, VAR_SAMP, as well as a general library of functions such as ABS, CAST, CEILING, FLOOR, LOWER, UPPER, LAST_VALUE and FIRST_VALUE.

A full list of libraries and functions are available on the documentation pages at www.sqlstream.com/docs

Extending SQL with Java functions

The SQL UDX (User Defined Transform) mechanism enables streaming operations to be written in Java, and for third party libraries and even entire systems to be included in a streaming SQL query. For example, the following illustrates the syntax of including a custom stream processing function written in Java into a streaming SQL query.

SELECT STREAM
    ticker, bid, ask,
    my_java_stream_processor(ticker, bid, ask)
FROM trades;

New UDXes are straightforward to develop and deploy, with examples and developer documentation available at sqlstream.com/docs, and we offer a range of industry-specific operators, built-in libraries and third-party system support out of the box:

A range of UDXes for industry-specific operators are delivered with the s-Server distribution, including specializations of the SQL Group_Rank()
Built-in libraries provided as standard include operations for geo-spatial analytics, traffic analytics and call detail record processing in telecoms, many of which are provided with the StreamApps solutions.
Entire third party Java modules and systems can also be included in a streaming SQL pipeline.

Latest blog posts

May	JUL	Aug
	06
2013	2014	2015

Stream Processing with SQL