SQLstream s-Server is a fully compliant, distributed, scalable and optimized SQL query engine for unstructured machine data streams.
Why SQL? Even early Big Data adopters, impressed with the power of Hadoop, struggled as MapReduce proved a bottleneck in terms of the skills required, the time to develop applications and wider access to data across the Enterprise. SQL has proven to be the answer, as a:
SQL as a Big Data query language is also blazingly fast. Period.
The benefits of SQL apply equally to the processing of unstructured machine data streams:
A streaming query is a continuous, standing query that executes over streaming data. Streaming queries are similar to database queries in how they analyze data; they differ by operating continuously on data as they arrive and by updating results incrementally in real-time.
The following simple example illustrates a SQL query executing over a stream of weather sensor data. Unlike a traditional query over stored data, the query executes continuously, processing each arriving record immediately, and outputting a result as the condition is met.
SELECT STREAM ROWTIME, City, Temperature FROM WEATHERSTREAM WHERE “City” = ‘San Francisco’ AND “Temperature” > ‘100’;
The result is a record containing the time, location and temperature reading for every input reading matching the selection criteria. The latency from the time the data record was emitted by the sensor to the alert being issued can be as little as a few milliseconds, even where data arrive at millions of records per second. Queries such as this are useful for generating real-time alerts from input data streams.
There are two types of basic operations on streaming data – streaming aggregation (tumbling windows) and windowed analytics (sliding windows).
Streaming aggregation queries, for example with functions – including AVG, COUNT, MAX, MIN, SUM, STDDEV_POP, STDDEV_SAMP, VAR_POP, VAR_SAMP – output a record with aggregated values for each GROUP of input records.
If we now want to find the minimum and maximum temperature recorded anywhere each minute (globally, regardless of city), the minimum and maximum temperature can be calculated using the aggregate functions MIN and MAX. To indicate that we want this information on a per-minute basis, we use the FLOOR function to round each record’s (or row’s) rowtime down to the nearest minute.
SELECT STREAM FLOOR(WEATHERSTREAM.ROWTIME to MINUTE) AS FLOOR_MINUTE, MIN(TEMP) AS MIN_TEMP, MAX(TEMP) AS MAX_TEMP, AVG(TEMP) AS AVG_TEMP FROM WEATHERSTREAM GROUP BY FLOOR(WEATHERSTREAM.ROWTIME TO MINUTE);
The result is a stream of new record, one per minute, specifying the maximum, minimum and average readings recorded in that minute. This type of query is particularly useful for generating periodic reports with zero latency from the input data streams.
The second basic operation for stream processing is windowed analytics, where an output record (or row) is produced for each new input record. Each field (or column) in the output record may be calculated using a different window or partition. Windows can be time or row-based.
Let’s say we want to find the minimum and maximum temperature recorded in the 60 second period prior to any given reading, globally, regardless of city. To do this, we define a one minute time window using RANGE INTERVAL ’1′ MINUTE PRECEDING, and use it in the OVER clause for the analytic functions.
SELECT STREAM ROWTIME, MIN(TEMP) OVER W1 AS WMIN_TEMP, MAX(TEMP) OVER W1 AS WMAX_TEMP, AVG(TEMP) OVER W1 AS WAVG_TEMP FROM WEATHERSTREAM WINDOW W1 AS ( RANGE INTERVAL '1' MINUTE PRECEDING );
The query may look similar to the streaming aggregation query above, but the result is very different. In this case, an output record (or row) is generated for each and every newly arriving input record, specifying the updates to the minimum and maximum temperatures over the preceding 60 seconds, plus an incrementally updated average for the temperature over that period. Windowed analytics queries are the most common on stream processing solutions, with the ability to incrementally update answers key to stream processing performance.
Let’s assume we want the same information but broken down by CITY. To do this, we add a PARTITION BY clause on CITY to the window specification. The result will be an output record for each and every new input record for the specific PARTITION, that is, for each CITY.
SELECT STREAM ROWTIME, CITY, MIN(TEMP) over W1 AS WMIN_TEMP, MAX(TEMP) over W1 AS WMAX_TEMP, AVG(TEMP) over W1 AS WAVG_TEMP FROM WEATHERSTREAM WINDOW W1 AS ( PARTITION BY CITY RANGE INTERVAL '1' MINUTE PRECEDING );
In effect, the PARTITION BY operation can be used to multiplex multiple logical output streams from a single input stream, in this case one logical output stream for each CITY in North America. Operations against partitioned data streams apply to all partitions, enabling the developer to write a single code statement and apply it to a large number of data streams. Window partitioning also allows queries to execute in parallel across multiple cores in a server, multiple servers in a data center, or multiple data centers in a cloud.
The ability to correlate and report over data arriving from multiple sources simultaneously is an essential stream processing requirement. Inner, outer and cross JOINs are supported (as well as UNION and UNION ALL). The following example illustrates a JOIN of two streaming data sources, one for stock orders, and one for the resulting stock trades.
SELECT STREAM ROWTIME, o.orderId, o.ticker, o.amount AS orderAmount, t.amount AS tradeAmount FROM Orders AS o JOIN Trades OVER (RANGE INTERVAL '1' MINUTE FOLLOWING) AS t ON o.orderId = t.orderId;
This query outputs a stream of all Orders matched by a Trade within one minute of the Order being placed. The output stream is sorted by rowtime, in this case, the rowtime from the Orders stream. Let’s say we wanted the exceptions, that is, the unmatched Trades and Orders. This requires an OUTER JOIN as illustrated below.
SELECT STREAM ROWTIME, o.orderId, o.ticker, o.amount AS orderAmount, t.ticker, t.tradeId, t.amount AS tradeAmount, FROM Orders AS o OUTER JOIN Trades OVER (RANGE INTERVAL '1' MINUTE FOLLOWING) AS t ON o.orderId = t.orderId;
The output is this case is a stream of Orders where the Trade has not occurred within a one minute time window, and including Trades with missing Orders.
JOINs are essential for real-world stream processing applications, enabling time and geo-spatial patterns to be identified over different types and formats of data.
SQL offers a wide range of libraries for manipulation of streaming data. For example, aggregation libraries including AVG, COUNT, MAX, MIN, SUM, MOD, LN and POWER, a statistical analysis library including STDDEV_POP, STDDEV_SAMP, VAR_POP, VAR_SAMP, as well as a general library of functions such as ABS, CAST, CEILING, FLOOR, LOWER, UPPER, LAST_VALUE and FIRST_VALUE.
A full list of libraries and functions are available on the documentation pages at www.sqlstream.com/docs
The SQL UDX (User Defined Transform) mechanism enables streaming operations to be written in Java, and for third party libraries and even entire systems to be included in a streaming SQL query. For example, the following illustrates the syntax of including a custom stream processing function written in Java into a streaming SQL query.
SELECT STREAM ticker, bid, ask, my_java_stream_processor(ticker, bid, ask) FROM trades;
New UDXes are straightforward to develop and deploy, with examples and developer documentation available at sqlstream.com/docs, and we offer a range of industry-specific operators, built-in libraries and third-party system support out of the box: