Unifying Stream Processing and Interactive Queries in Apache Kafka
Interactive Queries allows you to get more than just processing from streaming. It allows you to treat the stream processing layer as a lightweight, embedded database and directly query the state of your stream processing application, without needing to materialize that state to external databases or storage. Apache Kafka maintains and manages that state and guarantees high availability and fault tolerance. As such, this new feature enables the hyper-convergence of processing and storage into one easy-to-use application that uses the Apache Kafka’s Streams API.
交互式查询（Interactive Queries）并不仅仅是对流数据进行处理（stream processing），它相当于将流处理这一层看做是一个轻量级的、内置的数据库、直接查询流处理应用程序的状态，它不需要将状态（数据）物化（materialize）到外部数据库或存储系统中。Kafka会自己维护和管理状态，并且保证了高可用、容错。这个新特性使得处理逻辑变得高度聚合，并且可以使用Kafka的Streams API存储（状态）到一个易用的应用程序中。
The idea behind Interactive Queries is not new; a similar concept actually originated in traditional databases where it’s often known as “materialized views.” Though materialized views are very useful, the way they are implemented in databases is not suitable for modern application development – they force you to write your code all in SQL and deploy it into the database server. With our database background combined with our stream processing experience, we visited a key question: can the concept behind materialized views be applied to a modern stream processing engine to create a powerful and general purpose construct for building stateful applications and microservices? In this blog post, we show how Apache Kafka, through Interactive Queries, helps us do exactly that.
When we set out to design the stream processing API for Apache Kafka – Kafka Streams – a key motivation was to rethink the existing solution space for stream processing. Here, our vision has been to move stream processing out of the big data niche and make it available as a mainstream application development model. For us, the key to executing that vision is to radically simplify how users can process data at any scale – small, medium, large – and in fact, one of our mantras is “Build Applications, Not Clusters!” In the past, we wrote about three ways Apache Kafka simplifies the stream processing architecture – by eliminating the need for a separate cluster, having good abstractions for streams and tables and keeping the overall architecture simple. Interactive Queries is another feature to enable this vision.
在为Kafka设计流式处理API时（即Kafka Streams），一个重要的动机是重新思考已有的流处理系统的局限性，我们的焦点已经从大数据领域转到了流处理，并且将其作为可用的主流的应用程序开发模型。执行这个愿景的关键是从根本上简化用户如何处理数据的扩展性问题（不管是小批量的数据、中等规模的、大规模的数据），实际上我们的一个口号是：“构建应用程序，不要集群！”。之前的博文中我们写了Kafka简化流处理的 三种模式
In this blog post, we’ll start by digging deeper into the motivation behind Interactive Queries through a concrete example that outlines its applicability. Then we will describe how Interactive Queries works under the hood and provide a summary of related resources for further reading.
Let’s use an end-to-end example to pick up where we left off in our previous article on why stream processing applications need state. In that article, we described some simple stateful operations, e.g., if you are grouping data by some field and counting, then the state you maintain would be the counts that have accumulated so far. Or if you are joining two streams, the state would be the rows in each stream waiting to find a match in the other stream.
Now as a driving example in this blog, consider a financial institution, like a wealth management firm or a hedge fund, that maintains positions in assets held by the firm and/or its client investors. Maintaining positions means that the bank needs to keep track of the risk associated with those particular assets. The bank continuously collects business events and other data that could potentially influence the risk associated with a given position. This data includes market data fluctuations on the price of the asset, foreign exchange rates, research, or even news information that could influence the reputation of people involved with the asset. Any time this data changes, the risk position needs to be recalculated in order to keep a real-time view of the risk associated with each individual asset as well as on entire portfolios of investments.
Real-time risk management is an example of a stateful application. At a minimum, state is needed to keep track of the latest position for every asset. State is also needed inside the stream processing engine to keep track of various aggregate statistics, like the number of times an asset is traded in a day and the average bid/ask spread. The collected state needs to be continuously updated and queried.