The future of DataFusion is Streaming

Cover Image for The future of DataFusion is Streaming

DataFusion is a wonderfully robust Rust library for working with analytics data using the column-based Apache Arrow memory format. It uses a modular architecture and includes a powerful query planner, logical planner + optimizer, and physical planner + optimizer. You can think of it as an open source library for building high performance analytics databases. In fact, DataFusion is the magic behind such technologies as ParadeDB’s pg_lakehouse (formally pg_analytics) and the Apache Comet projects, which is a Spark accelerator.

While DataFusion is highly performant for batch use-cases, surprisingly, it also has a number of features required for the data streaming use-case. For example, most of the physical operators in DataFusion already have support for an “Unbounded” execution mode. Additionally, the “StreamTableExec” operator makes it trivial to write a data source that reads from Kafka, Kinesis, or any other event system. The library even includes data structures for writing unbounded streams to custom locations. Finally, “SymmetricHashJoinExec” operator works out of the box with Unbounded streams, which means DataFusion already supports streaming joins.

Of course, there are still some core components that do need to be implemented. Watermark tracking is critical for handling late data and building this feature requires carefully convincing the various optimizers not to drop the timestamp column. There’s also the always thorny problem of checkpointing and restoring state, a feature required for any system destined for production use. But, build out these features and sprinkle in a few custom operators to support windowing and stateful aggregations, and you now have a full featured, embeddable stream processing engine.

Rust is quickly overtaking Java to become the de-facto language for performant data systems and DataFusion is at the forefront of this shift. Older products are swapping out proprietary implementations for DataFusion while new data products are being built from the ground up to harness its power. By fully enabling streaming data support, DataFusion can be embedded into even more places and solve entirely new sets of problems.