omniture

Starburst Announces 100GB/second Streaming Ingest from Apache Kafka to Apache Iceberg Tables

Starburst
2024-10-24 20:00 1052

Go from data ingestion to blazing-fast SQL analytics in near real-time with the Starburst Open Hybrid Lakehouse

BOSTON, Oct. 24, 2024 /PRNewswire/ -- Starburst, the Trino company, today announced a range of new capabilities for their Trino-based open hybrid lakehouse platform, Galaxy: the general availability of fully managed streaming ingestion from Apache Kafka to Apache Iceberg tables; the public preview of fully managed ingestion from files landing in Amazon Web Services (AWS) S3 to Iceberg tables; and multiple enhancements to performance and price-performance of their lakehouse platform. Galaxy customers can now easily configure and ingest data at a verified scale of up to 100GB/second per Iceberg table at leading price-performance. In addition, Galaxy users can now benefit from faster and more accurate auto-scaling of resources, simplified policy-based routing of user queries, and enhanced performance through improved automatic caching and indexing.

Businesses that require data to be available for analytics in their cloud data lake with minimal delay traditionally build complex ingestion systems that require cobbling together multiple tools and writing custom software to stream data into cloud data lakes. Alternatively, these organizations may rely on incomplete solutions that only handle the ingestion process. Both approaches tend to be fragile, difficult to scale, costly to maintain, and solve only part of the problem. After the data lands in the lake, it still needs to be transformed and optimized for efficient querying—requiring even more code, pipelines, tools, and added complexity. In addition, the pressure for cost optimization across analytics functions is increasing. CIOs are looking for ways to improve their operational overhead against traditional lakehouses and legacy data warehouses while maintaining control of their data and analytics stack. 

"As businesses strive to perform analytics on real-time data, they seek frictionless solutions for continuous data ingestion. They also prioritize open standards like Apache Iceberg to future-proof their environments amid rapidly evolving technologies. Furthermore, reducing complexity and simplifying architectures is critical, helping organizations optimize IT investments and avoid unnecessary costs associated with integrating disparate systems," said Sanjeev Mohan, Principal and Founder of SanjMo. "Starburst's latest announcements are significant because they address these exact needs—delivering improved price performance, simplicity, and efficient elastic scaling for modern data workloads."

Streaming Ingest from Kafka (general availability) - Starburst now enables the easy creation of fully managed ingestion pipelines for Kafka topics at a verified scale up to 100GB/second, at half the cost of alternative solutions. Configuration is completed in minutes and simply entails selecting the Kafka topic, the auto-generated table schema, and the location of the resulting Iceberg table.

  • Starburst Galaxy's streaming ingestion is serverless and does the heavy lifting without any manual configuration, tuning, or additional tools required by the customer. Galaxy automatically ingests incoming messages from Kafka topics into managed Iceberg tables in S3, compacts and transforms the data, applies the necessary governance, and makes it available to query within about one minute.
  • Starburst's streaming ingestion can connect to Kafka-compliant systems, which includes Confluent Cloud, Amazon Managed Streaming for Apache Kafka (MSK), and Apache Kafka.
  • Starburst guarantees exactly once delivery, ensuring no duplicate messages are read, and no messages are missed to ensure accuracy.
  • It is built for a massive scale and has been tested to ingest 100 gigabytes of streaming data per second.

Ingest from Files landing in S3 (public preview) - Additionally, Starburst is expanding its ingestion capabilities by introducing file loading, offering customers a powerful, automated alternative to DIY or off-the-shelf solutions. This feature reads, parses, and writes records from files directly into Iceberg tables, which leverage the new ingestion capabilities to automatically optimize the tables for read performance through capabilities like compaction, snapshot retention, orphaned file removal, and statistics collection. The public preview of file loading will be available in November 2024.

Enhanced Auto Scaling (general availability) - Starburst makes auto scaling smarter in Starburst Galaxy. In environments with high concurrent users, demand for compute resources can fluctuate dynamically. The enhanced Auto Scaling intelligently monitors both active and pending queries to understand and allocate how much compute resources are needed per query up to 50% faster. Not only does enhanced Auto Scaling provision additional compute resources faster, but it also includes the ability to automatically reactivate draining worker nodes, improving the efficiency of resource utilization.

Next Gen Caching (private preview) - Data engineers undertake various labor-intensive data preparation tasks. Starburst Warp Speed helps automate some of those tasks. Still, as business needs evolve and teams turn to a semantic layer approach with tools like dbt, data engineers struggle to provide fast query performance, scalability, and stability for BI and dashboarding without significant overhead. The next-generation caching in Starburst Galaxy combines the power of Warp Speed's smart indexing and caching capabilities to intermediate workload results. Warp Speed will now be able to identify patterns of similar subqueries across different workloads while improving performance up to 62% compared to non-accelerated queries.

User Role Based Routing (private preview) -  Previously, users would spend too much effort determining which queries were appropriate for different cluster types. Also, administrators weren't able to assign groups of users to a cluster via roles and privileges. With User Role Based Routing, Starburst now supports the easy allocation of resources by cluster type. Customers can programmatically route queries to the appropriate Galaxy cluster based on a predefined set of rules. Users can send all queries to a single URL, which will route the queries based on the user's role, minimizing human intervention while improving what is already industry-leading price-performance against other leading cloud data warehouses and lakehouses.

"With our new ingestion capabilities to Iceberg, customers don't have to worry about how fast or how much data they need to land in their data lake. At 100GB/second, Galaxy's ingestion can handle the scale of the most demanding use cases. Because it is so easy to configure and cost-effective to operate, customers don't have to artificially limit the number of up-to-date, fresh tables in their lake, enabling them to make the most informed business decisions," said Tobias Ternstrom, Starburst's Chief Product Officer.

Supporting Resources

For more information, read Starburst's Icehouse launch blog.
Download an image of the Starburst Open Data Lakehouse here.

About Starburst

Starburst, the Open Hybrid Lakehouse, is the leading end-to-end data platform to securely access, analyze, and share data for analytics and AI across hybrid, on-premises, and multi-cloud environments. As the leaders in Trino, a modern open-source SQL engine, Starburst empowers the most data-intensive and security-conscious organizations like Comcast, Halliburton, Vectra, EMIS Health, and 7 of the top 10 global banks to democratize data access, enhance analytics performance, and improve architecture optionality. With the Open Hybrid Lakehouse from Starburst, enterprises globally can easily discover and use all their relevant business data to power new applications and analytics across risk mitigation, supply chain, customer experiences, product optimization, streaming, and more.   

For additional information, please visit https://www.starburst.io/

 

Source: Starburst
collection