A Hybrid Approach to Compute with Apache Iceberg (Part 3/3)

Introduction

In Part 2, we laid the groundwork for our budget-friendly data lake on DigitalOcean using Apache Iceberg as the data layer. We proved that DigitalOcean can serve as a budget-friendly storage solution for an Iceberg-based data lake. However, we feel the true value of Iceberg lies beyond its storage capabilities. The reality is that in pretty much all data lakehouse solutions, storage isn't the cost bottleneck, it's compute.

Our hope for Iceberg is to enable a mix-and-match approach to compute. By decoupling storage from a single dedicated compute platform, Iceberg lets us choose the most efficient and cost-effective engine for each data processing task. We can perform simple transformations and queries using serverless functions or open-source engines, while leveraging the power and integrations of Snowflake for more complex analytics, BI, and visualization.

This final article details our experiment with our hybrid compute model on DigitalOcean. Our primary goal was to utilize DigitalOcean Functions (their serverless offering like Lambdas on AWS) for lightweight processing and as a bridge to their GenAI Platform. Simultaneously, we aimed to connect our Snowflake Partner Account to the Iceberg catalog, demonstrating the ability to query data directly from Snowflake, thereby validating our vision of a unified data lake accessible through diverse compute engines.

The Price of Compute

Let's take Snowflake as an example. With its powerful query engine and incredible ease of development, it's a popular choice for data warehousing and analytics. And for good reason. However, it's notorious for its potential to get expensive quickly. The core idea behind Snowflake's compute pricing model is that you pay for the time that their warehouses (compute engines) are on and performing tasks. While there are many ways to optimize your consumption on Snowflake and common pitfalls you can avoid, it's difficult to escape the cost of running automated tasks such as data cleaning and transformation, each requiring a warehouse to be active in frequent increments that add up over time.

This is where smaller organizations, startups, and non-profits can find themselves struggling to keep up with costs as their data needs grow. But what if there was a way to offload those common tasks to in-house compute instead? Our goal is to minimize reliance on Snowflake's compute for these tasks, reserving its power for what it does best: high-performance analytical queries and reporting.

Hybrid Compute in This Experiment

For this experiment we kept it simple, looking to use just two types of compute:

DigitalOcean Functions: We envision using Functions for on-demand processing like simple data transformations, scheduled tasks, and potentially offloading compute from orchestration tools like Mage AI or Apache Airflow. For the purposes of this experiment however, we simply aimed to leverage the "Function Routing" feature of DigitalOcean's GenAI Platform, using Functions as tools for an agentic AI to query the data lake directly.
Snowflake: As discussed above, we envision using Snowflake for advanced analytical queries and BI/visualization tasks. In this experiment however, we would be happy just to see it connect to our Lakekeeper instance at all, and be able to refresh/query data directly without any need for ETL pipelines.

Implementation and Challenges

Implementing this strategy ultimately presented several challenges and limitations, particularly around Functions and the GenAI Platform on DigitalOcean.

Connecting Snowflake to Iceberg

Integrating Snowflake with our Iceberg catalog on DigitalOcean was a surprisingly smooth process. We had concerns initially that it might not be a simple integration given that Snowflake typically only supports AWS, GCP, and Azure. However when we attempted to connect to Lakekeeper, it worked on the first try.

We utilized Snowflake's Iceberg Tables feature, coupled with their External REST Catalog integration, to connect directly to our Lakekeeper-managed catalog. This allowed us to query data residing in DigitalOcean Spaces as if it were stored directly within Snowflake, which we feel has huge implications.

However, we did identify one area for further investigation in a production setting: authentication token management. Snowflake asks for a bearer token when you create the external catalog integration, and seems to expect that this token will be valid forever. You can't update or replace the integration while an iceberg table is using it, so it is unclear for now how we would implement any kind of token refresh pattern. This is something that merits future investigation, but doesn't hinder the completion of our experiment so we pressed on with temporary tokens.

DigitalOcean Serverless Functions

Our plan to use DigitalOcean Functions for data processing and AI integration hit a significant roadblock early on: there is currently a 48MB build size limitation on Functions. This constraint effectively ruled out all data processing libraries like polars, duckdb, daft but most importantly pyiceberg itself, which depends on the larger pyarrow library. As a result, we're effectively unable to use serverless Functions in DigitalOcean for much more than calling other APIs.

This limitation put a big bump in the road for our model as we had initially envisioned it. This means that as things currently stand, in order to run any kind of data processing in DigitalOcean we need to stand up some kind of server first. We talk more about possible workarounds in a later section.

Function Routing in the GenAI Platform

We initially intended to use DigitalOcean's GenAI Platform, specifically its Function Routing capability, to enable AI interactions with our data. The idea was to provide Functions as tools for an AI agent, allowing it to query and manipulate data in our Iceberg tables.

First and foremost, we had to address the 48mb size limit. This meant that our AI agent would be unable to query the data lake directly, and instead we were going to need to call some kind of API. Given the nature of this project as an experiment, it felt out of scope to write an entire API and deploy it just to do a one-off query against our data lake. Especially since at this point, the actual goal was to test the GenAI Platform capabilities.

As a temporary workaround only for this experiment, we decided to create a service user within our Snowflake Partner Account and granted it a role with access to our Iceberg tables. This allowed us to query Iceberg directly from Snowflake, providing the shortest path to getting a function that could give data to the AI agent. As a bonus, it also let us test querying Iceberg from Snowflake, and ended up raising an important flag for us later.

Screenshot of Function calling Snowflake

The GenAI Platform is a very new feature in DigitalOcean, only entering general availability very recently, so it's expected to run into challenges and limitations. In our case, we quickly discovered that we were unable to get it to reliably execute our Function. There is also currently a lack of debugging tools and visibility into the AI's processing that made it difficult to troubleshoot the issue. The reason we were able to tell for sure that the function wasn't being called, was that when we run the function manually, we can see the query appear in our Snowflake Query History. However, any time we ask the AI questions about our data, even if it tried to answer, no query appeared in the history which told us definitively the function wasn't being executed.

It's important to note this service user was a workaround for the experiment, not a design for a production system. In a real-world scenario, relying solely on Snowflake compute for all interactions would defeat the purpose of our mix-and-match strategy. It would essentially be equivalent to storing the data directly in Snowflake, negating the benefits of using Iceberg. Storage is relatively inexpensive in Snowflake - it's the compute that drives costs as discussed in our intro.

Potential Workarounds: Addressing the Limitations

While our experiment highlighted the current limitations of DigitalOcean Functions and the GenAI Platform, there are several potential workarounds we could consider for a production implementation:

FastAPI Server: We could deploy a lightweight FastAPI server (or similar) containing endpoints that do the data processing and we could provide Functions to the AI agent that just call those API endpoints. This feels like an unnecessary middleman though.
Alternative Compute Engines: We could explore other open-source query engines like Trino, which would offer more features and flexibility for handling querying, data transformations, etc. Functions used by the AI agent could trigger actions in Trino instead. However, many of these engines come with significant infrastructure overhead, potentially negating the cost-effectiveness we are seeking.
Leveraging GPU Droplets or External Services: For AI workloads to replace the GenAI Platform, we could utilize DigitalOcean's GPU Droplets or services like Paperspace to run open-source models (such as the new DeepSeek model everyone is talking about).

Ideally however, these limitations would be addressed by DigitalOcean and rectified. We feel they are challenges that will hinder anyone looking to build production grade AI/data engineering projects.

We reached out to DigitalOcean and learned that many of the limitations we experienced are already on their radar, and our experiment with Iceberg in particular has provided a solid use case to consider. We are optimistic that these limitations will be rectified sooner than later, which will unlock a world of possibilities for this architecture on DigitalOcean.

Results and Conclusion

Despite the challenges, our experiment yielded valuable insights and demonstrated the potential of our core concept:

Iceberg for Storage: We proved that S3-compatible services like DigitalOcean can effectively serve as the storage layer for an Iceberg-based data lake, offering an alternative path to mix-and-match compute to save significant costs in your data strategy.
Snowflake and Iceberg: The integration between Snowflake and Iceberg via Iceberg Tables and the External REST Catalog integration provides a seamless connection between Snowflake and any Iceberg implementation, even outside their core connections with AWS/GCP/Azure.
Hybrid Compute: While our initial vision of serverless compute is currently blocked within DigitalOcean, we still believe the experiment confirmed that a hybrid approach is possible. Combining different compute resources should be a viable strategy for optimizing costs and performance.

It's also important to note that everything we did in this experiment is cloud-agnostic. We could stand up the same infrastructure in AWS, GCP, Azure, or DigitalOcean and it would function more or less the same - just with different pros and cons. While we are huge DigitalOcean fans (and partners!) and look forward to the resolution of the current limitations, we could build this infrastructure in any cloud provider a client requested.

The Verdict: A Promising but Early Solution

We feel our experiment demonstrates that building a budget-friendly data lake using Apache Iceberg on DigitalOcean, integrated with Snowflake, is a viable and promising approach. However, the current limitations of DigitalOcean's serverless and AI offerings prevent us from fully realizing our initial vision on their platform at the time of writing. However, as mentioned above, we are optimistic this will change sooner than later. We are thrilled to be partnered with a company like DigitalOcean that we can discuss these limitations with directly - that's certainly not something we would expect from other cloud providers.

Despite these limitations, we believe this architecture offers a compelling path forward for organizations seeking to implement data strategies that reduce costs and prevent being locked into a single provider. The ability to mix and match compute, combined with the integration between Iceberg and Snowflake, provides a flexible and scalable foundation for data management and analytics.

Our next steps involve refining our implementation by testing other tools than Airbyte for data orchestration (such as Mage AI), comparing the speeds in Snowflake using an Iceberg table vs a regular table, and experimenting with tools like Paperspace to run open source models directly.

When we feel confident enough in our implementation, we also plan to develop and release an open-source StarterKit that allows anyone to simply run terraform apply and deploy their own version of this architecture, allowing us to contribute a blueprint for a budget-friendly data lake to the open source community and businesses of any size.

Read Part 1

Read Part 2