Implementing Apache Iceberg on DigitalOcean (Part 2/3)

Introduction

This section of our case study delves into the implementation of Apache Iceberg as the core storage layer for our budget-friendly data lake architecture on DigitalOcean. Building upon the vision outlined in Part 1, we detail our journey to establish a functional Iceberg catalog that is connected to DigitalOcean Spaces (their S3-compatible object storage) and capable of supporting data ingestion from various sources via Airbyte. This phase was crucial to validating our hypothesis that a cost-effective, Iceberg-centric data lake could be built on DigitalOcean's infrastructure.

Beyond the Tip of the Iceberg

Before diving into the implementation details, let's take a closer look at what Apache Iceberg actually is and how it works under the hood. While it boasts many advanced features, at its core, Iceberg provides a way to manage and access large datasets stored as files in object storage, in a manner similar to a traditional database.

Iceberg's Advantages

Apache Iceberg is an open table format that is making waves in data lake management. It overcomes limitations of traditional methods like Hive or storing raw data files, offering the following features:

ACID Transactions: Iceberg guarantees data consistency with ACID transaction support, similar to databases. This ensures reliable updates, even during concurrent writes, using optimistic concurrency to prevent conflicts.
Time Travel: By tracking table snapshots, Iceberg enables querying data as it existed at any past point in time. This is crucial for audits, debugging, and historical analysis.
Schema and Partition Evolution: Iceberg allows schema changes (adding, renaming columns) and partition scheme modifications without requiring data rewrites. This provides flexibility to adapt to evolving needs.
Hidden Partitioning: Users can query data efficiently without needing to know the underlying partitioning scheme. Iceberg automatically filters data based on partitions, simplifying queries and improving performance.
Enhanced Performance: Iceberg avoids expensive file-listing operations by tracking individual data files in its metadata. Combined with metadata-level pruning and the ability to avoid object store throttling by not requiring a rigid directory structure, query planning and execution is sped up significantly, especially at scale.
Openness and Interoperability: Being an open-source project, Iceberg ensures ongoing development, prevents vendor lock-in, and supports a wide range of compute engines (Spark, Trino, Flink, etc.), giving users more control and flexibility.

Ok That's Great, But What is Iceberg Really?

Let's try to explain how Iceberg works in simpler terms. We found that many existing blog posts are either too high level or too low level in their explanations and it took some research for us to understand exactly what Iceberg actually is.

At its core, Apache Iceberg is an open table format (an open source way to store data represented as tables) that provides a way to organize and manage large datasets stored as files in cloud storage like Amazon S3 (or DigitalOcean Spaces in our case). Instead of dealing directly with individual data files, you interact with Iceberg tables through a catalog.

Here is a diagram from the Apache Iceberg website:

In short, Iceberg is simply files stored in a storage bucket like S3, with metadata tracking changes.

Let's break down how it works:

Data Files: Your actual data resides in files in formats like Parquet, ORC, or Avro within your cloud storage bucket. These files are immutable, meaning they are never modified in place.
Iceberg Catalog: This is a service that acts as the central registry for your Iceberg tables. It stores a pointer to the current metadata file for each table. Common catalog implementations include REST-based services (like Polaris, Neesie, or LakeKeeper), Hive Metastore, or even a relational database via JDBC.
Metadata Layer: This layer is the heart of Iceberg. It uses a tiered structure of metadata files to track the data files that comprise a table:
- Table Metadata File: This file contains the current schema of the table, the partitioning scheme, and a list of snapshots.
- Snapshot: A snapshot represents the state of a table at a specific point in time. Each snapshot points to a manifest list.
- Manifest List: This file lists all the manifest files associated with a particular snapshot.
- Manifest File: Each manifest file contains a list of data files and delete files that belong to a subset of the snapshot, along with some statistics about those files (like column bounds).
Writing Data: When you write data to an Iceberg table:
- New data files are created in your storage bucket.
- A new set of manifest files are generated, listing these new data files.
- A new manifest list is created pointing to those manifests.
- A new table metadata file is created containing the new snapshot with the new manifest list.
- Finally, the Iceberg catalog is updated to point to this new table metadata file. This is an atomic operation, ensuring consistency.
Reading Data: When you query an Iceberg table:
- The query engine (like Spark, Trino, or Snowflake) contacts the Iceberg catalog to get the location of the current table metadata file.
- Most catalogs let you connect to an identity provider to authenticate access and provide role-based access policies.
- The catalog then reads the metadata file to find the relevant snapshot.
- The engine follows the chain through the manifest list and manifest files to identify the specific data files it needs to read, based on the query.
- The catalog then generates either a pre-signed URL or an STS token (depending on what your storage provider offers) that allows one-time access to only the files needed for the query.
- The engine reads only the necessary data files directly from cloud storage to do the query with the information/access provided by the catalog.

Iceberg maintains an immutable, versioned history of the table's state through its metadata files. Each change to the table creates a new snapshot, without modifying existing data or metadata files. The catalog ensures that all readers have a consistent view of the table. This approach enables concurrent writes, efficient data access, and features that we discussed earlier like time travel and schema evolution. It does all this by simply keeping track of files in object storage, and providing a way to access them securely and efficiently.

If you want to really get into the weeds, here is an in-depth Apache Iceberg tutorial from Dremio: Apache Iceberg 101

The Journey to a Functional Iceberg Catalog

Our journey to a functional Iceberg catalog on DigitalOcean was marked by a series of trials and tribulations, but in the end we found our way to something that works. We detail our journey here in this post, as the learnings here may help others seeking to replicate or build upon our work, particularly those focused on making data lake technology accessible to smaller organizations like us.

Attempt #1: Snowflake's Polaris

Our initial inclination was to use Polaris, Snowflake's open-source implementation of an Iceberg catalog. Given our intention to integrate with Snowflake, it seemed like the natural choice. It also appears at first glance to be the most mature (relatively speaking) and commonly used catalog. However, we quickly discovered a critical limitation: Polaris, at the time of this experiment in January 2025, is only compatible with AWS, GCP, and Azure. While DigitalOcean Spaces offers S3 compatibility, Polaris does not yet natively support it. There is, however, an open pull request to address this limitation.

While we are enthusiastic about Polaris and Snowflake's entry into open source, we unfortunately had to pivot our choice of catalog for this experiment due to the current incompatibility.

This early hurdle also highlighted a broader challenge in working with newer technologies like Iceberg. The ecosystem, while rapidly evolving, still has significant gaps in support for less-common cloud providers, which can impact smaller organizations who may favor these providers for their cost-effectiveness.

Attempt #2: Nessie and Zitadel

Our next attempt involved Nessie, another popular open-source Iceberg catalog. Nessie's promise of a Git-like experience for data resonated with our version control-focused development practices. However, we encountered a new set of challenges, primarily centered around authentication and integration with a third-party identity provider (IdP).

The role of an IdP is to store information and credentials for users outside of your application in a secure way. Your users can then authenticate with the IdP and receive a token that applications can verify with the IdP instead of taking your username/password directly.

DigitalOcean, unlike AWS or Azure, doesn't have a native IdP like AWS's Cognito. This meant we needed to find an alternative solution for authenticating with Nessie, preferably also open source. Our initial exploration led us to Zitadel, an open-source identity management platform. While initially promising, we found Zitadel to be somewhat slow, bulky, and its interface confusing - at least the open source version. No matter what we tried, we would continuously get a vague "invalid_client" error, even when attempting to hit the API directly with cURL commands right out of their docs.

We instead shifted our attention to Keycloak, another open-source IdP that we saw referenced in many places online. This proved to be a turning point. Keycloak's interface was much more straightforward, and we could get tokens successfully without issue. After more wrestling with Nessie's vexing configuration options, we were final able to use Keycloak to act as the identity provider for Nessie.

We did observe however that Keycloak can be somewhat CPU-intensive. We experienced a few instances where it consumed significant resources within our Kubernetes cluster, leading to node instability until we fine-tuned the resource requests. This highlighted a potential need for optimization/careful resourcing for future deployments, particularly for organizations operating with constrained resources. For context, our Kubernetes cluster is currently quite small so it's not entirely surprising that a more intensive application strained it.

Despite our success with authentication, we encountered further obstacles when attempting to create tables within Nessie. We consistently received errors related to missing S3 keys, even though we had meticulously followed the documentation and configured the keys in the appropriate locations that matched their examples exactly.

At this point, we began to truly appreciate the youth of the Iceberg ecosystem. The available documentation and community support, while growing, were not as extensive as those for more established technologies. We decided to explore one final option: Lakekeeper.

Attempt #3: Lakekeeper

Lakekeeper, a newer but promising Iceberg catalog built in Rust, caught our attention due to its simplicity, performance claims, and active developer community. This time, our efforts were rewarded with a much smoother experience.

We were able to get Lakekeeper up and running on DigitalOcean with minimal friction. Authentication with Keycloak worked on the first try, and we were able to create tables and access the Lakekeeper UI. This was a significant breakthrough, demonstrating that a functional Iceberg catalog could indeed be deployed on DigitalOcean. However, our success with Lakekeeper was short-lived, as we encountered errors when attempting to write data to Iceberg. We encountered 2 different errors while writing with different tools:

Airbyte: We wrote a terraform script to deploy Airbyte to a Droplet then configured it to ingest stock data from Yahoo and write it to our Lakekeeper-managed Iceberg catalog. This resulted in permission errors, seemingly stemming from Spark, which the Airbyte Iceberg destination appears to use under the hood.
PyIceberg: We attempted to write data using the PyIceberg library, using the yellow taxi data example from their getting started documentation. This resulted in errors related to S3 headers not being properly signed or a PutRequest missing headers.

These errors were particularly frustrating, as we were so damn close. It was at this point that we turned to the Lakekeeper community in Discord for assistance, where the developers are pleasantly active and responsive. Their insights proved invaluable in resolving these final hurdles:

Airbyte/Spark Error: The Lakekeeper developers quickly identified that we needed to enable "key-path-access", a setting not exposed in the UI at the time. They indicated this would be resolved shortly in an upcoming release.
PyIceberg Error: This issue turned out to be more deeply rooted and outside the realm of Lakekeeper. After extensive debugging and tracing through the PyIceberg source code, we at Polar Labs were able to identify a problem with botocore, a core library used by many AWS-related tools, including s3fs, used by PyIceberg, deep under the hood. We then found a corresponding issue on the botocore GitHub repository, confirming that this was a known problem affecting interactions with certain S3-compatible storage services. Downgrading to the last minor version fixed the issue for the purposes of this experiment, and though the issue is now closed with some extra settings for S3-compatible storages, it's unclear how those get set many layers deep within PyIceberg. A problem for another day.

With these fixes in place, we finally achieved our goal:

Airbyte successfully wrote data to our Iceberg tables managed by Lakekeeper
PyIceberg could both write and read data from our Iceberg tables.

Key Learnings and Reflections

This journey to a functional Iceberg catalog on DigitalOcean yielded several crucial learnings:

The Iceberg ecosystem is still maturing: While powerful, Iceberg and its surrounding tooling are still under active development. Documentation can be sparse, and compatibility with less common cloud providers often requires workarounds.
Community support is vital: The active and helpful communities around projects like Lakekeeper are invaluable resources for troubleshooting and navigating the complexities of emerging technologies.
Identity management is crucial: An identity management solution like Keycloak is essential for securing access to the Iceberg catalog. Though, we learned it's important to keep an eye on its resource usage in smaller environments. For orgs with more resources, a managed solution like Auth0 is likely a better choice.
DigitalOcean is a viable platform for Iceberg Storage: We successfully implemented the storage component of our data lake architecture on DigitalOcean. However, it is worth noting that larger cloud providers like AWS may be a more suitable solution for larger orgs. For example, DigitalOcean doesn't currently support STS on their buckets, while AWS not only does have STS, they also recently introduced support for a managed Iceberg implementation called S3 Tables.

We reached out to DigitalOcean about this and some other limitations we discuss in Part 3, and we are optimistic that they will address these challenges sooner than later.

Conclusion

By the end of this phase, we had a solid foundation for the storage layer of our data lake:

Terraform/Ansible Scripts: A repository containing Infrastructure as Code (IaC) scripts to automatically provision and configure all necessary components, including Airbyte, Lakekeeper, and supporting services on DigitalOcean. We also used our Helm StarterKit to deploy Keycloak into our Kubernetes cluster so we can use it for future endeavors.
Lakekeeper on DigitalOcean: A running instance of Lakekeeper, connected to our DigitalOcean Spaces bucket, and managing our Iceberg tables.
Data Ingestion: Airbyte successfully ingesting data and writing it to Iceberg.
Data Access: PyIceberg successfully writing and reading data from Iceberg.
Keycloak Integration: Keycloak providing secure authentication for our Iceberg catalog.

While this was a major milestone, our journey is not yet complete. We have at this point proven that we can store and manage data using Iceberg on DigitalOcean. However, to fully realize our vision of a budget-friendly data lake, we still need to validate the compute and AI integration aspects. In Part 3, we explore our ideas to leverage a hybrid compute approach that combines in-house resources and the analytical power of Snowflake.

Read Part 1