# Alex Merced - Architecting an Apache Iceberg Lakehouse (Highlights) ![rw-book-cover|256](https://readwise-assets.s3.amazonaws.com/media/uploaded_book_covers/profile_155788/41452166-3787-4a3d-88c1-fb453b6e96f9.jpg) ## Metadata **Review**:: [readwise.io](https://readwise.io/bookreview/61555345) **Source**:: #from/readwise #from/manning **Zettel**:: #zettel/fleeting **Status**:: #x **Authors**:: [[Alex Merced]] **Full Title**:: Architecting an Apache Iceberg Lakehouse **Category**:: #books #readwise/books **Category Icon**:: 📚 **Document Tags**:: #from/manning **Highlighted**:: [[2026-06-22]] **Created**:: [[2026-07-04]] ## Highlights ### 2 Apache Iceberg and the lakehouse - Instead of storing everything in one monolithic metadata structure, Iceberg organizes table metadata into layers. ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-2/c20c36f08738bce41747c3d2acb55917)) ^1027799894 - Its snapshot-based design enables time travel so that queries can read historical versions of a table’s data for auditing, debugging, or recovery. ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-2/ff39f57ad5e814e7704b2d24fd562c50)) ^1027799895 - Dremio—Federated SQL and native query acceleration and reflections for fast BI performance on Iceberg tables ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-2/eb4e21ae6f50de4f425649c973c590cc)) ^1027799896 - Apache Iceberg changes this with time travel. It lets you query older table versions exactly as they were at a specific point in time. ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-2/ebfdd7068618922480c518d98e93d589)) ^1027799897 - A lakehouse catalog makes Iceberg queryable and manageable at scale. It acts as the interface between your storage layer and query engines, tracking tables and the locations of their metadata. ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-2/5c8bb6159984982b517e249f726cdfa5)) ^1027799898 ### 3 Hands-on with Apache Iceberg - Operational database—PostgreSQL (as the transactional system) Data lake storage layer—MinIO (S3-compatible object store) Ingestion and processing layer—Apache Spark Catalog layer—Nessie (tracks Iceberg tables) Query engine—Dremio (runs fast SQL queries) BI and visualization—Apache Superset (dashboards) ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-3/0b80e68c34779b3232f86914f395514c)) ^1027799900 - We’ll use a PostgreSQL database as a simple, familiar source as we walk through a basic lakehouse workflow (see figure 3.1). We’ll extract data from PostgreSQL and use Apache Spark to write it to Apache Iceberg tables. Then we’ll curate those tables in Dremio and define virtual views to analyze sales trends by region, customer segment, and campaign performance. Apache Superset dashboards will then consume those views and provide interactive insights for marketing, sales, and merchandising teams. ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-3/0b80e68c34779b3232f86914f395514c)) ^1027799901 ### 7 Implementing the catalog layer - Apache Iceberg enables reliable and concurrent access to analytical datasets through an atomic, versioned metadata model. Each committed change to a table, such as adding data, removing rows, or evolving the schema, results in a new version of the table’s metadata that references the appropriate data and auxiliary files. These metadata versions are immutable and form a linear history of table states. ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-7/7a678bea86bf91b23679eace4ced12e1)) ^1027799903 - The catalog connects these metadata versions into a cohesive system (see figure 7.1). It acts as the coordination layer for the lakehouse, tracking Iceberg tables and determining which metadata version represents the table’s current state. ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-7/79fa23562ff679cc4e6b3c318eebe15f)) ^1027799904 - Table registration and discovery—The catalog tracks all registered Iceberg tables within a given namespace. When a user queries a table, the engine contacts the catalog to resolve the table’s name into a metadata location. ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-7/1bc47a817837abdcf31ad3c46420402a)) ^1027799905 - Rather than relying on a single fixed metadata root, the catalog serves as the arbiter of which metadata objects are associated with a given table identifier at any point in time, updating this association transactionally during commits. ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-7/e385222cc2d3c6de8cf648074254af69)) ^1027799906 - The catalog validates that the table state has not been modified by another concurrent operation in a conflicting way before accepting the update. If a conflict is detected, the commit is rejected, and the engine must retry using the latest table state. ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-7/f0d440efd609db837450b99fdef7046c)) ^1027799907 - Transaction coordination—During write operations, Iceberg uses an optimistic concurrency model coordinated by the catalog ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-7/f0d440efd609db837450b99fdef7046c)) ^1027799908 - At a minimum, your catalog should support namespacing and access control ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-7/6b193056314aa3deaa8a8ae2a9c54029)) ^1027799909 - Third, some catalogs implement credential vending, a security pattern where the catalog issues short-lived, scoped-down credentials to querying clients. This mechanism ensures that clients can access only the storage paths necessary for their query, reducing blast radius and simplifying auditing. ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-7/d5655bf57fb2bb9ed6bd541b68f94cce)) ^1027799910 - Catalogs should log metadata operations such as table creation, schema updates, and snapshot rollbacks. These logs help enforce internal controls and provide evidence during audits. Catalogs should also record which users or services initiated changes and from where they originated. ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-7/6bda2328d5c519ee1f82a17068cd63b7)) ^1027799911 - REST-based catalogs are especially attractive because they expose a consistent interface, allowing tools and services to communicate with the catalog ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-7/dca3abef1e791e4e80d42d4330627b84)) ^1027799912 Follow REST standard to develop services similar to a catalog. - In practice, throughput and concurrency tend to matter more than raw request latency for catalog interactions. ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-7/dfb5189bbea8bbca3b1e23735caafbc3)) ^1027799913 - Monitoring and observability also contribute to operational overhead. A catalog that provides clear metrics, structured logs, and integration with alerting systems allows teams to detect issues early and respond quickly. Catalogs that lack this visibility often require more manual troubleshooting, increasing support effort and time to resolution. ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-7/6289c07471456722888741fd07e0807a)) ^1027799914 - Documentation and community support are another part of the operational equation. Catalogs with mature documentation, active maintainers, and responsive forums can accelerate onboarding and reduce dependency on in-house experts. This support becomes especially important when custom integration or advanced tuning is needed. ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-7/319ad5967bf7f5120513aac6be6f4be5)) ^1027799915 - Balancing feature depth with operational simplicity is key. The most technically capable catalog may not be the right fit if it requires constant upkeep or exceeds your budget. Likewise, a lightweight catalog that integrates easily and runs efficiently may offer long-term value even if it lacks advanced features. ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-7/2ac6771880332d1d3f07d3bd4a3a30bc)) ^1027799916 - The lack of a standardized, server-­side commit model made multi-engine interoperability fragile in practice. ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-7/1fbae088e4d5293729b0dc6a30efa0bb)) ^1027799917 - The introduction of the Apache Iceberg REST Catalog specification marked a major step forward. By defining a consistent HTTP-based interface for catalog operations, it decouples the client from the catalog implementation. ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-7/b3d7e2469ecc15462ec2afa38df1dba2)) ^1027799918 - Finally, several catalogs are provided and maintained by cloud and analytics vendors. These include AWS Glue Data Catalog, Dremio catalog (based on Apache Polaris), and the Snowflake Open Catalog (also Polaris-based). ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-7/0797904b9006253f38e1b846ca3333a2)) ^1027799919 - The Hadoop catalog is best viewed as a legacy or transitional option rather than a recommended choice for modern Iceberg deployments. While it may still appear in examples or historical setups, it lacks the scalability, correctness guarantees, and centralized coordination required for real-world production environments. ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-7/5767f059e90ea39f0b27f7341d8f0a33)) ^1027799920 - External catalogs allow Polaris to sync metadata from catalogs managed elsewhere (such as Glue or Dremio). These tables are read-only within Polaris, making this model useful for centralized visibility without assuming control of write operations. ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-7/c1da6f45d2b6500edcf75e08dd5c45e2)) ^1027799921 - Nessie operates as a coordination layer only; it does not manage table metadata or data files directly. Instead, it delegates metadata storage and table operations to Iceberg, while maintaining a consistent view of their evolution through its own version log. ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-7/eff734f77b335e5df0ee4a8c944bfd10)) ^1027799922 - One of Nessie’s distinguishing features is its support for branching and merging of data changes at the catalog level. Analytics jobs or ingestion pipelines can operate in isolation by writing to a dedicated branch. Once the job completes successfully, all its changes can be merged atomically into the mainline. This model avoids exposing incomplete data and provides powerful rollback and debugging capabilities. ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-7/53710cc38b3460959aee3b9d4ae59dd7)) ^1027799923 - Nessie enables transactions that span multiple tables, a traditionally difficult task in distributed data lake environments. ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-7/b27d1cffc987a9cb4b6730a07b53da53)) ^1027799924 - When used with Iceberg’s REST protocol, Nessie can push configuration and temporary credentials to client engines like Spark or Trino. ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-7/dc6dae6ac6721f8240a6fa335e4d516c)) ^1027799925 - Lakekeeper is a lightweight, secure, and extensible implementation of the Apache Iceberg REST Catalog specification. Built in Rust, Lakekeeper offers a fast, JVM-free deployment option for Iceberg environments. ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-7/92be89a1fb2e84f9381ac0fbf4ebf749)) ^1027799926 ### 8 Designing the federation layer - As the demand for federated querying grows across industries, two lakehouse-focused platforms stand out for federated queries: Dremio and Trino. Both engines provide high-performance, distributed SQL querying across diverse data sources, but they have distinct architectures, features, and use cases. ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-8/ae2bdb6464e22bce39f57440383b3d3b)) ^1027799928 ### 11 Operationalizing Apache Iceberg - At scale, operational integrity relies on automation: scheduled jobs, metadata-aware triggers, and table-­specific logic that ensure the lakehouse remains healthy without constant human oversight. Orchestration is the discipline that connects all these pieces, making maintenance workflows reliable, repeatable, and responsive to changing data conditions. ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-11/70a9c93c928e39378a3748f215272e92)) ^1027799930 - Orchestration tools such as Apache Airflow, dbt (data build tool), Dagster, or custom workflow engines allow teams to manage these procedures in a structured, programmatic way. ([View Highlight](https://livebook.manning.com/book/architecting-an-apache-iceberg-lakehouse/chapter-11/3e74b22a66687c75bed97708214876f4)) ^1027799931