Sitemap

Introducing Multi-Catalog Sync in Apache XTable™ (incubating): Unlocking Catalog Interoperability

7 min readMay 7, 2025

TL;DR: This blog introduces Multi-Catalog Sync in Apache XTable™ (Incubating), a new capability designed to extend interoperability beyond open table formats to the catalog layer. The blog explains the challenges around catalog interoperability and walks through an end-to-end example of syncing a Hudi table from HMS as an Iceberg table in Glue.

Open table formats like Apache Hudi, Apache Iceberg, and Delta Lake have fundamentally shifted how organizations approach data storage and management in data lakehouses. These formats play a crucial role in establishing an open and flexible data foundation, empowering enterprises to select compute engines best suited to their workloads and freeing them from the limitations of proprietary storage formats. Yet, achieving a truly open data architecture goes beyond simply adopting open table formats — it requires seamless interoperability across open table formats, catalogs, and compute engines.

Apache XTable™ (Incubating), open-sourced in late 2023, takes a major step toward this goal by addressing interoperability challenges at the table format layer. It enables users to translate a table’s metadata from one format to another — for instance, translating a table originally written in Hudi into Iceberg, or vice versa. At a fundamental level, Hudi, Iceberg, and Delta Lake share a common architectural pattern: a data layer (typically consisting of Parquet files) and a metadata layer that manages transactional and structural information. XTable uses these shared characteristics to make format interoperability possible without rewriting the underlying data files.

Press enter or click to view image in full size

Interoperability in the Catalog Layer

While solutions like XTable have enabled storage format interoperability, the data catalog layer is quickly emerging as a new potential bottleneck in achieving a truly open lakehouse architecture. In the most basic sense, a catalog is an organized inventory of data assets within an organization. It keeps track of all tables and their metadata, table names, schemas, and references to specific metadata associated with each table’s format. Many vendor platforms today require users to adopt proprietary catalogs in order to fully support open table formats. This creates a significant limitation: true interoperability is compromised, forcing organizations to remain within a single vendor’s ecosystem and constraining their ability to access and manage data freely across different engines.

Beyond vendor lock-in, another growing operational challenge is the fragmentation of catalog usage within organizations. Different teams may rely on distinct catalogs as part of the ecosystem they are part of — sometimes even different implementations of the same specification, such as the Iceberg REST Catalog. While these catalogs may adhere to common APIs or standards, there is no straightforward method to synchronize tables across them without manually recreating or migrating metadata.

If we take the Apache Iceberg ecosystem as an example, the catalog is one of the most critical components. It hosts the reference to the current metadata pointer (latest metadata file), which is then used by engines to enforce ACID guarantees. Iceberg’s architecture assumes that one catalog exclusively manages a table’s metadata at any point in time. There is currently no agreed-upon protocol to:

  • Notify another catalog when a new snapshot or commit occurs
  • Reconcile concurrent writes or updates across multiple catalogs
  • Coordinate garbage collection and metadata cleanup activities

If two catalogs independently track and mutate the same Iceberg table, even if they both speak the REST API, it would break ACID guarantees and could result in data corruption or loss.

In summary, while open table formats have opened the door to storage-level interoperability, catalog interoperability remains a missing piece, and solving this problem is essential for building truly open lakehouses.

Multi-Catalog Sync using Apache XTable

To address the growing need for interoperability across catalogs, Apache XTable™ (Incubating) introduces a new capability: Multi-Catalog Sync.

Press enter or click to view image in full size

This feature enables organizations to automatically synchronize metadata for a given table from a source catalog to one or more target catalogs without needing to recreate table definitions, copy metadata manually, or modify the underlying data files. This unlocks a powerful new pattern: a lakehouse table written once and exposed safely across multiple catalogs and platforms.

For example, a table registered in Hive Metastore (HMS) can now be made available in AWS Glue Data Catalog with a single configuration and execution step. Additionally, organizations can opt to translate the table format during the sync — such as syncing a Hudi table from HMS as an Iceberg table in Glue, or converting a Delta table into Hudi for consumption by downstream applications.

With the new Multi-Catalog Sync feature, Apache XTable extends this table format-level interoperability into the catalog layer, bridging a gap that has long limited portability and reuse in open lakehouse environments.

Highlights

  • Sync once, expose everywhere: A table created in one catalog can be propagated to others without manual recreation.
  • Cross-format flexibility: Optional format translation during sync (e.g., Iceberg to Hudi).
  • Multi-catalog support: Sync to multiple target catalogs in a single run — current support includes Hive Metastore and AWS Glue, and more in the future (e.g., Unity Catalog, Polaris).
  • No data movement: Data files remain unchanged; only catalog metadata is updated or created.
  • Declarative config: YAML configs that are declarative and extensible; supports defining datasets, source/target formats, partition specs, and multiple catalogs in one place.

This makes XTable’s Multi-Catalog Sync a practical solution for reducing vendor lock-in by exposing tables across ecosystems and supporting hybrid catalog strategies without rewriting pipelines.

Translating a Hudi Table in Hive Metastore to an Iceberg Table in AWS Glue using Apache XTable

In this example, we will show how to use Apache XTable™ (Incubating) to:

  • Read a Hudi table registered in a Hive Metastore (HMS) catalog.
  • Translate it into an Iceberg table format.
  • And sync the translated table metadata into the AWS Glue Catalog.

This showcases the power of XTable’s Multi-Catalog Sync feature, eliminating manual re-registration efforts across catalogs and enabling seamless cross-format and cross-catalog interoperability.

Step 1: Set up the Hive Metastore Catalog

Follow the instructions listed in the Hive documentation to setup Hive metastore. Once Hive is configured, start it up locally using this:

hive --service metastore
Press enter or click to view image in full size

Step 2: Create and Register a Hudi Table in HMS Using Apache Spark

We will use Apache Spark as the engine to write a sample Hudi table into our local or cloud storage and sync it with the Hive Metastore catalog.

First, let’s start a Spark session with the required configurations:

spark-submit \
--packages org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.0 \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \
--conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083 \
--conf spark.sql.warehouse.dir=file:///Users/your-user/hive-warehouse \
--files /path/to/hive-site.xml

Once we have the Spark session configured, we can create and ingest some records into a Hudi table. We also sync the Hudi table to the Hive metastore using this code by connecting to the HMS Thrift URL: thrift://localhost:9083

from pyspark.sql.types import StructType, StructField, StringType, LongType, IntegerType, Row

# Define table and storage path
databaseName = "my_db"
tableName = "hudi_cow_demo1"
basePath = f"/Users/your-user/hudi-tables/{tableName}"

# Define schema
schema = StructType([
StructField("rowId", StringType(), True),
StructField("partitionId", StringType(), True),
StructField("preComb", LongType(), True),
StructField("name", StringType(), True),
StructField("versionId", StringType(), True),
StructField("toBeDeletedStr", StringType(), True),
StructField("intToLong", IntegerType(), True),
StructField("longToInt", LongType(), True)
])

# Create sample data
data = [
Row("row_1", "2021/01/01", 0, "bob", "v_0", "toBeDel0", 0, 1000000),
Row("row_2", "2021/01/01", 0, "john", "v_0", "toBeDel0", 0, 1000000),
Row("row_3", "2021/01/02", 0, "tom", "v_0", "toBeDel0", 0, 1000000)
]

# Write the Hudi table with Hive sync enabled
df = spark.createDataFrame(data, schema)

df.write.format("hudi") \
.option("hoodie.datasource.write.table.type", "COPY_ON_WRITE") \
.option("hoodie.datasource.write.operation", "upsert") \
.option("hoodie.datasource.write.precombine.field", "preComb") \
.option("hoodie.datasource.write.recordkey.field", "rowId") \
.option("hoodie.datasource.write.partitionpath.field", "partitionId") \
.option("hoodie.datasource.write.hive_style_partitioning", "true") \
.option("hoodie.table.name", tableName) \
.option("hoodie.database.name", databaseName) \
.option("hoodie.datasource.meta.sync.enable", "true") \
.option("hoodie.datasource.hive_sync.mode", "hms") \
.option("hoodie.datasource.hive_sync.metastore.uris", "thrift://localhost:9083") \
.mode("overwrite") \
.save(basePath)

At this point, the Hudi table should be registered in Hive Metastore and available for query engines connected to HMS.

Step 3: Set up Apache XTable and Define the Catalog Sync Configuration

First, clone and build the Apache XTable project:

git clone https://github.com/apache/incubator-xtable.git
cd incubator-xtable
# Follow build instructions to create the bundled utilities jar

You should be able to generate the xtable-utilities_2.12-0.2.0-SNAPSHOT-bundled.jar file after building the project. For details, follow the instructions here.

Next, create a my_config_catalog.yamlfile to configure the source and target catalogs along with dataset details:

sourceCatalog:
catalogId: "source-catalog-id"
catalogConversionSourceImpl: "org.apache.xtable.hms.HMSCatalogConversionSource"
catalogProperties:
externalCatalog.hms.serverUrl: "thrift://localhost:9083"

targetCatalogs:
- catalogId: "target-catalog-id-glue"
catalogSyncClientImpl: "org.apache.xtable.glue.GlueCatalogSyncClient"
catalogProperties:
externalCatalog.glue.region: "us-west-2"

datasets:
- sourceCatalogTableIdentifier:
tableIdentifier:
hierarchicalId: "my_db.hudi_cow_demo1"
partitionSpec: "partitionId:VALUE"
targetCatalogTableIdentifiers:
- catalogId: "target-catalog-id-glue"
tableFormat: "ICEBERG"
tableIdentifier:
hierarchicalId: "dip_db.iceberg_table_new"

This configuration defines:

  • Source: Hive Metastore catalog, reading the hudi_cow_demo1 table from the my_db database in HMS.
  • Target: AWS Glue catalog, registering the translated table as Iceberg under the dip_db database.

Step 4: Execute the Multi-Catalog Sync

Now run the catalog sync process using the built XTable utilities JAR:

java -cp xtable-utilities/target/xtable-utilities_2.12-0.2.0-SNAPSHOT-bundled.jar:hudi-hive-bundle-0.14.0.jar \
org.apache.xtable.utilities.RunCatalogSync \
--catalogSyncConfig my_config_catalog.yaml

Step 5: Validate the Translated Iceberg Table in AWS Glue

If we now navigate to the AWS Glue console and check the dip_db database, we should see the newly translated Iceberg table (originally Hudi).

Press enter or click to view image in full size

We can also validate the schema of the table.

Press enter or click to view image in full size

Building an open lakehouse architecture demands more than adopting open table formats — it requires breaking down silos across storage, table formats, and catalogs. While Apache XTable™ (Incubating) has already made format interoperability possible by bridging Hudi, Iceberg, and Delta Lake, its new Multi-Catalog Sync capability takes this goal further by addressing interoperability challenges at the catalog layer. Check out the official documentation for further reading.

--

--

Dipankar Mazumdar
Dipankar Mazumdar

Written by Dipankar Mazumdar

Dipankar is currently the Director of Developer Advocacy at Cloudera, where he leads worldwide developer initiatives focused on Lakehouse Architecture & GenAI.

No responses yet