Cluster classification visualization
HPC, Cluster, Visualization, Big Data,Database,Cloud,,

Data Management

Apache Iceberg gets cloud-based ETL pipeline

Published

Startup Etleap is introducing a cloud-based Extract-Transform-Load (ETL) pipeline for getting data into Apache Iceberg tables.

Apache Iceberg is an open source table format for large-scale datasets in data lakes, sitting above storage systems like Parquet, ORC, and Avro, and cloud object stores such as AWS S3, Azure Blob, and the Google Cloud Store. It brings database-like features to data lakes, such as ACID support, partitioning, time travel, and schema evolution. Iceberg format tables, are used in big data and enable SQL querying. Query engines such as Spark, Trino, Flink, Presto, Hive, Impala, StarRocks, and others can work on the tables simultaneously.

Christian Romming.

Christian Romming, Etleap CEO and founder, said: “Iceberg delivers major benefits for enterprises, but to realize them in practice requires a managed pipeline system around it. We believe our Iceberg pipeline platform meets this need, allowing data platform teams to adopt Iceberg without building and operating a custom pipeline stack.”

Etleap was started upon 2013 by Romming and, by data analytics startup standards, is lightly funded, having raised some $3.22 million across startup and seed rounds in 2017 and 2018.

Romming says Iceberg doesn’t ingest or model data, manage table operations, or coordinate changes across systems. Users have to build their own set of pipeline functions to hook up data sources to Iceberg and do this, having “to assemble a patchwork of ingestion tools, dbt Core jobs, orchestrators, and custom Iceberg maintenance.”

Now Etleap will do it for you, courtesy of a SaaS service. It’s unifying ingestion, transformation, orchestration, and Iceberg operations into a single, managed system that runs entirely inside a customer’s Virtual Private Cloud (VPC).

However supported data sources are limited. Currently only the following are supported as sources for Iceberg pipelines:

  • CDC-enabled databases (CDC = change data capture)
  • S3 sources when the “Trigger transformations through events” pipeline source option is enabled
  • Event Streams
  • Salesforce CDC entities

There is a limited set of data transforms available, which can be found here. There are CDC, event-triggered, and event stream Iceberg pipeline limitations as well. Many of these should get sorted out in the future.

Etleap has pipelines for AWS Redshift, S3/Glue, and also Snowflake. Its Iceberg pipeline platform is available and being used by customers to run Iceberg pipelines at scale.