# Async Usage
Icechunk includes an optional asynchronous interface for orchestrating repos and sessions. However, the Icechunk core is fully asynchronous and delivers full parallelism and performance whether you choose to use the synchronous or asynchronous interface. Most users, particularly those doing interactive data science and analytics, should use the synchronous interface.
## When to use async
The async interface allows for icechunk operations to run concurrently, without blocking the current thread while waiting for IO operations. The most common reason to use async is that you are developing a backend service in which work may be happening across multiple Icechunk repositories simultaneously.
## Using the async interface
You can call both sync and async methods on a `Repository`, `Session`, or `Store` as needed. (Of course, to use the async methods, you must be within a async function.) Methods that support async are named with an `_async` postfix:
```python exec="on" session="async_usage" source="material-block"
import icechunk
async def get_branches(storage: icechunk.Storage) -> set[str]:
repo = await icechunk.Repository.open_async(storage)
return await repo.list_branches_async()
```
# Concurrency
TODO: describe the general approach to concurrency in Icechunk
## Built-in concurrency
Describe the multi-threading and async concurrency in Icechunk / Zarr
## Distributed concurrency within a single transaction
"Cooperative" concurrency
## Concurrency across uncoordinated sessions
### Conflict detection
# Configuration
When creating and opening Icechunk repositories, there are many configuration options available to control the behavior of the repository and the storage backend. This page will guide you through the available options and how to use them.
## [`RepositoryConfig`](./reference.md#icechunk.RepositoryConfig)
The `RepositoryConfig` object is used to configure the repository. For convenience, this can be constructed using some sane defaults:
```python exec="on" session="config" source="material-block"
import icechunk
config = icechunk.RepositoryConfig.default()
```
or it can be optionally loaded from an existing repository:
```python
config = icechunk.Repository.fetch_config(storage)
```
It allows you to configure the following parameters:
### [`inline_chunk_threshold_bytes`](./reference.md#icechunk.RepositoryConfig.inline_chunk_threshold_bytes)
The threshold for when to inline a chunk into a manifest instead of storing it as a separate object in the storage backend.
### [`get_partial_values_concurrency`](./reference.md#icechunk.RepositoryConfig.get_partial_values_concurrency)
The number of concurrent requests to make when getting partial values from storage.
### [`compression`](./reference.md#icechunk.RepositoryConfig.compression)
Icechunk uses Zstd compression to compress its metadata files. [`CompressionConfig`](./reference.md#icechunk.CompressionConfig) allows you to configure the [compression level](./reference.md#icechunk.CompressionConfig.level) and [algorithm](./reference.md#icechunk.CompressionConfig.algorithm). Currently, the only algorithm available is [`Zstd`](https://facebook.github.io/zstd/).
```python exec="on" session="config" source="material-block"
config.compression = icechunk.CompressionConfig(
level=3,
algorithm=icechunk.CompressionAlgorithm.Zstd,
)
```
### [`caching`](./reference.md#icechunk.RepositoryConfig.caching)
Icechunk caches files (metadata and chunks) to speed up common operations. [`CachingConfig`](./reference.md#icechunk.CachingConfig) allows you to configure the caching behavior for the repository.
```python exec="on" session="config" source="material-block"
config.caching = icechunk.CachingConfig(
num_snapshot_nodes=100,
num_chunk_refs=100,
num_transaction_changes=100,
num_bytes_attributes=10_000,
num_bytes_chunks=1_000_000,
)
```
### [`storage`](./reference.md#icechunk.RepositoryConfig.storage)
This configures how Icechunk loads data from the storage backend. [`StorageSettings`](./reference.md#icechunk.StorageSettings) allows you to configure the storage settings.
```python exec="on" session="config" source="material-block"
config.storage = icechunk.StorageSettings(
concurrency=icechunk.StorageConcurrencySettings(
max_concurrent_requests_for_object=10,
ideal_concurrent_request_size=1_000_000,
),
storage_class="STANDARD",
metadata_storage_class="STANDARD_IA",
chunks_storage_class="STANDARD_IA",
)
```
### [`virtual_chunk_containers`](./reference.md#icechunk.RepositoryConfig.virtual_chunk_containers)
Icechunk allows repos to contain [virtual chunks](./virtual.md). To allow for referencing these virtual chunks, you must configure the `virtual_chunk_containers` parameter to specify the storage locations and configurations for any virtual chunks. Each virtual chunk container is specified by a [`VirtualChunkContainer`](./reference.md#icechunk.VirtualChunkContainer) object which contains a url prefix, and a storage configuration. When a container is added to the settings, any virtual chunks with a url that starts with the configured prefix will use the storage configuration for that matching container.
!!! note
Currently only `s3` compatible storage, `gcs`, `local_filesystem` and `http[s]` storages are supported for virtual chunk containers. Other storage backends such as `azure` are on the roadmap.
#### Example
For example, if we wanted to configure an icechunk repo to be able to contain virtual chunks from an `s3` bucket called `my-s3-bucket` in `us-east-1`, we would do the following:
```python exec="on" session="config" source="material-block"
config.set_virtual_chunk_container(
icechunk.VirtualChunkContainer(
"s3://my-s3-bucket/",
store=icechunk.s3_store(region="us-east-1"),
),
)
```
If we also wanted to configure the repo to be able to contain virtual chunks from another `s3` bucket called `my-other-s3-bucket` in `us-west-2`, we would do the following:
```python exec="on" session="config" source="material-block"
config.set_virtual_chunk_container(
icechunk.VirtualChunkContainer(
"s3://my-other-s3-bucket/",
store=icechunk.s3_store(region="us-west-2")
)
)
```
This will add a second `VirtualChunkContainer` but not overwrite the first one that was added because they have different url prefixes. Now at read time, if Icechunk encounters a virtual chunk url that starts with `s3://my-other-s3-bucket/`, it will use the storage configuration for the `my-other-s3-bucket` container.
!!! note
While virtual chunk containers specify the storage configuration for any virtual chunks, they do not contain any authentication information. The credentials must also be specified when opening the repository using the [`authorize_virtual_chunk_access`](./reference.md#icechunk.Repository.open) parameter. This parameter also serves as a way for the user to authorize the access to the virtual chunk containers, containers that are not explicitly allowed with `authorize_virtual_chunk_access` won't be able to fetch their chunks. See the [Virtual Chunk Credentials](#virtual-chunk-credentials) section for more information.
### [`manifest`](./reference.md#icechunk.RepositoryConfig.manifest)
The manifest configuration for the repository. [`ManifestConfig`](./reference.md#icechunk.ManifestConfig) allows you to configure behavior for how manifests are loaded. In particular, the `preload` parameter allows you to configure the preload behavior of the manifest using a [`ManifestPreloadConfig`](./reference.md#icechunk.ManifestPreloadConfig). This allows you to control the number of references that are loaded into memory when a session is created, along with which manifests are available to be preloaded.
#### Example
For example, if we have a repo which contains data that we plan to open as an [`Xarray`](./xarray.md) dataset, we may want to configure the manifest preload to only preload manifests that contain arrays that are coordinates, in our case `time`, `latitude`, and `longitude`.
```python exec="on" session="config" source="material-block"
config.manifest = icechunk.ManifestConfig(
preload=icechunk.ManifestPreloadConfig(
max_total_refs=100_000_000,
preload_if=icechunk.ManifestPreloadCondition.name_matches(".*time|.*latitude|.*longitude"),
),
)
```
### Applying Configuration
Now we can now create or open an Icechunk repo using our config.
#### Creating a new repo
If no config is provided, the repo will be created with the [default configuration](./reference.md#icechunk.RepositoryConfig.default).
!!! note
Icechunk repos cannot be created in the same location where another store already exists.
=== "Creating with S3 storage"
```python
storage = icechunk.s3_storage(
bucket='earthmover-sample-data',
prefix='icechunk/oisst.2020-2024/',
region='us-east-1',
from_env=True,
)
repo = icechunk.Repository.create(
storage=storage,
config=config,
)
```
=== "Creating with Google Cloud Storage"
```python
storage = icechunk.gcs_storage(
bucket='earthmover-sample-data',
prefix='icechunk/oisst.2020-2024/',
from_env=True,
)
repo = icechunk.Repository.create(
storage=storage,
config=config,
)
```
=== "Creating with Azure Blob Storage"
```python
storage = icechunk.azure_storage(
container='earthmover-sample-data',
prefix='icechunk/oisst.2020-2024/',
from_env=True,
)
repo = icechunk.Repository.create(
storage=storage,
config=config,
)
```
=== "Creating with local filesystem"
```python
repo = icechunk.Repository.create(
storage=icechunk.local_filesystem_storage("/path/to/my/dataset"),
config=config
)
```
#### Opening an existing repo
When opening an existing repo, the config will be loaded from the repo if it exists. If no config exists and no config was specified, the repo will be opened with the [default configuration](./reference.md#icechunk.RepositoryConfig.default).
However, if a config was specified when opening the repo AND a config was previously persisted in the repo, the two configurations will be merged. The config specified when opening the repo will take precedence over the persisted config.
=== "Opening from S3 Storage"
```python
storage = icechunk.s3_storage(
bucket='earthmover-sample-data',
prefix='icechunk/oisst.2020-2024/',
region='us-east-1',
from_env=True,
)
repo = icechunk.Repository.open(
storage=storage,
config=config,
)
```
=== "Opening from Google Cloud Storage"
```python
storage = icechunk.gcs_storage(
bucket='earthmover-sample-data',
prefix='icechunk/oisst.2020-2024/',
from_env=True,
)
repo = icechunk.Repository.open(
storage=storage,
config=config,
)
```
=== "Opening from Azure Blob Storage"
```python
storage = icechunk.azure_storage(
container='earthmover-sample-data',
prefix='icechunk/oisst.2020-2024/',
from_env=True,
)
repo = icechunk.Repository.open(
storage=storage,
config=config,
)
```
=== "Opening from local filesystem"
```python
storage = icechunk.local_filesystem_storage("/path/to/my/dataset")
store = icechunk.IcechunkStore.open(
storage=storage,
config=config,
)
```
### Persisting Configuration
Once the repo is opened, the current config can be persisted to the repo by calling [`save_config`](./reference.md#icechunk.Repository.save_config).
```python
repo.save_config()
```
The next time this repo is opened, the persisted config will be loaded by default.
## Virtual Chunk Credentials
When using virtual chunk containers, the containers must be authorized by the repo user, and the credentials for the storage backend must be specified. This is done using the [`authorize_virtual_chunk_access`](./reference.md#icechunk.Repository.open) parameter when creating or opening the repo. Credentials are specified as a dictionary of container url prefixes mapping to credential objects or `None`. A `None` credential will fetch credentials from the process environment or it will use anonymous credentials if the container allows it. A helper function, [`containers_credentials`](./reference.md#icechunk.containers_credentials), is provided to make it easier to specify credentials for multiple containers.
### Example
Expanding on the example from the [Virtual Chunk Containers](#virtual_chunk_containers) section, we can configure the repo to use the credentials for the `my-s3-bucket` and `my-other-s3-bucket` containers.
```python
credentials = icechunk.containers_credentials(
{ "s3://my_s3_bucket": icechunk.s3_credentials(bucket="my-s3-bucket", region="us-east-1"),
"s3://my_other_s3_bucket": icechunk.s3_credentials(bucket="my-other-s3-bucket", region="us-west-2"),
}
)
repo = icechunk.Repository.open(
storage=storage,
config=config,
authorize_virtual_chunk_access=credentials,
)
```
---
title: Contributing
---
# Contributing
👋 Hi! Thanks for your interest in contributing to Icechunk!
Icechunk is an open source (Apache 2.0) project and welcomes contributions in the form of:
- Usage questions - [open a GitHub issue](https://github.com/earth-mover/icechunk/issues)
- Bug reports - [open a GitHub issue](https://github.com/earth-mover/icechunk/issues)
- Feature requests - [open a GitHub issue](https://github.com/earth-mover/icechunk/issues)
- Documentation improvements - [open a GitHub pull request](https://github.com/earth-mover/icechunk/pulls)
- Bug fixes and enhancements - [open a GitHub pull request](https://github.com/earth-mover/icechunk/pulls)
## Development
### Python Development Workflow
The Python code is developed in the `icechunk-python` subdirectory. To make changes first enter that directory:
```bash
cd icechunk-python
```
Create / activate a virtual environment:
=== "Venv"
```bash
python3 -m venv .venv
source .venv/bin/activate
```
=== "Conda / Mamba"
```bash
mamba create -n icechunk python=3.12 rust zarr
mamba activate icechunk
```
=== "uv"
```bash
uv sync
```
Install `maturin`:
=== "Venv"
```bash
pip install maturin
```
Build the project in dev mode:
```bash
maturin develop
# or with the optional dependencies
maturin develop --extras=test,benchmark
```
or build the project in editable mode:
```bash
pip install -e icechunk@.
```
=== "uv"
uv manages rebuilding as needed, so it will run the Maturin build when using `uv run`.
To explicitly use Maturin, install it globally.
```bash
uv tool install maturin
```
Maturin may need to know it should work with uv, so add `--uv` to the CLI.
```bash
maturin develop --uv --extras=test,benchmark
```
#### Testing
The full Python test suite depends on S3 and Azure compatible object stores.
They can be run from the root of the repo with `docker compose up` (`ctrl-c` then `docker compose down` once done to clean up.).
=== "uv"
```bash
uv run pytest
```
### Rust Development Workflow
TODO
## Roadmap
### Features
- Support more object stores and more of their custom features
- Better Python API and helper functions
- Bindings to other languages: C, Wasm
- Better, faster, more secure distributed sessions
- Savepoints and persistent sessions
- Chunk and repo level statistics and metrics
- More powerful conflict detection and resolution
- Efficient move operation
- Telemetry
- Zarr-less usage from Python and other languages
- Better documentation and examples
### Performance
- Lower changeset memory footprint
- Optimize virtual dataset prefixes
- Bring back manifest joining for small arrays
- Improve performance of `ancestry`, `garbage_collect`, `get_size` and other metrics
- More flexible caching hierarchy
- Better I/O pipeline
- Better GIL management
- Request batching and splitting
- Bringing parts of the codec pipeline to the Rust side
- Chunk compaction
### Zarr-related
We’re very excited about a number of extensions to Zarr that would work great with Icechunk.
- [Variable length chunks](https://zarr.dev/zeps/draft/ZEP0003.html)
- [Chunk-level statistics](https://zarr.dev/zeps/draft/ZEP0005.html)
# Distributed Writes with dask
You can use Icechunk in conjunction with Xarray and Dask to perform large-scale distributed writes from a multi-node cluster.
However, because of how Icechunk works, it's not possible to use the existing [`Dask.Array.to_zarr`](https://docs.dask.org/en/latest/generated/dask.array.to_zarr.html) or [`Xarray.Dataset.to_zarr`](https://docs.xarray.dev/en/latest/generated/xarray.Dataset.to_zarr.html) functions with either the Dask multiprocessing or distributed schedulers. (It is fine with the multithreaded scheduler.)
Instead, Icechunk provides its own specialized functions to make distributed writes with Dask and Xarray.
This page explains how to use these specialized functions.
Start with an icechunk store and dask arrays.
```python exec="on" session="dask" source="material-block"
import icechunk
import tempfile
# initialize the icechunk store
storage = icechunk.local_filesystem_storage(tempfile.TemporaryDirectory().name)
repo = icechunk.Repository.create(storage)
session = repo.writable_session("main")
```
## Icechunk + Dask
Use [`icechunk.dask.store_dask`](./reference.md#icechunk.dask.store_dask) to write a Dask array to an Icechunk store.
The API follows that of [`dask.array.store`](https://docs.dask.org/en/stable/generated/dask.array.store.html) *without*
support for the `compute` kwarg.
First create a dask array to write:
```python exec="on" session="dask" source="material-block"
import dask.array as da
shape = (100, 100)
dask_chunks = (20, 20)
dask_array = da.random.random(shape, chunks=dask_chunks)
```
Now create the Zarr array you will write to.
```python exec="on" session="dask" source="material-block"
import zarr
zarr_chunks = (10, 10)
group = zarr.group(store=session.store, overwrite=True)
zarray = group.create_array(
"array",
shape=shape,
chunks=zarr_chunks,
dtype="f8",
fill_value=float("nan"),
)
session.commit("initialize array")
```
Note that the chunks in the store are a divisor of the dask chunks. This means each individual
write task is independent, and will not conflict. It is your responsibility to ensure that such
conflicts are avoided.
First remember to fork the session before re-opening the Zarr array.
`store_dask` will merge all the remote write sessions on the cluster before returning back
a single merged `ForkSession`.
```python exec="on" session="dask" source="material-block" result="code"
import icechunk.dask
session = repo.writable_session("main")
fork = session.fork()
zarray = zarr.open_array(fork.store, path="array")
remote_session = icechunk.dask.store_dask(
sources=[dask_array],
targets=[zarray]
)
```
Merge the remote session in to the local Session
```python exec="on" session="dask" source="material-block" result="code"
session.merge(remote_session)
```
Finally commit your changes!
```python exec="on" session="dask" source="material-block"
print(session.commit("wrote a dask array!"))
```
## Icechunk + Dask + Xarray
The [`icechunk.xarray.to_icechunk`](./reference.md#icechunk.xarray.to_icechunk) is functionally identical to Xarray's
[`Dataset.to_zarr`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.to_zarr.html), including many of the same keyword arguments.
Notably the ``compute`` kwarg is not supported.
Now roundtrip an xarray dataset
```python exec="on" session="dask" source="material-block" result="code"
import distributed
import icechunk.xarray
import xarray as xr
client = distributed.Client()
session = repo.writable_session("main")
dataset = xr.tutorial.open_dataset(
"rasm",
chunks={"time": 1}).isel(time=slice(24)
)
# `to_icechunk` takes care of handling the forking
icechunk.xarray.to_icechunk(dataset, session, mode="w")
# remember you must commit before executing a distributed read.
print(session.commit("wrote an Xarray dataset!"))
roundtripped = xr.open_zarr(session.store, consolidated=False)
print(dataset.identical(roundtripped))
```
```python exec="on" session="dask"
# handy when running mkdocs serve locally
client.shutdown();
```
# Expiring Data
Over time, an Icechunk Repository will accumulate many snapshots, not all of which need to be kept around.
"Expiration" allows you to mark snapshots as expired, and "garbage collection" deletes all data (manifests, chunks, snapshots, etc.) associated with expired snapshots.
First create a Repository, configured so that there are no "inline" chunks. This will help illustrate that data is actually deleted.
```python exec="on" session="version" source="material-block"
import icechunk
repo = icechunk.Repository.create(
icechunk.in_memory_storage(),
config=icechunk.RepositoryConfig(inline_chunk_threshold_bytes=0),
)
```
## Generate a few snapshots
Let us generate a sequence of snapshots
```python exec="on" session="version" source="material-block"
import zarr
import time
for i in range(10):
session = repo.writable_session("main")
array = zarr.create_array(
session.store, name="array", shape=(10,), fill_value=-1, dtype=int, overwrite=True
)
array[:] = i
session.commit(f"snap {i}")
time.sleep(0.1)
```
There are 10 snapshots
```python exec="on" session="version" source="material-block"
ancestry = list(repo.ancestry(branch="main"))
print("\n\n".join([str((a.id, a.written_at)) for a in ancestry]))
```
## Expire snapshots
!!! danger
Expiring snapshots is an irreversible operation. Use it with care.
First we must expire snapshots. Here we will expire any snapshot older than the 5th one.
```python exec="on" session="version" source="material-block"
expiry_time = ancestry[5].written_at
print(expiry_time)
```
```python exec="on" session="version" source="material-block"
expired = repo.expire_snapshots(older_than=expiry_time)
print(expired)
```
This prints out the set of snapshots that were expired.
!!! note
The first snapshot is never expired!
Confirm that these are the right snapshots (remember that ancestry list commits in decreasing order of `written_at` time):
```python exec="on" session="version" source="material-block"
print([a.id for a in ancestry[-5:-1]])
```
Note that ancestry is now shorter:
```python exec="on" session="version" source="material-block"
new_ancestry = list(repo.ancestry(branch="main"))
print("\n\n".join([str((a.id, a.written_at)) for a in new_ancestry]))
```
## Delete expired data
!!! danger
Garbage collection is an irreversible operation that deletes data. Use it with care.
Use `Repository.garbage_collect` to delete data associated with expired snapshots
```python exec="on" session="version" source="material-block"
results = repo.garbage_collect(expiry_time)
print(results)
```
---
title: Frequently Asked Questions
---
# FAQ
## Why was Icechunk created?
Icechunk was created by [Earthmover](https://earthmover.io/) as the open-source format for its cloud data platform [Arraylake](https://docs.earthmover.io).
Icechunk builds on the successful [Zarr](https://zarr.dev) project.
Zarr is a great foundation for storing and querying large multidimensional array data in a flexible, scalable way.
But when people started using Zarr together with cloud object storage in a collaborative way, it became clear that Zarr alone could not offer the sort of consistency many users desired.
Icechunk makes Zarr work a little bit more like a database, enabling different users / processes to safely read and write concurrently, while still only using object storage as a persistence layer.
Another motivation for Icechunk was the success of [Kerchunk](https://github.com/fsspec/kerchunk/).
The Kerchunk project showed that it was possible to map many existing archival formats (e.g. HDF5, NetCDF, GRIB) to the Zarr data model without actually rewriting any bytes, by creating "virtual" Zarr datasets referencing binary chunks inside other files.
Doing this at scale requires tracking millions of "chunk references."
Icechunk's storage model allows for these virtual chunks to be stored seamlessly alongside native Zarr chunks.
Finally, Icechunk provides a universal I/O layer for cloud object storage, implementing numerous performance optimizations designed to accelerate data-intensive applications.
Solving these problems in one go via a powerful, open-source, Rust-based library will bring massive benefits
to the cloud-native scientific data community.
## Where does the name "Icechunk" come from?
Icechunk was inspired partly by [Apache Iceberg](https://iceberg.apache.org/), a popular cloud-native table format.
However, instead of storing tabular data, Icechunk stores multidimensional arrays, for which the individual unit of
storage is the _chunk_.
## When should I use Icechunk?
Here are some scenarios where it makes sense to use Icechunk:
- You want to store large, dynamically evolving multi-dimensional array (a.k.a. tensor) in cloud object storage.
- You want to allow multiple uncoordinated processes to access your data at the same time (like a database).
- You want to be able to safely roll back failed updates or revert Zarr data to an earlier state.
- You want to use concepts from data version control (e.g. tagging, branching, snapshots) with Zarr data.
- You want to achieve cloud-native performance on archival file formats (HDF5, NetCDF, GRIB) by exposing them as virtual Zarr datasets and need to store chunk references in a a robust, scalable, interoperable way.
- You want to get the best possible performance for reading / writing tensor data in AI / ML workflows.
## What are the downsides to using Icechunk?
As with all things in technology, the benefits of Icechunk come with some tradeoffs:
- There may be slightly higher cold-start latency to opening Icechunk datasets compared with regular Zarr.
- The on-disk format is less transparent than regular Zarr.
- The process for distributed writes is more complex to coordinate.
## What is Icechunk's relationship to Zarr?
The Zarr format and protocol is agnostic to the underlying storage system ("store" in Zarr terminology)
and communicates with the store via a simple key / value interface.
Zarr tells the store which keys and values it wants to get or set, and it's the store's job
to figure out how to persist or retrieve the required bytes.
Most existing Zarr stores have a simple 1:1 mapping between Zarr's keys and the underlying file / object names.
For example, if Zarr asks for a key call `myarray/c/0/0`, the store may just look up a key of the same name
in an underlying cloud object storage bucket.
Icechunk is a storage engine which creates a layer of indirection between the
Zarr keys and the actual files in storage.
A Zarr library doesn't have to know explicitly how Icechunk works or how it's storing data on disk.
It just gets / sets keys as it would with any store.
Icechunk figures out how to materialize these keys based on its [storage schema](./spec.md).
- __Standard Zarr + Fsspec__
---
In standard Zarr usage (without Icechunk), [fsspec](https://filesystem-spec.readthedocs.io/) sits
between the Zarr library and the object store, translating Zarr keys directly to object store keys.
```mermaid
flowchart TD
zarr-python[Zarr Library] <-- key / value--> icechunk[fsspec]
icechunk <-- key / value --> storage[(Object Storage)]
```
- __Zarr + Icechunk__
---
With Icechunk, the Icechunk library intercepts the Zarr keys and translates them to the
Icechunk schema, storing data in object storage using its own format.
```mermaid
flowchart TD
zarr-python[Zarr Library] <-- key / value--> icechunk[Icechunk Library]
icechunk <-- icechunk data / metadata files --> storage[(Object Storage)]
```
Implementing Icechunk this way allows Icechunk's specification to evolve independently from Zarr's,
maintaining interoperability while enabling rapid iteration and promoting innovation on the I/O layer.
## Is Icechunk part of the Zarr Spec?
No. At the moment, the Icechunk spec is completely independent of the Zarr spec.
In the future, we may choose to propose Icechunk as a Zarr extension.
However, because it sits _below_ Zarr in the stack, it's not immediately clear how to do that.
## Should I implement Icechunk on my own based on the spec?
No, we do not recommend implementing Icechunk independently of the existing Rust library.
There are two reasons for this:
1. The spec has not yet been stabilized and is still evolving rapidly.
1. It's probably much easier to bind to the Rust library from your language of choice,
rather than re-implement from scratch.
We welcome contributions from folks interested in developing Icechunk bindings for other languages!
## Is Icechunk stable?
Yes! Icechunk 1.0, released in July 2025, is a stable release suitable for production use.
Data written by Icechunk 1.0 and greater will forever be readable by future Icechunk versions.
## What is the backwards compatibility policy for the Icechunk format?
Any data written by Icechunk v1.0 or greater will be readable forever by future Icechunk releases (backwards compatible).
Any data written in a _greater major version_ are not guaranteed to be compatible with a _lesser major version_ (e.g. data written by v2.2 are not guaranteed to be readable with v1.1 libraries).
## Is Icechunk fast?
Icechunk is at least as fast as the existing Zarr / Dask / fsspec stack
and in many cases achieves significantly lower latency and higher throughput.
Furthermore, Icechunk achieves this without using Dask, by implementing its own asynchronous multithreaded I/O pipeline.
For a demonstration of Icechunk's performance, see the blog post
[Solving NASA’s Cloud Data Dilemma: How Icechunk Revolutionizes Earth Data Access](https://earthmover.io/blog/nasa-icechunk)
## How does Icechunk compare to X?
### Array Formats
Array formats are file formats for storing multi-dimensional array (tensor) data.
Icechunk is an array format.
Here is how Icechunk compares to other popular array formats.
#### [HDF5](https://www.hdfgroup.org/solutions/hdf5/)
HDF5 (Hierarchical Data Format version 5) is a popular format for storing scientific data.
HDF is widely used in high-performance computing.
- __Similarities__
---
Icechunk and HDF5 share the same data model: multidimensional arrays and metadata organized into a hierarchical tree structure.
This data model can accommodate a wide range of different use cases and workflows.
Both Icechunk and HDF5 use the concept of "chunking" to split large arrays into smaller storage units.
- __Differences__
---
HDF5 is a monolithic file format designed first and foremost for POSIX filesystems.
All of the chunks in an HDF5 dataset live within a single file.
The size of an HDF5 dataset is limited to the size of a single file.
HDF5 relies on the filesystem for consistency and is not designed for multiple concurrent yet uncoordinated readers and writers.
Icechunk spreads chunks over many files and is designed first and foremost for cloud object storage.
Icechunk can accommodate datasets of arbitrary size.
Icechunk's optimistic concurrency design allows for safe concurrent access for uncoordinated readers and writers.
#### [NetCDF](https://www.unidata.ucar.edu/software/netcdf/)
> NetCDF (Network Common Data Form) is a set of software libraries and machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.
NetCDF4 uses HDF5 as its underlying file format.
Therefore, the similarities and differences with Icechunk are fundamentally the same.
Icechunk can accommodate the NetCDF data model.
It's possible to write NetCDF compliant data in Icechunk using [Xarray](https://xarray.dev/).
#### [Zarr](https://zarr.dev)
Icechunk works together with Zarr.
(See [What is Icechunk's relationship to Zarr?](#what-is-icechunks-relationship-to-zarr) for more detail.)
Compared to regular Zarr (without Icechunk), Icechunk offers many benefits, including
- Serializable isolation of updates via transactions
- Data version control (snapshots, branches, tags)
- Ability to store references to chunks in external datasets (HDF5, NetCDF, GRIB, etc.)
- A Rust-optimized I/O layer for communicating with object storage
#### [Cloud Optimized GeoTiff](http://cogeo.org/) (CoG)
> A Cloud Optimized GeoTIFF (COG) is a regular GeoTIFF file, aimed at being hosted on a HTTP file server, with an internal organization that enables more efficient workflows on the cloud.
> It does this by leveraging the ability of clients issuing HTTP GET range requests to ask for just the parts of a file they need.
CoG has become very popular in the geospatial community as a cloud-native format for raster data.
A CoG file contains a single image (possibly with multiple bands), sharded into chunks of an appropriate size.
A CoG also contains "overviews," lower resolution versions of the same data.
Finally, a CoG contains relevant geospatial metadata regarding projection, CRS, etc. which allow georeferencing of the data.
Data identical to what is found in a CoG can be stored in the Zarr data model and therefore in an Icechunk repo.
Furthermore, Zarr / Icechunk can accommodate rasters of arbitrarily large size and facilitate massive-scale concurrent writes (in addition to reads);
A CoG, in contrast, is limited to a single file and thus has limitations on scale and write concurrency.
However, Zarr and Icechunk currently do not offer the same level of broad geospatial interoperability that CoG does.
The [GeoZarr](https://github.com/zarr-developers/geozarr-spec) project aims to change that.
#### [TileDB Embedded](https://docs.tiledb.com/main/background/key-concepts-and-data-format)
TileDB Embedded is an innovative array storage format that bears many similarities to both Zarr and Icechunk.
Like TileDB Embedded, Icechunk aims to provide database-style features on top of the array data model.
Both technologies use an embedded / serverless architecture, where client processes interact directly with
data files in storage, rather than through a database server.
However, there are a number of difference, enumerated below.
The following table compares Zarr + Icechunk with TileDB Embedded in a few key areas
| feature | **Zarr + Icechunk** | **TileDB Embedded** | Comment |
|---------|---------------------|---------------------|---------|
| *atomicity* | atomic updates can span multiple arrays and groups | _array fragments_ limited to a single array | Icechunk's model allows a writer to stage many updates across interrelated arrays into a single transaction. |
| *concurrency and isolation* | serializable isolation of transactions | [eventual consistency](https://docs.tiledb.com/main/background/internal-mechanics/consistency) | While both formats enable lock-free concurrent reading and writing, Icechunk can catch (and potentially reject) inconsistent, out-of order updates. |
| *versioning* | snapshots, branches, tags | linear version history | Icechunk's data versioning model is closer to Git's. |
| *unit of storage* | chunk | tile | (basically the same thing) |
| *minimum write* | chunk | cell | TileDB allows atomic updates to individual cells, while Zarr requires writing an entire chunk. |
| *sparse arrays* | :material-close: | :material-check: | Zarr + Icechunk do not currently support sparse arrays. |
| *virtual chunk references* | :material-check: | :material-close: | Icechunk enables references to chunks in other file formats (HDF5, NetCDF, GRIB, etc.), while TileDB does not. |
Beyond this list, there are numerous differences in the design, file layout, and implementation of Icechunk and TileDB embedded
which may lead to differences in suitability and performance for different workfloads.
#### [SafeTensors](https://github.com/huggingface/safetensors)
SafeTensors is a format developed by HuggingFace for storing tensors (arrays) safely, in contrast to Python pickle objects.
By the same criteria Icechunk and Zarr are also "safe", in that it is impossible to trigger arbitrary code execution when reading data.
SafeTensors is a single-file format, like HDF5,
SafeTensors optimizes for a simple on-disk layout that facilitates mem-map-based zero-copy reading in ML training pipelines,
assuming that the data are being read from a local POSIX filesystem
Zarr and Icechunk instead allow for flexible chunking and compression to optimize I/O against object storage.
### Tabular Formats
Tabular formats are for storing tabular data.
Tabular formats are extremely prevalent in general-purpose data analytics but are less widely used in scientific domains.
The tabular data model is different from Icechunk's multidimensional array data model, and so a direct comparison is not always apt.
However, Icechunk is inspired by many tabular file formats, and there are some notable similarities.
#### [Apache Parquet](https://parquet.apache.org/)
> Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.
> It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming language and analytics tools.
Parquet employs many of the same core technological concepts used in Zarr + Icechunk such as chunking, compression, and efficient metadata access in a cloud context.
Both formats support a range of different numerical data types.
Both are "columnar" in the sense that different columns / variables / arrays can be queried efficiently without having to fetch unwanted data from other columns.
Both also support attaching arbitrary key-value metadata to variables.
Parquet supports "nested" types like variable-length lists, dicts, etc. that are currently unsupported in Zarr (but may be possible in the future).
In general, Parquet and other tabular formats can't be substituted for Zarr / Icechunk, due to the lack of multidimensional array support.
On the other hand, tabular data can be modeled in Zarr / Icechunk in a relatively straightforward way: each column as a 1D array, and a table / dataframe as a group of same-sized 1D arrays.
#### [Apache Iceberg](https://iceberg.apache.org/)
> Iceberg is a high-performance format for huge analytic tables.
> Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time.
Iceberg is commonly used to manage many Parquet files as a single table in object storage.
Iceberg was influential in the design of Icechunk.
Many of the [spec](./spec.md) core requirements are similar to Iceberg.
Specifically, both formats share the following properties:
- Files written to object storage immutably
- All data and metadata files are tracked explicitly by manifests
- Similar mechanism for staging snapshots and committing transactions
- Support for branches and tags
However, unlike Iceberg, Icechunk _does not require an external catalog_ to commit transactions; it relies solely on the consistency of the object store.
#### [Delta Lake](https://delta.io/)
Delta is another popular table format based on a log of updates to the table state.
Its functionality and design is quite similar to Iceberg, as is its comparison to Icechunk.
#### [Lance](https://lancedb.github.io/lance/index.html)
> Lance is a modern columnar data format that is optimized for ML workflows and datasets.
Despite its focus on multimodal data, as a columnar format, Lance can't accommodate large arrays / tensors chunked over arbitrary dimensions, making it fundamentally different from Icechunk.
However, the modern design of Lance was very influential on Icechunk.
Icechunk's commit and conflict resolution mechanism is partly inspired by Lance.
### Other Related projects
#### [Xarray](https://xarray.dev/)
> Xarray is an open source project and Python package that introduces labels in the form of dimensions, coordinates, and attributes on top of raw NumPy-like arrays, which allows for more intuitive, more concise, and less error-prone user experience.
>
> Xarray includes a large and growing library of domain-agnostic functions for advanced analytics and visualization with these data structures.
Xarray and Zarr / Icechunk work great together!
Xarray is the recommended way to read and write Icechunk data for Python users in geospatial, weather, climate, and similar domains.
#### [Kerchunk](https://fsspec.github.io/kerchunk/)
> Kerchunk is a library that provides a unified way to represent a variety of chunked, compressed data formats (e.g. NetCDF/HDF5, GRIB2, TIFF, …), allowing efficient access to the data from traditional file systems or cloud object storage.
> It also provides a flexible way to create virtual datasets from multiple files. It does this by extracting the byte ranges, compression information and other information about the data and storing this metadata in a new, separate object.
> This means that you can create a virtual aggregate dataset over potentially many source files, for efficient, parallel and cloud-friendly in-situ access without having to copy or translate the originals.
> It is a gateway to in-the-cloud massive data processing while the data providers still insist on using legacy formats for archival storage
Kerchunk emerged from the [Pangeo](https://www.pangeo.io/) community as an experimental
way of reading archival files, allowing those files to be accessed "virtually" using the Zarr protocol.
Kerchunk pioneered the concept of a "chunk manifest", a file containing references to compressed binary chunks in other files in the form of the tuple `(uri, offset, size)`.
Kerchunk has experimented with different ways of serializing chunk manifests, including JSON and Parquet.
Icechunk provides a highly efficient and scalable mechanism for storing and tracking the references generated by Kerchunk.
Kerchunk and Icechunk are highly complimentary.
#### [VirtualiZarr](https://virtualizarr.readthedocs.io/en/latest/)
> VirtualiZarr creates virtual Zarr stores for cloud-friendly access to archival data, using familiar Xarray syntax.
VirtualiZarr provides another way of generating and manipulating Kerchunk-style references.
VirtualiZarr first uses Kerchunk to generate virtual references, but then provides a simple Xarray-based interface for manipulating those references.
As VirtualiZarr can also write virtual references into an Icechunk Store directly, together they form a complete pipeline for generating and storing references to multiple pre-existing files.
#### [LakeFS](https://lakefs.io/)
LakeFS is a solution git-style version control on top of cloud object storage.
LakeFS enables git-style commits, tags, and branches representing the state of an entire object storage bucket.
LakeFS is format agnostic and can accommodate any type of data, including Zarr.
LakeFS can therefore be used to create a versioned Zarr store, similar to Icechunk.
Icechunk, however, is designed specifically for array data, based on the Zarr data model.
This specialization enables numerous optimizations and user-experience enhancements not possible with LakeFS.
LakeFS also requires a server to operate.
Icechunk, in contrast, works with just object storage.
#### [TensorStore](https://google.github.io/tensorstore/index.html)
> TensorStore is a library for efficiently reading and writing large multi-dimensional arrays.
TensorStore can read and write a variety of different array formats, including Zarr.
While TensorStore is not yet compatible with Icechunk, it should be possible to implement Icechunk support in TensorStore.
TensorStore implements an [ocdbt](https://google.github.io/tensorstore/kvstore/ocdbt/index.html#ocdbt-key-value-store-driver):
> The ocdbt driver implements an Optionally-Cooperative Distributed B+Tree (OCDBT) on top of a base key-value store.
Ocdbt implements a transactional, versioned key-value store suitable for storing Zarr data, thereby supporting some of the same features as Icechunk.
Unlike Icechunk, the ocdbt key-value store is not specialized to Zarr, does not differentiate between chunk or metadata keys, and does not store any metadata about chunks.
## Why do I have to fork a Session for parallel writes?**
Icechunk is different from normal Zarr stores because it is stateful. In a distributed setting, you have to be careful to communicate back the Session objects from remote write tasks, merge them appropriately, and then execute the commit. The explicit use of a `fork` allows Icechunk to hint to the user that they need to be sure about what they are doing.
## Does `icechunk-python` include logging?
Yes! Set the environment variable `ICECHUNK_LOG=icechunk=debug` to print debug logs to stdout. Available "levels" in order of increasing verbosity are `error`, `warn`, `info`, `debug`, `trace`. The default level is `error`. The Rust library uses `tracing-subscriber` crate. The `ICECHUNK_LOG` variable can be used to filter logging following that crate's [documentation](https://docs.rs/tracing-subscriber/latest/tracing_subscriber/filter/struct.EnvFilter.html#directives). For example, `ICECHUNK_LOG=trace` will set both icechunk and it's dependencies' log levels to `trace` while `ICECHUNK_LOG=icechunk=trace` will enable the `trace` level for icechunk only. For more complex control `ICECHUNK_LOG=debug,icechunk=trace,rustls=info,h2=info,hyper=info` will set `trace` for `icechunk`, `info` for `rustls`,`hyper`, and `h2` crates, and `debug` for every other crate.
You can also use Python's `os.environ` to set or change the value of the variable. If you change the environment variable after `icechunk` was
imported, you will need to call `icechunk.set_logs_filter(None)` for changes to take effect.
This function also accepts the filter directive. If you prefer not to use environment variables, you can do:
```python
icechunk.set_logs_filter("debug,icechunk=trace")
```
## How to get a read-only Icechunk store?
Zarr has a few mechanisms to define read-only zarr stores. These don't always work perfectly with Icechunk,
because Icechunk has a more advanced session model. The safest way to make sure you don't write to
an Icechunk repo is to use `Repository.readonly_session` to create the session. It doesn't matter what
you do to the Zarr store, a read-only session cannot do writes.
## Does Icechunk work with Zarr sharding?
Yes, as long as you use `zarr.config.set({"async.concurrency": 1})` when _writing_ as the Zarr sharding implementation is [not parallel-safe for _writes_](https://github.com/zarr-developers/zarr-python/pull/3217).
# How To: Common Icechunk Operations
This page gathers common Icechunk operations into one compact how-to guide.
It is not intended as a deep explanation of how Icechunk works.
## Creating and Opening Repos
Creating and opening repos requires creating a `Storage` object.
See the [Storage guide](./storage.md) for all the details.
### Create a New Repo
```python
storage = icechunk.s3_storage(bucket="my-bucket", prefix="my-prefix", from_env=True)
repo = icechunk.Repository.create(storage)
```
### Open an Existing Repo
```python
repo = icechunk.Repository.open(storage)
```
### Specify Custom Config when Opening a Repo
There are many configuration options available to control the behavior of the repository and the storage backend.
See [Configuration](./configuration.md) for all the details.
```python
config = icechunk.RepositoryConfig.default()
config.caching = icechunk.CachingConfig(num_bytes_chunks=100_000_000)
repo = icechunk.Repository.open(storage, config=config)
```
### Deleting a Repo
Icechunk doesn't provide a way to delete a repo once it has been created.
If you need to delete a repo, just go to the underlying storage and remove the directory where you created the repo.
## Reading, Writing, and Modifying Data with Zarr
Read and write operations occur within the context of a [transaction](./version-control.md).
The general pattern is
```python
session = repo.writable_session(branch="main")
# interact with the repo via session.store
# ...
session.commit(message="wrote some data")
```
!!! info
In the examples below, we just show the interaction with the `store` object.
Keep in mind that all sessions need to be concluded with a `.commit()`.
Alternatively, you can also use the `.transaction` function as a context manager,
which automatically commits when the context exits.
```python
with repo.transaction(branch="main", message="wrote some data") as store:
# interact with the repo via store
```
### Create a Group
```python
group = zarr.create_group(session.store, path="my-group", zarr_format=3)
```
### Create an Array
```python
array = group.create("my_array", shape=(10, 20), dtype='int32')
```
### Write Data to an Array
```python
array[2:5, :10] = 1
```
### Read Data from an Array
```python
data = array[:5, :10]
```
### Resize an Array
```python
array.resize((20, 30))
```
### Add or Modify Array / Group Attributes
```python
array.attrs["standard_name"] = "time"
```
### View Array / Group Attributes
```python
dict(array.attrs)
```
### Delete a Group
```python
del group["subgroup"]
```
### Delete an Array
```python
del group["array"]
```
## Reading and Writing Data with Xarray
### Write an in-memory Xarray Dataset
```python
ds.to_zarr(session.store, group="my-group", zarr_format=3, consolidated=False)
```
### Append to an existing datast
```python
ds.to_zarr(session.store, group="my-group", append_dim='time', consolidated=False)
```
### Write an Xarray dataset with Dask
Writing with Dask or any other parallel execution framework requires special care.
See [Parallel writes](./parallel.md) and [Xarray](./xarray.md) for more detail.
```python
from icechunk.xarray import to_icechunk
to_icechunk(ds, session)
```
### Read a dataset with Xarray
Reading can be done with a read-only session.
```python
session = repo.readonly_session("main")
ds = xr.open_zarr(session.store, group="my-group", zarr_format=3, consolidated=False)
```
## Transactions and Version Control
For more depth, see [Transactions and Version Control](./version-control.md).
### Create a Snapshot via a Transaction
```python
snapshot_id = session.commit("commit message")
```
### Resolve a Commit Conflict
The case of no actual conflicts:
```python
try:
session.commit("commit message")
except icechunk.ConflictError:
session.rebase(icechunk.ConflictDetector())
session.commit("committed after rebasing")
```
Or if you have conflicts between different commits and want to overwrite the other changes:
```python
try:
session.commit("commit message")
except icechunk.ConflictError:
session.rebase(icechunk.BasicConflictSolver(on_chunk_conflict=icechunk.VersionSelection.UseOurs))
session.commit("committed after rebasing")
```
### Commit with Automatic Rebasing
This will automatically retry the commit until it succeeds
```python
session.commit("commit message", rebase_with=icechunk.ConflictDetector())
```
### List Snapshots
```python
for snapshot in repo.ancestry(branch="main"):
print(snapshot)
```
### Check out a Snapshot
```python
session = repo.readonly_session(snapshot_id=snapshot_id)
```
### Create a Branch
```python
repo.create_branch("dev", snapshot_id=snapshot_id)
```
### List all Branches
```python
branches = repo.list_branches()
```
### Check out a Branch
```python
session = repo.writable_session("dev")
```
### Reset a Branch to a Different Snapshot
```python
repo.reset_branch("dev", snapshot_id=snapshot_id)
```
### Create a Tag
```python
repo.create_tag("v1.0.0", snapshot_id=snapshot_id)
```
### List all Tags
```python
tags = repo.list_tags()
```
### Check out a Tag
```python
session = repo.readonly_session(tag="v1.0.0")
```
### Delete a Tag
```python
repo.delete_tag("v1.0.0")
```
## Repo Maintenance
For more depth, see [Data Expiration](./expiration.md).
### Run Snapshot Expiration
```python
from datetime import datetime, timedelta
expiry_time = datetime.now() - timedelta(days=10)
expired = repo.expire_snapshots(older_than=expiry_time)
```
### Run Garbage Collection
```python
results = repo.garbage_collect(expiry_time)
```
### Usage in async contexts
Most methods in Icechunk have an async counterpart, named with an `_async` postfix. For more info, see [Async Usage](./async.md).
```python
results = await repo.garbage_collect_async(expiry_time)
```
# Icechunk for Git Users
While Icechunk does not work the same way as [git](https://git-scm.com/), it borrows from a lot of the same concepts. This guide will talk through the version control features of Icechunk from the perspective of a user that is familiar with git.
## Repositories
The main primitive in Icechunk is the [repository](./reference.md#icechunk.Repository). Similar to git, the repository is the entry point for all operations and the source of truth for the data. However there are many important differences.
When developing with git, you will commonly have a local and remote copy of the repository. The local copy is where you do all of your work. The remote copy is where you push your changes when you are ready to share them with others. In Icechunk, there is not local or remote repository, but a single repository that typically exists in a cloud storage bucket. This means that every transaction is saved to the same repository that others may be working on. Icechunk uses the consistency guarantees from storage systems to provide strong consistency even when multiple users are working on the same repository.
## Working with branches
Icechunk has [branches](version-control.md#branches) similar to git.
### Creating a branch
In practice, this means the workflow is different from git. For instance, I wanted to make a new branch based on the `main` branch on my existing git repository and then commit my changes in git this is how I would do it:
```bash
# Assume currently on main branch
# create branch
git checkout -b my-new-branch
# stage changes
git add myfile.txt
# commit changes
git commit -m "My new branch"
# push to remote
git push origin -u my-new-branch
```
In Icechunk, you would do the following:
```python
# We create the branch
repo.create_branch("my-new-branch", repo.lookup_branch("main"))
# create a writable session
session = repo.writable_session("my-new-branch")
... # make some changes
# commit the changes
session.commit("My new branch")
```
Two things to note:
1. When we create a branch, the branch is now available for any other instance of this `Repository` object. It is not a local branch, it is created in the repositories storage backend.
2. When we commit the changes are immediately visible to other users of the repository. There is not concept of a local commit, all snapshots happen in the storage backend.
### Checking out a branch
In git, you can check out a branch by using the `git checkout` command. Icechunk does not have the concept of checking out a branch, instead you create [`Session`s](reference.md#icechunk.Session) that are based on the tip of a branch.
We can either check out a branch for [read-only access](reference.md#icechunk.Repository.readonly_session) or for [read-write access](reference.md#icechunk.Repository.writable_session).
```python
# check out a branch for read-only access
session = repo.readonly_session(branch="my-new-branch")
# readonly_session accepts a branch name by default
session = repo.readonly_session("my-new-branch")
# check out a branch for read-write access
session = repo.writable_session("my-new-branch")
```
Once we have checked out a session, the [`store`](reference.md#icechunk.Session.store) method will return a [`Store`](reference.md#icechunk.IcechunkStore) object that we can use to read and write data to the repository with `zarr`.
### Resetting a branch
In git, you can reset a branch to previous commit. Similarly, in Icechunk you can [reset a branch to a previous snapshot](reference.md#icechunk.Repository.reset_branch).
```python
# reset the branch to the previous snapshot
repo.reset_branch("my-new-branch", "198273178639187")
```
!!! warning
This is a destructive operation. It will overwrite the branch reference with the snapshot immediately. It can only be undone by resetting the branch again.
At this point, the tip of the branch is now the snapshot `198273178639187` and any changes made to the branch will be based on this snapshot. This also means the history of the branch is now same as the ancestry of this snapshot.
### Branch History
In Icechunk, you can view the history of a branch by using the [`repo.ancestry()`](reference.md#icechunk.Repository.ancestry) command, similar to the `git log` command.
```python
[ancestor for ancestor in repo.ancestry(branch="my-new-branch")]
#[Snapshot(id='198273178639187', ...), ...]
```
### Listing branches
We can also [list all branches](reference.md#icechunk.Repository.list_branches) in the repository.
```python
repo.list_branches()
# ['main', 'my-new-branch']
```
You can also view the snapshot that a branch is based on by using the [`repo.lookup_branch()`](reference.md#icechunk.Repository.lookup_branch) command.
```python
repo.lookup_branch("my-new-branch")
# '198273178639187'
```
### Deleting a branch
You can delete a branch by using the [`repo.delete_branch()`](reference.md#icechunk.Repository.delete_branch) command.
```python
repo.delete_branch("my-new-branch")
```
## Working with tags
Icechunk [tags](version-control.md#tags) are also similar to git tags.
### Creating a tag
We [create a tag](reference.md#icechunk.Repository.create_tag) by providing a name and a snapshot id, similar to the `git tag` command.
```python
repo.create_tag("my-new-tag", "198273178639187")
```
Just like git tags, Icechunk tags are immutable and cannot be modified. They can however be [deleted like git tags](reference.md#icechunk.Repository.delete_tag):
```python
repo.delete_tag("my-new-tag")
```
However, unlike git tags once a tag is deleted it cannot be recreated. This will now raise an error:
```python
repo.create_tag("my-new-tag", "198273178639187")
# IcechunkError: Tag with name 'my-new-tag' already exists
```
### Listing tags
We can also [list all tags](reference.md#icechunk.Repository.list_tags) in the repository.
```python
repo.list_tags()
# ['my-new-tag']
```
### Viewing tag history
We can also view the history of a tag by using the [`repo.ancestry()`](reference.md#icechunk.Repository.ancestry) command.
```python
repo.ancestry(tag="my-new-tag")
```
This will return an iterator of snapshots that are ancestors of the tag. Similar to branches we can lookup the snapshot that a tag is based on by using the [`repo.lookup_tag()`](reference.md#icechunk.Repository.lookup_tag) command.
```python
repo.lookup_tag("my-new-tag")
# '198273178639187'
```
## Merging and Rebasing
Git supports merging and rebasing branches together. Icechunk currently does not support merging and rebasing branches together. It does support [rebasing sessions that share the same branch](version-control.md#conflict-resolution).
---
title: Rust
---
# Icechunk Rust
The Icechunk rust library is used internally by Icechunk Python.
It is currently not designed to be used in standalone form.
- [Icechunk Rust Documentation](https://docs.rs/icechunk/latest/icechunk/) at docs.rs
We welcome contributors interested in implementing more Rust functionality!
---
template: home.html
title: Icechunk - Open-source, cloud-native transactional tensor storage engine
---
# Migration guide
## Parallel Writes
Icechunk is a stateful store and requires care when executing distributed writes.
Icechunk 1.0 introduces new API for safe [_coordinated_ distributed writes](./parallel.md#cooperative-distributed-writes) where a Session is distributed to remote workers:
1. Create a [`ForkSession`](./reference.md#icechunk.ForkSession) using [`Session.fork`](./reference.md#icechunk.Session.fork) instead of the [`Session.allow_pickling`](./reference.md#icechunk.Session.allow_pickling) context manager. ForkSessions can be pickled and written to in remote distributed workers.
1. Only Sessions with _no_ changes can be forked. You may need to insert commits in your current workflows.
1. [`Session.merge`](./reference.md#icechunk.Session.merge) can now merge multiple sessions, so the use of [`merge_sessions`](./reference.md#icechunk.distributed.merge_sessions) is discouraged.
The tabs below highlight typical code changes required:
=== "After"
```python {hl_lines="7 10 17"}
from concurrent.futures import ProcessPoolExecutor
session = repo.writable_session("main")
with ProcessPoolExecutor() as executor:
# obtain a writable session that can be pickled.
fork = session.fork()
# submit the writes, distribute `fork`
futures = [
executor.submit(write_timestamp, itime=i, session=fork)
for i in range(ds.sizes["time"])
]
# grab the Session objects from each individual write task
remote_sessions = [f.result() for f in futures]
# manually merge the remote sessions in to the local session
session.merge(*remote_sessions)
session.commit("finished writes")
```
=== "Before"
```python {hl_lines="2 7 10 17"}
from concurrent.futures import ProcessPoolExecutor
from icechunk.distributed import merge_sessions
session = repo.writable_session("main")
with ProcessPoolExecutor() as executor:
# obtain a writable session that can be pickled.
with session.allow_pickling():
# submit the writes, distribute `session`
futures = [
executor.submit(write_timestamp, itime=i, session=session)
for i in range(ds.sizes["time"])
]
# grab the Session objects from each individual write task
remote_sessions = [f.result() for f in futures]
# manually merge the remote sessions in to the local session
session = merge_sessions(session, *sessions)
session.commit("finished writes")
```
## Virtual Datasets
Icechunk 1.0 gives the user more control over what virtual chunks can be resolved at runtime.
Virtual chunks are associated with a virtual chunk container based on their url. Each virtual
chunk container must declare its url prefix. In versions before 1.0, Icechunk had a list of
default virtual chunk containers. To give the user more control before Icechunk tries to resolve
virtual chunks, since version 1.0 Icechunk requires repository creators to explicitly declare their
virtual chunk containers, no defaults are provided.
You can follow this example to declare a virtual chunk container in your repo
```python
# The store defines the type of Storage that will be used for the virtual chunks
# other options are gcs_store, local_filesystem_store and http_store
store_config = s3_store(region="us-east-1")
# we create a container by giving it the url prefix and the store config
container = VirtualChunkContainer("s3://testbucket", store_config)
# we add it to the repo config
config = RepositoryConfig.default()
config.set_virtual_chunk_container(container)
# we set credentials for the virtual chunk container
# repo readers will also need this to be able to resolve the virtual chunks
credentials = containers_credentials(
{
# we identify for which container we are passing credentials
# by using its url_prefx
# If the value in the map is None, Icechunk will use the "natural" credential
# type, usually fetching them from the process environment
"s3://testbucket": s3_credentials(
access_key_id="abcd", secret_access_key="0123"
)
}
)
# When we create the repo, its configuration will be saved to disk
# including its virtual chunk containers
repo = Repository.create(
storage=...
config=config,
authorize_virtual_chunk_access=credentials,
)
```
In previous Icechunk versions, `Repository.create` and `Repository.open` had a
`virtual_chunk_credentials` argument. This argument is replaced by the new
`authorize_virtual_chunk_access`. If a container is not present in the
`authorize_virtual_chunk_access` dictionary, Icechunk will refuse to resolve
chunks matching its url prefix.
---
title: Overview
---
# Icechunk
Icechunk is an open-source (Apache 2.0), transactional storage engine for tensor / ND-array data designed for use on cloud object storage.
Icechunk works together with **[Zarr](https://zarr.dev/)**, augmenting the Zarr core data model with features
that enhance performance, collaboration, and safety in a cloud-computing context.
## Icechunk Overview
Let's break down what "transactional storage engine for Zarr" actually means:
- **[Zarr](https://zarr.dev/)** is an open source specification for the storage of multidimensional array (a.k.a. tensor) data.
Zarr defines the metadata for describing arrays (shape, dtype, etc.) and the way these arrays are chunked, compressed, and converted to raw bytes for storage. Zarr can store its data in any key-value store.
There are many different implementations of Zarr in different languages. _Right now, Icechunk only supports
[Zarr Python](https://zarr.readthedocs.io/en/stable/)._
If you're interested in implementing Icechunk support, please [open an issue](https://github.com/earth-mover/icechunk/issues) so we can help you.
- **Storage engine** - Icechunk exposes a key-value interface to Zarr and manages all of the actual I/O for getting, setting, and updating both metadata and chunk data in cloud object storage.
Zarr libraries don't have to know exactly how icechunk works under the hood in order to use it.
- **Transactional** - The key improvement that Icechunk brings on top of regular Zarr is to provide consistent serializable isolation between transactions.
This means that Icechunk data are safe to read and write in parallel from multiple uncoordinated processes.
This allows Zarr to be used more like a database.
The core entity in Icechunk is a repository or **repo**.
A repo is defined as a Zarr hierarchy containing one or more Arrays and Groups, and a repo functions as
self-contained _Zarr Store_.
The most common scenario is for an Icechunk repo to contain a single Zarr group with multiple arrays, each corresponding to different physical variables but sharing common spatiotemporal coordinates.
However, formally a repo can be any valid Zarr hierarchy, from a single Array to a deeply nested structure of Groups and Arrays.
Users of Icechunk should aim to scope their repos only to related arrays and groups that require consistent transactional updates.
Icechunk supports the following core requirements:
1. **Object storage** - the format is designed around the consistency features and performance characteristics available in modern cloud object storage. No external database or catalog is required to maintain a repo.
(It also works with file storage.)
1. **Serializable isolation** - Reads are isolated from concurrent writes and always use a committed snapshot of a repo. Writes are committed atomically and are never partially visible. No locks are required for reading.
1. **Time travel** - Previous snapshots of a repo remain accessible while and after new snapshots are written.
1. **Data version control** - Repos support both _tags_ (immutable references to snapshots) and _branches_ (mutable references to snapshots).
1. **Chunk shardings** - Chunk storage is decoupled from specific file names. Multiple chunks can be packed into a single object (sharding).
1. **Chunk references** - Zarr-compatible chunks within other file formats (e.g. HDF5, NetCDF) can be referenced.
1. **Schema evolution** - Arrays and Groups can be added, renamed, and removed from the hierarchy with minimal overhead.
## Key Concepts
### Groups, Arrays, and Chunks
Icechunk is designed around the Zarr data model, widely used in scientific computing, data science, and AI / ML.
(The Zarr high-level data model is effectively the same as HDF5.)
The core data structure in this data model is the **array**.
Arrays have two fundamental properties:
- **shape** - a tuple of integers which specify the dimensions of each axis of the array. A 10 x 10 square array would have shape (10, 10)
- **data type** - a specification of what type of data is found in each element, e.g. integer, float, etc. Different data types have different precision (e.g. 16-bit integer, 64-bit float, etc.)
In Zarr / Icechunk, arrays are split into **chunks**,
A chunk is the minimum unit of data that must be read / written from storage, and thus choices about chunking have strong implications for performance.
Zarr leaves this completely up to the user.
Chunk shape should be chosen based on the anticipated data access pattern for each array
An Icechunk array is not bounded by an individual file and is effectively unlimited in size.
For further organization of data, Icechunk supports **groups** within a single repo.
Group are like folders which contain multiple arrays and or other groups.
Groups enable data to be organized into hierarchical trees.
A common usage pattern is to store multiple arrays in a group representing a NetCDF-style dataset.
Arbitrary JSON-style key-value metadata can be attached to both arrays and groups.
### Snapshots
Every update to an Icechunk store creates a new **snapshot** with a unique ID.
Icechunk users must organize their updates into groups of related operations called **transactions**.
For example, appending a new time slice to multiple arrays should be done as a single transaction, comprising the following steps
1. Update the array metadata to resize the array to accommodate the new elements.
2. Write new chunks for each array in the group.
While the transaction is in progress, none of these changes will be visible to other users of the store.
Once the transaction is committed, a new snapshot is generated.
Readers can only see and use committed snapshots.
### Branches and Tags
Additionally, snapshots occur in a specific linear (i.e. serializable) order within a **branch**.
A branch is a mutable reference to a snapshot: a pointer that maps the branch name to a snapshot ID.
The default branch is `main`.
Every commit to the main branch updates this reference.
Icechunk's design protects against the race condition in which two uncoordinated sessions attempt to update the branch at the same time; only one can succeed.
Icechunk also defines **tags**--_immutable_ references to a snapshot.
Tags are appropriate for publishing specific releases of a repository or for any application which requires a persistent, immutable identifier to the store state.
### Chunk References
Chunk references are "pointers" to chunks that exist in other files--HDF5, NetCDF, GRIB, etc.
Icechunk can store these references alongside native Zarr chunks as "virtual datasets".
You can then update these virtual datasets incrementally (overwrite chunks, change metadata, etc.) without touching the underlying files.
## How Does It Work?
!!! note
For more detailed explanation, have a look at the [Icechunk spec](./spec.md)
Zarr itself works by storing both metadata and chunk data into a abstract store according to a specified system of "keys".
For example, a 2D Zarr array called `myarray`, within a group called `mygroup`, would generate the following keys:
```
mygroup/zarr.json
mygroup/myarray/zarr.json
mygroup/myarray/c/0/0
mygroup/myarray/c/0/1
```
In standard regular Zarr stores, these key map directly to filenames in a filesystem or object keys in an object storage system.
When writing data, a Zarr implementation will create these keys and populate them with data. When modifying existing arrays or groups, a Zarr implementation will potentially overwrite existing keys with new data.
This is generally not a problem, as long there is only one person or process coordinating access to the data.
However, when multiple uncoordinated readers and writers attempt to access the same Zarr data at the same time, [various consistency problems](https://docs.earthmover.io/concepts/version-control-system#consistency-problems-with-zarr) problems emerge.
These consistency problems can occur in both file storage and object storage; they are particularly severe in a cloud setting where Zarr is being used as an active store for data that are frequently changed while also being read.
With Icechunk, we keep the same core Zarr data model, but add a layer of indirection between the Zarr keys and the on-disk storage.
The Icechunk library translates between the Zarr keys and the actual on-disk data given the particular context of the user's state.
Icechunk defines a series of interconnected metadata and data files that together enable efficient isolated reading and writing of metadata and chunks.
Once written, these files are immutable.
Icechunk keeps track of every single chunk explicitly in a "chunk manifest".
```mermaid
flowchart TD
zarr-python[Zarr Library] <-- key / value--> icechunk[Icechunk Library]
icechunk <-- data / metadata files --> storage[(Object Storage)]
```
# Parallel Writes
A common pattern with large distributed write jobs is to first initialize the dataset on a disk
with all appropriate metadata, and any coordinate variables. Following this a large write job
is kicked off in a distributed setting, where each worker is responsible for an independent
"region" of the output.
## Why is Icechunk different from any other Zarr store?
The reason is that unlike Zarr, Icechunk is a "stateful" store. The Session object keeps a record of all writes, that is then
bundled together in a commit. Thus `Session.commit` must be executed on a Session object that knows about all writes,
including those executed remotely in a multi-processing or any other remote execution context.
!!! info
Learn about Icechunk consistency with a clichéd but instructive example [in this blog post](https://earthmover.io/blog/learning-about-icechunk-consistency)
## Example
Here is how you can execute such writes with Icechunk, illustrate with a `ThreadPoolExecutor`.
First read some example data, and create an Icechunk Repository.
```python exec="on" session="parallel" source="material-block"
import xarray as xr
import tempfile
from icechunk import Repository, local_filesystem_storage
ds = xr.tutorial.open_dataset("rasm").isel(time=slice(24))
repo = Repository.create(local_filesystem_storage(tempfile.TemporaryDirectory().name))
session = repo.writable_session("main")
```
We will orchestrate so that each task writes one timestep.
This is an arbitrary choice but determines what we set for the Zarr chunk size.
```python exec="on" session="parallel" source="material-block" result="code"
chunks = {1 if dim == "time" else ds.sizes[dim] for dim in ds.Tair.dims}
```
Initialize the dataset using [`Dataset.to_zarr`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.to_zarr.html)
and `compute=False`, this will NOT write any chunked array data, but will write all array metadata, and any
in-memory arrays (only `time` in this case).
```python exec="on" session="parallel" source="material-block"
ds.to_zarr(session.store, compute=False, encoding={"Tair": {"chunks": chunks}}, mode="w", consolidated=False)
# this commit is optional, but may be useful in your workflow
print(session.commit("initialize store"))
```
## Multi-threading
First define a function that constitutes one "write task".
```python exec="on" session="parallel" source="material-block"
from icechunk import Session
def write_timestamp(*, itime: int, session: Session) -> None:
# pass a list to isel to preserve the time dimension
ds = xr.tutorial.open_dataset("rasm").isel(time=[itime])
# region="auto" tells Xarray to infer which "region" of the output arrays to write to.
ds.to_zarr(session.store, region="auto", consolidated=False)
```
Now execute the writes.
```python
from concurrent.futures import ThreadPoolExecutor, wait
session = repo.writable_session("main")
with ThreadPoolExecutor() as executor:
# submit the writes
futures = [executor.submit(write_timestamp, itime=i, session=session) for i in range(ds.sizes["time"])]
wait(futures)
print(session.commit("finished writes"))
```
Verify that the writes worked as expected:
```python exec="on" session="parallel" source="material-block" result="code"
ondisk = xr.open_zarr(repo.readonly_session("main").store, consolidated=False)
xr.testing.assert_identical(ds, ondisk)
```
## Distributed writes
There are fundamentally two different modes for distributed writes in Icechunk:
- "Cooperative" distributed writes, in which all of the changes being written are part of the same transaction.
The point of this is to allow large-scale, massively-parallel writing to the store as part of a single coordinated job.
In this scenario, it's the user's job to align the writing process with the Zarr chunks and avoid inconsistent metadata updates.
- "Uncooperative" writes, in which multiple workers are attempting to write to the same store in an uncoordinated way.
This path relies on the optimistic concurrency mechanism to detect and resolve conflicts.
!!! info
This code will not execute with a `ProcessPoolExecutor` without [some changes](https://docs.python.org/3/library/multiprocessing.html#programming-guidelines).
Specifically it requires wrapping the code in a `if __name__ == "__main__":` block.
See a full executable example [here](https://github.com/earth-mover/icechunk/blob/main/icechunk-python/examples/mpwrite.py).
### Cooperative distributed writes
Any task execution framework (e.g. `ProcessPoolExecutor`, Joblib, Lithops, Dask Distributed, Ray, etc.)
can be used instead of the `ThreadPoolExecutor`. However such workloads should account for
Icechunk being a "stateful" store that records changes executed in a write session.
There are three key points to keep in mind:
1. The `write_task` function *must* return the `Session`. It contains a record of the changes executed by this task.
These changes *must* be manually communicated back to the coordinating process, since each of the distributed processes
are working with their own independent `Session` instance.
2. Icechunk requires that users obtain a distributable *writable* `Session` using `Session.fork()`.
This creates a new `ForkSession` object that can be pickled. Sessions can be forked _only_ when they have no uncommitted changes.
3. The user *must* manually merge the `ForkSession` objects into the `Session` to create a meaningful commit.
First we modify `write_task` to return the `Session`:
```python
from icechunk import Session
from icechunk.session import ForkSession
def write_timestamp(*, itime: int, session: ForkSession) -> ForkSession:
# pass a list to isel to preserve the time dimension
ds = xr.tutorial.open_dataset("rasm").isel(time=[itime])
# region="auto" tells Xarray to infer which "region" of the output arrays to write to.
ds.to_zarr(session.store, region="auto", consolidated=False)
return session
```
The steps for making a distribute write are as follows:
1. fork the Session with `Session.fork`,
2. gather the ForkSessions from individual tasks,
3. merge the `Session` with the gathered ForkSessions using [`Session.merge`](./reference.md#icechunk.Session.merge), and finally
4. make a successful commit using [`Session.commit`](./reference.md#icechunk.Session.commit).
```python
from concurrent.futures import ProcessPoolExecutor
session = repo.writable_session("main")
with ProcessPoolExecutor() as executor:
# obtain a writable session that can be pickled.
fork = session.fork()
# submit the writes
futures = [
executor.submit(write_timestamp, itime=i, session=fork)
for i in range(ds.sizes["time"])
]
# grab the Session objects from each individual write task
remote_sessions = [f.result() for f in futures]
# manually merge the remote sessions in to the local session
session.merge(*remote_sessions)
print(session.commit("finished writes"))
```
Verify that the writes worked as expected:
```python
ondisk = xr.open_zarr(repo.readonly_session("main").store, consolidated=False)
xr.testing.assert_identical(ds, ondisk)
```
### Uncooperative distributed writes
!!! warning
Using multiprocessing start method 'fork' will result in deadlock when trying to open an existing repository.
This happens because the files behind the repository needs to be locked.
The 'fork' start method will copying not only the lock, but also the state of the lock.
Thus all child processes will copy the file lock in an acquired state, leaving them hanging indefinitely waiting for the file lock to be released, which never happens.
Polars has a similar issue, which is described in their [documentation about multiprocessing](https://docs.pola.rs/user-guide/misc/multiprocessing/).
Putting `mp.set_start_method('forkserver')` at the beginning of the script will solve this issue.
Only necessary for POSIX systems except MacOS, because MacOS and Windows do not support the `fork` method.
Here is an example of uncooperative distributed writes using `multiprocessing`, based on [this discussion](https://github.com/earth-mover/icechunk/discussions/802).
```python
import multiprocessing as mp
import icechunk as ic
import zarr
def get_storage():
return ic.local_filesystem_storage(tempfile.TemporaryDirectory().name)
def worker(i):
print(f"Stated worker {i}")
storage = get_storage()
repo = ic.Repository.open(storage)
# keep trying until it succeeds
while True:
try:
session = repo.writable_session("main")
z = zarr.open(session.store, mode="r+")
print(f"Opened store for {i} | {dict(z.attrs)}")
a = z.attrs.get("done", [])
a.append(i)
z.attrs["done"] = a
session.commit(f"wrote from worker {i}")
break
except ic.ConflictError:
print(f"Conflict for {i}, retrying")
pass
def main():
# This is necessary on linux systems
mp.set_start_method('forkserver')
storage = get_storage()
repo = ic.Repository.create(storage)
session = repo.writable_session("main")
zarr.create(
shape=(10, 10),
chunks=(5, 5),
store=session.store,
overwrite=True,
)
session.commit("initialized dataset")
p1 = mp.Process(target=worker, args=(1,))
p2 = mp.Process(target=worker, args=(2,))
p1.start()
p2.start()
p1.join()
p2.join()
session = repo.readonly_session(branch="main")
z = zarr.open(session.store, mode="r")
print(z.attrs["done"])
print(list(repo.ancestry(branch="main")))
if __name__ == "__main__":
main()
```
This should output something like the following. (Note that the order of the writes is not guaranteed.)
```sh
Stated worker 1
Stated worker 2
Opened store for 1 | {}
Opened store for 2 | {}
Conflict for 1, retrying
Opened store for 1 | {'done': [2]}
[2, 1]
[SnapshotInfo(id="MGPV1YE1SY0799AZFFB0", parent_id=YAN3D2N7ANCNKCFN3JSG, written_at=datetime.datetime(2025,3,4,21,40,57,19985, tzinfo=datetime.timezone.utc), message="wrote from..."), SnapshotInfo(id="YAN3D2N7ANCNKCFN3JSG", parent_id=0M5H3J6SC8MYBQYWACC0, written_at=datetime.datetime(2025,3,4,21,40,56,734126, tzinfo=datetime.timezone.utc), message="wrote from..."), SnapshotInfo(id="0M5H3J6SC8MYBQYWACC0", parent_id=WKKQ9K7ZFXZER26SES5G, written_at=datetime.datetime(2025,3,4,21,40,56,47192, tzinfo=datetime.timezone.utc), message="initialize..."), SnapshotInfo(id="WKKQ9K7ZFXZER26SES5G", parent_id=None, written_at=datetime.datetime(2025,3,4,21,40,55,868277, tzinfo=datetime.timezone.utc), message="Repository...")]
```
# Performance
!!! info
This is advanced material, and you will need it only if you have arrays with more than a million chunks.
Icechunk aims to provide an excellent experience out of the box.
## Scalability
Icechunk is designed to be cloud native, making it able to take advantage of the horizontal scaling of cloud providers. To learn more, check out [this blog post](https://earthmover.io/blog/exploring-icechunk-scalability) which explores just how well Icechunk can perform when matched with AWS S3.
## Cold buckets and repos
Modern object stores usually reshard their buckets on-the-fly, based on perceived load. The
strategies they use are not published and very hard to discover. The details are not super important
anyway, the important take away is that on new buckets and even on new repositories, the scalability
of the object store may not be great from the start. You are expected to slowly ramp up load, as you
write data to the repository.
Once you have applied consistently high write/read load to a repository for a few minutes, the object
store will usually reshard your bucket allowing for more load. While this resharding happens, different
object stores can respond in different ways. For example, S3 returns 5xx errors with a "SlowDown"
indication. GCS returns 429 responses.
Icechunk helps this process by retrying failed requests with an exponential backoff. In our
experience, the default configuration is enough to ingest into a fresh bucket using around 100 machines.
But if this is not the case for you, you can tune the retry configuration using [StorageRetriesSettings](https://icechunk.io/en/latest/icechunk-python/reference/#icechunk.StorageRetriesSettings).
To learn more about how Icechunk manages object store prefixes, read our
[blog post](https://earthmover.io/blog/exploring-icechunk-scalability)
on Icechunk scalability.
!!! warning
Currently, Icechunk implementation of retry logic during resharding is not
[working properly](https://github.com/earth-mover/icechunk/issues/954) on GCS.
We have a [pull request open](https://github.com/apache/arrow-rs-object-store/pull/410) to
one of Icechunk's dependencies that will solve this.
In the meantime, if you get 429 errors from your Google bucket, please lower concurrency and try
again. Increase concurrency slowly until errors disappear.
## Splitting manifests
Icechunk stores chunk references in a chunk manifest file stored in `manifests/`.
By default, Icechunk stores all chunk references in a single manifest file per array.
For very large arrays (millions of chunks), these files can get quite large.
Requesting even a single chunk will require downloading the entire manifest.
In some cases, this can result in a slow time-to-first-byte or large memory usage.
Similarly, appending a small amount of data to a large array requires
downloading and rewriting the entire manifest.
!!! note
Note that the chunk sizes in the following examples are tiny for demonstration purposes.
### Configuring splitting
To solve this issue, Icechunk lets you __split__ the manifest files by specifying a ``ManifestSplittingConfig``.
```python exec="on" session="perf" source="material-block"
import icechunk as ic
from icechunk import ManifestSplitCondition, ManifestSplittingConfig, ManifestSplitDimCondition
split_config = ManifestSplittingConfig.from_dict(
{
ManifestSplitCondition.AnyArray(): {
ManifestSplitDimCondition.DimensionName("time"): 365 * 24
}
}
)
repo_config = ic.RepositoryConfig(
manifest=ic.ManifestConfig(splitting=split_config),
)
```
Then pass the `config` to `Repository.open` or `Repository.create`
```python
repo = ic.Repository.open(..., config=repo_config)
```
!!! important
Once you find a splitting configuration you like, remember to persist it on-disk using `repo.save_config`.
This particular example splits manifests so that each manifest contains `365 * 24` chunks along the time dimension, and every chunk along every other dimension in a single file.
Options for specifying the arrays whose manifest you want to split are:
1. [`ManifestSplitCondition.name_matches`](./reference.md#icechunk.ManifestSplitCondition.name_matches) takes a regular expression used to match an array's name;
2. [`ManifestSplitCondition.path_matches`](./reference.md#icechunk.ManifestSplitCondition.path_matches) takes a regular expression used to match an array's path;
3. [`ManifestSplitCondition.and_conditions`](./reference.md#icechunk.ManifestSplitCondition.and_conditions) to combine (1), (2), and (4) together; and
4. [`ManifestSplitCondition.or_conditions`](./reference.md#icechunk.ManifestSplitCondition.or_conditions) to combine (1), (2), and (3) together.
`And` and `Or` may be used to combine multiple path and/or name matches. For example,
```python exec="on" session="perf" source="material-block"
array_condition = ManifestSplitCondition.or_conditions(
[
ManifestSplitCondition.name_matches("temperature"),
ManifestSplitCondition.name_matches("salinity"),
]
)
sconfig = ManifestSplittingConfig.from_dict(
{array_condition: {ManifestSplitDimCondition.DimensionName("longitude"): 3}}
)
```
!!! note
Instead of using `and_conditions` and `or_conditions`, you can use `&` and `|` operators to combine conditions:
```python
array_condition = ManifestSplitCondition.name_matches("temperature") | ManifestSplitCondition.name_matches("salinity")
```
Options for specifying how to split along a specific axis or dimension are:
1. [`ManifestSplitDimCondition.Axis`](./reference.md#icechunk.ManifestSplitDimCondition.Axis) takes an integer axis;
2. [`ManifestSplitDimCondition.DimensionName`](./reference.md#icechunk.ManifestSplitDimCondition.DimensionName) takes a regular expression used to match the dimension names of the array;
3. [`ManifestSplitDimCondition.Any`](./reference.md#icechunk.ManifestSplitDimCondition.Any) matches any _remaining_ dimension name or axis.
For example, for an array with dimensions `time, latitude, longitude`, the following config
```python exec="on" session="perf" source="material-block"
from icechunk import ManifestSplitDimCondition
{
ManifestSplitDimCondition.DimensionName("longitude"): 3,
ManifestSplitDimCondition.Axis(1): 2,
ManifestSplitDimCondition.Any(): 1,
}
```
will result in splitting manifests so that each manifest contains (3 longitude chunks x 2 latitude chunks x 1 time chunk) = 6 chunks per manifest file.
!!! note
Python dictionaries preserve insertion order, so the first condition encountered takes priority.
### Splitting behaviour
By default, Icechunk minimizes the number of chunk refs that are written in a single commit.
Consider this simple example: a 1D array with split size 1 along axis 0.
```python exec="on" session="perf" source="material-block"
import random
import icechunk as ic
from icechunk import (
ManifestSplitCondition,
ManifestSplitDimCondition,
ManifestSplittingConfig,
)
split_config = ManifestSplittingConfig.from_dict(
{ManifestSplitCondition.AnyArray(): {ManifestSplitDimCondition.Any(): 1}}
)
repo_config = ic.RepositoryConfig(manifest=ic.ManifestConfig(splitting=split_config))
storage = ic.local_filesystem_storage(
f"/tmp/splitting-test/{random.randint(100, 20000)}"
)
# Note any config passed to Repository.create is persisted to disk.
repo = ic.Repository.create(storage, config=repo_config)
```
Create an array
```python exec="on" session="perf" source="material-block"
import zarr
session = repo.writable_session("main")
root = zarr.group(session.store)
name = "array"
array = root.create_array(name=name, shape=(10,), dtype=int, chunks=(1,))
```
Now lets write 5 chunk references
```python exec="on" session="perf" source="material-block"
import numpy as np
array[:5] = np.arange(10, 15)
print(session.status())
```
And commit
```python exec="on" session="perf" source="material-block"
snap = session.commit("Add 5 chunks")
```
Use [`repo.lookup_snapshot`](./reference.md#icechunk.Repository.lookup_snapshot) to examine the manifests associated with a Snapshot
```python exec="on" session="perf" source="material-block"
print(repo.lookup_snapshot(snap).manifests)
```
Let's open the Repository again with a different splitting config --- where 5 chunk references are in a single manifest.
```python exec="on" session="perf" source="material-block"
split_config = ManifestSplittingConfig.from_dict(
{ManifestSplitCondition.AnyArray(): {ManifestSplitDimCondition.Any(): 5}}
)
repo_config = ic.RepositoryConfig(manifest=ic.ManifestConfig(splitting=split_config))
new_repo = ic.Repository.open(storage, config=repo_config)
print(new_repo.config.manifest)
```
Now let's append data.
```python exec="on" session="perf" source="material-block"
session = new_repo.writable_session("main")
array = zarr.open_array(session.store, path=name, mode="a")
array[6:9] = [1, 2, 3]
print(session.status())
```
```python exec="on" session="perf" source="material-block"
snap2 = session.commit("appended data")
repo.lookup_snapshot(snap2).manifests
```
Look carefully, only one new manifest with the 3 new chunk refs has been written.
Why?
Icechunk minimizes how many chunk references are rewritten at each commit (to save time and memory). The previous splitting configuration (split size of 1) results in manifests that are _compatible_ with the current configuration (split size of 5) because the bounding box of every existing manifest `slice(0, 1)`, `slice(1, 2)`, etc. is fully contained in the bounding boxes implied by the new configuration `[slice(0, 5), slice(5, 10)]`.
Now for a more complex example: let's rewrite the references in `slice(3,7)` i.e. spanning the break in manifests
```python exec="on" session="perf" source="material-block"
session = new_repo.writable_session("main")
array = zarr.open_array(session.store, path=name, mode="a")
array[3:7] = [1, 2, 3, 4]
print(session.status())
```
```python exec="on" session="perf" source="material-block"
snap3 = session.commit("rewrite [3,7)")
print(repo.lookup_snapshot(snap3).manifests)
```
This ends up rewriting all refs to two new manifests.
### Rewriting manifests
Remember, by default Icechunk only writes one manifest per array regardless of size.
For large enough arrays, you might see a relative performance hit while committing a new update (e.g. an append),
or when reading from a Repository object that was just created.
At that point, you will want to experiment with different manifest split configurations.
To force Icechunk to rewrite all chunk refs to the current splitting configuration use [`rewrite_manifests`](./reference.md#icechunk.Repository.rewrite_manifests)
To illustrate, we will use a split size of 3 --- for the current example this will consolidate to two manifests.
```python exec="on" session="perf" source="material-block"
split_config = ManifestSplittingConfig.from_dict(
{ManifestSplitCondition.AnyArray(): {ManifestSplitDimCondition.Any(): 3}}
)
repo_config = ic.RepositoryConfig(
manifest=ic.ManifestConfig(splitting=split_config),
)
new_repo = ic.Repository.open(storage, config=repo_config)
snap4 = new_repo.rewrite_manifests(
f"rewrite_manifests with new config", branch="main"
)
```
`rewrite_snapshots` will create a new commit on `branch` with the provided `message`.
```python exec="on" session="perf" source="material-block"
print(repo.lookup_snapshot(snap4).manifests)
```
The splitting configuration is saved in the snapshot metadata.
```python exec="on" session="perf" source="material-block"
print(repo.lookup_snapshot(snap4).metadata)
```
!!! important
Once you find a splitting configuration you like, remember to persist it on-disk using `repo.save_config`.
### Example workflow
Here is an example workflow for experimenting with splitting
```python exec="on" session="perf" source="material-block"
# first define a new config
split_config = ManifestSplittingConfig.from_dict(
{ManifestSplitCondition.AnyArray(): {ManifestSplitDimCondition.Any(): 5}}
)
repo_config = ic.RepositoryConfig(
manifest=ic.ManifestConfig(splitting=split_config),
)
# open the repo with the new config.
repo = ic.Repository.open(storage, config=repo_config)
```
We will rewrite the manifests on a different branch
```python exec="on" session="perf" source="material-block"
repo.create_branch("split-experiment-1", repo.lookup_branch("main"))
snap = repo.rewrite_manifests(
f"rewrite_manifests with new config", branch="split-experiment-1"
)
print(repo.lookup_snapshot(snap).manifests)
```
Now benchmark reads on `main` vs `split-experiment-1`
```python exec="on" session="perf" source="material-block"
store = repo.readonly_session("main").store
store_split = repo.readonly_session("split-experiment-1").store
# ...
```
Assume we decided the configuration on `split-experiment-1` was good.
First we persist that configuration to disk
```python exec="on" session="perf" source="material-block"
repo.save_config()
```
Now point the `main` branch to the commit with rewritten manifests
```python exec="on" session="perf" source="material-block"
repo.reset_branch("main", repo.lookup_branch("split-experiment-1"))
```
Notice that the persisted config is restored when opening a Repository
```python exec="on" session="perf" source="material-block"
print(ic.Repository.open(storage).config.manifest)
```
## Preloading manifests
While [manifest splitting](./performance.md#splitting-manifests) is a great way to control the size of manifests, it can be useful to configure the manner in which manifests are loaded. In Icechunk manifests are loaded lazily by default, meaning that when you read a chunk, Icechunk will only load the manifest for that chunk when it is needed to fetch the chunk data. While this is good for memory performance, it can increase the latency to first the first elements of an array. Once a manifest has been loaded, it will usually remain in memory for future chunks in the same manifest.
To address this, Icechunk provides a way to preload manifests at the time of opening a Session, loading all manifests that match specific conditions into the cache. This means when it is time to read a chunk, the manifest will already be in the cache and the chunk can be found without having to first load the manifest from storage.
### Configuring preloading
To configure manifest preloading, you can use the `ManifestPreloadConfig` class to specify the `ic.RepositoryConfig.manifest.preload` field.
```python exec="on" session="perf" source="material-block"
preload_config = ic.ManifestPreloadConfig(
preload_if=ic.ManifestPreloadCondition.name_matches("^x$"), # preload all manifests with the name "x"
)
repo_config = ic.RepositoryConfig(
manifest=ic.ManifestConfig(
preload=preload_config,
)
)
```
Then pass the `config` to `Repository.open` or `Repository.create`
```python
repo = ic.Repository.open(..., config=repo_config)
```
This example will preload all manifests that match the regex "x" when opening a Session. While this is a simple example, you can use the `ManifestPreloadCondition` class to create more complex preload conditions using the following options:
- `ManifestPreloadCondition.name_matches` takes a regular expression used to match an array's name;
- `ManifestPreloadCondition.path_matches` takes a regular expression used to match an array's path;
- `ManifestPreloadCondition.and_conditions` to combine (1), (2), and (4) together; and
- `ManifestPreloadCondition.or_conditions` to combine (1), (2), and (3) together.
`And` and `Or` may be used to combine multiple path and/or name matches. For example,
```python exec="on" session="perf" source="material-block"
preload_config = ic.ManifestPreloadConfig(
preload_if=ic.ManifestPreloadCondition.or_conditions(
[
ic.ManifestPreloadCondition.name_matches("^x$"),
ic.ManifestPreloadCondition.path_matches("y"),
]
),
)
```
This will preload all manifests that match the array name "x" or where the array path contains "y".
!!! important
`name_matches` and `path_matches` are regular expressions, so if you only want to match the exact string, you need to use `^x$` instead of just "x". We plan to add more explicit string matching options in the future, see [this issue](https://github.com/earth-mover/icechunk/issues/996).
Preloading can also be limited to manifests that are within a limited size range. This can be useful to limit the amount of memory used by the preload cache, when some manifest may be very large. This can be configured using the `ic.RepositoryConfig.manifest.preload.num_refs` field.
```python exec="on" session="perf" source="material-block"
preload_config = ic.ManifestPreloadConfig(
preload_if=ic.ManifestPreloadCondition.and_conditions(
[
ic.ManifestPreloadCondition.name_matches("x"),
ic.ManifestPreloadCondition.num_refs(1000, 10000),
]
),
)
```
This will preload all manifests that match the array name "x" and have between 1000 and 10000 chunk references.
!!! note
Like with `ManifestSplitCondition`, you can use `&` and `|` operators to combine conditions instead of `and_conditions` and `or_conditions`:
```python
preload_config = ic.ManifestPreloadConfig(
preload_if=ic.ManifestPreloadCondition.name_matches("x") & ic.ManifestPreloadCondition.num_refs(1000, 10000),
)
```
Lastly, the number of total manifests that can be preloaded can be limited using the `ic.RepositoryConfig.manifest.preload.max_total_refs` field.
```python exec="on" session="perf" source="material-block"
preload_config = ic.ManifestPreloadConfig(
preload_if=ic.ManifestPreloadCondition.name_matches("x"),
max_total_refs=10000,
)
```
This will preload all manifests that match the array name "x" while the number of total chunk references that have been preloaded is less than 10000.
!!! important
Once you find a preload configuration you like, remember to persist it on-disk using `repo.save_config`. The saved config can be overridden at runtime for different applications.
#### Default preload configuration
Icechunk has a default `preload_if` configuration that will preload all manifests that match [cf-xarrays coordinate axis regex](https://github.com/xarray-contrib/cf-xarray/blob/1591ff5ea7664a6bdef24055ef75e242cd5bfc8b/cf_xarray/criteria.py#L149-L160).
Meanwhile, the default `max_total_refs` is set to `10_000`.
# Quickstart
Icechunk is designed to be mostly in the background.
As a Python user, you'll mostly be interacting with Zarr.
If you're not familiar with Zarr, you may want to start with the [Zarr Tutorial](https://zarr.readthedocs.io/en/latest/tutorial.html)
## Installation
Icechunk can be installed using pip or conda:
=== "pip"
```bash
python -m pip install icechunk
```
=== "conda"
```bash
conda install -c conda-forge icechunk
```
!!! note
Icechunk is currently designed to support the [Zarr V3 Specification](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html).
Using it today requires installing Zarr Python 3.
## Create a new Icechunk repository
To get started, let's create a new Icechunk repository.
We recommend creating your repo on a cloud storage platform to get the most out of Icechunk's cloud-native design.
However, you can also create a repo on your local filesystem.
=== "S3 Storage"
```python
import icechunk
storage = icechunk.s3_storage(bucket="my-bucket", prefix="my-prefix", from_env=True)
repo = icechunk.Repository.create(storage)
```
=== "Google Cloud Storage"
```python
import icechunk
storage = icechunk.gcs_storage(bucket="my-bucket", prefix="my-prefix", from_env=True)
repo = icechunk.Repository.create(storage)
```
=== "Azure Blob Storage"
```python
import icechunk
storage = icechunk.azure_storage(container="my-container", prefix="my-prefix", from_env=True)
repo = icechunk.Repository.create(storage)
```
=== "Local Storage"
```python exec="on" session="quickstart" source="material-block"
import icechunk
import tempfile
storage = icechunk.local_filesystem_storage(tempfile.TemporaryDirectory().name)
repo = icechunk.Repository.create(storage)
```
## Accessing the Icechunk store
Once the repository is created, we can use `Session`s to read and write data. Since there is no data in the repository yet,
let's create a writable session on the default `main` branch.
```python exec="on" session="quickstart" source="material-block"
session = repo.writable_session("main")
```
Now that we have a session, we can access the `IcechunkStore` from it to interact with the underlying data using `zarr`:
```python exec="on" session="quickstart" source="material-block"
store = session.store # A zarr store
```
## Write some data and commit
We can now use our Icechunk `store` with Zarr.
Let's first create a group and an array within it.
```python exec="on" session="quickstart" source="material-block"
import zarr
group = zarr.group(store)
array = group.create("my_array", shape=10, dtype='int32', chunks=(5,))
```
Now let's write some data
```python exec="on" session="quickstart" source="material-block"
array[:] = 1
```
Now let's commit our update using the session
```python exec="on" session="quickstart" source="material-block" result="code"
snapshot_id_1 = session.commit("first commit")
print(snapshot_id_1)
```
🎉 Congratulations! You just made your first Icechunk snapshot.
!!! note
Once a writable `Session` has been successfully committed to, it becomes read only to ensure that all writing is done explicitly.
If you need to write more data, you have to start a new session.
## Make a second commit
At this point, we have already committed using our session, so we need to get a new session and store to make more changes.
Here we will use an alternative syntax, using the `transaction` context manager.
In this update, we put some new data into our array, overwriting the first five elements.
```python exec="on" session="quickstart" source="material-block"
with repo.transaction("main", message="overwrite some values") as store:
group = zarr.open_group(store)
array = group["my_array"]
array[:5] = 2
```
The transaction is automatically committed when the context exits.
## Explore version history
We can see the full version history of our repo:
```python exec="on" session="quickstart" source="material-block" result="code"
hist = repo.ancestry(branch="main")
for ancestor in hist:
print(ancestor.id, ancestor.message, ancestor.written_at)
```
...and we can go back in time to the earlier version.
```python exec="on" session="quickstart" source="material-block"
# latest version
assert array[0] == 2
# check out earlier snapshot
earlier_session = repo.readonly_session(snapshot_id=snapshot_id_1)
store = earlier_session.store
# get the array
group = zarr.open_group(store, mode="r")
array = group["my_array"]
# verify data matches first version
assert array[0] == 1
```
---
That's it! You now know how to use Icechunk!
For an overview of all of the important operations, check out the [How-to guide](./howto.md).
::: icechunk
::: icechunk.xarray
::: icechunk.dask
---
title: Sample Datasets
---
# Sample Datasets
## Native Datasets
### Weatherbench2 ERA5
A subset of the Weatherbench2 copy of the ERA5 reanalysis dataset.
=== "AWS"
```python
import icechunk as ic
import xarray as xr
storage = ic.s3_storage(
bucket="icechunk-public-data",
prefix="v1/era5_weatherbench2",
region="us-east-1",
anonymous=True,
)
repo = ic.Repository.open(storage=storage)
session = repo.readonly_session("main")
ds = xr.open_dataset(
session.store, group="1x721x1440", engine="zarr", chunks=None, consolidated=False
)
```
=== "Google Cloud"
```python
import icechunk as ic
import xarray as xr
storage = ic.gcs_storage(
bucket="icechunk-public-data-gcs",
prefix="v01/era5_weatherbench2",
)
repo = ic.Repository.open(storage=storage)
session = repo.readonly_session("main")
ds = xr.open_dataset(
session.store, group="1x721x1440", engine="zarr", chunks=None, consolidated=False
)
```
=== "Cloudflare R2"
```python
import icechunk as ic
import xarray as xr
storage = ic.r2_storage(
prefix="v1/era5_weatherbench2",
endpoint_url="https://data.icechunk.cloud",
anonymous=True,
)
repo = ic.Repository.open(storage=storage)
session = repo.readonly_session("main")
ds = xr.open_dataset(
session.store, group="1x721x1440", engine="zarr", chunks=None, consolidated=False
)
```
### GLAD Land Cover Land Use
A copy of the GLAD Land Cover Land Use dataset distributed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).
Source: https://storage.googleapis.com/earthenginepartners-hansen/GLCLU2000-2020/v2/download.html
=== "AWS"
```python
import icechunk as ic
import xarray as xr
storage = ic.s3_storage(
bucket="icechunk-public-data",
prefix=f"v1/glad",
region="us-east-1",
anonymous=True,
)
repo = ic.Repository.open(storage=storage)
session = repo.readonly_session("main")
ds = xr.open_dataset(
session.store, chunks=None, consolidated=False, engine="zarr"
)
```
---
title: Specification
---
# Icechunk Specification
!!! Note
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://www.rfc-editor.org/rfc/rfc2119.html).
## Introduction
The Icechunk specification is a storage specification for [Zarr](https://zarr-specs.readthedocs.io/en/latest/specs.html) data.
Icechunk is inspired by Apache Iceberg and borrows many concepts and ideas from the [Iceberg Spec](https://iceberg.apache.org/spec/#version-2-row-level-deletes).
This specification describes a single Icechunk **repository**.
A repository is defined as a Zarr store containing one or more Arrays and Groups.
The most common scenario is for a repository to contain a single Zarr group with multiple arrays, each corresponding to different physical variables but sharing common spatiotemporal coordinates.
However, formally a repository can be any valid Zarr hierarchy, from a single Array to a deeply nested structure of Groups and Arrays.
Users of Icechunk should aim to scope their repository only to related arrays and groups that require consistent transactional updates.
Icechunk defines a series of interconnected metadata and data files that together comprise the format.
All the data and metadata for a repository are stored in a directory in object storage or file storage.
## Goals
The goals of the specification are as follows:
1. **Object storage** - the format is designed around the consistency features and performance characteristics available in modern cloud object storage. No external database or catalog is required.
1. **Serializable isolation** - Reads will be isolated from concurrent writes and always use a committed snapshot of a repository. Writes to repositories will be committed atomically and will not be partially visible. Readers will not acquire locks.
1. **Time travel** - Previous snapshots of a repository remain accessible after new ones have been written.
1. **Chunk sharding and references** - Chunk storage is decoupled from specific file names. Multiple chunks can be packed into a single object (sharding). Zarr-compatible chunks within other file formats (e.g. HDF5, NetCDF) can be referenced.
1. **Schema Evolution** - Arrays and Groups can be added and removed from the hierarchy with minimal overhead.
### Non Goals
1. **Low Latency** - Icechunk is designed to support analytical workloads for large repositories. We accept that the extra layers of metadata files and indirection will introduce additional cold-start latency compared to regular Zarr.
1. **No Catalog** - The spec does not extend beyond a single repository or provide a way to organize multiple repositories into a hierarchy.
1. **Access Controls** - Access control is the responsibility of the storage medium.
The spec is not designed to enable fine-grained access restrictions (e.g. only read specific arrays) within a single repository.
### Storage Operations
Icechunk requires that the storage system support the following operations:
- **In-place write** - Strong read-after-write and list-after-write consistency is expected. Files are not moved or altered once they are written.
- **Write-if-not-exists** - For creating new references.
- **Conditional update** - For the commit process to be safe and consistent, the storage system must be able to atomically update a file only if the current version is known to the writer.
- **Seekable reads** - Chunk file formats may require seek support (e.g. shards).
- **Deletes** - Delete files that are no longer used (via a garbage-collection operation).
- **Sorted List** - The storage system must allow the listing of directories / prefixes in lexicographical order.
These requirements are compatible with object stores, like S3, as well as with filesystems.
The storage system is not required to support random-access writes. Once written, most files are immutable until they are deleted. The exceptions to this rule are:
- the repository configuration file doesn't track history, updates are done atomically but in place,
- branch reference files are also atomically updated in place,
- snapshot files can be updated in place by the expiration process (and administrative operation).
## Specification
### Overview
Icechunk uses a series of linked metadata files to describe the state of the repository.
- The **Snapshot file** records all of the different arrays and groups in a specific snapshot of the repository, plus their metadata. Every new commit creates a new snapshot file. The snapshot file contains pointers to one or more chunk manifest files.
- **Chunk manifests** store references to individual chunks. A single manifest may store references for multiple arrays or a subset of all the references for a single array.
- **Chunk files** store the actual compressed chunk data, potentially containing data for multiple chunks in a single file.
- **Transaction log files**, an overview of the operations executed during a session, used for rebase and diffs.
- **Reference files**, also called refs, track the state of branches and tags, containing a lightweight pointer to a snapshot file. Transactions on a branch are committed by atomically updating the branch reference file.
- **Tag tombstones**, tags are immutable in Icechunk but can be deleted. When they are deleted a tombstone file is created so the
same tag name cannot be reused later.
- **Config file**, a yaml file with the default repository configuration.
When reading from object store, the client opens the latest branch or tag file to obtain a pointer to the relevant snapshot file.
The client then reads the snapshot file to determine the structure and hierarchy of the repository.
When fetching data from an array, the client first examines the chunk manifest file[s] for that array and finally fetches the chunks referenced therein.
When writing a new repository snapshot, the client first writes a new set of chunks and chunk manifests, and then generates a new snapshot file.
Finally, to commit the transaction, it updates the branch reference file using an atomic conditional update operation.
This operation may fail if a different client has already committed the next snapshot.
In this case, the client may attempt to resolve the conflicts and retry the commit.
```mermaid
flowchart TD
subgraph metadata[Metadata]
subgraph reference_files[Reference Files]
old_branch[Main Branch File 001]
branch[Main Branch File 002]
end
subgraph snapshots[Snapshots]
snapshot1[Snapshot File 1]
snapshot2[Snapshot File 2]
end
subgraph manifests[Manifests]
manifestA[Chunk Manifest A]
manifestB[Chunk Manifest B]
end
end
subgraph data
chunk1[Chunk File 1]
chunk2[Chunk File 2]
chunk3[Chunk File 3]
chunk4[Chunk File 4]
end
branch -- snapshot ID --> snapshot2
snapshot1 --> manifestA
snapshot2 -->manifestA
snapshot2 -->manifestB
manifestA --> chunk1
manifestA --> chunk2
manifestB --> chunk3
manifestB --> chunk4
```
### File Layout
All data and metadata files are stored within a root directory (typically a prefix within an object store) using the following directory structure.
- `$ROOT` base URI (s3, gcs, local directory, etc.)
- `$ROOT/config.yaml` optional persistent default configuration for the repository
- `$ROOT/refs/` reference files
- `$ROOT/snapshots/` snapshot files
- `$ROOT/manifests/` chunk manifests
- `$ROOT/transactions/` transaction log files
- `$ROOT/chunks/` chunks
### File Formats
#### Reference Files
Similar to Git, Icechunk supports the concept of _branches_ and _tags_.
These references point to a specific snapshot of the repository.
- **Branches** are _mutable_ references to a snapshot.
Repositories may have one or more branches.
The default branch name is `main`.
Repositories must always have a `main` branch, which is used to detect the existence of a valid repository in a given path.
After creation, branches may be updated to point to a different snapshot.
- **Tags** are _immutable_ references to a snapshot.
A repository may contain zero or more tags.
After creation, tags may never be updated, unlike in Git.
References are very important in the Icechunk design.
Creating or updating references is the point at which consistency and atomicity of Icechunk transactions is enforced.
Different client sessions may simultaneously create two inconsistent snapshots; however, only one session may successfully update a reference to point it to its snapshot.
References (both branches and tags) are stored as JSON files, the content is a JSON object with:
- keys: a single key `"snapshot"`,
- value: a string representation of the snapshot id, using [Base 32 Crockford](https://www.crockford.com/base32.html) encoding. The snapshot id is 12 byte random binary, so the encoded string has 20 characters.
Here is an example of a JSON file corresponding to a tag or branch:
```json
{"snapshot":"VY76P925PRY57WFEK410"}
```
##### Creating and Updating Branches
The process of creating and updating branches is designed to use the limited consistency guarantees offered by object storage to ensure transactional consistency.
When a client checks out a branch, it obtains a specific snapshot ID and uses this snapshot as the basis for any changes it creates during its session.
The client creates a new snapshot and then updates the branch reference to point to the new snapshot (a "commit").
However, when updating the branch reference, the client must detect whether a _different session_ has updated the branch reference in the interim, possibly retrying or failing the commit if so.
This is an "optimistic concurrency" strategy; the resolution mechanism can be expensive, but conflicts are expected to be infrequent.
All major object stores support a "conditional update" operation.
In other words, object stores can guard against the race condition which occurs when two sessions attempt to update the same file at the same time. Only one of those will succeed.
This mechanism is used by Icechunk on commits.
When a client checks out a branch, it keeps track of the "version" of the reference file for the branch.
When it tries to commit, it attempts to conditionally update this file in an atomic "all or nothing" operation.
If this succeeds, the commit is successful.
If this fails (because another client updated that file since the session started), the commit fails.
At this point, the client may choose to retry its commit (possibly re-reading the updated data) and then try the conditional update again.
Branch references are stored in the `refs/` directory within a subdirectory corresponding to the branch name prepended by the string `branch.`: `refs/branch.$BRANCH_NAME/ref.json`.
Branch names may not contain the `/` character.
Branch are deleted simply eliminating their ref file.
##### Tags
Tags are immutable. Their files follow the pattern `refs/tag.$TAG_NAME/ref.json`.
Tag names may not contain the `/` character.
When creating a new tag, the client attempts to create the tag file using a "create if not exists" operation.
If successful, the tag is created.
If not, that means another client has already created that tag.
Tags can also be deleted once created, but we cannot allow a delete followed by a creation, since that would
result in an observable mutation of the tag. To solve this issue, we don't allow recreating tags that were deleted.
When a tag is deleted, its reference file is not deleted, but a new tombstone file is created in the path:
`refs/tags.$TAG_NAME/ref.json.deleted`.
#### Snapshot Files
The snapshot file fully describes the schema of the repository, including all arrays and groups.
The snapshot file is encoded using [flatbuffers](https://github.com/google/flatbuffers). The IDL for the
on-disk format can be found in [the repository file](https://github.com/earth-mover/icechunk/tree/main/icechunk/flatbuffers/snapshot.fbs)
The most important parts of a snapshot file are:
- An id, 12 random bytes also encoded in the file name.
- The id of its parent snapshot. All snapshots but the first one in the repository must have a parent.
- The commit time (`flushed_at`), message string, (`message`) and metadata map (`metadata`).
- A list of `NodeSnapshot`, one item for each group or array in the repository snapshot.
- A list of `ManifestFileInfo`
`NodeSnapshot` objects can also be found in the same flatbuffers file. They contain:
- A node id (8 random bytes).
- The node path within the repository hierarchy, for example `foo/bar/baz`.
- `user_data`, any metadata used to create the node, this will usually be the Zarr metadata.
- A `node_data` union, that can be either an `ArrayNodeData` or a `GroupNodeData`.
`GroupNodeData` is empty, so it works as a pure marker signaling that the node is a group.
`ArrayNodeData` is a richer datastructure that keeps:
- The array shape, both for the whole array and its chunks.
- The array dimension names
- A list of `ManifestRef`
A `ManifestRef` is a pointer to a manifest file. It includes an id, that is used to determine the file path,
and a range of coordinates contained in the manifest for each array dimension.
Finally, a `ManifestFileInfo` is also a pointer to a manifest file, but it includes information about all the chunks
held in the manifest.
#### Chunk Manifest Files
A chunk manifest file stores chunk references.
Chunk references from multiple arrays can be stored in the same chunk manifest.
The chunks from a single array can also be spread across multiple manifests.
Manifest files are encoded using flatbuffers. The IDL for the
on-disk format can be found in [the repository file](https://github.com/earth-mover/icechunk/tree/main/icechunk/flatbuffers/manifest.fbs)
A manifest file has:
- An id (12 random bytes), that is also encoded in the file name.
- A list of `ArrayManifest` sorted by node id
Each `ArrayManifest` contains chunk references for a given array. It contains the `node_id`
of the array and a list of `ChunkRef` sorted by the chunk coordinate.
`ChunkRef` is a complex data structure because chunk references in Icechunk can have three different types:
- Native, pointing to a chunk object within the Icechunk repository.
- Inline, an optimization for very small chunks that can be embedded directly in the manifest. Mostly used for coordinate arrays.
- Virtual, pointing to a region of a file outside of the Icechunk repository, for example,
a chunk that is inside a NetCDF file in object store
These three types of chunks references are encoded in the same flatbuffers table, using optional fields.
#### Chunk Files
Chunk files contain the compressed binary chunks of a Zarr array.
Icechunk permits quite a bit of flexibility about how chunks are stored.
Chunk files can be:
- One chunk per chunk file (i.e. standard Zarr)
- Multiple contiguous chunks from the same array in a single chunk file (similar to Zarr V3 shards)
- Chunks from multiple different arrays in the same file
- Other file types (e.g. NetCDF, HDF5) which contain Zarr-compatible chunks
Applications may choose to arrange chunks within files in different ways to optimize I/O patterns.
#### Transaction logs
Transaction logs keep track of the operations done in a commit. They are not used to read objects
from the repo, but they are useful for features such as commit diff and conflict resolution.
Transaction logs are an optimization, to provide fast conflict resolution and commit diff. They are
not absolutely required to implement the core Icechunk operations.
Transaction log files are encoded using flatbuffers. The IDL for the
on-disk format can be found in [the repository file](https://github.com/earth-mover/icechunk/tree/main/icechunk/flatbuffers/transaction_log.fbs)
The transaction log file maintains information about the id of modified objects:
- `new_groups`: list of node ids.
- `new_arrays`: list of node ids.
- `deleted_groups`: list of node ids.
- `deleted_arrays`: list of node ids.
- `updated_groups`: list of node ids.
- `updated_arrays`: list of node ids.
- `updated_chunks`: list of node ids and chunk indices.
## Algorithms
### Initialize New Repository
A new repository is initialized by creating a new empty snapshot file and then creating the reference for branch `main`.
The first snapshot has a well known id, that encodes to a file name: `1CECHNKREP0F1RSTCMT0`. All object ids are
encoded in paths using Crockford base 32.
If another client attempts to initialize a repository in the same location, only one can succeed.
### Read from Repository
#### From Snapshot ID
If the specific snapshot ID is known, a client can open it directly in read only mode.
1. Use the specified snapshot ID to fetch the snapshot file.
1. Inspect the snapshot to find the relevant manifest or manifests.
1. Fetch the relevant manifests and the desired chunks pointed by them.
#### From Branch
Usually, a client will want to read from the latest branch (e.g. `main`).
1. Resolve the object store prefix `refs/branch.$BRANCH_NAME/ref.json` to obtain the latest ref file.
1. Parse the branch file JSON contents to obtain the snapshot ID.
1. Use the snapshot ID to fetch the snapshot file.
1. Fetch the relevant manifests and the desired chunks pointed by them.
#### From Tag
1. Read the tag file found at `refs/tag.$TAG_NAME/ref.json` to obtain the snapshot ID.
1. Use the snapshot ID to fetch the snapshot file.
1. Fetch the relevant manifests and the desired chunks pointed by them.
### Write New Snapshot
1. Open a repository at a specific branch as described above, keeping track of the sequence number and branch name in the session context.
1. [optional] Write new chunk files.
1. [optional] Write new chunk manifests.
1. Write a new transaction log file summarizing all changes in the session.
1. Write a new snapshot file with the new repository hierarchy and manifest links.
1. Do conditional update to write the new value of the branch reference file
1. If successful, the commit succeeded and the branch is updated.
1. If unsuccessful, attempt to reconcile and retry the commit.
### Create New Tag
A tag can be created from any snapshot.
1. Open the repository at a specific snapshot.
1. Attempt to create the tag file.
a. If successful, the tag was created.
b. If unsuccessful, the tag already exists.
# Storage
Icechunk can be configured to work with both object storage and filesystem backends. The storage configuration defines the location of an Icechunk store, along with any options or information needed to access data from a given storage type.
### S3 Storage
When using Icechunk with s3 compatible storage systems, credentials must be provided to allow access to the data on the given endpoint. Icechunk allows for creating the storage config for s3 in three ways:
=== "From environment"
With this option, the credentials for connecting to S3 are detected automatically from your environment.
This is usually the best choice if you are connecting from within an AWS environment (e.g. from EC2). [See the API](./reference.md#icechunk.s3_storage)
```python
icechunk.s3_storage(
bucket="icechunk-test",
prefix="quickstart-demo-1",
from_env=True
)
```
=== "Provide credentials"
With this option, you provide your credentials and other details explicitly. [See the API](./reference.md#icechunk.s3_storage)
```python
icechunk.s3_storage(
bucket="icechunk-test",
prefix="quickstart-demo-1",
region='us-east-1',
access_key_id='my-access-key',
secret_access_key='my-secret-key',
# session token is optional
session_token='my-token',
endpoint_url=None, # if using a custom endpoint
allow_http=False, # allow http connections (default is False)
)
```
=== "Anonymous"
With this option, you connect to S3 anonymously (without credentials).
This is suitable for public data. [See the API](./reference.md#icechunk.s3_storage)
```python
icechunk.s3_storage(
bucket="icechunk-test",
prefix="quickstart-demo-1",
region='us-east-1,
anonymous=True,
)
```
=== "Refreshable Credentials"
With this option, you provide a callback function that will be called to obtain S3 credentials when needed. This is useful for workloads that depend on retrieving short-lived credentials from AWS or similar authority, allowing for credentials to be refreshed as needed without interrupting any workflows. [See the API](./reference.md#icechunk.s3_storage)
```python
def get_credentials() -> S3StaticCredentials:
# In practice, you would use a function that actually fetches the credentials and returns them
# along with an optional expiration time which will trigger this callback to run again
return icechunk.S3StaticCredentials(
access_key_id="xyz",
secret_access_key="abc",å
expires_after=datetime.now(UTC) + timedelta(days=1)
)
icechunk.s3_storage(
bucket="icechunk-test",
prefix="quickstart-demo-1",
region='us-east-1',
get_credentials=get_credentials,
)
```
#### Tigris
[Tigris](https://www.tigrisdata.com/) is available as a storage backend for Icechunk. Icechunk provides a helper function specifically for [creating Tigris storage configurations](./reference.md#icechunk.tigris_storage).
```python
icechunk.tigris_storage(
bucket="icechunk-test",
prefix="quickstart-demo-1",
access_key_id='my-access-key',
secret_access_key='my-secret-key',
)
```
Even is Tigris is API-compatible with S3, this function is needed because Tigris implements a different form of consistency. If instead you use `s3_storage` with the Tigris endpoint, Icechunk won't be able to achieve all its consistency guarantees.
#### Cloudflare R2
Icechunk can use Cloudflare R2's S3-compatible API. You will need to:
1. provide either the account ID or set the [endpoint URL](https://developers.cloudflare.com/r2/api/s3/api/) specific to your bucket: `https://.r2.cloudflarestorage.com`;
2. [create an API token](https://developers.cloudflare.com/r2/api/s3/tokens/) to generate a secret access key and access key ID; and
```python
icechunk.r2_storage(
bucket="bucket-name",
prefix="icechunk-test/quickstart-demo-1",
access_key_id='my-access-key',
secret_access_key='my-secret-key',
account_id='my-account-id',
)
```
For buckets with public access,
```python
icechunk.r2_storage(
prefix="icechunk-test/quickstart-demo-1",
endpoint_url="https://public-url,
)
```
#### Minio
[Minio](https://min.io/) is available as a storage backend for Icechunk. Functionally this storage backend is the same as S3 storage, but with a different endpoint.
For example, if we have a Minio server running at `http://localhost:9000` with access key `minio` and
secret key `minio123` we can create a storage configuration as follows:
```python
icechunk.s3_storage(
bucket="icechunk-test",
prefix="quickstart-demo-1",
region='us-east-1',
access_key_id='minio',
secret_access_key='minio123',
endpoint_url='http://localhost:9000',
allow_http=True,
force_path_style=True,
```
A few things to note:
1. The `endpoint_url` parameter is set to the URL of the Minio server.
2. If the Minio server is running over HTTP and not HTTPS, the `allow_http` parameter must be set to `True`.
3. Even though this is running on a local server, the `region` parameter must still be set to a valid region. [By default use `us-east-1`](https://github.com/minio/minio/discussions/15063).
#### Object stores lacking conditional writes
Some object stores don't have support for conditional writes, so they don't work with Icechunk out of the box. This is changing rapidly since AWS added support for these operations, and most major object store have had support for a long time.
If you are trying to use one of these object stores, like [JASMIN](https://help.jasmin.ac.uk/docs/short-term-project-storage/using-the-jasmin-object-store/) for example, you'll need to accept some trade-offs.
Icechunk can work on them, but you'll lose the consistency guarantee in the presence of multiple concurrent committers. If two sessions commit at the same time, one of them could get lost. If you decide
to accept this risk, you can configure Icechunk like so:
```python
storage = icechunk.s3_storage(...)
storage_config = icechunk.StorageSettings(
unsafe_use_conditional_update=False,
unsafe_use_conditional_create=False,
)
config = icechunk.RepositoryConfig(
storage = storage_config
)
repo = icechunk.Repository.create(
storage=storage,
config= config
)
```
### Google Cloud Storage
Icechunk can be used with [Google Cloud Storage](https://cloud.google.com/storage?hl=en).
=== "From environment"
With this option, the credentials for connecting to GCS are detected automatically from your environment. [See the API](./reference.md#icechunk.gcs_storage)
```python
icechunk.gcs_storage(
bucket="icechunk-test",
prefix="quickstart-demo-1",
from_env=True
)
```
=== "Service Account File"
With this option, you provide the path to a [service account file](https://cloud.google.com/iam/docs/service-account-creds#key-types). [See the API](./reference.md#icechunk.gcs_storage)
```python
icechunk.gcs_storage(
bucket="icechunk-test",
prefix="quickstart-demo-1",
service_account_file="/path/to/service-account.json"
)
```
=== "Service Account Key"
With this option, you provide the service account key as a string. [See the API](./reference.md#icechunk.gcs_storage)
```python
icechunk.gcs_storage(
bucket="icechunk-test",
prefix="quickstart-demo-1",
service_account_key={
"type": "service_account",
"project_id": "my-project",
"private_key_id": "my-private-key-id",
"private_key": "-----BEGIN PRIVATE KEY-----\nmy-private-key\n-----END PRIVATE KEY-----\n",
"client_email": "
},
)
```
=== "Application Default Credentials"
With this option, you use the [application default credentials (ADC)](https://cloud.google.com/docs/authentication/provide-credentials-adc) to authentication with GCS. Provide the path to the credentials. [See the API](./reference.md#icechunk.gcs_storage)
```python
icechunk.gcs_storage(
bucket="icechunk-test",
prefix="quickstart-demo-1",
application_credentials="/path/to/application-credentials.json"
)
```
=== "Bearer Token"
With this option, you provide a bearer token to use for the object store. This is useful for short lived workflows where expiration is not relevant or when the bearer token will not expire [See the API](./reference.md#icechunk.gcs_storage)
```python
icechunk.gcs_storage(
bucket="icechunk-test",
prefix="quickstart-demo-1",
bearer_token="my-bearer-token"
)
```
=== "Refreshable Credentials"
With this option, you provide a callback function that will be called to obtain GCS credentials when needed. This is useful for workloads that depend on retrieving short-lived credentials from GCS or similar authority, allowing for credentials to be refreshed as needed without interrupting any workflows. This works at a lower level than the other methods, and accepts a bearer token and expiration time. These are the same credentials that are created for you when specifying the service account file, key, or ADC. [See the API](./reference.md#icechunk.gcs_storage)
```python
def get_credentials() -> GcsBearerCredential:
# In practice, you would use a function that actually fetches the credentials and returns them
# along with an optional expiration time which will trigger this callback to run again
return icechunk.GcsBearerCredential(bearer="my-bearer-token", expires_after=datetime.now(UTC) + timedelta(days=1))
icechunk.gcs_storage(
bucket="icechunk-test",
prefix="quickstart-demo-1",
get_credentials=get_credentials,
)
```
#### Limitations
- The consistency guarantees for GCS function differently than S3. Specifically, GCS uses the [generation](https://cloud.google.com/storage/docs/request-preconditions#compose-preconditions) instead of etag for `if-match` `put` requests. Icechunk has not wired this through yet and thus [configuration updating](https://github.com/earth-mover/icechunk/issues/533) is potentially unsafe. This is not a problem for most use cases that are not frequently updating the configuration.
- GCS does not yet support [`bearer` tokens and auth refreshing](https://github.com/earth-mover/icechunk/issues/637). This means currently auth is limited to service account files.
### Azure Blob Storage
Icechunk can be used with [Azure Blob Storage](https://azure.microsoft.com/en-us/services/storage/blobs/).
=== "From environment"
With this option, the credentials for connecting to Azure Blob Storage are detected automatically from your environment. [See the API](./reference.md#icechunk.azure_storage)
```python
icechunk.azure_storage(
account="my-account-name",
container="icechunk-test",
prefix="quickstart-demo-1",
from_env=True
)
```
=== "Provide credentials"
With this option, you provide your credentials and other details explicitly. [See the API](./reference.md#icechunk.azure_storage)
```python
icechunk.azure_storage(
account_name='my-account-name',
container="icechunk-test",
prefix="quickstart-demo-1",
account_key='my-account-key',
access_token=None, # optional
sas_token=None, # optional
bearer_token=None, # optional
)
```
### Filesystem Storage
Icechunk can also be used on a [local filesystem](./reference.md#icechunk.local_filesystem_storage) by providing a path to the location of the store
=== "Local filesystem"
```python
icechunk.local_filesystem_storage("/path/to/my/dataset")
```
#### Limitations
!!! warning
File system Storage is not safe in the presence of concurrent commits. If two sessions are trying to commit at the same time, both operations may return successfully but one of the commits can be lost. Don't use file system storage in production if there is the possibility of concurrent commits.
- Icechunk currently does not work with a local filesystem storage backend on Windows. See [this issue](https://github.com/earth-mover/icechunk/issues/665) for more discussion. To work around, try using [WSL](https://learn.microsoft.com/en-us/windows/wsl/about) or a cloud storage backend.
### In Memory Storage
While it should never be used for production data, Icechunk can also be used with an in-memory storage backend. This is useful for testing and development purposes. This is volatile and when the Python process ends, all data is lost.
```python
icechunk.in_memory_storage()
```
# Transactions and Version Control
Icechunk carries over concepts from other version control software (e.g. Git) to multidimensional arrays. Doing so helps ease the burden of managing multiple versions of your data, and helps you be precise about which version of your dataset is being used for downstream purposes.
Core concepts of Icechunk's version control system are:
- A snapshot bundles together related data and metadata changes in a single "transaction".
- A branch points to the latest snapshot in a series of snapshots. Multiple branches can co-exist at a given time, and multiple users can add snapshots to a single branch. One common pattern is to use dev, stage, and prod branches to separate versions of a dataset.
- A tag is an immutable reference to a snapshot, usually used to represent an "important" version of the dataset such as a release.
Snapshots, branches, and tags all refer to specific versions of your dataset. You can time-travel/navigate back to any version of your data as referenced by a snapshot, a branch, or a tag using a snapshot ID, a branch name, or a tag name when creating a new `Session`.
## Setup
To get started, we can create a new `Repository`.
!!! note
This example uses an in-memory storage backend, but you can also use any other storage backend instead.
```python exec="on" session="version" source="material-block"
import icechunk
repo = icechunk.Repository.create(icechunk.in_memory_storage())
```
On creating a new [`Repository`](./reference.md#icechunk.Repository), it will automatically create a `main` branch with an initial snapshot. We can take a look at the ancestry of the `main` branch to confirm this.
```python
for ancestor in repo.ancestry(branch="main"):
print(ancestor)
```
!!! note
The [`ancestry`](./reference.md#icechunk.Repository.ancestry) method can be used to inspect the ancestry of any branch, snapshot, or tag.
We get back an iterator of [`SnapshotInfo`](./reference.md#icechunk.SnapshotInfo) objects, which contain information about the snapshot, including its ID, the ID of its parent snapshot, and the time it was written.
## Creating a snapshot
Now that we have a `Repository` with a `main` branch, we can modify the data in the repository and create a new snapshot. First we need to create a writable Session from the `main` branch.
!!! note
Writable `Session` objects are required to create new snapshots, and can only be created from the tip of a branch. Checking out tags or other snapshots is read-only.
```python exec="on" session="version" source="material-block"
session = repo.writable_session("main")
```
We can now access the `zarr.Store` from the `Session` and create a new root group. Then we can modify the attributes of the root group and create a new snapshot.
```python exec="on" session="version" source="material-block" result="code"
import zarr
root = zarr.create_group(session.store)
root.attrs["foo"] = "bar"
print(session.commit(message="Add foo attribute to root group"))
```
Success! We've created a new snapshot with a new attribute on the root group.
Once we've committed the snapshot, the `Session` will become read-only, and we can no longer modify the data using our existing `Session`. If we want to modify the data again, we need to create a new writable `Session` from the branch. Notice that we don't have to refresh the `Repository` to get the updates from the `main` branch. Instead, the `Repository` will automatically fetch the latest snapshot from the branch when we create a new writable `Session` from it.
```python exec="on" session="version" source="material-block" result="code"
session = repo.writable_session("main")
root = zarr.open_group(session.store)
root.attrs["foo"] = "baz"
print(session.commit(message="Update foo attribute on root group"))
```
With a few snapshots committed, we can take a look at the ancestry of the `main` branch:
```python exec="on" session="version" source="material-block" result="code"
for snapshot in repo.ancestry(branch="main"):
print(snapshot)
```
Visually, this looks like below, where the arrows represent the parent-child relationship between snapshots.
```python exec="1" result="mermaid" session="version"
print("""
gitGraph
commit id: "{}" type: NORMAL
commit id: "{}" type: NORMAL
commit id: "{}" type: NORMAL
""".format(*[snap.id[:6] for snap in repo.ancestry(branch="main")]))
```
## Transaction Context Manager
To simplify the process of updating a repo, Icechunk provides a `transaction` context manager which yields an `IcechunkStore` object directly:
```python exec="on" session="version" source="material-block"
with repo.transaction("main", message="updated from context manager") as store:
root = zarr.open_group(store)
root.attrs["foo"] = "qux"
```
The context manager creates as `Session` in the background and automatically commits it (provided there are no errors within the context).
## Time Travel
Now that we've created a few snapshots, we can time-travel back to the previous snapshot using the snapshot ID.
!!! note
It's important to note that because the `zarr Store` is read-only, we need to pass `mode="r"` to the `zarr.open_group` function.
```python exec="on" session="version" source="material-block" result="code"
session = repo.readonly_session(snapshot_id=list(repo.ancestry(branch="main"))[1].id)
root = zarr.open_group(session.store, mode="r")
print(root.attrs["foo"])
```
## Branches
If we want to modify the data from a previous snapshot, we can create a new branch from that snapshot with [`create_branch`](./reference.md#icechunk.Repository.create_branch).
```python exec="on" session="version" source="material-block"
main_branch_snapshot_id = repo.lookup_branch("main")
repo.create_branch("dev", snapshot_id=main_branch_snapshot_id)
```
We can now create a new writable `Session` from the `dev` branch and modify the data.
```python exec="on" session="version" source="material-block" result="code"
session = repo.writable_session("dev")
root = zarr.open_group(session.store)
root.attrs["foo"] = "balogna"
print(session.commit(message="Update foo attribute on root group"))
```
We can also create a new branch from the tip of the `main` branch if we want to modify our current working branch without modifying the `main` branch.
```python exec="on" session="version" source="material-block" result="code"
repo.create_branch("feature", snapshot_id=main_branch_snapshot_id)
session = repo.writable_session("feature")
root = zarr.open_group(session.store)
root.attrs["foo"] = "cherry"
print(session.commit(message="Update foo attribute on root group"))
```
With these branches created, the hierarchy of the repository now looks like below.
```python exec="on" result="mermaid" session="version"
main_commits = [s.id[:6] for s in list(repo.ancestry(branch='main'))]
dev_commits = [s.id[:6] for s in list(repo.ancestry(branch='dev'))]
feature_commits = [s.id[:6] for s in list(repo.ancestry(branch='feature'))]
print(
"""
gitGraph
commit id: "{}" type: NORMAL
commit id: "{}" type: NORMAL
branch dev
checkout dev
commit id: "{}" type: NORMAL
checkout main
commit id: "{}" type: NORMAL
checkout main
branch feature
commit id: "{}" type: NORMAL
""".format(*[main_commits[-2], main_commits[-1], dev_commits[0], main_commits[0],feature_commits[0]])
)
```
We can also [list all branches](./reference.md#icechunk.Repository.list_branches) in the repository.
```python exec="on" session="version" source="material-block" result="code"
print(repo.list_branches())
```
If we need to find the snapshot that a branch is based on, we can use the [`lookup_branch`](./reference.md#icechunk.Repository.lookup_branch) method.
```python exec="on" session="version" source="material-block" result="code"
print(repo.lookup_branch("feature"))
```
We can also [delete a branch](./reference.md#icechunk.Repository.delete_branch) with [`delete_branch`](./reference.md#icechunk.Repository.delete_branch).
```python exec="on" session="version" source="material-block"
repo.delete_branch("feature")
```
Finally, we can [reset a branch](./reference.md#icechunk.Repository.reset_branch) to a previous snapshot with [`reset_branch`](./reference.md#icechunk.Repository.reset_branch). This immediately modifies the branch tip to the specified snapshot, changing the history of the branch.
```python exec="on" session="version" source="material-block"
repo.reset_branch("dev", snapshot_id=main_branch_snapshot_id)
```
## Tags
Tags are immutable references to a snapshot. They are created with [`create_tag`](./reference.md#icechunk.Repository.create_tag).
For example to tag the second commit in `main`'s history:
```python exec="on" session="version" source="material-block"
repo.create_tag("v1.0.0", snapshot_id=list(repo.ancestry(branch="main"))[1].id)
```
Because tags are immutable, we need to use a readonly `Session` to access the data referenced by a tag.
```python exec="on" session="version" source="material-block" result="code"
session = repo.readonly_session(tag="v1.0.0")
root = zarr.open_group(session.store, mode="r")
print(root.attrs["foo"])
```
```python exec="1" result="mermaid" session="version"
print("""
gitGraph
commit id: "{}" type: NORMAL
commit id: "{}" type: NORMAL
commit tag: "v1.0.0"
commit id: "{}" type: NORMAL
""".format(*[snap.id[:6] for snap in repo.ancestry(branch="main")]))
```
We can also [list all tags](./reference.md#icechunk.Repository.list_tags) in the repository.
```python exec="on" session="version" source="material-block" result="code"
print(repo.list_tags())
```
and we can look up the snapshot that a tag is based on with [`lookup_tag`](./reference.md#icechunk.Repository.lookup_tag).
```python exec="on" session="version" source="material-block" result="code"
print(repo.lookup_tag("v1.0.0"))
```
And then finally delete a tag with [`delete_tag`](./reference.md#icechunk.Repository.delete_tag).
!!! note
Tags are immutable and once a tag is deleted, it can never be recreated.
```python exec="on" session="version" source="material-block"
repo.delete_tag("v1.0.0")
```
## Conflict Resolution
Icechunk is a serverless distributed system, and as such, it is possible to have multiple users or processes modifying the same data at the same time. Icechunk relies on the consistency guarantees of the underlying storage backends to ensure that the data is always consistent. In situations where two users or processes attempt to modify the same data at the same time, Icechunk will detect the conflict and raise an exception at commit time. This can be illustrated with the following example.
Let's create a fresh repository, add some attributes to the root group and create an array named `data`.
```python exec="on" session="version" source="material-block" result="code"
import icechunk
import numpy as np
import zarr
repo = icechunk.Repository.create(icechunk.in_memory_storage())
session = repo.writable_session("main")
root = zarr.create_group(session.store)
root.attrs["foo"] = "bar"
root.create_dataset("data", shape=(10, 10), chunks=(1, 1), dtype=np.int32)
print(session.commit(message="Add foo attribute and data array"))
```
Lets try to modify the `data` array in two different sessions, created from the `main` branch.
```python exec="on" session="version" source="material-block"
session1 = repo.writable_session("main")
session2 = repo.writable_session("main")
root1 = zarr.group(session1.store)
root2 = zarr.group(session2.store)
root1["data"][0,0] = 1
root2["data"][0,:] = 2
```
and then try to commit the changes.
```python
print(session1.commit(message="Update first element of data array"))
print(session2.commit(message="Update first row of data array"))
# AE9XS2ZWXT861KD2JGHG
# ---------------------------------------------------------------------------
# ConflictError Traceback (most recent call last)
# Cell In[7], line 11
# 8 root2.attrs["foo"] = "baz"
# 10 print(session1.commit(message="Update foo attribute on root group"))
# ---> 11 print(session2.commit(message="Update foo attribute on root group"))
# File ~/Developer/icechunk/icechunk-python/python/icechunk/session.py:224, in Session.commit(self, message, metadata)
# 222 return self._session.commit(message, metadata)
# 223 except PyConflictError as e:
# --> 224 raise ConflictError(e) from None
# ConflictError: Failed to commit, expected parent: Some("BG0W943WSNFMMVD1FXJ0"), actual parent: Some("AE9XS2ZWXT861KD2JGHG")
```
The first session was able to commit successfully, but the second session failed with a [`ConflictError`](./reference.md#icechunk.ConflictError). When the second session was created, the changes made were relative to the tip of the `main` branch, but the tip of the `main` branch had been modified by the first session.
To resolve this conflict, we can use the [`rebase`](./reference.md#icechunk.Session.rebase) functionality.
### Rebasing
To update the second session so it is based off the tip of the `main` branch, we can use the [`rebase`](./reference.md#icechunk.Session.rebase) method.
First, we can try to rebase, without merging any conflicting changes:
```python
session2.rebase(rebase_with=icechunk.ConflictDetector())
# ---------------------------------------------------------------------------
# RebaseFailedError Traceback (most recent call last)
# Cell In[8], line 1
# ----> 1 session2.rebase(icechunk.ConflictDetector())
# File ~/Developer/icechunk/icechunk-python/python/icechunk/session.py:247, in Session.rebase(self, solver)
# 245 self._session.rebase(solver)
# 246 except PyRebaseFailedError as e:
# --> 247 raise RebaseFailedError(e) from None
# RebaseFailedError: Rebase failed on snapshot AE9XS2ZWXT861KD2JGHG: 1 conflicts found
```
This however fails because both sessions modified metadata. We can use the `RebaseFailedError` to get more information about the conflict.
```python
try:
session1.rebase(icechunk.ConflictDetector())
except icechunk.RebaseFailedError as e:
for conflict in e.conflicts:
print(f"Conflict at {conflict.path}: {conflict.conflicted_chunks}")
# Conflict at /data: [[0, 0]]
```
We get a clear indication of the conflict, and the chunks that are conflicting. In this case we have decided that the first session's changes are correct, so we can again use the [`BasicConflictSolver`](./reference.md#icechunk.BasicConflictSolver) to resolve the conflict.
```python
session1.rebase(icechunk.BasicConflictSolver(on_chunk_conflict=icechunk.VersionSelection.UseOurs))
session1.commit(message="Update first element of data array")
# 'R4WXW2CYNAZTQ3HXTNK0'
```
Success! We have now resolved the conflict and committed the changes.
Let's look at the value of the `data` array to confirm that the conflict was resolved correctly.
```python
session = repo.readonly_session("main")
root = zarr.open_group(session.store, mode="r")
root["data"][0,:]
# array([1, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)
```
As you can see, `readonly_session` accepts a string for a branch name, or you can also write:
```python
session = repo.readonly_session(branch="main")
```
Lastly, if you make changes to non-conflicting chunks or attributes, you can rebase without having to resolve any conflicts.
This time we will show how to use rebase automatically during the `commit` call:
```python
session1 = repo.writable_session("main")
session2 = repo.writable_session("main")
root1 = zarr.group(session1.store)
root2 = zarr.group(session2.store)
root1["data"][3,:] = 3
root2["data"][4,:] = 4
session1.commit(message="Update fourth row of data array")
session2.commit(message="Update fifth row of data array", rebase_with=icechunk.ConflictDetector())
print("Rebase+commit succeeded")
```
And now we can see the data in the `data` array to confirm that the changes were committed correctly.
```python
session = repo.readonly_session(branch="main")
root = zarr.open_group(session.store, mode="r")
root["data"][:,:]
# array([[1, 2, 2, 2, 2, 2, 2, 2, 2, 2],
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
# [3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
# [4, 4, 4, 4, 4, 4, 4, 4, 4, 4],
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)
```
#### Limitations
At the moment, the rebase functionality is limited to resolving conflicts with chunks in arrays. Other types of conflicts are not able to be resolved by icechunk yet and must be resolved manually.
# Virtual Datasets
While Icechunk works wonderfully with native chunks managed by Zarr, there is lots of archival data out there in other formats already. To interoperate with such data, Icechunk supports "Virtual" chunks, where any number of chunks in a given dataset may reference external data in existing archival formats, such as netCDF, HDF, GRIB, or TIFF. Virtual chunks are loaded directly from the original source without copying or modifying the original achival data files. This enables Icechunk to manage large datasets from existing data without needing that data to be in Zarr format already.
!!! note
The concept of a "virtual Zarr dataset" originates from the [Kerchunk](https://fsspec.github.io/kerchunk/) project, which preceded and inspired [VirtualiZarr](https://virtualizarr.readthedocs.io/en/latest/). Like `VirtualiZarr`, the `kerchunk` package provides functionality to scan metadata of existing data files and combine these references into larger virtual datasets, but unlike `VirtualiZarr` the `Kerchunk` package currently has no facility for writing to `Icechunk` stores. If you previously were interested in "Kerchunking" your data, you can now achieve a similar result by using `VirtualiZarr` to create virtual datasets and write them to `icechunk`.
`VirtualiZarr` lets users ingest existing data files into virtual datasets using various different tools under the hood, including `kerchunk`, `xarray`, `zarr`, and now `icechunk`. It does so by creating virtual references to existing data that can be combined and manipulated to create larger virtual datasets using `xarray`. These datasets can then be exported to `kerchunk` reference format or to an `Icechunk` repository, without ever copying or moving the existing data files.
!!! note
Currently, Icechunk support virtual references to data stored in `s3` compatible,`gcs`, `http/https`, and `local` storage backends. Support for [`azure`](https://github.com/earth-mover/icechunk/issues/602) is on the roadmap.
## Creating a virtual dataset with VirtualiZarr
We are going to create a virtual dataset pointing to all of the [OISST](https://www.ncei.noaa.gov/products/optimum-interpolation-sst) data for August 2024. This data is distributed publicly as netCDF files on AWS S3, with one netCDF file containing the Sea Surface Temperature (SST) data for each day of the month. We are going to use `VirtualiZarr` to combine all of these files into a single virtual dataset spanning the entire month, then write that dataset to Icechunk for use in analysis.
Before we get started, we need to install `virtualizarr`, and `icechunk`. We also need to install `fsspec` and `s3fs` for working with data on s3.
```shell
pip install virtualizarr icechunk fsspec s3fs
```
First, we need to find all of the files we are interested in, we will do this with fsspec using a `glob` expression to find every netcdf file in the August 2024 folder in the bucket:
```python
import fsspec
fs = fsspec.filesystem('s3')
oisst_files = fs.glob('s3://noaa-cdr-sea-surface-temp-optimum-interpolation-pds/data/v2.1/avhrr/202408/oisst-avhrr-v02r01.*.nc')
oisst_files = sorted(['s3://'+f for f in oisst_files])
#['s3://noaa-cdr-sea-surface-temp-optimum-interpolation-pds/data/v2.1/avhrr/201001/oisst-avhrr-v02r01.20100101.nc',
# 's3://noaa-cdr-sea-surface-temp-optimum-interpolation-pds/data/v2.1/avhrr/201001/oisst-avhrr-v02r01.20100102.nc',
# 's3://noaa-cdr-sea-surface-temp-optimum-interpolation-pds/data/v2.1/avhrr/201001/oisst-avhrr-v02r01.20100103.nc',
# 's3://noaa-cdr-sea-surface-temp-optimum-interpolation-pds/data/v2.1/avhrr/201001/oisst-avhrr-v02r01.20100104.nc',
#...
#]
```
Now that we have the filenames of the data we need, we can create virtual datasets with `VirtualiZarr`. This may take a minute.
```python
from virtualizarr import open_virtual_dataset
virtual_datasets =[
open_virtual_dataset(url, indexes={})
for url in oisst_files
]
```
We can now use `xarray` to combine these virtual datasets into one large virtual dataset (For more details on this operation see [`VirtualiZarr`'s documentation](https://virtualizarr.readthedocs.io/en/latest/usage.html#combining-virtual-datasets)). We know that each of our files share the same structure but with a different date. So we are going to concatenate these datasets on the `time` dimension.
```python
import xarray as xr
virtual_ds = xr.concat(
virtual_datasets,
dim='time',
coords='minimal',
compat='override',
combine_attrs='override'
)
# Size: 257MB
#Dimensions: (time: 31, zlev: 1, lat: 720, lon: 1440)
#Coordinates:
# time (time) float32 124B ManifestArray Size: 1GB
#Dimensions: (lon: 1440, time: 31, zlev: 1, lat: 720)
#Coordinates:
# * lon (lon) float32 6kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
# * zlev (zlev) float32 4B 0.0
# * time (time) datetime64[ns] 248B 2024-08-01T12:00:00 ... 2024-08-31T12...
# * lat (lat) float32 3kB -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88
#Data variables:
# sst (time, zlev, lat, lon) float64 257MB dask.array
# ice (time, zlev, lat, lon) float64 257MB dask.array
# anom (time, zlev, lat, lon) float64 257MB dask.array
# err (time, zlev, lat, lon) float64 257MB dask.array
```
Success! We have created our full dataset with 31 timesteps spanning the month of august, all with virtual references to pre-existing data files in object store. This means we can now version control our dataset, allowing us to update it, and roll it back to a previous version without copying or moving any data from the original files.
Finally, let's make a plot of the sea surface temperature!
```python
ds.sst.isel(time=26, zlev=0).plot(x='lon', y='lat', vmin=0)
```

!!! note
Users of the repo will need to enable the virtual chunk container by passing the `credentials` argument to `Repository.open`. This way, the repo user, flags the container as authorized. `credentials` argument must be a dict using url prefixes as keys and optional credentials as values. If the container requires no credentials, `None` can be used as the value in the map. Failing to authorize a container, will generate an error when a chunk is fetched from it.
## Virtual Reference API
While `VirtualiZarr` is the easiest way to create virtual datasets with Icechunk, the Store API that it uses to create the datasets in Icechunk is public. `IcechunkStore` contains a [`set_virtual_ref`](./reference.md#icechunk.IcechunkStore.set_virtual_ref) method that specifies a virtual ref for a specified chunk.
### Virtual Reference Storage Support
Currently, Icechunk supports four types of storage for virtual references:
#### S3 Compatible
References to files accessible via S3 compatible storage.
##### Example
Here is how we can set the chunk at key `c/0` to point to a file on an s3 bucket,`mybucket`, with the prefix `my/data/file.nc`:
```python
config = icechunk.RepositoryConfig.default()
config.set_virtual_chunk_container(icechunk.VirtualChunkContainer("s3://mybucket/my/data", icechunk.s3_store(region="us-east-1")))
repo = icechunk.Repository.create(storage, config)
session = repo.writable_session("main")
session.store.set_virtual_ref('c/0', 's3://mybucket/my/data/file.nc', offset=1000, length=200)
```
##### Configuration
S3 virtual references require configuring credential for the store to be able to access the specified s3 bucket. See [the configuration docs](./configuration.md#virtual-chunk-credentials) for instructions.
#### GCS
References to files accessible on Google Cloud Storage
##### Example
Here is how we can set the chunk at key `c/0` to point to a file on an s3 bucket,`mybucket`, with the prefix `my/data/file.nc`:
```python
config = icechunk.RepositoryConfig.default()
config.set_virtual_chunk_container(icechunk.VirtualChunkContainer("gcs://mybucket/my/data", icechunk.gcs_store(options={})))
repo = icechunk.Repository.create(storage, config)
session = repo.writable_session("main")
session.store.set_virtual_ref('c/0', 'gcs://mybucket/my/data/file.nc', offset=1000, length=200)
```
#### HTTP
References to files accessible via http(s) protocol
##### Example
Here is how we can set the chunk at key `c/0` to point to a file on `myserver`, with the prefix `my/data/file.nc`:
```python
config = icechunk.RepositoryConfig.default()
config.set_virtual_chunk_container(icechunk.VirtualChunkContainer("https://myserver/my/data", icechunk.http_store(options={})))
repo = icechunk.Repository.create(storage, config)
session = repo.writable_session("main")
session.store.set_virtual_ref('c/0', 'https://myserver/my/data/file.nc', offset=1000, length=200)
```
#### Local Filesystem
References to files accessible via local filesystem. This requires any file paths to be **absolute** at this time.
##### Example
Here is how we can set the chunk at key `c/0` to point to a file on my local filesystem located at `/path/to/my/file.nc`:
```python
config = icechunk.RepositoryConfig.default()
config.set_virtual_chunk_container(icechunk.VirtualChunkContainer("s3://mybucket/my/data", icechunk.local_filesystem_store("/path/to/my")))
repo = icechunk.Repository.create(storage, config)
session = repo.writable_session("main")
session.store.set_virtual_ref('c/0', 'file:///path/to/my/file.nc', offset=20, length=100)
```
No extra configuration is necessary for local filesystem references.
### Virtual Reference File Format Support
Currently, Icechunk supports `HDF5`, `netcdf4`, and `netcdf3` files for use in virtual references with `VirtualiZarr`. Support for other filetypes is under development in the VirtualiZarr project. Below are some relevant issues:
- [meta issue for file format support](https://github.com/zarr-developers/VirtualiZarr/issues/218)
- [Support for GRIB2 files](https://github.com/zarr-developers/VirtualiZarr/issues/312)
- [Support for GRIB2 files with datatree](https://github.com/zarr-developers/VirtualiZarr/issues/11)
- [Support for TIFF files](https://github.com/zarr-developers/VirtualiZarr/issues/291)
# Icechunk + Xarray
Icechunk was designed to work seamlessly with Xarray. Xarray users can read and
write data to Icechunk using [`xarray.open_zarr`](https://docs.xarray.dev/en/latest/generated/xarray.open_zarr.html#xarray.open_zarr)
and `icechunk.xarray.to_icechunk` methods.
!!! warning
Using Xarray and Icechunk together currently requires installing Xarray >= 2025.1.1.
```shell
pip install "xarray>=2025.1.1"
```
!!!note "`to_icechunk` vs `to_zarr`"
[`xarray.Dataset.to_zarr`](https://docs.xarray.dev/en/latest/generated/xarray.Dataset.to_zarr.html#xarray.Dataset.to_zarr)
and [`to_icechunk`](./reference.md#icechunk.xarray.to_icechunk) are nearly functionally identical.
In a distributed context, e.g.
writes orchestrated with `multiprocesssing` or a `dask.distributed.Client` and `dask.array`, you *must* use `to_icechunk`.
This will ensure that you can execute a commit that successfully records all remote writes.
See [these docs on orchestrating parallel writes](./parallel.md) and [these docs on dask.array with distributed](./dask.md#icechunk-dask-xarray)
for more.
If using `to_zarr`, remember to set `zarr_format=3, consolidated=False`. Consolidated metadata
is unnecessary (and unsupported) in Icechunk. Icechunk already organizes the dataset metadata
in a way that makes it very fast to fetch from storage.
In this example, we'll explain how to create a new Icechunk repo, write some sample data
to it, and append data a second block of data using Icechunk's version control features.
## Create a new repo
Similar to the example in [quickstart](./quickstart.md), we'll create an
Icechunk repo in S3 or a local file system. You will need to replace the `StorageConfig`
with a bucket or file path that you have access to.
```python exec="on" session="xarray" source="material-block"
import xarray as xr
import icechunk
```
=== "S3 Storage"
```python
storage_config = icechunk.s3_storage(
bucket="icechunk-test",
prefix="xarray-demo"
)
repo = icechunk.Repository.create(storage_config)
```
=== "Local Storage"
```python exec="on" session="xarray" source="material-block"
import tempfile
storage_config = icechunk.local_filesystem_storage(tempfile.TemporaryDirectory().name)
repo = icechunk.Repository.create(storage_config)
```
## Open tutorial dataset from Xarray
For this demo, we'll open Xarray's RASM tutorial dataset and split it into two blocks.
We'll write the two blocks to Icechunk in separate transactions later in the this example.
!!! note
Downloading xarray tutorial data requires pooch and netCDF4. These can be installed with
```shell
pip install pooch netCDF4
```
```python exec="on" session="xarray" source="material-block"
ds = xr.tutorial.open_dataset('rasm')
ds1 = ds.isel(time=slice(None, 18)) # part 1
ds2 = ds.isel(time=slice(18, None)) # part 2
```
## Write Xarray data to Icechunk
Create a new writable session on the `main` branch to get the `IcechunkStore`:
```python exec="on" session="xarray" source="material-block"
session = repo.writable_session("main")
```
Writing Xarray data to Icechunk is as easy as calling `to_icechunk`:
```python exec="on" session="xarray" source="material-block"
from icechunk.xarray import to_icechunk
to_icechunk(ds, session)
```
After writing, we commit the changes using the session:
```python exec="on" session="xarray" source="material-block" result="code"
first_snapshot = session.commit("add RASM data to store")
print(first_snapshot)
```
## Append to an existing store
Next, we want to add a second block of data to our store. Above, we created `ds2` for just
this reason. Again, we'll use `Dataset.to_zarr`, this time with `append_dim='time'`.
```python exec="on" session="xarray" source="material-block"
# we have to get a new session after committing
session = repo.writable_session("main")
to_icechunk(ds2, session, append_dim='time')
```
And then we'll commit the changes:
```python exec="on" session="xarray" source="material-block" result="code"
print(session.commit("append more data"))
```
## Reading data with Xarray
```python exec="on" session="xarray" source="material-block" result="code"
xr.set_options(display_style="text")
print(xr.open_zarr(session.store, consolidated=False))
```
We can also read data from previous snapshots by checking out prior versions:
```python exec="on" session="xarray" source="material-block" result="code"
session = repo.readonly_session(snapshot_id=first_snapshot)
print(xr.open_zarr(session.store, consolidated=False))
```
Notice that this second `xarray.Dataset` has a time dimension of length 18 whereas the
first has a time dimension of length 36.
## Next steps
For more details on how to use Xarray's Zarr integration, checkout [Xarray's documentation](https://docs.xarray.dev/en/stable/user-guide/io.html#zarr).# Python API Reference
[Home](/en/v1.1.0/) / [reference](/en/v1.1.0/reference)
## `` icechunk #
Modules:
Name | Description
---|---
`credentials` |
`dask` |
`distributed` |
`repository` |
`session` |
`storage` |
`store` |
`xarray` |
Classes:
Name | Description
---|---
`AzureCredentials` | Credentials for an azure storage backend
`AzureStaticCredentials` | Credentials for an azure storage backend
`BasicConflictSolver` | A basic conflict solver that allows for simple configuration of resolution behavior
`CachingConfig` | Configuration for how Icechunk caches its metadata files
`CompressionAlgorithm` | Enum for selecting the compression algorithm used by Icechunk to write its metadata files
`CompressionConfig` | Configuration for how Icechunk compresses its metadata files
`Conflict` | A conflict detected between snapshots
`ConflictDetector` | A conflict solver that can be used to detect conflicts between two stores, but does not resolve them
`ConflictError` | An error that occurs when a conflict is detected
`ConflictSolver` | An abstract conflict solver that can be used to detect or resolve conflicts between two stores
`ConflictType` | Type of conflict detected
`Diff` | The result of comparing two snapshots
`ForkSession` |
`GCSummary` | Summarizes the results of a garbage collection operation on an icechunk repo
`GcsBearerCredential` | Credentials for a google cloud storage backend
`GcsCredentials` | Credentials for a google cloud storage backend
`GcsStaticCredentials` | Credentials for a google cloud storage backend
`IcechunkError` | Base class for all Icechunk errors
`IcechunkStore` |
`ManifestConfig` | Configuration for how Icechunk manifests
`ManifestFileInfo` | Manifest file metadata
`ManifestPreloadCondition` | Configuration for conditions under which manifests will preload on session creation
`ManifestPreloadConfig` | Configuration for how Icechunk manifest preload on session creation
`ManifestSplitCondition` | Configuration for conditions under which manifests will be split into splits
`ManifestSplitDimCondition` | Conditions for specifying dimensions along which to shard manifests.
`ManifestSplittingConfig` | Configuration for manifest splitting.
`RebaseFailedError` | An error that occurs when a rebase operation fails
`Repository` | An Icechunk repository.
`RepositoryConfig` | Configuration for an Icechunk repository
`S3Credentials` | Credentials for an S3 storage backend
`S3Options` | Options for accessing an S3-compatible storage backend
`S3StaticCredentials` | Credentials for an S3 storage backend
`Session` | A session object that allows for reading and writing data from an Icechunk repository.
`SnapshotInfo` | Metadata for a snapshot
`Storage` | Storage configuration for an IcechunkStore
`StorageConcurrencySettings` | Configuration for how Icechunk uses its Storage instance
`StorageRetriesSettings` | Configuration for how Icechunk retries requests.
`StorageSettings` | Configuration for how Icechunk uses its Storage instance
`VersionSelection` | Enum for selecting the which version of a conflict
`VirtualChunkContainer` | A virtual chunk container is a configuration that allows Icechunk to read virtual references from a storage backend.
`VirtualChunkSpec` | The specification for a virtual chunk reference.
Functions:
Name | Description
---|---
`azure_credentials` | Create credentials Azure Blob Storage object store.
`azure_from_env_credentials` | Instruct Azure Blob Storage object store to fetch credentials from the operative system environment.
`azure_static_credentials` | Create static credentials Azure Blob Storage object store.
`azure_storage` | Create a Storage instance that saves data in Azure Blob Storage object store.
`containers_credentials` | Build a map of credentials for virtual chunk containers.
`gcs_credentials` | Create credentials Google Cloud Storage object store.
`gcs_from_env_credentials` | Instruct Google Cloud Storage object store to fetch credentials from the operative system environment.
`gcs_refreshable_credentials` | Create refreshable credentials for Google Cloud Storage object store.
`gcs_static_credentials` | Create static credentials Google Cloud Storage object store.
`gcs_storage` | Create a Storage instance that saves data in Google Cloud Storage object store.
`gcs_store` | Build an ObjectStoreConfig instance for Google Cloud Storage object stores.
`http_store` | Build an ObjectStoreConfig instance for HTTP object stores.
`in_memory_storage` | Create a Storage instance that saves data in memory.
`initialize_logs` | Initialize the logging system for the library.
`local_filesystem_storage` | Create a Storage instance that saves data in the local file system.
`local_filesystem_store` | Build an ObjectStoreConfig instance for local file stores.
`r2_storage` | Create a Storage instance that saves data in Tigris object store.
`s3_anonymous_credentials` | Create no-signature credentials for S3 and S3 compatible object stores.
`s3_credentials` | Create credentials for S3 and S3 compatible object stores.
`s3_from_env_credentials` | Instruct S3 and S3 compatible object stores to gather credentials from the operative system environment.
`s3_refreshable_credentials` | Create refreshable credentials for S3 and S3 compatible object stores.
`s3_static_credentials` | Create static credentials for S3 and S3 compatible object stores.
`s3_storage` | Create a Storage instance that saves data in S3 or S3 compatible object stores.
`s3_store` | Build an ObjectStoreConfig instance for S3 or S3 compatible object stores.
`set_logs_filter` | Set filters and log levels for the different modules.
`spec_version` | The version of the Icechunk specification that the library is compatible with.
`tigris_storage` | Create a Storage instance that saves data in Tigris object store.
### `` AzureCredentials #
Credentials for an azure storage backend
This can be used to authenticate with an azure storage backend.
Classes:
Name | Description
---|---
`FromEnv` | Uses credentials from environment variables
`Static` | Uses azure credentials without expiration
#### `` FromEnv #
Uses credentials from environment variables
#### `` Static #
Uses azure credentials without expiration
### `` AzureStaticCredentials #
Credentials for an azure storage backend
Classes:
Name | Description
---|---
`AccessKey` | Credentials for an azure storage backend using an access key
`BearerToken` | Credentials for an azure storage backend using a bearer token
`SasToken` | Credentials for an azure storage backend using a shared access signature token
#### `` AccessKey #
Credentials for an azure storage backend using an access key
Parameters:
Name | Type | Description | Default
---|---|---|---
`key` | `str` | The access key to use for authentication. | _required_
#### `` BearerToken #
Credentials for an azure storage backend using a bearer token
Parameters:
Name | Type | Description | Default
---|---|---|---
`token` | `str` | The bearer token to use for authentication. | _required_
#### `` SasToken #
Credentials for an azure storage backend using a shared access signature token
Parameters:
Name | Type | Description | Default
---|---|---|---
`token` | `str` | The shared access signature token to use for authentication. | _required_
### `` BasicConflictSolver #
Bases: `ConflictSolver`
A basic conflict solver that allows for simple configuration of resolution
behavior
This conflict solver allows for simple configuration of resolution behavior
for conflicts that may occur during a rebase operation. It will attempt to
resolve a limited set of conflicts based on the configuration options
provided.
* When a chunk conflict is encountered, the behavior is determined by the `on_chunk_conflict` option
* When an array is deleted that has been updated, `fail_on_delete_of_updated_array` will determine whether to fail the rebase operation
* When a group is deleted that has been updated, `fail_on_delete_of_updated_group` will determine whether to fail the rebase operation
Methods:
Name | Description
---|---
`__init__` | Create a BasicConflictSolver object with the given configuration options
#### `` __init__ #
__init__(*, on_chunk_conflict=VersionSelection.UseOurs, fail_on_delete_of_updated_array=False, fail_on_delete_of_updated_group=False)
Create a BasicConflictSolver object with the given configuration options
Parameters:
Name | Type | Description | Default
---|---|---|---
`on_chunk_conflict` | `VersionSelection` | The behavior to use when a chunk conflict is encountered, by default VersionSelection.use_theirs() | `UseOurs`
`fail_on_delete_of_updated_array` | `bool` | Whether to fail when a chunk is deleted that has been updated, by default False | `False`
`fail_on_delete_of_updated_group` | `bool` | Whether to fail when a group is deleted that has been updated, by default False | `False`
### `` CachingConfig #
Configuration for how Icechunk caches its metadata files
Methods:
Name | Description
---|---
`__init__` | Create a new `CachingConfig` object
Attributes:
Name | Type | Description
---|---|---
`num_bytes_attributes` | `int | None` | The number of bytes of attributes to cache.
`num_bytes_chunks` | `int | None` | The number of bytes of chunks to cache.
`num_chunk_refs` | `int | None` | The number of chunk references to cache.
`num_snapshot_nodes` | `int | None` | The number of snapshot nodes to cache.
`num_transaction_changes` | `int | None` | The number of transaction changes to cache.
#### `` num_bytes_attributes `property` `writable` #
num_bytes_attributes
The number of bytes of attributes to cache.
Returns:
Type | Description
---|---
`int | None` | The number of bytes of attributes to cache.
#### `` num_bytes_chunks `property` `writable` #
num_bytes_chunks
The number of bytes of chunks to cache.
Returns:
Type | Description
---|---
`int | None` | The number of bytes of chunks to cache.
#### `` num_chunk_refs `property` `writable` #
num_chunk_refs
The number of chunk references to cache.
Returns:
Type | Description
---|---
`int | None` | The number of chunk references to cache.
#### `` num_snapshot_nodes `property` `writable` #
num_snapshot_nodes
The number of snapshot nodes to cache.
Returns:
Type | Description
---|---
`int | None` | The number of snapshot nodes to cache.
#### `` num_transaction_changes `property` `writable` #
num_transaction_changes
The number of transaction changes to cache.
Returns:
Type | Description
---|---
`int | None` | The number of transaction changes to cache.
#### `` __init__ #
__init__(num_snapshot_nodes=None, num_chunk_refs=None, num_transaction_changes=None, num_bytes_attributes=None, num_bytes_chunks=None)
Create a new `CachingConfig` object
Parameters:
Name | Type | Description | Default
---|---|---|---
`num_snapshot_nodes` | `int | None` | The number of snapshot nodes to cache. | `None`
`num_chunk_refs` | `int | None` | The number of chunk references to cache. | `None`
`num_transaction_changes` | `int | None` | The number of transaction changes to cache. | `None`
`num_bytes_attributes` | `int | None` | The number of bytes of attributes to cache. | `None`
`num_bytes_chunks` | `int | None` | The number of bytes of chunks to cache. | `None`
### `` CompressionAlgorithm #
Bases: `Enum`
Enum for selecting the compression algorithm used by Icechunk to write its
metadata files
Attributes:
Name | Type | Description
---|---|---
`Zstd` | `int` | The Zstd compression algorithm.
Methods:
Name | Description
---|---
`default` | The default compression algorithm used by Icechunk to write its metadata files.
#### `` default `staticmethod` #
default()
The default compression algorithm used by Icechunk to write its metadata
files.
Returns:
Type | Description
---|---
`CompressionAlgorithm` | The default compression algorithm.
### `` CompressionConfig #
Configuration for how Icechunk compresses its metadata files
Methods:
Name | Description
---|---
`__init__` | Create a new `CompressionConfig` object
`default` | The default compression configuration used by Icechunk to write its metadata files.
Attributes:
Name | Type | Description
---|---|---
`algorithm` | `CompressionAlgorithm | None` | The compression algorithm used by Icechunk to write its metadata files.
`level` | `int | None` | The compression level used by Icechunk to write its metadata files.
#### `` algorithm `property` `writable` #
algorithm
The compression algorithm used by Icechunk to write its metadata files.
Returns:
Type | Description
---|---
`CompressionAlgorithm | None` | The compression algorithm used by Icechunk to write its metadata files.
#### `` level `property` `writable` #
level
The compression level used by Icechunk to write its metadata files.
Returns:
Type | Description
---|---
`int | None` | The compression level used by Icechunk to write its metadata files.
#### `` __init__ #
__init__(algorithm=None, level=None)
Create a new `CompressionConfig` object
Parameters:
Name | Type | Description | Default
---|---|---|---
`algorithm` | `CompressionAlgorithm | None` | The compression algorithm to use. | `None`
`level` | `int | None` | The compression level to use. | `None`
#### `` default `staticmethod` #
default()
The default compression configuration used by Icechunk to write its metadata
files.
Returns:
Type | Description
---|---
`CompressionConfig` |
### `` Conflict #
A conflict detected between snapshots
Attributes:
Name | Type | Description
---|---|---
`conflict_type` | `ConflictType` | The type of conflict detected
`conflicted_chunks` | `list[list[int]] | None` | If the conflict is a chunk conflict, this will return the list of chunk indices that are in conflict
`path` | `str` | The path of the node that caused the conflict
#### `` conflict_type `property` #
conflict_type
The type of conflict detected
Returns: ConflictType: The type of conflict detected
#### `` conflicted_chunks `property` #
conflicted_chunks
If the conflict is a chunk conflict, this will return the list of chunk
indices that are in conflict
Returns: list[list[int]] | None: The list of chunk indices that are in conflict
#### `` path `property` #
path
The path of the node that caused the conflict
Returns: str: The path of the node that caused the conflict
### `` ConflictDetector #
Bases: `ConflictSolver`
A conflict solver that can be used to detect conflicts between two stores, but
does not resolve them
Where the `BasicConflictSolver` will attempt to resolve conflicts, the
`ConflictDetector` will only detect them. This means that during a rebase
operation the `ConflictDetector` will raise a `RebaseFailed` error if any
conflicts are detected, and allow the rebase operation to be retried with a
different conflict resolution strategy. Otherwise, if no conflicts are
detected the rebase operation will succeed.
### `` ConflictError #
Bases: `Exception`
An error that occurs when a conflict is detected
Attributes:
Name | Type | Description
---|---|---
`actual_parent` | `str` | The actual parent snapshot ID of the branch that the session attempted to commit to.
`expected_parent` | `str` | The expected parent snapshot ID.
#### `` actual_parent `property` #
actual_parent
The actual parent snapshot ID of the branch that the session attempted to
commit to.
When the session is based on a branch, this is the snapshot ID of the branch
tip. If this error is raised, it means the branch was modified and committed
by another session after the session was created.
#### `` expected_parent `property` #
expected_parent
The expected parent snapshot ID.
This is the snapshot ID that the session was based on when the commit
operation was called.
### `` ConflictSolver #
An abstract conflict solver that can be used to detect or resolve conflicts
between two stores
This should never be used directly, but should be subclassed to provide
specific conflict resolution behavior
### `` ConflictType #
Bases: `Enum`
Type of conflict detected
Attributes:
Name | Type | Description
---|---|---
`ChunkDoubleUpdate` | | A chunk update conflicts with an existing chunk update
`ChunksUpdatedInDeletedArray` | | Chunks are updated in a deleted array
`ChunksUpdatedInUpdatedArray` | | Chunks are updated in an updated array
`DeleteOfUpdatedArray` | | A delete is attempted on an updated array
`DeleteOfUpdatedGroup` | | A delete is attempted on an updated group
`NewNodeConflictsWithExistingNode` | | A new node conflicts with an existing node
`NewNodeInInvalidGroup` | | A new node is in an invalid group
`ZarrMetadataDoubleUpdate` | | A zarr metadata update conflicts with an existing zarr metadata update
`ZarrMetadataUpdateOfDeletedArray` | | A zarr metadata update is attempted on a deleted array
`ZarrMetadataUpdateOfDeletedGroup` | | A zarr metadata update is attempted on a deleted group
#### `` ChunkDoubleUpdate `class-attribute` `instance-attribute` #
ChunkDoubleUpdate = (6,)
A chunk update conflicts with an existing chunk update
#### `` ChunksUpdatedInDeletedArray `class-attribute` `instance-attribute` #
ChunksUpdatedInDeletedArray = (7,)
Chunks are updated in a deleted array
#### `` ChunksUpdatedInUpdatedArray `class-attribute` `instance-attribute` #
ChunksUpdatedInUpdatedArray = (8,)
Chunks are updated in an updated array
#### `` DeleteOfUpdatedArray `class-attribute` `instance-attribute` #
DeleteOfUpdatedArray = (9,)
A delete is attempted on an updated array
#### `` DeleteOfUpdatedGroup `class-attribute` `instance-attribute` #
DeleteOfUpdatedGroup = (10,)
A delete is attempted on an updated group
#### `` NewNodeConflictsWithExistingNode `class-attribute` `instance-
attribute` #
NewNodeConflictsWithExistingNode = (1,)
A new node conflicts with an existing node
#### `` NewNodeInInvalidGroup `class-attribute` `instance-attribute` #
NewNodeInInvalidGroup = (2,)
A new node is in an invalid group
#### `` ZarrMetadataDoubleUpdate `class-attribute` `instance-attribute` #
ZarrMetadataDoubleUpdate = (3,)
A zarr metadata update conflicts with an existing zarr metadata update
#### `` ZarrMetadataUpdateOfDeletedArray `class-attribute` `instance-
attribute` #
ZarrMetadataUpdateOfDeletedArray = (4,)
A zarr metadata update is attempted on a deleted array
#### `` ZarrMetadataUpdateOfDeletedGroup `class-attribute` `instance-
attribute` #
ZarrMetadataUpdateOfDeletedGroup = (5,)
A zarr metadata update is attempted on a deleted group
### `` Diff #
The result of comparing two snapshots
Attributes:
Name | Type | Description
---|---|---
`deleted_arrays` | `set[str]` | The arrays that were deleted in the target ref.
`deleted_groups` | `set[str]` | The groups that were deleted in the target ref.
`new_arrays` | `set[str]` | The arrays that were added to the target ref.
`new_groups` | `set[str]` | The groups that were added to the target ref.
`updated_arrays` | `set[str]` | The arrays that were updated via zarr metadata in the target ref.
`updated_chunks` | `dict[str, list[list[int]]]` | The chunks indices that had data updated in the target ref, keyed by the path to the array.
`updated_groups` | `set[str]` | The groups that were updated via zarr metadata in the target ref.
#### `` deleted_arrays `property` #
deleted_arrays
The arrays that were deleted in the target ref.
#### `` deleted_groups `property` #
deleted_groups
The groups that were deleted in the target ref.
#### `` new_arrays `property` #
new_arrays
The arrays that were added to the target ref.
#### `` new_groups `property` #
new_groups
The groups that were added to the target ref.
#### `` updated_arrays `property` #
updated_arrays
The arrays that were updated via zarr metadata in the target ref.
#### `` updated_chunks `property` #
updated_chunks
The chunks indices that had data updated in the target ref, keyed by the path
to the array.
#### `` updated_groups `property` #
updated_groups
The groups that were updated via zarr metadata in the target ref.
### `` ForkSession #
Bases: `Session`
Methods:
Name | Description
---|---
`merge_async` | Merge the changes for this fork session with the changes from other fork sessions (async version).
Attributes:
Name | Type | Description
---|---|---
`store` | `IcechunkStore` | Get a zarr Store object for reading and writing data from the repository using zarr python.
#### `` store `property` #
store
Get a zarr Store object for reading and writing data from the repository using
zarr python.
Returns:
Type | Description
---|---
`IcechunkStore` | A zarr Store object for reading and writing data from the repository.
#### `` merge_async `async` #
merge_async(*others)
Merge the changes for this fork session with the changes from other fork
sessions (async version).
Parameters:
Name | Type | Description | Default
---|---|---|---
`others` | `ForkSession` | The other fork sessions to merge changes from. | `()`
### `` GCSummary #
Summarizes the results of a garbage collection operation on an icechunk repo
Attributes:
Name | Type | Description
---|---|---
`attributes_deleted` | `int` | How many attributes were deleted.
`bytes_deleted` | `int` | How many bytes were deleted.
`chunks_deleted` | `int` | How many chunks were deleted.
`manifests_deleted` | `int` | How many manifests were deleted.
`snapshots_deleted` | `int` | How many snapshots were deleted.
`transaction_logs_deleted` | `int` | How many transaction logs were deleted.
#### `` attributes_deleted `property` #
attributes_deleted
How many attributes were deleted.
#### `` bytes_deleted `property` #
bytes_deleted
How many bytes were deleted.
#### `` chunks_deleted `property` #
chunks_deleted
How many chunks were deleted.
#### `` manifests_deleted `property` #
manifests_deleted
How many manifests were deleted.
#### `` snapshots_deleted `property` #
snapshots_deleted
How many snapshots were deleted.
#### `` transaction_logs_deleted `property` #
transaction_logs_deleted
How many transaction logs were deleted.
### `` GcsBearerCredential #
Credentials for a google cloud storage backend
This is a bearer token that has an expiration time.
Methods:
Name | Description
---|---
`__init__` | Create a GcsBearerCredential object
#### `` __init__ #
__init__(bearer, *, expires_after=None)
Create a GcsBearerCredential object
Parameters:
Name | Type | Description | Default
---|---|---|---
`bearer` | `str` | The bearer token to use for authentication. | _required_
`expires_after` | `datetime | None` | The expiration time of the bearer token. | `None`
### `` GcsCredentials #
Credentials for a google cloud storage backend
This can be used to authenticate with a google cloud storage backend.
Classes:
Name | Description
---|---
`FromEnv` | Uses credentials from environment variables
`Refreshable` | Allows for an outside authority to pass in a function that can be used to provide credentials.
`Static` | Uses gcs credentials without expiration
#### `` FromEnv #
Uses credentials from environment variables
#### `` Refreshable #
Allows for an outside authority to pass in a function that can be used to
provide credentials.
This is useful for credentials that have an expiration time, or are otherwise
not known ahead of time.
#### `` Static #
Uses gcs credentials without expiration
### `` GcsStaticCredentials #
Credentials for a google cloud storage backend
Classes:
Name | Description
---|---
`ApplicationCredentials` | Credentials for a google cloud storage backend using application default credentials
`BearerToken` | Credentials for a google cloud storage backend using a bearer token
`ServiceAccount` | Credentials for a google cloud storage backend using a service account json file
`ServiceAccountKey` | Credentials for a google cloud storage backend using a a serialized service account key
#### `` ApplicationCredentials #
Credentials for a google cloud storage backend using application default
credentials
Parameters:
Name | Type | Description | Default
---|---|---|---
`path` | `str` | The path to the application default credentials (ADC) file. | _required_
#### `` BearerToken #
Credentials for a google cloud storage backend using a bearer token
Parameters:
Name | Type | Description | Default
---|---|---|---
`token` | `str` | The bearer token to use for authentication. | _required_
#### `` ServiceAccount #
Credentials for a google cloud storage backend using a service account json
file
Parameters:
Name | Type | Description | Default
---|---|---|---
`path` | `str` | The path to the service account json file. | _required_
#### `` ServiceAccountKey #
Credentials for a google cloud storage backend using a a serialized service
account key
Parameters:
Name | Type | Description | Default
---|---|---|---
`key` | `str` | The serialized service account key. | _required_
### `` IcechunkError #
Bases: `Exception`
Base class for all Icechunk errors
### `` IcechunkStore #
Bases: `Store`, `SyncMixin`
Methods:
Name | Description
---|---
`__init__` | Create a new IcechunkStore.
`clear` | Clear the store.
`delete` | Remove a key from the store
`delete_dir` | Delete a prefix
`exists` | Check if a key exists in the store.
`get` | Retrieve the value associated with a given key.
`get_partial_values` | Retrieve possibly partial values from given key_ranges.
`is_empty` | Check if the directory is empty.
`list` | Retrieve all keys in the store.
`list_dir` | Retrieve all keys and prefixes with a given prefix and which do not contain the character
`list_prefix` | Retrieve all keys in the store that begin with a given prefix. Keys are returned relative
`set` | Store a (key, value) pair.
`set_if_not_exists` | Store a key to `value` if the key is not already present.
`set_partial_values` | Store values at a given key, starting at byte range_start.
`set_virtual_ref` | Store a virtual reference to a chunk.
`set_virtual_ref_async` | Store a virtual reference to a chunk asynchronously.
`set_virtual_refs` | Store multiple virtual references for the same array.
`set_virtual_refs_async` | Store multiple virtual references for the same array asynchronously.
`sync_clear` | Clear the store.
Attributes:
Name | Type | Description
---|---|---
`supports_listing` | `bool` | Does the store support listing?
`supports_partial_writes` | `bool` | Does the store support partial writes?
`supports_writes` | `bool` | Does the store support writes?
#### `` supports_listing `property` #
supports_listing
Does the store support listing?
#### `` supports_partial_writes `property` #
supports_partial_writes
Does the store support partial writes?
#### `` supports_writes `property` #
supports_writes
Does the store support writes?
#### `` __init__ #
__init__(store, for_fork, read_only=None, *args, **kwargs)
Create a new IcechunkStore.
This should not be called directly, instead use the `create`, `open_existing`
or `open_or_create` class methods.
#### `` clear `async` #
clear()
Clear the store.
This will remove all contents from the current session, including all groups
and all arrays. But it will not modify the repository history.
#### `` delete `async` #
delete(key)
Remove a key from the store
Parameters:
Name | Type | Description | Default
---|---|---|---
`key` | `str` | | _required_
#### `` delete_dir `async` #
delete_dir(prefix)
Delete a prefix
Parameters:
Name | Type | Description | Default
---|---|---|---
`prefix` | `str` | | _required_
#### `` exists `async` #
exists(key)
Check if a key exists in the store.
Parameters:
Name | Type | Description | Default
---|---|---|---
`key` | `str` | | _required_
Returns:
Type | Description
---|---
`bool` |
#### `` get `async` #
get(key, prototype, byte_range=None)
Retrieve the value associated with a given key.
Parameters:
Name | Type | Description | Default
---|---|---|---
`key` | `str` | | _required_
`byte_range` | `ByteRequest` | ByteRequest may be one of the following. If not provided, all data associated with the key is retrieved.
* RangeByteRequest(int, int): Request a specific range of bytes in the form (start, end). The end is exclusive. If the given range is zero-length or starts after the end of the object, an error will be returned. Additionally, if the range ends after the end of the object, the entire remainder of the object will be returned. Otherwise, the exact requested range will be returned.
* OffsetByteRequest(int): Request all bytes starting from a given byte offset. This is equivalent to bytes={int}- as an HTTP header.
* SuffixByteRequest(int): Request the last int bytes. Note that here, int is the size of the request, not the byte offset. This is equivalent to bytes=-{int} as an HTTP header.
| `None`
Returns:
Type | Description
---|---
`Buffer` |
#### `` get_partial_values `async` #
get_partial_values(prototype, key_ranges)
Retrieve possibly partial values from given key_ranges.
Parameters:
Name | Type | Description | Default
---|---|---|---
`key_ranges` | `Iterable[tuple[str, tuple[int | None, int | None]]]` | Ordered set of key, range pairs, a key may occur multiple times with different ranges | _required_
Returns:
Type | Description
---|---
`list of values, in the order of the key_ranges, may contain null/none for missing keys` |
#### `` is_empty `async` #
is_empty(prefix)
Check if the directory is empty.
Parameters:
Name | Type | Description | Default
---|---|---|---
`prefix` | `str` | Prefix of keys to check. | _required_
Returns:
Type | Description
---|---
`bool` | True if the store is empty, False otherwise.
#### `` list #
list()
Retrieve all keys in the store.
Returns:
Type | Description
---|---
`AsyncIterator[str, None]` |
#### `` list_dir #
list_dir(prefix)
Retrieve all keys and prefixes with a given prefix and which do not contain
the character “/” after the given prefix.
Parameters:
Name | Type | Description | Default
---|---|---|---
`prefix` | `str` | | _required_
Returns:
Type | Description
---|---
`AsyncIterator[str, None]` |
#### `` list_prefix #
list_prefix(prefix)
Retrieve all keys in the store that begin with a given prefix. Keys are
returned relative to the root of the store.
Parameters:
Name | Type | Description | Default
---|---|---|---
`prefix` | `str` | | _required_
Returns:
Type | Description
---|---
`AsyncIterator[str, None]` |
#### `` set `async` #
set(key, value)
Store a (key, value) pair.
Parameters:
Name | Type | Description | Default
---|---|---|---
`key` | `str` | | _required_
`value` | `Buffer` | | _required_
#### `` set_if_not_exists `async` #
set_if_not_exists(key, value)
Store a key to `value` if the key is not already present.
Parameters:
Name | Type | Description | Default
---|---|---|---
`key` | `str` | | _required_
`value` | `Buffer` | | _required_
#### `` set_partial_values `async` #
set_partial_values(key_start_values)
Store values at a given key, starting at byte range_start.
Parameters:
Name | Type | Description | Default
---|---|---|---
`key_start_values` | `list[tuple[str, int, BytesLike]]` | set of key, range_start, values triples, a key may occur multiple times with different range_starts, range_starts (considering the length of the respective values) must not specify overlapping ranges for the same key | _required_
#### `` set_virtual_ref #
set_virtual_ref(key, location, *, offset, length, checksum=None, validate_container=True)
Store a virtual reference to a chunk.
Parameters:
Name | Type | Description | Default
---|---|---|---
`key` | `str` | The chunk to store the reference under. This is the fully qualified zarr key eg: 'array/c/0/0/0' | _required_
`location` | `str` | The location of the chunk in storage. This is absolute path to the chunk in storage eg: 's3://bucket/path/to/file.nc' | _required_
`offset` | `int` | The offset in bytes from the start of the file location in storage the chunk starts at | _required_
`length` | `int` | The length of the chunk in bytes, measured from the given offset | _required_
`checksum` | `str | datetime | None` | The etag or last_medified_at field of the object | `None`
`validate_container` | `bool` | If set to true, fail for locations that don't match any existing virtual chunk container | `True`
#### `` set_virtual_ref_async `async` #
set_virtual_ref_async(key, location, *, offset, length, checksum=None, validate_container=True)
Store a virtual reference to a chunk asynchronously.
Parameters:
Name | Type | Description | Default
---|---|---|---
`key` | `str` | The chunk to store the reference under. This is the fully qualified zarr key eg: 'array/c/0/0/0' | _required_
`location` | `str` | The location of the chunk in storage. This is absolute path to the chunk in storage eg: 's3://bucket/path/to/file.nc' | _required_
`offset` | `int` | The offset in bytes from the start of the file location in storage the chunk starts at | _required_
`length` | `int` | The length of the chunk in bytes, measured from the given offset | _required_
`checksum` | `str | datetime | None` | The etag or last_medified_at field of the object | `None`
`validate_container` | `bool` | If set to true, fail for locations that don't match any existing virtual chunk container | `True`
#### `` set_virtual_refs #
set_virtual_refs(array_path, chunks, *, validate_containers=True)
Store multiple virtual references for the same array.
Parameters:
Name | Type | Description | Default
---|---|---|---
`array_path` | `str` | The path to the array inside the Zarr store. Example: "/groupA/groupB/outputs/my-array" | _required_
`chunks` | `(list[VirtualChunkSpec],)` | The list of virtual chunks to add | _required_
`validate_containers` | `bool` | If set to true, ignore virtual references for locations that don't match any existing virtual chunk container | `True`
Returns:
Type | Description
---|---
`list[tuple[int, ...]] | None` | If all virtual references where successfully updated, it returns None. If there were validation errors, it returns the chunk indices of all failed references.
#### `` set_virtual_refs_async `async` #
set_virtual_refs_async(array_path, chunks, *, validate_containers=True)
Store multiple virtual references for the same array asynchronously.
Parameters:
Name | Type | Description | Default
---|---|---|---
`array_path` | `str` | The path to the array inside the Zarr store. Example: "/groupA/groupB/outputs/my-array" | _required_
`chunks` | `(list[VirtualChunkSpec],)` | The list of virtual chunks to add | _required_
`validate_containers` | `bool` | If set to true, ignore virtual references for locations that don't match any existing virtual chunk container | `True`
Returns:
Type | Description
---|---
`list[tuple[int, ...]] | None` | If all virtual references where successfully updated, it returns None. If there were validation errors, it returns the chunk indices of all failed references.
#### `` sync_clear #
sync_clear()
Clear the store.
This will remove all contents from the current session, including all groups
and all arrays. But it will not modify the repository history.
### `` ManifestConfig #
Configuration for how Icechunk manifests
Methods:
Name | Description
---|---
`__init__` | Create a new `ManifestConfig` object
Attributes:
Name | Type | Description
---|---|---
`preload` | `ManifestPreloadConfig | None` | The configuration for how Icechunk manifests will be preloaded.
`splitting` | `ManifestSplittingConfig | None` | The configuration for how Icechunk manifests will be split.
#### `` preload `property` `writable` #
preload
The configuration for how Icechunk manifests will be preloaded.
Returns:
Type | Description
---|---
`ManifestPreloadConfig | None` | The configuration for how Icechunk manifests will be preloaded.
#### `` splitting `property` `writable` #
splitting
The configuration for how Icechunk manifests will be split.
Returns:
Type | Description
---|---
`ManifestSplittingConfig | None` | The configuration for how Icechunk manifests will be split.
#### `` __init__ #
__init__(preload=None, splitting=None)
Create a new `ManifestConfig` object
Parameters:
Name | Type | Description | Default
---|---|---|---
`preload` | `ManifestPreloadConfig | None` | The configuration for how Icechunk manifests will be preloaded. | `None`
`splitting` | `ManifestSplittingConfig | None` | The configuration for how Icechunk manifests will be split. | `None`
### `` ManifestFileInfo #
Manifest file metadata
Attributes:
Name | Type | Description
---|---|---
`id` | `str` | The manifest id
`num_chunk_refs` | `int` | The number of chunk references contained in this manifest
`size_bytes` | `int` | The size in bytes of the
#### `` id `property` #
id
The manifest id
#### `` num_chunk_refs `property` #
num_chunk_refs
The number of chunk references contained in this manifest
#### `` size_bytes `property` #
size_bytes
The size in bytes of the
### `` ManifestPreloadCondition #
Configuration for conditions under which manifests will preload on session
creation
Methods:
Name | Description
---|---
`__and__` | Create a preload condition that matches if both this condition and `other` match.
`__or__` | Create a preload condition that matches if either this condition or `other` match.
`and_conditions` | Create a preload condition that matches only if all passed `conditions` match
`false` | Create a preload condition that never matches any manifests
`name_matches` | Create a preload condition that matches if the array's name matches the passed regex.
`num_refs` | Create a preload condition that matches only if the number of chunk references in the manifest is within the given range.
`or_conditions` | Create a preload condition that matches if any of `conditions` matches
`path_matches` | Create a preload condition that matches if the full path to the array matches the passed regex.
`true` | Create a preload condition that always matches any manifest
#### `` __and__ #
__and__(other)
Create a preload condition that matches if both this condition and `other`
match.
#### `` __or__ #
__or__(other)
Create a preload condition that matches if either this condition or `other`
match.
#### `` and_conditions `staticmethod` #
and_conditions(conditions)
Create a preload condition that matches only if all passed `conditions` match
#### `` false `staticmethod` #
false()
Create a preload condition that never matches any manifests
#### `` name_matches `staticmethod` #
name_matches(regex)
Create a preload condition that matches if the array's name matches the passed
regex.
Example, for an array `/model/outputs/temperature`, the following will match:
name_matches(".*temp.*")
#### `` num_refs `staticmethod` #
num_refs(from_refs, to_refs)
Create a preload condition that matches only if the number of chunk references
in the manifest is within the given range.
from_refs is inclusive, to_refs is exclusive.
#### `` or_conditions `staticmethod` #
or_conditions(conditions)
Create a preload condition that matches if any of `conditions` matches
#### `` path_matches `staticmethod` #
path_matches(regex)
Create a preload condition that matches if the full path to the array matches
the passed regex.
Array paths are absolute, as in `/path/to/my/array`
#### `` true `staticmethod` #
true()
Create a preload condition that always matches any manifest
### `` ManifestPreloadConfig #
Configuration for how Icechunk manifest preload on session creation
Methods:
Name | Description
---|---
`__init__` | Create a new `ManifestPreloadConfig` object
Attributes:
Name | Type | Description
---|---|---
`max_total_refs` | `int | None` | The maximum number of references to preload.
`preload_if` | `ManifestPreloadCondition | None` | The condition under which manifests will be preloaded.
#### `` max_total_refs `property` `writable` #
max_total_refs
The maximum number of references to preload.
Returns:
Type | Description
---|---
`int | None` | The maximum number of references to preload.
#### `` preload_if `property` `writable` #
preload_if
The condition under which manifests will be preloaded.
Returns:
Type | Description
---|---
`ManifestPreloadCondition | None` | The condition under which manifests will be preloaded.
#### `` __init__ #
__init__(max_total_refs=None, preload_if=None)
Create a new `ManifestPreloadConfig` object
Parameters:
Name | Type | Description | Default
---|---|---|---
`max_total_refs` | `int | None` | The maximum number of references to preload. | `None`
`preload_if` | `ManifestPreloadCondition | None` | The condition under which manifests will be preloaded. | `None`
### `` ManifestSplitCondition #
Configuration for conditions under which manifests will be split into splits
Methods:
Name | Description
---|---
`AnyArray` | Create a splitting condition that matches any array.
`__and__` | Create a splitting condition that matches if both this condition and `other` match
`__or__` | Create a splitting condition that matches if either this condition or `other` matches
`and_conditions` | Create a splitting condition that matches only if all passed `conditions` match
`name_matches` | Create a splitting condition that matches if the array's name matches the passed regex.
`or_conditions` | Create a splitting condition that matches if any of `conditions` matches
`path_matches` | Create a splitting condition that matches if the full path to the array matches the passed regex.
#### `` AnyArray `staticmethod` #
AnyArray()
Create a splitting condition that matches any array.
#### `` __and__ #
__and__(other)
Create a splitting condition that matches if both this condition and `other`
match
#### `` __or__ #
__or__(other)
Create a splitting condition that matches if either this condition or `other`
matches
#### `` and_conditions `staticmethod` #
and_conditions(conditions)
Create a splitting condition that matches only if all passed `conditions`
match
#### `` name_matches `staticmethod` #
name_matches(regex)
Create a splitting condition that matches if the array's name matches the
passed regex.
Example, for an array `/model/outputs/temperature`, the following will match:
name_matches(".*temp.*")
#### `` or_conditions `staticmethod` #
or_conditions(conditions)
Create a splitting condition that matches if any of `conditions` matches
#### `` path_matches `staticmethod` #
path_matches(regex)
Create a splitting condition that matches if the full path to the array
matches the passed regex.
Array paths are absolute, as in `/path/to/my/array`
### `` ManifestSplitDimCondition #
Conditions for specifying dimensions along which to shard manifests.
Classes:
Name | Description
---|---
`Any` | Split along any other unspecified dimension.
`Axis` | Split along specified integer axis.
`DimensionName` | Split along specified named dimension.
#### `` Any #
Split along any other unspecified dimension.
#### `` Axis #
Split along specified integer axis.
#### `` DimensionName #
Split along specified named dimension.
### `` ManifestSplittingConfig #
Configuration for manifest splitting.
Methods:
Name | Description
---|---
`__init__` | Configuration for how Icechunk manifests will be split.
Attributes:
Name | Type | Description
---|---|---
`split_sizes` | `SplitSizes` | Configuration for how Icechunk manifests will be split.
#### `` split_sizes `property` `writable` #
split_sizes
Configuration for how Icechunk manifests will be split.
Returns:
Type | Description
---|---
`tuple[tuple[ManifestSplitCondition, tuple[tuple[ManifestSplitDimCondition, int], ...]], ...]` | The configuration for how Icechunk manifests will be preloaded.
#### `` __init__ #
__init__(split_sizes)
Configuration for how Icechunk manifests will be split.
Parameters:
Name | Type | Description | Default
---|---|---|---
`split_sizes` | `SplitSizes` | The configuration for how Icechunk manifests will be preloaded. | _required_
Examples:
Split manifests for the `temperature` array, with 3 chunks per shard along the
`longitude` dimension.
>>> ManifestSplittingConfig.from_dict(
... {
... ManifestSplitCondition.name_matches("temperature"): {
... ManifestSplitDimCondition.DimensionName("longitude"): 3
... }
... }
... )
### `` RebaseFailedError #
Bases: `IcechunkError`
An error that occurs when a rebase operation fails
Attributes:
Name | Type | Description
---|---|---
`conflicts` | `list[Conflict]` | The conflicts that occurred during the rebase operation
`snapshot` | `str` | The snapshot ID that the session was rebased to
#### `` conflicts `property` #
conflicts
The conflicts that occurred during the rebase operation
Returns: list[Conflict]: The conflicts that occurred during the rebase
operation
#### `` snapshot `property` #
snapshot
The snapshot ID that the session was rebased to
### `` Repository #
An Icechunk repository.
Methods:
Name | Description
---|---
`ancestry` | Get the ancestry of a snapshot.
`async_ancestry` | Get the ancestry of a snapshot.
`create` | Create a new Icechunk repository.
`create_async` | Create a new Icechunk repository asynchronously.
`create_branch` | Create a new branch at the given snapshot.
`create_branch_async` | Create a new branch at the given snapshot (async version).
`create_tag` | Create a new tag at the given snapshot.
`create_tag_async` | Create a new tag at the given snapshot (async version).
`default_commit_metadata` | Get the current configured default commit metadata for the repository.
`delete_branch` | Delete a branch.
`delete_branch_async` | Delete a branch (async version).
`delete_tag` | Delete a tag.
`delete_tag_async` | Delete a tag (async version).
`diff` | Compute an overview of the operations executed from version `from` to version `to`.
`diff_async` | Compute an overview of the operations executed from version `from` to version `to` (async version).
`exists` | Check if a repository exists at the given storage location.
`exists_async` | Check if a repository exists at the given storage location (async version).
`expire_snapshots` | Expire all snapshots older than a threshold.
`expire_snapshots_async` | Expire all snapshots older than a threshold (async version).
`fetch_config` | Fetch the configuration for the repository saved in storage.
`fetch_config_async` | Fetch the configuration for the repository saved in storage (async version).
`garbage_collect` | Delete any objects no longer accessible from any branches or tags.
`garbage_collect_async` | Delete any objects no longer accessible from any branches or tags (async version).
`list_branches` | List the branches in the repository.
`list_branches_async` | List the branches in the repository (async version).
`list_tags` | List the tags in the repository.
`list_tags_async` | List the tags in the repository (async version).
`lookup_branch` | Get the tip snapshot ID of a branch.
`lookup_branch_async` | Get the tip snapshot ID of a branch (async version).
`lookup_snapshot` | Get the SnapshotInfo given a snapshot ID
`lookup_snapshot_async` | Get the SnapshotInfo given a snapshot ID (async version)
`lookup_tag` | Get the snapshot ID of a tag.
`lookup_tag_async` | Get the snapshot ID of a tag (async version).
`open` | Open an existing Icechunk repository.
`open_async` | Open an existing Icechunk repository asynchronously.
`open_or_create` | Open an existing Icechunk repository or create a new one if it does not exist.
`open_or_create_async` | Open an existing Icechunk repository or create a new one if it does not exist (async version).
`readonly_session` | Create a read-only session.
`readonly_session_async` | Create a read-only session (async version).
`reopen_async` | Reopen the repository with new configuration or credentials (async version).
`reset_branch` | Reset a branch to a specific snapshot.
`reset_branch_async` | Reset a branch to a specific snapshot (async version).
`rewrite_manifests` | Rewrite manifests for all arrays.
`rewrite_manifests_async` | Rewrite manifests for all arrays (async version).
`save_config` | Save the repository configuration to storage, this configuration will be used in future calls to Repository.open.
`save_config_async` | Save the repository configuration to storage (async version).
`set_default_commit_metadata` | Set the default commit metadata for the repository. This is useful for providing
`total_chunks_storage` | Calculate the total storage used for chunks, in bytes .
`total_chunks_storage_async` | Calculate the total storage used for chunks, in bytes (async version).
`transaction` | Create a transaction on a branch.
`writable_session` | Create a writable session on a branch.
`writable_session_async` | Create a writable session on a branch (async version).
Attributes:
Name | Type | Description
---|---|---
`config` | `RepositoryConfig` | Get a copy of this repository's config.
`storage` | `Storage` | Get a copy of this repository's Storage instance.
#### `` config `property` #
config
Get a copy of this repository's config.
Returns:
Type | Description
---|---
`RepositoryConfig` | The repository configuration.
#### `` storage `property` #
storage
Get a copy of this repository's Storage instance.
Returns:
Type | Description
---|---
`Storage` | The repository storage instance.
#### `` ancestry #
ancestry(*, branch=None, tag=None, snapshot_id=None)
Get the ancestry of a snapshot.
Parameters:
Name | Type | Description | Default
---|---|---|---
`branch` | `str` | The branch to get the ancestry of. | `None`
`tag` | `str` | The tag to get the ancestry of. | `None`
`snapshot_id` | `str` | The snapshot ID to get the ancestry of. | `None`
Returns:
Type | Description
---|---
`list[SnapshotInfo]` | The ancestry of the snapshot, listing out the snapshots and their metadata.
Notes
Only one of the arguments can be specified.
#### `` async_ancestry #
async_ancestry(*, branch=None, tag=None, snapshot_id=None)
Get the ancestry of a snapshot.
Parameters:
Name | Type | Description | Default
---|---|---|---
`branch` | `str` | The branch to get the ancestry of. | `None`
`tag` | `str` | The tag to get the ancestry of. | `None`
`snapshot_id` | `str` | The snapshot ID to get the ancestry of. | `None`
Returns:
Type | Description
---|---
`list[SnapshotInfo]` | The ancestry of the snapshot, listing out the snapshots and their metadata.
Notes
Only one of the arguments can be specified.
#### `` create `classmethod` #
create(storage, config=None, authorize_virtual_chunk_access=None)
Create a new Icechunk repository. If one already exists at the given store
location, an error will be raised.
Warning
Attempting to create a Repo concurrently in the same location from multiple
processes is not safe. Instead, create a Repo once and then open it
concurrently.
Parameters:
Name | Type | Description | Default
---|---|---|---
`storage` | `Storage` | The storage configuration for the repository. | _required_
`config` | `RepositoryConfig` | The repository configuration. If not provided, a default configuration will be used. | `None`
`authorize_virtual_chunk_access` | `dict[str, AnyCredential | None]` | Authorize Icechunk to access virtual chunks in these containers. A mapping from container url_prefix to the credentials to use to access chunks in that container. If credential is `None`, they will be fetched from the environment, or anonymous credentials will be used if the container allows it. As a security measure, Icechunk will block access to virtual chunks if the container is not authorized using this argument. | `None`
Returns:
Type | Description
---|---
`Self` | An instance of the Repository class.
#### `` create_async `async` `classmethod` #
create_async(storage, config=None, authorize_virtual_chunk_access=None)
Create a new Icechunk repository asynchronously. If one already exists at the
given store location, an error will be raised.
Warning
Attempting to create a Repo concurrently in the same location from multiple
processes is not safe. Instead, create a Repo once and then open it
concurrently.
Parameters:
Name | Type | Description | Default
---|---|---|---
`storage` | `Storage` | The storage configuration for the repository. | _required_
`config` | `RepositoryConfig` | The repository configuration. If not provided, a default configuration will be used. | `None`
`authorize_virtual_chunk_access` | `dict[str, AnyCredential | None]` | Authorize Icechunk to access virtual chunks in these containers. A mapping from container url_prefix to the credentials to use to access chunks in that container. If credential is `None`, they will be fetched from the environment, or anonymous credentials will be used if the container allows it. As a security measure, Icechunk will block access to virtual chunks if the container is not authorized using this argument. | `None`
Returns:
Type | Description
---|---
`Self` | An instance of the Repository class.
#### `` create_branch #
create_branch(branch, snapshot_id)
Create a new branch at the given snapshot.
Parameters:
Name | Type | Description | Default
---|---|---|---
`branch` | `str` | The name of the branch to create. | _required_
`snapshot_id` | `str` | The snapshot ID to create the branch at. | _required_
Returns:
Type | Description
---|---
`None` |
#### `` create_branch_async `async` #
create_branch_async(branch, snapshot_id)
Create a new branch at the given snapshot (async version).
Parameters:
Name | Type | Description | Default
---|---|---|---
`branch` | `str` | The name of the branch to create. | _required_
`snapshot_id` | `str` | The snapshot ID to create the branch at. | _required_
Returns:
Type | Description
---|---
`None` |
#### `` create_tag #
create_tag(tag, snapshot_id)
Create a new tag at the given snapshot.
Parameters:
Name | Type | Description | Default
---|---|---|---
`tag` | `str` | The name of the tag to create. | _required_
`snapshot_id` | `str` | The snapshot ID to create the tag at. | _required_
Returns:
Type | Description
---|---
`None` |
#### `` create_tag_async `async` #
create_tag_async(tag, snapshot_id)
Create a new tag at the given snapshot (async version).
Parameters:
Name | Type | Description | Default
---|---|---|---
`tag` | `str` | The name of the tag to create. | _required_
`snapshot_id` | `str` | The snapshot ID to create the tag at. | _required_
Returns:
Type | Description
---|---
`None` |
#### `` default_commit_metadata #
default_commit_metadata()
Get the current configured default commit metadata for the repository.
Returns:
Type | Description
---|---
`dict[str, Any]` | The default commit metadata.
#### `` delete_branch #
delete_branch(branch)
Delete a branch.
Parameters:
Name | Type | Description | Default
---|---|---|---
`branch` | `str` | The branch to delete. | _required_
Returns:
Type | Description
---|---
`None` |
#### `` delete_branch_async `async` #
delete_branch_async(branch)
Delete a branch (async version).
Parameters:
Name | Type | Description | Default
---|---|---|---
`branch` | `str` | The branch to delete. | _required_
Returns:
Type | Description
---|---
`None` |
#### `` delete_tag #
delete_tag(tag)
Delete a tag.
Parameters:
Name | Type | Description | Default
---|---|---|---
`tag` | `str` | The tag to delete. | _required_
Returns:
Type | Description
---|---
`None` |
#### `` delete_tag_async `async` #
delete_tag_async(tag)
Delete a tag (async version).
Parameters:
Name | Type | Description | Default
---|---|---|---
`tag` | `str` | The tag to delete. | _required_
Returns:
Type | Description
---|---
`None` |
#### `` diff #
diff(*, from_branch=None, from_tag=None, from_snapshot_id=None, to_branch=None, to_tag=None, to_snapshot_id=None)
Compute an overview of the operations executed from version `from` to version
`to`.
Both versions, `from` and `to`, must be identified. Identification can be done
using a branch, tag or snapshot id. The styles used to identify the `from` and
`to` versions can be different.
The `from` version must be a member of the `ancestry` of `to`.
Returns:
Type | Description
---|---
`Diff` | The operations executed between the two versions
#### `` diff_async `async` #
diff_async(*, from_branch=None, from_tag=None, from_snapshot_id=None, to_branch=None, to_tag=None, to_snapshot_id=None)
Compute an overview of the operations executed from version `from` to version
`to` (async version).
Both versions, `from` and `to`, must be identified. Identification can be done
using a branch, tag or snapshot id. The styles used to identify the `from` and
`to` versions can be different.
The `from` version must be a member of the `ancestry` of `to`.
Returns:
Type | Description
---|---
`Diff` | The operations executed between the two versions
#### `` exists `staticmethod` #
exists(storage)
Check if a repository exists at the given storage location.
Parameters:
Name | Type | Description | Default
---|---|---|---
`storage` | `Storage` | The storage configuration for the repository. | _required_
Returns:
Type | Description
---|---
`bool` | True if the repository exists, False otherwise.
#### `` exists_async `async` `staticmethod` #
exists_async(storage)
Check if a repository exists at the given storage location (async version).
Parameters:
Name | Type | Description | Default
---|---|---|---
`storage` | `Storage` | The storage configuration for the repository. | _required_
Returns:
Type | Description
---|---
`bool` | True if the repository exists, False otherwise.
#### `` expire_snapshots #
expire_snapshots(older_than, *, delete_expired_branches=False, delete_expired_tags=False)
Expire all snapshots older than a threshold.
This processes snapshots found by navigating all references in the repo, tags
first, branches leter, both in lexicographical order.
Returns the ids of all snapshots considered expired and skipped from history.
Notice that this snapshot are not necessarily available for garbage
collection, they could still be pointed by ether refs.
If `delete_expired_*` is set to True, branches or tags that, after the
expiration process, point to expired snapshots directly, will be deleted.
Danger
This is an administrative operation, it should be run carefully. The
repository can still operate concurrently while `expire_snapshots` runs, but
other readers can get inconsistent views of the repository history.
Parameters:
Name | Type | Description | Default
---|---|---|---
`older_than` | `datetime` | Expire snapshots older than this time. | _required_
`delete_expired_branches` | `bool` | Whether to delete any branches that now have only expired snapshots. | `False`
`delete_expired_tags` | `bool` | Whether to delete any tags associated with expired snapshots | `False`
Returns:
Type | Description
---|---
`set of expires snapshot IDs` |
#### `` expire_snapshots_async `async` #
expire_snapshots_async(older_than, *, delete_expired_branches=False, delete_expired_tags=False)
Expire all snapshots older than a threshold (async version).
This processes snapshots found by navigating all references in the repo, tags
first, branches leter, both in lexicographical order.
Returns the ids of all snapshots considered expired and skipped from history.
Notice that this snapshot are not necessarily available for garbage
collection, they could still be pointed by ether refs.
If `delete_expired_*` is set to True, branches or tags that, after the
expiration process, point to expired snapshots directly, will be deleted.
Danger
This is an administrative operation, it should be run carefully. The
repository can still operate concurrently while `expire_snapshots` runs, but
other readers can get inconsistent views of the repository history.
Parameters:
Name | Type | Description | Default
---|---|---|---
`older_than` | `datetime` | Expire snapshots older than this time. | _required_
`delete_expired_branches` | `bool` | Whether to delete any branches that now have only expired snapshots. | `False`
`delete_expired_tags` | `bool` | Whether to delete any tags associated with expired snapshots | `False`
Returns:
Type | Description
---|---
`set of expires snapshot IDs` |
#### `` fetch_config `staticmethod` #
fetch_config(storage)
Fetch the configuration for the repository saved in storage.
Parameters:
Name | Type | Description | Default
---|---|---|---
`storage` | `Storage` | The storage configuration for the repository. | _required_
Returns:
Type | Description
---|---
`RepositoryConfig | None` | The repository configuration if it exists, None otherwise.
#### `` fetch_config_async `async` `staticmethod` #
fetch_config_async(storage)
Fetch the configuration for the repository saved in storage (async version).
Parameters:
Name | Type | Description | Default
---|---|---|---
`storage` | `Storage` | The storage configuration for the repository. | _required_
Returns:
Type | Description
---|---
`RepositoryConfig | None` | The repository configuration if it exists, None otherwise.
#### `` garbage_collect #
garbage_collect(delete_object_older_than, *, dry_run=False, max_snapshots_in_memory=50, max_compressed_manifest_mem_bytes=512 * 1024 * 1024, max_concurrent_manifest_fetches=500)
Delete any objects no longer accessible from any branches or tags.
Danger
This is an administrative operation, it should be run carefully. The
repository can still operate concurrently while `garbage_collect` runs, but
other reades can get inconsistent views if they are trying to access the
expired snapshots.
Parameters:
Name | Type | Description | Default
---|---|---|---
`delete_object_older_than` | `datetime` | Delete objects older than this time. | _required_
`dry_run` | `bool` | Report results but don't delete any objects | `False`
`max_snapshots_in_memory` | `int` | Don't prefetch more than this many Snapshots to memory. | `50`
`max_compressed_manifest_mem_bytes` | `int` | Don't use more than this memory to store compressed in-flight manifests. | `512 * 1024 * 1024`
`max_concurrent_manifest_fetches` | `int` | Don't run more than this many concurrent manifest fetches. | `500`
Returns:
Type | Description
---|---
`GCSummary` | Summary of objects deleted.
#### `` garbage_collect_async `async` #
garbage_collect_async(delete_object_older_than, *, dry_run=False, max_snapshots_in_memory=50, max_compressed_manifest_mem_bytes=512 * 1024 * 1024, max_concurrent_manifest_fetches=500)
Delete any objects no longer accessible from any branches or tags (async
version).
Danger
This is an administrative operation, it should be run carefully. The
repository can still operate concurrently while `garbage_collect` runs, but
other reades can get inconsistent views if they are trying to access the
expired snapshots.
Parameters:
Name | Type | Description | Default
---|---|---|---
`delete_object_older_than` | `datetime` | Delete objects older than this time. | _required_
`dry_run` | `bool` | Report results but don't delete any objects | `False`
`max_snapshots_in_memory` | `int` | Don't prefetch more than this many Snapshots to memory. | `50`
`max_compressed_manifest_mem_bytes` | `int` | Don't use more than this memory to store compressed in-flight manifests. | `512 * 1024 * 1024`
`max_concurrent_manifest_fetches` | `int` | Don't run more than this many concurrent manifest fetches. | `500`
Returns:
Type | Description
---|---
`GCSummary` | Summary of objects deleted.
#### `` list_branches #
list_branches()
List the branches in the repository.
Returns:
Type | Description
---|---
`set[str]` | A set of branch names.
#### `` list_branches_async `async` #
list_branches_async()
List the branches in the repository (async version).
Returns:
Type | Description
---|---
`set[str]` | A set of branch names.
#### `` list_tags #
list_tags()
List the tags in the repository.
Returns:
Type | Description
---|---
`set[str]` | A set of tag names.
#### `` list_tags_async `async` #
list_tags_async()
List the tags in the repository (async version).
Returns:
Type | Description
---|---
`set[str]` | A set of tag names.
#### `` lookup_branch #
lookup_branch(branch)
Get the tip snapshot ID of a branch.
Parameters:
Name | Type | Description | Default
---|---|---|---
`branch` | `str` | The branch to get the tip of. | _required_
Returns:
Type | Description
---|---
`str` | The snapshot ID of the tip of the branch.
#### `` lookup_branch_async `async` #
lookup_branch_async(branch)
Get the tip snapshot ID of a branch (async version).
Parameters:
Name | Type | Description | Default
---|---|---|---
`branch` | `str` | The branch to get the tip of. | _required_
Returns:
Type | Description
---|---
`str` | The snapshot ID of the tip of the branch.
#### `` lookup_snapshot #
lookup_snapshot(snapshot_id)
Get the SnapshotInfo given a snapshot ID
Parameters:
Name | Type | Description | Default
---|---|---|---
`snapshot_id` | `str` | The id of the snapshot to look up | _required_
Returns:
Type | Description
---|---
`SnapshotInfo` |
#### `` lookup_snapshot_async `async` #
lookup_snapshot_async(snapshot_id)
Get the SnapshotInfo given a snapshot ID (async version)
Parameters:
Name | Type | Description | Default
---|---|---|---
`snapshot_id` | `str` | The id of the snapshot to look up | _required_
Returns:
Type | Description
---|---
`SnapshotInfo` |
#### `` lookup_tag #
lookup_tag(tag)
Get the snapshot ID of a tag.
Parameters:
Name | Type | Description | Default
---|---|---|---
`tag` | `str` | The tag to get the snapshot ID of. | _required_
Returns:
Type | Description
---|---
`str` | The snapshot ID of the tag.
#### `` lookup_tag_async `async` #
lookup_tag_async(tag)
Get the snapshot ID of a tag (async version).
Parameters:
Name | Type | Description | Default
---|---|---|---
`tag` | `str` | The tag to get the snapshot ID of. | _required_
Returns:
Type | Description
---|---
`str` | The snapshot ID of the tag.
#### `` open `classmethod` #
open(storage, config=None, authorize_virtual_chunk_access=None)
Open an existing Icechunk repository.
If no repository exists at the given storage location, an error will be
raised.
Warning
This method must be used with care in a multiprocessing context. Read more in
our [Parallel Write Guide](../parallel/#uncooperative-distributed-writes).
Parameters:
Name | Type | Description | Default
---|---|---|---
`storage` | `Storage` | The storage configuration for the repository. | _required_
`config` | `RepositoryConfig` | The repository settings. If not provided, a default configuration will be loaded from the repository. | `None`
`authorize_virtual_chunk_access` | `dict[str, AnyCredential | None]` | Authorize Icechunk to access virtual chunks in these containers. A mapping from container url_prefix to the credentials to use to access chunks in that container. If credential is `None`, they will be fetched from the environment, or anonymous credentials will be used if the container allows it. As a security measure, Icechunk will block access to virtual chunks if the container is not authorized using this argument. | `None`
Returns:
Type | Description
---|---
`Self` | An instance of the Repository class.
#### `` open_async `async` `classmethod` #
open_async(storage, config=None, authorize_virtual_chunk_access=None)
Open an existing Icechunk repository asynchronously.
If no repository exists at the given storage location, an error will be
raised.
Warning
This method must be used with care in a multiprocessing context. Read more in
our [Parallel Write Guide](../parallel/#uncooperative-distributed-writes).
Parameters:
Name | Type | Description | Default
---|---|---|---
`storage` | `Storage` | The storage configuration for the repository. | _required_
`config` | `RepositoryConfig` | The repository settings. If not provided, a default configuration will be loaded from the repository. | `None`
`authorize_virtual_chunk_access` | `dict[str, AnyCredential | None]` | Authorize Icechunk to access virtual chunks in these containers. A mapping from container url_prefix to the credentials to use to access chunks in that container. If credential is `None`, they will be fetched from the environment, or anonymous credentials will be used if the container allows it. As a security measure, Icechunk will block access to virtual chunks if the container is not authorized using this argument. | `None`
Returns:
Type | Description
---|---
`Self` | An instance of the Repository class.
#### `` open_or_create `classmethod` #
open_or_create(storage, config=None, authorize_virtual_chunk_access=None)
Open an existing Icechunk repository or create a new one if it does not exist.
Warning
This method must be used with care in a multiprocessing context. Read more in
our [Parallel Write Guide](../parallel/#uncooperative-distributed-writes).
Attempting to create a Repo concurrently in the same location from multiple
processes is not safe. Instead, create a Repo once and then open it
concurrently.
Parameters:
Name | Type | Description | Default
---|---|---|---
`storage` | `Storage` | The storage configuration for the repository. | _required_
`config` | `RepositoryConfig` | The repository settings. If not provided, a default configuration will be loaded from the repository. | `None`
`authorize_virtual_chunk_access` | `dict[str, AnyCredential | None]` | Authorize Icechunk to access virtual chunks in these containers. A mapping from container url_prefix to the credentials to use to access chunks in that container. If credential is `None`, they will be fetched from the environment, or anonymous credentials will be used if the container allows it. As a security measure, Icechunk will block access to virtual chunks if the container is not authorized using this argument. | `None`
Returns:
Type | Description
---|---
`Self` | An instance of the Repository class.
#### `` open_or_create_async `async` `classmethod` #
open_or_create_async(storage, config=None, authorize_virtual_chunk_access=None)
Open an existing Icechunk repository or create a new one if it does not exist
(async version).
Warning
This method must be used with care in a multiprocessing context. Read more in
our [Parallel Write Guide](../parallel/#uncooperative-distributed-writes).
Attempting to create a Repo concurrently in the same location from multiple
processes is not safe. Instead, create a Repo once and then open it
concurrently.
Parameters:
Name | Type | Description | Default
---|---|---|---
`storage` | `Storage` | The storage configuration for the repository. | _required_
`config` | `RepositoryConfig` | The repository settings. If not provided, a default configuration will be loaded from the repository. | `None`
`authorize_virtual_chunk_access` | `dict[str, AnyCredential | None]` | Authorize Icechunk to access virtual chunks in these containers. A mapping from container url_prefix to the credentials to use to access chunks in that container. If credential is `None`, they will be fetched from the environment, or anonymous credentials will be used if the container allows it. As a security measure, Icechunk will block access to virtual chunks if the container is not authorized using this argument. | `None`
Returns:
Type | Description
---|---
`Self` | An instance of the Repository class.
#### `` readonly_session #
readonly_session(branch=None, *, tag=None, snapshot_id=None, as_of=None)
Create a read-only session.
This can be thought of as a read-only checkout of the repository at a given
snapshot. When branch or tag are provided, the session will be based on the
tip of the branch or the snapshot ID of the tag.
Parameters:
Name | Type | Description | Default
---|---|---|---
`branch` | `str` | If provided, the branch to create the session on. | `None`
`tag` | `str` | If provided, the tag to create the session on. | `None`
`snapshot_id` | `str` | If provided, the snapshot ID to create the session on. | `None`
`as_of` | `datetime | None` | When combined with the branch argument, it will open the session at the last snapshot that is at or before this datetime | `None`
Returns:
Type | Description
---|---
`Session` | The read-only session, pointing to the specified snapshot, tag, or branch.
Notes
Only one of the arguments can be specified.
#### `` readonly_session_async `async` #
readonly_session_async(branch=None, *, tag=None, snapshot_id=None, as_of=None)
Create a read-only session (async version).
This can be thought of as a read-only checkout of the repository at a given
snapshot. When branch or tag are provided, the session will be based on the
tip of the branch or the snapshot ID of the tag.
Parameters:
Name | Type | Description | Default
---|---|---|---
`branch` | `str` | If provided, the branch to create the session on. | `None`
`tag` | `str` | If provided, the tag to create the session on. | `None`
`snapshot_id` | `str` | If provided, the snapshot ID to create the session on. | `None`
`as_of` | `datetime | None` | When combined with the branch argument, it will open the session at the last snapshot that is at or before this datetime | `None`
Returns:
Type | Description
---|---
`Session` | The read-only session, pointing to the specified snapshot, tag, or branch.
Notes
Only one of the arguments can be specified.
#### `` reopen_async `async` #
reopen_async(config=None, authorize_virtual_chunk_access=None)
Reopen the repository with new configuration or credentials (async version).
Parameters:
Name | Type | Description | Default
---|---|---|---
`config` | `RepositoryConfig` | The new repository configuration. If not provided, uses the existing configuration. | `None`
`authorize_virtual_chunk_access` | `dict[str, AnyCredential | None]` | New virtual chunk access credentials. | `None`
Returns:
Type | Description
---|---
`Self` | A new Repository instance with the updated configuration.
#### `` reset_branch #
reset_branch(branch, snapshot_id)
Reset a branch to a specific snapshot.
This will permanently alter the history of the branch such that the tip of the
branch is the specified snapshot.
Parameters:
Name | Type | Description | Default
---|---|---|---
`branch` | `str` | The branch to reset. | _required_
`snapshot_id` | `str` | The snapshot ID to reset the branch to. | _required_
Returns:
Type | Description
---|---
`None` |
#### `` reset_branch_async `async` #
reset_branch_async(branch, snapshot_id)
Reset a branch to a specific snapshot (async version).
This will permanently alter the history of the branch such that the tip of the
branch is the specified snapshot.
Parameters:
Name | Type | Description | Default
---|---|---|---
`branch` | `str` | The branch to reset. | _required_
`snapshot_id` | `str` | The snapshot ID to reset the branch to. | _required_
Returns:
Type | Description
---|---
`None` |
#### `` rewrite_manifests #
rewrite_manifests(message, *, branch, metadata=None)
Rewrite manifests for all arrays.
This method will start a new writable session on the specified branch, rewrite
manifests for all arrays, and then commits with the specifeid `message` and
`metadata`.
A JSON representation of the currently active splitting configuration will be
stored in the commit's metadata under the key `"splitting_config"`.
Parameters:
Name | Type | Description | Default
---|---|---|---
`message` | `str` | The message to write with the commit. | _required_
`branch` | `str` | The branch to commit to. | _required_
`metadata` | `dict[str, Any] | None` | Additional metadata to store with the commit snapshot. | `None`
Returns:
Type | Description
---|---
`str` | The snapshot ID of the new commit.
#### `` rewrite_manifests_async `async` #
rewrite_manifests_async(message, *, branch, metadata=None)
Rewrite manifests for all arrays (async version).
This method will start a new writable session on the specified branch, rewrite
manifests for all arrays, and then commits with the specifeid `message` and
`metadata`.
A JSON representation of the currently active splitting configuration will be
stored in the commit's metadata under the key `"splitting_config"`.
Parameters:
Name | Type | Description | Default
---|---|---|---
`message` | `str` | The message to write with the commit. | _required_
`branch` | `str` | The branch to commit to. | _required_
`metadata` | `dict[str, Any] | None` | Additional metadata to store with the commit snapshot. | `None`
Returns:
Type | Description
---|---
`str` | The snapshot ID of the new commit.
#### `` save_config #
save_config()
Save the repository configuration to storage, this configuration will be used
in future calls to Repository.open.
Returns:
Type | Description
---|---
`None` |
#### `` save_config_async `async` #
save_config_async()
Save the repository configuration to storage (async version).
Returns:
Type | Description
---|---
`None` |
#### `` set_default_commit_metadata #
set_default_commit_metadata(metadata)
Set the default commit metadata for the repository. This is useful for
providing addition static system conexted metadata to all commits.
When a commit is made, the metadata will be merged with the metadata provided,
with any duplicate keys being overwritten by the metadata provided in the
commit.
Warning
This metadata is only applied to sessions that are created after this call.
Any open writable sessions will not be affected and will not use the new
default metadata.
Parameters:
Name | Type | Description | Default
---|---|---|---
`metadata` | `dict[str, Any]` | The default commit metadata. Pass an empty dict to clear the default metadata. | _required_
#### `` total_chunks_storage #
total_chunks_storage(*, max_snapshots_in_memory=50, max_compressed_manifest_mem_bytes=512 * 1024 * 1024, max_concurrent_manifest_fetches=500)
Calculate the total storage used for chunks, in bytes .
It reports the storage needed to store all snapshots in the repository that
are reachable from any branches or tags. Unreachable snapshots can be
generated by using `reset_branch` or `expire_snapshots`. The chunks for these
snapshots are not included in the result, and they should probably be deleted
using `garbage_collection`.
The result includes only native chunks, not adding virtual or inline chunks.
Parameters:
Name | Type | Description | Default
---|---|---|---
`max_snapshots_in_memory` | `int` | Don't prefetch more than this many Snapshots to memory. | `50`
`max_compressed_manifest_mem_bytes` | `int` | Don't use more than this memory to store compressed in-flight manifests. | `512 * 1024 * 1024`
`max_concurrent_manifest_fetches` | `int` | Don't run more than this many concurrent manifest fetches. | `500`
#### `` total_chunks_storage_async `async` #
total_chunks_storage_async(*, max_snapshots_in_memory=50, max_compressed_manifest_mem_bytes=512 * 1024 * 1024, max_concurrent_manifest_fetches=500)
Calculate the total storage used for chunks, in bytes (async version).
It reports the storage needed to store all snapshots in the repository that
are reachable from any branches or tags. Unreachable snapshots can be
generated by using `reset_branch` or `expire_snapshots`. The chunks for these
snapshots are not included in the result, and they should probably be deleted
using `garbage_collection`.
The result includes only native chunks, not adding virtual or inline chunks.
Parameters:
Name | Type | Description | Default
---|---|---|---
`max_snapshots_in_memory` | `int` | Don't prefetch more than this many Snapshots to memory. | `50`
`max_compressed_manifest_mem_bytes` | `int` | Don't use more than this memory to store compressed in-flight manifests. | `512 * 1024 * 1024`
`max_concurrent_manifest_fetches` | `int` | Don't run more than this many concurrent manifest fetches. | `500`
#### `` transaction #
transaction(branch, *, message, metadata=None, rebase_with=None, rebase_tries=1000)
Create a transaction on a branch.
This is a context manager that creates a writable session on the specified
branch. When the context is exited, the session will be committed to the
branch using the specified message.
Parameters:
Name | Type | Description | Default
---|---|---|---
`branch` | `str` | The branch to create the transaction on. | _required_
`message` | `str` | The commit message to use when committing the session. | _required_
`metadata` | `dict[str, Any] | None` | Additional metadata to store with the commit snapshot. | `None`
`rebase_with` | `ConflictSolver | None` | If other session committed while the current session was writing, use Session.rebase with this solver. | `None`
`rebase_tries` | `int` | If other session committed while the current session was writing, use Session.rebase up to this many times in a loop. | `1000`
Yields:
Name | Type | Description
---|---|---
`store` | `IcechunkStore` | A Zarr Store which can be used to interact with the data in the repository.
#### `` writable_session #
writable_session(branch)
Create a writable session on a branch.
Like the read-only session, this can be thought of as a checkout of the
repository at the tip of the branch. However, this session is writable and can
be used to make changes to the repository. When ready, the changes can be
committed to the branch, after which the session will become a read-only
session on the new snapshot.
Parameters:
Name | Type | Description | Default
---|---|---|---
`branch` | `str` | The branch to create the session on. | _required_
Returns:
Type | Description
---|---
`Session` | The writable session on the branch.
#### `` writable_session_async `async` #
writable_session_async(branch)
Create a writable session on a branch (async version).
Like the read-only session, this can be thought of as a checkout of the
repository at the tip of the branch. However, this session is writable and can
be used to make changes to the repository. When ready, the changes can be
committed to the branch, after which the session will become a read-only
session on the new snapshot.
Parameters:
Name | Type | Description | Default
---|---|---|---
`branch` | `str` | The branch to create the session on. | _required_
Returns:
Type | Description
---|---
`Session` | The writable session on the branch.
### `` RepositoryConfig #
Configuration for an Icechunk repository
Methods:
Name | Description
---|---
`__init__` | Create a new `RepositoryConfig` object
`clear_virtual_chunk_containers` | Clear all virtual chunk containers from the repository.
`default` | Create a default repository config instance
`get_virtual_chunk_container` | Get the virtual chunk container for the repository associated with the given name.
`set_virtual_chunk_container` | Set the virtual chunk container for the repository.
Attributes:
Name | Type | Description
---|---|---
`caching` | `CachingConfig | None` | The caching configuration for the repository.
`compression` | `CompressionConfig | None` | The compression configuration for the repository.
`get_partial_values_concurrency` | `int | None` | The number of concurrent requests to make when getting partial values from storage.
`inline_chunk_threshold_bytes` | `int | None` | The maximum size of a chunk that will be stored inline in the repository. Chunks larger than this size will be written to storage.
`manifest` | `ManifestConfig | None` | The manifest configuration for the repository.
`storage` | `StorageSettings | None` | The storage configuration for the repository.
`virtual_chunk_containers` | `dict[str, VirtualChunkContainer] | None` | The virtual chunk containers for the repository.
#### `` caching `property` `writable` #
caching
The caching configuration for the repository.
Returns:
Type | Description
---|---
`CachingConfig | None` | The caching configuration for the repository.
#### `` compression `property` `writable` #
compression
The compression configuration for the repository.
Returns:
Type | Description
---|---
`CompressionConfig | None` | The compression configuration for the repository.
#### `` get_partial_values_concurrency `property` `writable` #
get_partial_values_concurrency
The number of concurrent requests to make when getting partial values from
storage.
Returns:
Type | Description
---|---
`int | None` | The number of concurrent requests to make when getting partial values from storage.
#### `` inline_chunk_threshold_bytes `property` `writable` #
inline_chunk_threshold_bytes
The maximum size of a chunk that will be stored inline in the repository.
Chunks larger than this size will be written to storage.
#### `` manifest `property` `writable` #
manifest
The manifest configuration for the repository.
Returns:
Type | Description
---|---
`ManifestConfig | None` | The manifest configuration for the repository.
#### `` storage `property` `writable` #
storage
The storage configuration for the repository.
Returns:
Type | Description
---|---
`StorageSettings | None` | The storage configuration for the repository.
#### `` virtual_chunk_containers `property` #
virtual_chunk_containers
The virtual chunk containers for the repository.
Returns:
Type | Description
---|---
`dict[str, VirtualChunkContainer] | None` | The virtual chunk containers for the repository.
#### `` __init__ #
__init__(inline_chunk_threshold_bytes=None, get_partial_values_concurrency=None, compression=None, caching=None, storage=None, virtual_chunk_containers=None, manifest=None)
Create a new `RepositoryConfig` object
Parameters:
Name | Type | Description | Default
---|---|---|---
`inline_chunk_threshold_bytes` | `int | None` | The maximum size of a chunk that will be stored inline in the repository. | `None`
`get_partial_values_concurrency` | `int | None` | The number of concurrent requests to make when getting partial values from storage. | `None`
`compression` | `CompressionConfig | None` | The compression configuration for the repository. | `None`
`caching` | `CachingConfig | None` | The caching configuration for the repository. | `None`
`storage` | `StorageSettings | None` | The storage configuration for the repository. | `None`
`virtual_chunk_containers` | `dict[str, VirtualChunkContainer] | None` | The virtual chunk containers for the repository. | `None`
`manifest` | `ManifestConfig | None` | The manifest configuration for the repository. | `None`
#### `` clear_virtual_chunk_containers #
clear_virtual_chunk_containers()
Clear all virtual chunk containers from the repository.
#### `` default `staticmethod` #
default()
Create a default repository config instance
#### `` get_virtual_chunk_container #
get_virtual_chunk_container(name)
Get the virtual chunk container for the repository associated with the given
name.
Parameters:
Name | Type | Description | Default
---|---|---|---
`name` | `str` | The name of the virtual chunk container to get. | _required_
Returns:
Type | Description
---|---
`VirtualChunkContainer | None` | The virtual chunk container for the repository associated with the given name.
#### `` set_virtual_chunk_container #
set_virtual_chunk_container(cont)
Set the virtual chunk container for the repository.
Parameters:
Name | Type | Description | Default
---|---|---|---
`cont` | `VirtualChunkContainer` | The virtual chunk container to set. | _required_
### `` S3Credentials #
Credentials for an S3 storage backend
Classes:
Name | Description
---|---
`Anonymous` | Does not sign requests, useful for public buckets
`FromEnv` | Uses credentials from environment variables
`Refreshable` | Allows for an outside authority to pass in a function that can be used to provide credentials.
`Static` | Uses s3 credentials without expiration
#### `` Anonymous #
Does not sign requests, useful for public buckets
#### `` FromEnv #
Uses credentials from environment variables
#### `` Refreshable #
Allows for an outside authority to pass in a function that can be used to
provide credentials.
This is useful for credentials that have an expiration time, or are otherwise
not known ahead of time.
Parameters:
Name | Type | Description | Default
---|---|---|---
`pickled_function` | `bytes` | The pickled function to use to provide credentials. | _required_
`current` | `S3StaticCredentials | None` | The initial credentials. They will be returned the first time credentials are requested and then deleted. | `None`
#### `` Static #
Uses s3 credentials without expiration
Parameters:
Name | Type | Description | Default
---|---|---|---
`credentials` | `S3StaticCredentials` | The credentials to use for authentication. | _required_
### `` S3Options #
Options for accessing an S3-compatible storage backend
Methods:
Name | Description
---|---
`__init__` | Create a new `S3Options` object
#### `` __init__ #
__init__(region=None, endpoint_url=None, allow_http=False, anonymous=False, force_path_style=False)
Create a new `S3Options` object
Parameters:
Name | Type | Description | Default
---|---|---|---
`region` | `str | None` | Optional, the region to use for the storage backend. | `None`
`endpoint_url` | `str | None` | Optional, the endpoint URL to use for the storage backend. | `None`
`allow_http` | `bool` | Whether to allow HTTP requests to the storage backend. | `False`
`anonymous` | `bool` | Whether to use anonymous credentials to the storage backend. When `True`, the s3 requests will not be signed. | `False`
`force_path_style` | `bool` | Whether to force use of path-style addressing for buckets. | `False`
### `` S3StaticCredentials #
Credentials for an S3 storage backend
Attributes: access_key_id: str The access key ID to use for authentication. secret_access_key: str The secret access key to use for authentication. session_token: str | None The session token to use for authentication. expires_after: datetime.datetime | None Optional, the expiration time of the credentials.
Methods:
Name | Description
---|---
`__init__` | Create a new `S3StaticCredentials` object
#### `` __init__ #
__init__(access_key_id, secret_access_key, session_token=None, expires_after=None)
Create a new `S3StaticCredentials` object
Parameters:
Name | Type | Description | Default
---|---|---|---
`access_key_id` | `str` | The access key ID to use for authentication. | _required_
`secret_access_key` | `str` | The secret access key to use for authentication. | _required_
`session_token` | `str | None` | Optional, the session token to use for authentication. | `None`
`expires_after` | `datetime | None` | Optional, the expiration time of the credentials. | `None`
### `` Session #
A session object that allows for reading and writing data from an Icechunk
repository.
Methods:
Name | Description
---|---
`all_virtual_chunk_locations` | Return the location URLs of all virtual chunks.
`all_virtual_chunk_locations_async` | Return the location URLs of all virtual chunks (async version).
`allow_pickling` | Context manager to allow unpickling this store if writable.
`chunk_coordinates` | Return an async iterator to all initialized chunks for the array at array_path
`commit` | Commit the changes in the session to the repository.
`commit_async` | Commit the changes in the session to the repository (async version).
`discard_changes` | When the session is writable, discard any uncommitted changes.
`merge` | Merge the changes for this session with the changes from another session.
`merge_async` | Merge the changes for this session with the changes from another session (async version).
`rebase` | Rebase the session to the latest ancestry of the branch.
`rebase_async` | Rebase the session to the latest ancestry of the branch (async version).
`status` | Compute an overview of the current session changes
Attributes:
Name | Type | Description
---|---|---
`branch` | `str | None` | The branch that the session is based on. This is only set if the session is writable.
`has_uncommitted_changes` | `bool` | Whether the session has uncommitted changes. This is only possibly true if the session is writable.
`read_only` | `bool` | Whether the session is read-only.
`snapshot_id` | `str` | The base snapshot ID of the session.
`store` | `IcechunkStore` | Get a zarr Store object for reading and writing data from the repository using zarr python.
#### `` branch `property` #
branch
The branch that the session is based on. This is only set if the session is
writable.
Returns:
Type | Description
---|---
`str or None` | The branch that the session is based on if the session is writable, None otherwise.
#### `` has_uncommitted_changes `property` #
has_uncommitted_changes
Whether the session has uncommitted changes. This is only possibly true if the
session is writable.
Returns:
Type | Description
---|---
`bool` | True if the session has uncommitted changes, False otherwise.
#### `` read_only `property` #
read_only
Whether the session is read-only.
Returns:
Type | Description
---|---
`bool` | True if the session is read-only, False otherwise.
#### `` snapshot_id `property` #
snapshot_id
The base snapshot ID of the session.
Returns:
Type | Description
---|---
`str` | The base snapshot ID of the session.
#### `` store `property` #
store
Get a zarr Store object for reading and writing data from the repository using
zarr python.
Returns:
Type | Description
---|---
`IcechunkStore` | A zarr Store object for reading and writing data from the repository.
#### `` all_virtual_chunk_locations #
all_virtual_chunk_locations()
Return the location URLs of all virtual chunks.
Returns:
Type | Description
---|---
`list of str` | The location URLs of all virtual chunks.
#### `` all_virtual_chunk_locations_async `async` #
all_virtual_chunk_locations_async()
Return the location URLs of all virtual chunks (async version).
Returns:
Type | Description
---|---
`list of str` | The location URLs of all virtual chunks.
#### `` allow_pickling #
allow_pickling()
Context manager to allow unpickling this store if writable.
#### `` chunk_coordinates `async` #
chunk_coordinates(array_path, batch_size=1000)
Return an async iterator to all initialized chunks for the array at array_path
Returns:
Type | Description
---|---
`an async iterator to chunk coordinates as tuples` |
#### `` commit #
commit(message, metadata=None, rebase_with=None, rebase_tries=1000)
Commit the changes in the session to the repository.
When successful, the writable session is completed and the session is now
read-only and based on the new commit. The snapshot ID of the new commit is
returned.
If the session is out of date, this will raise a ConflictError exception
depicting the conflict that occurred. The session will need to be rebased
before committing.
Parameters:
Name | Type | Description | Default
---|---|---|---
`message` | `str` | The message to write with the commit. | _required_
`metadata` | `dict[str, Any] | None` | Additional metadata to store with the commit snapshot. | `None`
`rebase_with` | `ConflictSolver | None` | If other session committed while the current session was writing, use Session.rebase with this solver. | `None`
`rebase_tries` | `int` | If other session committed while the current session was writing, use Session.rebase up to this many times in a loop. | `1000`
Returns:
Type | Description
---|---
`str` | The snapshot ID of the new commit.
Raises:
Type | Description
---|---
`ConflictError` | If the session is out of date and a conflict occurs.
#### `` commit_async `async` #
commit_async(message, metadata=None, rebase_with=None, rebase_tries=1000)
Commit the changes in the session to the repository (async version).
When successful, the writable session is completed and the session is now
read-only and based on the new commit. The snapshot ID of the new commit is
returned.
If the session is out of date, this will raise a ConflictError exception
depicting the conflict that occurred. The session will need to be rebased
before committing.
Parameters:
Name | Type | Description | Default
---|---|---|---
`message` | `str` | The message to write with the commit. | _required_
`metadata` | `dict[str, Any] | None` | Additional metadata to store with the commit snapshot. | `None`
`rebase_with` | `ConflictSolver | None` | If other session committed while the current session was writing, use Session.rebase with this solver. | `None`
`rebase_tries` | `int` | If other session committed while the current session was writing, use Session.rebase up to this many times in a loop. | `1000`
Returns:
Type | Description
---|---
`str` | The snapshot ID of the new commit.
Raises:
Type | Description
---|---
`ConflictError` | If the session is out of date and a conflict occurs.
#### `` discard_changes #
discard_changes()
When the session is writable, discard any uncommitted changes.
#### `` merge #
merge(*others)
Merge the changes for this session with the changes from another session.
Parameters:
Name | Type | Description | Default
---|---|---|---
`others` | `ForkSession` | The forked sessions to merge changes from. | `()`
#### `` merge_async `async` #
merge_async(*others)
Merge the changes for this session with the changes from another session
(async version).
Parameters:
Name | Type | Description | Default
---|---|---|---
`others` | `ForkSession` | The forked sessions to merge changes from. | `()`
#### `` rebase #
rebase(solver)
Rebase the session to the latest ancestry of the branch.
This method will iteratively crawl the ancestry of the branch and apply the
changes from the branch to the session. If a conflict is detected, the
conflict solver will be used to optionally resolve the conflict. When
complete, the session will be based on the latest commit of the branch and the
session will be ready to attempt another commit.
When a conflict is detected and a resolution is not possible with the provided
solver, a RebaseFailed exception will be raised. This exception will contain
the snapshot ID that the rebase failed on and a list of conflicts that
occurred.
Parameters:
Name | Type | Description | Default
---|---|---|---
`solver` | `ConflictSolver` | The conflict solver to use when a conflict is detected. | _required_
Raises:
Type | Description
---|---
`RebaseFailedError` | When a conflict is detected and the solver fails to resolve it.
#### `` rebase_async `async` #
rebase_async(solver)
Rebase the session to the latest ancestry of the branch (async version).
This method will iteratively crawl the ancestry of the branch and apply the
changes from the branch to the session. If a conflict is detected, the
conflict solver will be used to optionally resolve the conflict. When
complete, the session will be based on the latest commit of the branch and the
session will be ready to attempt another commit.
When a conflict is detected and a resolution is not possible with the provided
solver, a RebaseFailed exception will be raised. This exception will contain
the snapshot ID that the rebase failed on and a list of conflicts that
occurred.
Parameters:
Name | Type | Description | Default
---|---|---|---
`solver` | `ConflictSolver` | The conflict solver to use when a conflict is detected. | _required_
Raises:
Type | Description
---|---
`RebaseFailedError` | When a conflict is detected and the solver fails to resolve it.
#### `` status #
status()
Compute an overview of the current session changes
Returns:
Type | Description
---|---
`Diff` | The operations executed in the current session but still not committed.
### `` SnapshotInfo #
Metadata for a snapshot
Attributes:
Name | Type | Description
---|---|---
`id` | `str` | The snapshot ID
`manifests` | `list[ManifestFileInfo]` | The manifests linked to this snapshot
`message` | `str` | The commit message of the snapshot
`metadata` | `dict[str, Any]` | The metadata of the snapshot
`parent_id` | `str | None` | The snapshot ID
`written_at` | `datetime` | The timestamp when the snapshot was written
#### `` id `property` #
id
The snapshot ID
#### `` manifests `property` #
manifests
The manifests linked to this snapshot
#### `` message `property` #
message
The commit message of the snapshot
#### `` metadata `property` #
metadata
The metadata of the snapshot
#### `` parent_id `property` #
parent_id
The snapshot ID
#### `` written_at `property` #
written_at
The timestamp when the snapshot was written
### `` Storage #
Storage configuration for an IcechunkStore
Currently supports memory, filesystem S3, azure blob, and google cloud storage
backends. Use the following methods to create a Storage object with the
desired backend.
Ex:
storage = icechunk.in_memory_storage()
storage = icechunk.local_filesystem_storage("/path/to/root")
storage = icechunk.s3_storage("bucket", "prefix", ...)
storage = icechunk.gcs_storage("bucket", "prefix", ...)
storage = icechunk.azure_storage("container", "prefix", ...)
### `` StorageConcurrencySettings #
Configuration for how Icechunk uses its Storage instance
Methods:
Name | Description
---|---
`__init__` | Create a new `StorageConcurrencySettings` object
Attributes:
Name | Type | Description
---|---|---
`ideal_concurrent_request_size` | `int | None` | The ideal concurrent request size.
`max_concurrent_requests_for_object` | `int | None` | The maximum number of concurrent requests for an object.
#### `` ideal_concurrent_request_size `property` `writable` #
ideal_concurrent_request_size
The ideal concurrent request size.
Returns:
Type | Description
---|---
`int | None` | The ideal concurrent request size.
#### `` max_concurrent_requests_for_object `property` `writable` #
max_concurrent_requests_for_object
The maximum number of concurrent requests for an object.
Returns:
Type | Description
---|---
`int | None` | The maximum number of concurrent requests for an object.
#### `` __init__ #
__init__(max_concurrent_requests_for_object=None, ideal_concurrent_request_size=None)
Create a new `StorageConcurrencySettings` object
Parameters:
Name | Type | Description | Default
---|---|---|---
`max_concurrent_requests_for_object` | `int | None` | The maximum number of concurrent requests for an object. | `None`
`ideal_concurrent_request_size` | `int | None` | The ideal concurrent request size. | `None`
### `` StorageRetriesSettings #
Configuration for how Icechunk retries requests.
Icechunk retries failed requests with an exponential backoff algorithm.
Methods:
Name | Description
---|---
`__init__` | Create a new `StorageRetriesSettings` object
Attributes:
Name | Type | Description
---|---|---
`initial_backoff_ms` | `int | None` | The initial backoff duration in milliseconds.
`max_backoff_ms` | `int | None` | The maximum backoff duration in milliseconds.
`max_tries` | `int | None` | The maximum number of tries, including the initial one.
#### `` initial_backoff_ms `property` `writable` #
initial_backoff_ms
The initial backoff duration in milliseconds.
Returns:
Type | Description
---|---
`int | None` | The initial backoff duration in milliseconds.
#### `` max_backoff_ms `property` `writable` #
max_backoff_ms
The maximum backoff duration in milliseconds.
Returns:
Type | Description
---|---
`int | None` | The maximum backoff duration in milliseconds.
#### `` max_tries `property` `writable` #
max_tries
The maximum number of tries, including the initial one.
Returns:
Type | Description
---|---
`int | None` | The maximum number of tries.
#### `` __init__ #
__init__(max_tries=None, initial_backoff_ms=None, max_backoff_ms=None)
Create a new `StorageRetriesSettings` object
Parameters:
Name | Type | Description | Default
---|---|---|---
`max_tries` | `int | None` | The maximum number of tries, including the initial one. Set to 1 to disable retries | `None`
`initial_backoff_ms` | `int | None` | The initial backoff duration in milliseconds | `None`
`max_backoff_ms` | `int | None` | The limit to backoff duration in milliseconds | `None`
### `` StorageSettings #
Configuration for how Icechunk uses its Storage instance
Methods:
Name | Description
---|---
`__init__` | Create a new `StorageSettings` object
Attributes:
Name | Type | Description
---|---|---
`chunks_storage_class` | `str | None` | Chunk objects in object store will use this storage class or self.storage_class if None
`concurrency` | `StorageConcurrencySettings | None` | The configuration for how much concurrency Icechunk store uses
`metadata_storage_class` | `str | None` | Metadata objects in object store will use this storage class or self.storage_class if None
`minimum_size_for_multipart_upload` | `int | None` | Use object store's multipart upload for objects larger than this size in bytes
`retries` | `StorageRetriesSettings | None` | The configuration for how Icechunk retries failed requests.
`storage_class` | `str | None` | All objects in object store will use this storage class or the default if None
`unsafe_use_conditional_create` | `bool | None` | True if Icechunk will use conditional PUT operations for creation in the object store
`unsafe_use_conditional_update` | `bool | None` | True if Icechunk will use conditional PUT operations for updates in the object store
`unsafe_use_metadata` | `bool | None` | True if Icechunk will write object metadata in the object store
#### `` chunks_storage_class `property` `writable` #
chunks_storage_class
Chunk objects in object store will use this storage class or
self.storage_class if None
#### `` concurrency `property` `writable` #
concurrency
The configuration for how much concurrency Icechunk store uses
Returns:
Type | Description
---|---
`StorageConcurrencySettings | None` | The configuration for how Icechunk uses its Storage instance.
#### `` metadata_storage_class `property` `writable` #
metadata_storage_class
Metadata objects in object store will use this storage class or
self.storage_class if None
#### `` minimum_size_for_multipart_upload `property` `writable` #
minimum_size_for_multipart_upload
Use object store's multipart upload for objects larger than this size in bytes
#### `` retries `property` `writable` #
retries
The configuration for how Icechunk retries failed requests.
Returns:
Type | Description
---|---
`StorageRetriesSettings | None` | The configuration for how Icechunk retries failed requests.
#### `` storage_class `property` `writable` #
storage_class
All objects in object store will use this storage class or the default if None
#### `` unsafe_use_conditional_create `property` `writable` #
unsafe_use_conditional_create
True if Icechunk will use conditional PUT operations for creation in the
object store
#### `` unsafe_use_conditional_update `property` `writable` #
unsafe_use_conditional_update
True if Icechunk will use conditional PUT operations for updates in the object
store
#### `` unsafe_use_metadata `property` `writable` #
unsafe_use_metadata
True if Icechunk will write object metadata in the object store
#### `` __init__ #
__init__(concurrency=None, retries=None, unsafe_use_conditional_create=None, unsafe_use_conditional_update=None, unsafe_use_metadata=None, storage_class=None, metadata_storage_class=None, chunks_storage_class=None, minimum_size_for_multipart_upload=None)
Create a new `StorageSettings` object
Parameters:
Name | Type | Description | Default
---|---|---|---
`concurrency` | `StorageConcurrencySettings | None` | The configuration for how Icechunk uses its Storage instance. | `None`
`retries` | `StorageRetriesSettings | None` | The configuration for how Icechunk retries failed requests. | `None`
`unsafe_use_conditional_update` | `bool | None` | If set to False, Icechunk loses some of its consistency guarantees. This is only useful in object stores that don't support the feature. Use it at your own risk. | `None`
`unsafe_use_conditional_create` | `bool | None` | If set to False, Icechunk loses some of its consistency guarantees. This is only useful in object stores that don't support the feature. Use at your own risk. | `None`
`unsafe_use_metadata` | `bool | None` | Don't write metadata fields in Icechunk files. This is only useful in object stores that don't support the feature. Use at your own risk. | `None`
`storage_class` | `str | None` | Store all objects using this object store storage class If None the object store default will be used. Currently not supported in GCS. Example: STANDARD_IA | `None`
`metadata_storage_class` | `str | None` | Store metadata objects using this object store storage class. Currently not supported in GCS. Defaults to storage_class. | `None`
`chunks_storage_class` | `str | None` | Store chunk objects using this object store storage class. Currently not supported in GCS. Defaults to storage_class. | `None`
`minimum_size_for_multipart_upload` | `int | None` | Use object store's multipart upload for objects larger than this size in bytes. Default: 100 MB if None is passed. | `None`
### `` VersionSelection #
Bases: `Enum`
Enum for selecting the which version of a conflict
Attributes:
Name | Type | Description
---|---|---
`Fail` | `int` | Fail the rebase operation
`UseOurs` | `int` | Use the version from the source store
`UseTheirs` | `int` | Use the version from the target store
### `` VirtualChunkContainer #
A virtual chunk container is a configuration that allows Icechunk to read
virtual references from a storage backend.
Attributes:
Name | Type | Description
---|---|---
`url_prefix` | `str` | The prefix of urls that will use this containers configuration for reading virtual references.
`store` | `ObjectStoreConfig` | The storage backend to use for the virtual chunk container.
Methods:
Name | Description
---|---
`__init__` | Create a new `VirtualChunkContainer` object
#### `` __init__ #
__init__(url_prefix, store)
Create a new `VirtualChunkContainer` object
Parameters:
Name | Type | Description | Default
---|---|---|---
`url_prefix` | `str` | The prefix of urls that will use this containers configuration for reading virtual references. | _required_
`store` | `AnyObjectStoreConfig` | The storage backend to use for the virtual chunk container. | _required_
### `` VirtualChunkSpec #
The specification for a virtual chunk reference.
Attributes:
Name | Type | Description
---|---|---
`etag_checksum` | `str | None` | Optional object store e-tag for the containing object.
`index` | `list[int]` | The chunk index, in chunk coordinates space
`last_updated_at_checksum` | `datetime | None` | Optional timestamp for the containing object.
`length` | `int` | The length of the chunk in bytes
`location` | `str` | The URL to the virtual chunk data, something like 's3://bucket/foo.nc'
`offset` | `int` | The chunk offset within the pointed object, in bytes
#### `` etag_checksum `property` #
etag_checksum
Optional object store e-tag for the containing object.
Icechunk will refuse to serve data from this chunk if the etag has changed.
#### `` index `property` #
index
The chunk index, in chunk coordinates space
#### `` last_updated_at_checksum `property` #
last_updated_at_checksum
Optional timestamp for the containing object.
Icechunk will refuse to serve data from this chunk if it has been modified in
object store after this time.
#### `` length `property` #
length
The length of the chunk in bytes
#### `` location `property` #
location
The URL to the virtual chunk data, something like 's3://bucket/foo.nc'
#### `` offset `property` #
offset
The chunk offset within the pointed object, in bytes
### `` azure_credentials #
azure_credentials(*, access_key=None, sas_token=None, bearer_token=None, from_env=None)
Create credentials Azure Blob Storage object store.
If all arguments are None, credentials are fetched from the operative system
environment.
### `` azure_from_env_credentials #
azure_from_env_credentials()
Instruct Azure Blob Storage object store to fetch credentials from the
operative system environment.
### `` azure_static_credentials #
azure_static_credentials(*, access_key=None, sas_token=None, bearer_token=None)
Create static credentials Azure Blob Storage object store.
### `` azure_storage #
azure_storage(*, account, container, prefix, access_key=None, sas_token=None, bearer_token=None, from_env=None, config=None)
Create a Storage instance that saves data in Azure Blob Storage object store.
Parameters:
Name | Type | Description | Default
---|---|---|---
`account` | `str` | The account to which the caller must have access privileges | _required_
`container` | `str` | The container where the repository will store its data | _required_
`prefix` | `str` | The prefix within the container that is the root directory of the repository | _required_
`access_key` | `str | None` | Azure Blob Storage credential access key | `None`
`sas_token` | `str | None` | Azure Blob Storage credential SAS token | `None`
`bearer_token` | `str | None` | Azure Blob Storage credential bearer token | `None`
`from_env` | `bool | None` | Fetch credentials from the operative system environment | `None`
`config` | `dict[str, str] | None` | A dictionary of options for the Azure Blob Storage object store. See https://docs.rs/object_store/latest/object_store/azure/enum.AzureConfigKey.html#variants for a list of possible configuration keys. | `None`
### `` containers_credentials #
containers_credentials(m)
Build a map of credentials for virtual chunk containers.
Parameters:
Name | Type | Description | Default
---|---|---|---
`m` | `Mapping[str, AnyS3Credential | AnyGcsCredential | AnyAzureCredential | None]` | A mapping from container url prefixes to credentials. | _required_
Examples:
import icechunk as ic
config = ic.RepositoryConfig.default()
config.inline_chunk_threshold_bytes = 512
virtual_store_config = ic.s3_store(
region="us-east-1",
endpoint_url="http://localhost:9000",
allow_http=True,
s3_compatible=True,
force_path_style=True,
)
container = ic.VirtualChunkContainer("s3://somebucket", virtual_store_config)
config.set_virtual_chunk_container(container)
credentials = ic.containers_credentials(
{"s3://somebucket": ic.s3_credentials(access_key_id="ACCESS_KEY", secret_access_key="SECRET"}
)
repo = ic.Repository.create(
storage=ic.local_filesystem_storage(store_path),
config=config,
authorize_virtual_chunk_access=credentials,
)
### `` gcs_credentials #
gcs_credentials(*, service_account_file=None, service_account_key=None, application_credentials=None, bearer_token=None, from_env=None, get_credentials=None, scatter_initial_credentials=False)
Create credentials Google Cloud Storage object store.
If all arguments are None, credentials are fetched from the operative system
environment.
### `` gcs_from_env_credentials #
gcs_from_env_credentials()
Instruct Google Cloud Storage object store to fetch credentials from the
operative system environment.
### `` gcs_refreshable_credentials #
gcs_refreshable_credentials(get_credentials, scatter_initial_credentials=False)
Create refreshable credentials for Google Cloud Storage object store.
Parameters:
Name | Type | Description | Default
---|---|---|---
`get_credentials` | `Callable[[], GcsBearerCredential]` | Use this function to get and refresh the credentials. The function must be pickable. | _required_
`scatter_initial_credentials` | `bool` | Immediately call and store the value returned by get_credentials. This is useful if the repo or session will be pickled to generate many copies. Passing scatter_initial_credentials=True will ensure all those copies don't need to call get_credentials immediately. After the initial set of credentials has expired, the cached value is no longer used. Notice that credentials obtained are stored, and they can be sent over the network if you pickle the session/repo. | `False`
### `` gcs_static_credentials #
gcs_static_credentials(*, service_account_file=None, service_account_key=None, application_credentials=None, bearer_token=None)
Create static credentials Google Cloud Storage object store.
### `` gcs_storage #
gcs_storage(*, bucket, prefix, service_account_file=None, service_account_key=None, application_credentials=None, bearer_token=None, from_env=None, config=None, get_credentials=None, scatter_initial_credentials=False)
Create a Storage instance that saves data in Google Cloud Storage object
store.
Parameters:
Name | Type | Description | Default
---|---|---|---
`bucket` | `str` | The bucket where the repository will store its data | _required_
`prefix` | `str | None` | The prefix within the bucket that is the root directory of the repository | _required_
`service_account_file` | `str | None` | The path to the service account file | `None`
`service_account_key` | `str | None` | The service account key | `None`
`application_credentials` | `str | None` | The path to the application credentials file | `None`
`bearer_token` | `str | None` | The bearer token to use for the object store | `None`
`from_env` | `bool | None` | Fetch credentials from the operative system environment | `None`
`config` | `dict[str, str] | None` | A dictionary of options for the Google Cloud Storage object store. See https://docs.rs/object_store/latest/object_store/gcp/enum.GoogleConfigKey.html#variants for a list of possible configuration keys. | `None`
`get_credentials` | `Callable[[], GcsBearerCredential] | None` | Use this function to get and refresh object store credentials | `None`
`scatter_initial_credentials` | `bool` | Immediately call and store the value returned by get_credentials. This is useful if the repo or session will be pickled to generate many copies. Passing scatter_initial_credentials=True will ensure all those copies don't need to call get_credentials immediately. After the initial set of credentials has expired, the cached value is no longer used. Notice that credentials obtained are stored, and they can be sent over the network if you pickle the session/repo. | `False`
### `` gcs_store #
gcs_store(opts=None)
Build an ObjectStoreConfig instance for Google Cloud Storage object stores.
Parameters:
Name | Type | Description | Default
---|---|---|---
`opts` | `dict[str, str] | None` | A dictionary of options for the Google Cloud Storage object store. See https://docs.rs/object_store/latest/object_store/gcp/enum.GoogleConfigKey.html#variants for a list of possible configuration keys. | `None`
### `` http_store #
http_store(opts=None)
Build an ObjectStoreConfig instance for HTTP object stores.
Parameters:
Name | Type | Description | Default
---|---|---|---
`opts` | `dict[str, str] | None` | A dictionary of options for the HTTP object store. See https://docs.rs/object_store/latest/object_store/client/enum.ClientConfigKey.html#variants for a list of possible keys in snake case format. | `None`
### `` in_memory_storage #
in_memory_storage()
Create a Storage instance that saves data in memory.
This Storage implementation is used for tests. Data will be lost after the
process finishes, and can only be accesses through the Storage instance
returned. Different instances don't share data.
### `` initialize_logs #
initialize_logs()
Initialize the logging system for the library.
Reads the value of the environment variable ICECHUNK_LOG to obtain the
filters. This is autamtically called on `import icechunk`.
### `` local_filesystem_storage #
local_filesystem_storage(path)
Create a Storage instance that saves data in the local file system.
This Storage instance is not recommended for production data
### `` local_filesystem_store #
local_filesystem_store(path)
Build an ObjectStoreConfig instance for local file stores.
Parameters:
Name | Type | Description | Default
---|---|---|---
`path` | `str` | The root directory for the store. | _required_
### `` r2_storage #
r2_storage(*, bucket=None, prefix=None, account_id=None, endpoint_url=None, region=None, allow_http=False, access_key_id=None, secret_access_key=None, session_token=None, expires_after=None, anonymous=None, from_env=None, get_credentials=None, scatter_initial_credentials=False)
Create a Storage instance that saves data in Tigris object store.
Parameters:
Name | Type | Description | Default
---|---|---|---
`bucket` | `str | None` | The bucket name | `None`
`prefix` | `str | None` | The prefix within the bucket that is the root directory of the repository | `None`
`account_id` | `str | None` | Cloudflare account ID. When provided, a default endpoint URL is constructed as `https://.r2.cloudflarestorage.com`. If not provided, `endpoint_url` must be provided instead. | `None`
`endpoint_url` | `str | None` | Endpoint where the object store serves data, example: `https://.r2.cloudflarestorage.com` | `None`
`region` | `str | None` | The region to use in the object store, if `None` the default region 'auto' will be used | `None`
`allow_http` | `bool` | If the object store can be accessed using http protocol instead of https | `False`
`access_key_id` | `str | None` | S3 credential access key | `None`
`secret_access_key` | `str | None` | S3 credential secret access key | `None`
`session_token` | `str | None` | Optional S3 credential session token | `None`
`expires_after` | `datetime | None` | Optional expiration for the object store credentials | `None`
`anonymous` | `bool | None` | If set to True requests to the object store will not be signed | `None`
`from_env` | `bool | None` | Fetch credentials from the operative system environment | `None`
`get_credentials` | `Callable[[], S3StaticCredentials] | None` | Use this function to get and refresh object store credentials | `None`
`scatter_initial_credentials` | `bool` | Immediately call and store the value returned by get_credentials. This is useful if the repo or session will be pickled to generate many copies. Passing scatter_initial_credentials=True will ensure all those copies don't need to call get_credentials immediately. After the initial set of credentials has expired, the cached value is no longer used. Notice that credentials obtained are stored, and they can be sent over the network if you pickle the session/repo. | `False`
### `` s3_anonymous_credentials #
s3_anonymous_credentials()
Create no-signature credentials for S3 and S3 compatible object stores.
### `` s3_credentials #
s3_credentials(*, access_key_id=None, secret_access_key=None, session_token=None, expires_after=None, anonymous=None, from_env=None, get_credentials=None, scatter_initial_credentials=False)
Create credentials for S3 and S3 compatible object stores.
If all arguments are None, credentials are fetched from the environment.
Parameters:
Name | Type | Description | Default
---|---|---|---
`access_key_id` | `str | None` | S3 credential access key | `None`
`secret_access_key` | `str | None` | S3 credential secret access key | `None`
`session_token` | `str | None` | Optional S3 credential session token | `None`
`expires_after` | `datetime | None` | Optional expiration for the object store credentials | `None`
`anonymous` | `bool | None` | If set to True requests to the object store will not be signed | `None`
`from_env` | `bool | None` | Fetch credentials from the operative system environment | `None`
`get_credentials` | `Callable[[], S3StaticCredentials] | None` | Use this function to get and refresh object store credentials | `None`
`scatter_initial_credentials` | `bool` | Immediately call and store the value returned by get_credentials. This is useful if the repo or session will be pickled to generate many copies. Passing scatter_initial_credentials=True will ensure all those copies don't need to call get_credentials immediately. After the initial set of credentials has expired, the cached value is no longer used. Notice that credentials obtained are stored, and they can be sent over the network if you pickle the session/repo. | `False`
### `` s3_from_env_credentials #
s3_from_env_credentials()
Instruct S3 and S3 compatible object stores to gather credentials from the
operative system environment.
### `` s3_refreshable_credentials #
s3_refreshable_credentials(get_credentials, scatter_initial_credentials=False)
Create refreshable credentials for S3 and S3 compatible object stores.
Parameters:
Name | Type | Description | Default
---|---|---|---
`get_credentials` | `Callable[[], S3StaticCredentials]` | Use this function to get and refresh the credentials. The function must be pickable. | _required_
`scatter_initial_credentials` | `bool` | Immediately call and store the value returned by get_credentials. This is useful if the repo or session will be pickled to generate many copies. Passing scatter_initial_credentials=True will ensure all those copies don't need to call get_credentials immediately. After the initial set of credentials has expired, the cached value is no longer used. Notice that credentials obtained are stored, and they can be sent over the network if you pickle the session/repo. | `False`
### `` s3_static_credentials #
s3_static_credentials(*, access_key_id, secret_access_key, session_token=None, expires_after=None)
Create static credentials for S3 and S3 compatible object stores.
Parameters:
Name | Type | Description | Default
---|---|---|---
`access_key_id` | `str` | S3 credential access key | _required_
`secret_access_key` | `str` | S3 credential secret access key | _required_
`session_token` | `str | None` | Optional S3 credential session token | `None`
`expires_after` | `datetime | None` | Optional expiration for the object store credentials | `None`
### `` s3_storage #
s3_storage(*, bucket, prefix, region=None, endpoint_url=None, allow_http=False, access_key_id=None, secret_access_key=None, session_token=None, expires_after=None, anonymous=None, from_env=None, get_credentials=None, scatter_initial_credentials=False, force_path_style=False)
Create a Storage instance that saves data in S3 or S3 compatible object
stores.
Parameters:
Name | Type | Description | Default
---|---|---|---
`bucket` | `str` | The bucket where the repository will store its data | _required_
`prefix` | `str | None` | The prefix within the bucket that is the root directory of the repository | _required_
`region` | `str | None` | The region to use in the object store, if `None` a default region will be used | `None`
`endpoint_url` | `str | None` | Optional endpoint where the object store serves data, example: http://localhost:9000 | `None`
`allow_http` | `bool` | If the object store can be accessed using http protocol instead of https | `False`
`access_key_id` | `str | None` | S3 credential access key | `None`
`secret_access_key` | `str | None` | S3 credential secret access key | `None`
`session_token` | `str | None` | Optional S3 credential session token | `None`
`expires_after` | `datetime | None` | Optional expiration for the object store credentials | `None`
`anonymous` | `bool | None` | If set to True requests to the object store will not be signed | `None`
`from_env` | `bool | None` | Fetch credentials from the operative system environment | `None`
`get_credentials` | `Callable[[], S3StaticCredentials] | None` | Use this function to get and refresh object store credentials | `None`
`scatter_initial_credentials` | `bool` | Immediately call and store the value returned by get_credentials. This is useful if the repo or session will be pickled to generate many copies. Passing scatter_initial_credentials=True will ensure all those copies don't need to call get_credentials immediately. After the initial set of credentials has expired, the cached value is no longer used. Notice that credentials obtained are stored, and they can be sent over the network if you pickle the session/repo. | `False`
`force_path_style` | `bool` | Whether to force using path-style addressing for buckets | `False`
### `` s3_store #
s3_store(region=None, endpoint_url=None, allow_http=False, anonymous=False, s3_compatible=False, force_path_style=False)
Build an ObjectStoreConfig instance for S3 or S3 compatible object stores.
### `` set_logs_filter #
set_logs_filter(log_filter_directive)
Set filters and log levels for the different modules.
Examples: - set_logs_filter("trace") # trace level for all modules -
set_logs_filter("error") # error level for all modules -
set_logs_filter("icechunk=debug,info") # debug level for icechunk, info for
everything else
Full spec for the log_filter_directive syntax is documented in
https://docs.rs/tracing-
subscriber/latest/tracing_subscriber/filter/struct.EnvFilter.html#directives
Parameters:
Name | Type | Description | Default
---|---|---|---
`log_filter_directive` | `str | None` | The comma separated list of directives for modules and log levels. If None, the directive will be read from the environment variable ICECHUNK_LOG | _required_
### `` spec_version #
spec_version()
The version of the Icechunk specification that the library is compatible with.
Returns: int: The version of the Icechunk specification that the library is
compatible with
### `` tigris_storage #
tigris_storage(*, bucket, prefix, region=None, endpoint_url=None, use_weak_consistency=False, allow_http=False, access_key_id=None, secret_access_key=None, session_token=None, expires_after=None, anonymous=None, from_env=None, get_credentials=None, scatter_initial_credentials=False)
Create a Storage instance that saves data in Tigris object store.
Parameters:
Name | Type | Description | Default
---|---|---|---
`bucket` | `str` | The bucket where the repository will store its data | _required_
`prefix` | `str | None` | The prefix within the bucket that is the root directory of the repository | _required_
`region` | `str | None` | The region to use in the object store, if `None` a default region will be used | `None`
`endpoint_url` | `str | None` | Optional endpoint where the object store serves data, example: http://localhost:9000 | `None`
`use_weak_consistency` | `bool` | If set to True it will return a Storage instance that is read only, and can read from the the closest Tigris region. Behavior is undefined if objects haven't propagated to the region yet. This option is for experts only. | `False`
`allow_http` | `bool` | If the object store can be accessed using http protocol instead of https | `False`
`access_key_id` | `str | None` | S3 credential access key | `None`
`secret_access_key` | `str | None` | S3 credential secret access key | `None`
`session_token` | `str | None` | Optional S3 credential session token | `None`
`expires_after` | `datetime | None` | Optional expiration for the object store credentials | `None`
`anonymous` | `bool | None` | If set to True requests to the object store will not be signed | `None`
`from_env` | `bool | None` | Fetch credentials from the operative system environment | `None`
`get_credentials` | `Callable[[], S3StaticCredentials] | None` | Use this function to get and refresh object store credentials | `None`
`scatter_initial_credentials` | `bool` | Immediately call and store the value returned by get_credentials. This is useful if the repo or session will be pickled to generate many copies. Passing scatter_initial_credentials=True will ensure all those copies don't need to call get_credentials immediately. After the initial set of credentials has expired, the cached value is no longer used. Notice that credentials obtained are stored, and they can be sent over the network if you pickle the session/repo. | `False`
## `` icechunk.xarray #
Functions:
Name | Description
---|---
`to_icechunk` | Write an Xarray object to a group of an Icechunk store.
### `` to_icechunk #
to_icechunk(obj, session, *, group=None, mode=None, safe_chunks=True, append_dim=None, region=None, encoding=None, chunkmanager_store_kwargs=None, split_every=None)
Write an Xarray object to a group of an Icechunk store.
Parameters:
Name | Type | Description | Default
---|---|---|---
`obj` | `DataArray | Dataset` | Xarray object to write | _required_
`session` | `Session` | Writable Icechunk Session | _required_
`mode` | `"w", "w-", "a", "a-", r+", None` | Persistence mode: "w" means create (overwrite if exists); "w-" means create (fail if exists); "a" means override all existing variables including dimension coordinates (create if does not exist); "a-" means only append those variables that have `append_dim`. "r+" means modify existing array _values_ only (raise an error if any metadata or shapes would change). The default mode is "a" if `append_dim` is set. Otherwise, it is "r+" if `region` is set and `w-` otherwise. | `"w"`
`group` | `str` | Group path. (a.k.a. `path` in zarr terminology.) | `None`
`encoding` | `dict` | Nested dictionary with variable names as keys and dictionaries of variable specific encodings as values, e.g., `{"my_variable": {"dtype": "int16", "scale_factor": 0.1,}, ...}` | `None`
`append_dim` | `hashable` | If set, the dimension along which the data will be appended. All other dimensions on overridden variables must remain the same size. | `None`
`region` | `dict or auto` | Optional mapping from dimension names to either a) `"auto"`, or b) integer slices, indicating the region of existing zarr array(s) in which to write this dataset's data. If `"auto"` is provided the existing store will be opened and the region inferred by matching indexes. `"auto"` can be used as a single string, which will automatically infer the region for all dimensions, or as dictionary values for specific dimensions mixed together with explicit slices for other dimensions. Alternatively integer slices can be provided; for example, `{'x': slice(0, 1000), 'y': slice(10000, 11000)}` would indicate that values should be written to the region `0:1000` along `x` and `10000:11000` along `y`. Users are expected to ensure that the specified region aligns with Zarr chunk boundaries, and that dask chunks are also aligned. Xarray makes limited checks that these multiple chunk boundaries line up. It is possible to write incomplete chunks and corrupt the data with this option if you are not careful. | `None`
`safe_chunks` | `bool` | If True, only allow writes to when there is a many-to-one relationship between Zarr chunks (specified in encoding) and Dask chunks. Set False to override this restriction; however, data may become corrupted if Zarr arrays are written in parallel. In addition to the many-to-one relationship validation, it also detects partial chunks writes when using the region parameter, these partial chunks are considered unsafe in the mode "r+" but safe in the mode "a". Note: Even with these validations it can still be unsafe to write two or more chunked arrays in the same location in parallel if they are not writing in independent regions. | `True`
`chunkmanager_store_kwargs` | `dict` | Additional keyword arguments passed on to the `ChunkManager.store` method used to store chunked arrays. For example for a dask array additional kwargs will be passed eventually to `dask.array.store()`. Experimental API that should not be relied upon. | `None`
`split_every` | `int | None` | Number of tasks to merge at every level of the tree reduction. | `None`
Returns:
Type | Description
---|---
`None` |
Notes
Two restrictions apply to the use of `region`:
* If `region` is set, _all_ variables in a dataset must have at least one dimension in common with the region. Other variables should be written in a separate single call to `to_icechunk()`.
* Dimensions cannot be included in both `region` and `append_dim` at the same time. To create empty arrays to fill in with `region`, use the `_XarrayDatasetWriter` directly.
## `` icechunk.dask #
Functions:
Name | Description
---|---
`computing_meta` | A decorator to handle the dask-specific `computing_meta` flag.
`store_dask` | A version of `dask.array.store` for Icechunk stores.
### `` computing_meta #
computing_meta(func)
A decorator to handle the dask-specific `computing_meta` flag.
If `computing_meta` is True in the keyword arguments, the decorated function
will return a placeholder meta object (np.array([object()], dtype=object)).
Otherwise, it will execute the original function.
### `` store_dask #
store_dask(*, sources, targets, regions=None, split_every=None, **store_kwargs)
A version of `dask.array.store` for Icechunk stores.
This method will eagerly execute writes to the Icechunk store, and will merge
the changesets corresponding to each write task. The `store` object passed in
will be updated in-place with the fully merged changeset.
For distributed or multi-processing writes, this method must be called within
the `Session.allow_pickling()` context. All Zarr arrays in `targets` must also
be created within this context since they contain a reference to the Session.
Parameters:
Name | Type | Description | Default
---|---|---|---
`sources` | `list[Array]` | List of dask arrays to write. | _required_
`targets` | `list of `zarr.Array`` | Corresponding list of Zarr array objects to write to. | _required_
`regions` | `list[tuple[slice, ...]] | None` | Corresponding region for each of `targets` to write to. | `None`
`split_every` | `int | None` | Number of changesets to merge at a given time. | `None`
`**store_kwargs` | `Any` | Arbitrary keyword arguments passed to `dask.array.store`. Notably `compute`, `return_stored`, `load_stored`, and `lock` are unsupported. | `{}`
July 10, 2025 Was this page helpful?
Thanks for your feedback!
Thanks for your feedback! Help us improve this page by using our [feedback
form](https://share.hsforms.com/10yWXT95yS6mWXg3bc1e4PQdl5u6).