Blog.

Blosc2 compression with Rust

Cover Image for Blosc2 compression with Rust
Stan Malinowski
Stan Malinowski

Introduction

How do scientific applications store and process massive datasets without grinding to a halt? This question drove me to explore the world of file compression and data handling. That's a world that was nebulous to me when I started working at a synchrotron. I was aware that the recorded scientific data is compressed but the details were a black box tome. I wanted to dive deeper into low-level optimizations with Rust. To my surprise I also learnedabout the the trade-offs behind making storage and retrieval both fast and efficient.

In this quest, I discovered c-blosc2. It's the second iteration of a powerful C compression library designed for scientific applications, particularly those involving HDF files. Its clever use of features like shuffling and delta encoding piqued my curiosity. What if I could port this library to Rust? Could Rust's safety guarantees and modern abstractions enhance the developer experience?

In this post, I’ll take you through my journey of porting c-blosc2 to Rust—what worked, what didn’t, the challenges I faced, and the lessons I learned. We’ll explore compression techniques, file handling, and where I ultimately landed with this project. Whether you’re an intermediate Rust user or just curious about tackling low-level problems in Rust, there’s something here for you.

On the Compression Library: c-blosc2

When working with large datasets, particularly in scientific computing, compression is a necessity as data can easily go into terabytes. c-blosc2 builds on its predecessor with a broader feature set aimed at improving speed, configurability, and functionality.

This library is tailored to handle HDF files efficiently, offering advanced techniques like shuffling (reordering data for better compression) and delta encoding (storing only differences between sequential data points). These optimizations make it particularly well-suited for compressing time-series data or scientific arrays.

In this post, I focus on the journey of translating this C-based library into Rust. I wanted to explore how Rust’s strict guarantees around safety and concurrency and modern syntax could make such a project more developer friendly. This post is written with intermediate Rust users and file compression enthusiasts in mind, detailing insights gained from tackling a low-level problem like this.


Understanding HDF Files

To appreciate c-blosc2’s role, it’s important to understand HDF files (Hierarchical Data Format). These files are a standard in scientific computing, allowing researchers to work with big datasets efficiently.

HDF files are designed for scalability with the view to store an entire experiment - multidimensional imeages, metadata and time series. They feature a hierarchical structure, akin to a filesystem, making it easier to group related data. However, their size can grow significantly, especially with high-resolution simulations or sensor data.

That’s where c-blosc2 comes in. By offering fast, configurable compression techniques, the library reduces the storage footprint of HDF files without compromising access speed. Compression methods like delta encoding are especially useful for time series of images, ensuring that both storage and retrieval are as fast as possible.


Goals for the Project

My project began as an exploration but soon grew into a structured effort to bring c-blosc2’s capabilities to Rust. Here’s what I set out to achieve:

  1. Feature Replication: I aimed to replicate as much of c-blosc2’s feature set as possible, with a focus on learning Rust for the filesystem IO use case, and making design choices.
  2. Publishing a Rust Crate: By creating a Rust implementation, I wanted to make the library accessible to the Rust community.
  3. Python Bindings for my crate: To ensure easy use in Python scientific workflows.
  4. Benchmarking: Writing benchmarks to compare the speed of my Rust implementation with c-blosc2 was a priority. I also wanted to ensure these benchmarks were easy for others to reproduce.
  5. Integration Examples: I planned to provide practical examples showing how the library could be integrated into both Rust and Python servers. This would showcase the potential use cases in real-world applications.

Feature Set of c-blosc2

The strength of c-blosc2 in the scientific use case lies in the optimization for compression speed more than the compressed size -which is what many standard algorithms like zstd do.

  • Shuffling Before Encoding: This technique reorders data at the byte level, grouping similar values together, which significantly improves the effectiveness of compression algorithms.
  • Delta Encoding: Designed for time-series data, delta encoding stores the differences between consecutive values rather than the values themselves. Ideal for data with incremental changes.

These features are carefully designed to tackle large datasets, offering a balance of speed, memory efficiency, and flexibility.


My Approach

To bring c-blosc2’s features into Rust, I prioritized a methodical approach that allowed me to focus on learning:

  1. Studying the Protocol and Code: I started by diving into the library’s protocol specification and high-level C code. Understanding the design principles and the library’s headers gave me a clearer picture of its architecture.
  2. Incremental Implementation: Instead of a line-by-line translation, I focused on the bigger picture. By identifying core functionality, I implemented features incrementally.
  3. Starting Small: I deliberately began with simpler features, gaining confidence and gradually tackling more complex components. This iterative approach helped me deepen my understanding of Rust’s low-level capabilities and its suitability for systems programming.

This strategy helped me avoid being overwhelmed and kept the project moving forward, even when encountering unexpected challenges.


Challenges

Porting c-blosc2 to Rust was not without its hurdles.

  • Asynchronous Logic: One of the biggest challenges was implementing asynchronous functionality to avoid holding large datasets entirely in memory. I needed to manage disk pointers effectively while maintaining performance.
  • Underspecified Protocol: Certain aspects of the protocol were insufficiently documented or ambiguous, leaving me to rely on experimentation and educated guesses. This is what ultimately made me decide against going for a full feature parity.

Core Technical Elements

At the heart of the project were several key algorithms and data structures that powered the compression process.

  • Algorithms & Data Structures: I dug into the core principles behind the compression methods, particularly shuffling, delta encoding, and the management of compressed data blocks. Understanding these algorithms at a deeper level helped me translate them into Rust efficiently.

See the Code Snippets and Diagrams that will help illustrate how the data is processed, including a flow diagram of the encoding process and the specific data structures that play a role in it. Visualizing these processes made it easier for me to understand and communicate the nuances of compression and file handling.

#[derive(Debug)]
struct BloscHeader {
    version: u8,
    versionlz: u8,
    flags: u8,
    typesize: u8,
    nbytes: u32,
    blocksize: u32,
    cbytes: u32,
}



#[derive(Debug)]
struct BloscBlock {
    start: u32,
    compressed_data: Vec<u8>,
}

#[derive(Debug)]
struct BloscChunk {
    header: BloscHeader,
    bstarts: Vec<u32>,
    blocks: Vec<BloscBlock>,
}


// possible multidimensional data implementation
#[derive(Debug, Deserialize, Serialize)]
struct B2ndMetadata {
    version: u8,
    shape: Vec<usize>,
    chunk_shape: Vec<usize>,
    dtype: String,
    codec: String,
    other_metadata: serde_json::Value,
}

#[derive(Debug, PartialEq)]
pub enum FrameType {
    Contiguous,
    Sparse,
    Reserved,
    UserDefined(u8), // For user-defined types 4 to 7
}

pub fn decode_frame_type(flags: u8) -> FrameType {
    match flags & 0x0F {
        0 => FrameType::Contiguous,
        1 => FrameType::Sparse,
        2..=3 => FrameType::Reserved,
        4..=7 => FrameType::UserDefined(flags & 0x0F),
        _ => FrameType::Reserved,
    }
}


End Result

After months of work, the project reached a workable outcome, though it didn’t fully replicate every feature of c-blosc2:

  • Thin Wrapper Around lz4_flex: The final product was a thin wrapper around the lz4_flex library. This solution copied some of c-blosc2’s key data primitives but not fully.
  • Feasibility of Full Implementation: The full implementation wasn’t possible due to several challenges: the underspecified protocol, incomplete documentation, and compatibility issues with the possible deployment environments on the synchrotron machines. These constraints led me to focus on a lighter, more manageable solution.
  • Popularity of c-blosc1: Despite the advantages of c-blosc2, c-blosc remains the more popular option. This decision influenced my thinking about long-term viability and ease of adoption in the community.

While the project didn’t achieve everything I initially hoped for, it provided valuable lessons and insights into systems programming, compression, and the Rust ecosystem.

What I Liked About Rust in This Use Case

As expected, Rust proved to be an excellent fit for systems programming, especially when dealing with compression and memory management:

  • Existing Ecosystem - Serde and lz_flex4 crates made it far simpler than starting from scratch.
  • Clear Error Handling: One of the standout features was Rust’s explicit error handling, which helped in debugging. Being able to catch errors early and manage them with Result and Option types made the development flow smoother.
  • Explicit Data Types: The explicit nature of Rust’s primitive types, like u8 and u64, was invaluable when working with binary files and offsets. These clear, well-defined types made my code more predictable and easier to maintain, especially when translating low-level C code.

Rust’s approach to these fundamental issues was a breath of fresh air and made the implementation much more robust.


Unexpected Lessons Learned

Beyond the technical challenges specific to porting a compression library, this project revealed some general programming lessons:

  • Compression Algorithms Comparison: While working with c-blosc2, I compared it to other popular compression algorithms like zstd and zlib. This comparison gave me a deeper understanding of the strengths and trade-offs of each, helping me appreciate the nuances of choosing the right tool for the job.
  • Existence of Architecture Specific C codes - I had not realize that C code dedicated to SIMD (single instruction, multiple data) needs to be architecture specific. However I also learned that Rust aims to deliver a tool that is architecture agnostic - portable-simd.
  • Linux Kernel Features: I also gained insights into the role that Linux kernel versions play in system performance, particularly features like io_uring, which is crucial for asynchronous I/O in performance-driven projects. This knowledge was further enriched by exploring projects like DataDog's glommio crate, which leverages these features for efficiency.
  • Resource Constraints: Finally, I had to rethink my approach to system resources. In application development, it's easy to take resources for granted, but this project pushed me to be more mindful of memory usage, disk I/O, and computational limits. This has given me a more holistic view of how software interacts with hardware.

These lessons extended far beyond the specifics of file compression and broadened my understanding of systems programming as a whole.


Future Work and Conclusions

This project has sparked new interest in low-level Rust programming, and I’m eager to continue exploring this domain:

  • Networking and Low-Level Systems: Given the parallels between data structures in networking (e.g., headers and blobs) and filesystems, I plan to look into networking applications using Rust. This will likely involve parsing network packets and handling raw TCP/UDP data.
  • Increased Comfort with Binary Parsing: I now feel much more comfortable with parsing binary files and working with custom encodings, headers, and flags. This experience has been invaluable for developing an approach to handling complex data formats.
  • Looking Forward: Overall, the project has enhanced my understanding of both Rust and systems programming. I look forward to taking on more challenging projects, refining my skills, and exploring further into the world of low-level software development.

This experience has opened up new opportunities for growth, and I’m excited to see where it takes me next.