Text search library in Rust, day 3


I’m rewriting my GNETextSearch C library in Rust.


How to generate Rust bindings for GNETextSearch

After getting GNETextSearch to compile correctly using cc-rs, my next task was to generate Rust bindings for the C interface of GNETextSearch. After generating the Rust bindings and exposing them in the text-search-sys crate, I’ll consume them in text-search.

The common way to generate Rust bindings for C code is via the bindgen crate. bindgen automatically generates Rust Foreign Function Interface (FFI) bindings to C and some C++ code. Automatically generating the bindings can save a lot of error-prone and repetitive typing.

/* C header file */

typedef struct CoolStruct {
  int x;
  int y;
} CoolStruct;

void cool_function(int i, char c, CoolStruct* cs);
/* automatically generated by rust-bindgen */

#[repr(C)]
pub struct CoolStruct {
  pub x: ::std::os::raw::c_int,
  pub y: ::std::os::raw::c_int,
}

extern "C" {
  pub fn cool_function(i: ::std::os::raw::c_int,
                       c: ::std::os::raw::c_char,
                       cs: *mut CoolStruct);
}

The bindgen User Guide

Configuring bindgen to run during build

I wanted to generate the Rust bindings during the build phase of text-search-sys and save them as lib.rs inside of that crate, which would mean that the public interface of the text-search-sys crate would be the generated Rust bindings of GNETextSearch.

The first step was adding bindgen as a build dependency of text-search-sys.

[build-dependencies]
cc = "1.0"
bindgen = "0.52"

The second step was adding a function to build.rs to generate the bindings after building GNETextSearch.

fn generate_bindings(project_dir: &PathBuf, src: &PathBuf) {
  let header = string_from_path(src, Some("GNETextSearch.h"));
  let include_root = format!("-I{}", string_from_path(src, None));
  let include_set = format!("-I{}", string_from_path(src, Some("Set")));
  let include_tree = format!("-I{}", string_from_path(src, Some("Tree")));

  let bindings = bindgen::Builder::default()
    .header(header)
    .clang_arg(include_root)
    .clang_arg(include_set)
    .clang_arg(include_tree)
    .raw_line(
      "#![allow(non_upper_case_globals, non_snake_case, non_camel_case_types, improper_ctypes)]",
    )
    .generate()
    .expect("Unable to generate bindings");

  let lib_rs_path = project_dir.join("src/lib.rs");
  bindings
    .write_to_file(lib_rs_path)
    .expect("Unable to write bindings");
}

fn string_from_path(root: &PathBuf, subpath: Option<&str>) -> String {
  let path: PathBuf;
  if let Some(subpath) = subpath {
    path = root.join(subpath);
  } else {
    path = root.to_path_buf();
  }
  path.into_os_string().into_string().unwrap()
}

Using bindgen::Builder to generate the bindings was pretty straightforward. First, I passed the path to the C library’s umbrella header to the builder via header(). Then, I generated any necessary include path flags and passed those to the builder via clang_arg(). Last, to silence warnings resulting from names that didn’t adhere to Rust naming guidelines, I added some allow lint checking attributes using raw_line(). After generating the bindings with generate(), I saved them to src/lib.rs inside of the text-search-sys crate.

After building the workspace using cargo build, text-search/text-search-sys/src/lib.rs was generated. The resulting file contained more than 11,000 lines of generated Rust code. I wouldn’t recommend trying to read through it all. However, searching for “tsearch” revealed the specific bindings created by bindgen for GNETextSearch.

Verifying the bindings work

So, bindgen appeared to work. It certainly generated a lot of code. The big question was could GNETextSearch be used from Rust? To verify it could, I added a test to text-search/text-search/src/lib.rs.

use text_search_sys;

#[cfg(test)]
mod tests {
  use super::*;

  #[test]
  fn can_use_countedset() {
    let count = unsafe {
      let set = text_search_sys::tsearch_countedset_init();
      text_search_sys::tsearch_countedset_add_int(set, 1);
      text_search_sys::tsearch_countedset_add_int(set, 1);
      text_search_sys::tsearch_countedset_add_int(set, 2);
      let count = text_search_sys::tsearch_countedset_get_count(set);
      text_search_sys::tsearch_countedset_free(set);
      count
    };
    assert_eq!(2, count);
  }
}

The test used the counted set type from GNETextSearch, which is loosely modeled after NSCountedSet. It’s a data structure similar to a normal set, except that it also keeps track of the number of times a given integer has been added to it. This data structure is used by GNETextSearch to keep track of the IDs of documents containing entries in the ternary tree and how often a given entry occurs in each document.

The test was pretty simple. I initialized a counted set, added 1, 1, and 2 to it, and then verified the number of integers stored inside the counted set equaled 2 (remember: this is a set and only one copy of duplicate values is stored). I needed to wrap the calls to the generated bindings inside of unsafe { } because the Rust compiler wasn’t able to ensure memory-safe use of GNETextSearch. Running cargo test in the workspace directory succeeded, showing the bindings were generated properly and could be used.

The code for this post can be found here.

Next, I’ll need to wrap tsearch_countedset and tsearch_ternarytree in ergonomic, memory-safe Rust types.