stringzilla

# StringZilla 🦖 ![StringZilla banner](https://github.com/ashvardanian/ashvardanian/blob/master/repositories/StringZilla-v4.jpg?raw=true) The world wastes a minimum of $100M annually due to inefficient string operations. A typical codebase processes strings character by character, resulting in too many branches and data-dependencies, neglecting 90% of modern CPU's potential. LibC is different. It attempts to leverage SIMD instructions to boost some operations, and is often used by higher-level languages, runtimes, and databases. But it isn't perfect. 1️⃣ First, even on common hardware, including over a billion 64-bit ARM CPUs, common functions like `strstr` and `memmem` only achieve 1/3 of the CPU's throughput. 2️⃣ Second, SIMD coverage is inconsistent: acceleration in forward scans does not guarantee speed in the reverse-order search. 3️⃣ At last, most high-level languages can't always use LibC, as the strings are often not NULL-terminated or may contain the Unicode "Zero" character in the middle of the string. That's why StringZilla was created. To provide predictably high performance, portable to any modern platform, operating system, and programming language. [![StringZilla Python installs](https://static.pepy.tech/personalized-badge/stringzilla?period=total&units=abbreviation&left_color=black&right_color=blue&left_text=StringZilla%20Python%20installs)](https://github.com/ashvardanian/stringzilla) [![StringZilla Rust installs](https://img.shields.io/crates/d/stringzilla?logo=rust&label=Rust%20installs)](https://crates.io/crates/stringzilla) [![Ubuntu status](https://img.shields.io/github/checks-status/ashvardanian/StringZilla/main?checkName=Linux%20CI&label=Ubuntu)](https://github.com/ashvardanian/StringZilla/actions/workflows/release.yml) [![Windows status](https://img.shields.io/github/checks-status/ashvardanian/StringZilla/main?checkName=Windows%20CI&label=Windows)](https://github.com/ashvardanian/StringZilla/actions/workflows/release.yml) [![macOS status](https://img.shields.io/github/checks-status/ashvardanian/StringZilla/main?checkName=macOS%20CI&label=macOS)](https://github.com/ashvardanian/StringZilla/actions/workflows/release.yml) ![StringZilla code size](https://img.shields.io/github/languages/code-size/ashvardanian/stringzilla) StringZilla is the GodZilla of string libraries, using [SIMD][faq-simd] and [SWAR][faq-swar] to accelerate string operations on modern CPUs. It is up to __10x faster than the default and even other SIMD-accelerated string libraries__ in C, C++, Python, and other languages, while covering broad functionality. It __accelerates exact and fuzzy string matching, edit distance computations, sorting, lazily-evaluated ranges to avoid memory allocations, and even random-string generators__. [faq-simd]: https://en.wikipedia.org/wiki/Single_instruction,_multiple_data [faq-swar]: https://en.wikipedia.org/wiki/SWAR - 🐂 __[C](#basic-usage-with-c-99-and-newer) :__ Upgrade LibC's `<string.h>` to `<stringzilla/stringzilla.h>` in C 99 - 🐉 __[C++](#basic-usage-with-c-11-and-newer):__ Upgrade STL's `<string>` to `<stringzilla/stringzilla.hpp>` in C++ 11 - 🧮 __[CUDA](#cuda):__ Process in-bulk with `<stringzillas/stringzillas.cuh>` in CUDA C++ 17 - 🐍 __[Python](#quick-start-python-🐍):__ Upgrade your `str` to faster `Str` - 🦀 __[Rust](#quick-start-rust-🦀):__ Use the `StringZilla` traits crate - 🦫 __[Go](#quick-start-golang-🦫):__ Use the `StringZilla` cGo module - 🍎 __[Swift](#quick-start-swift-🍏):__ Use the `String+StringZilla` extension - 🟨 __[JavaScript](#quick-start-javascript-🟨):__ Use the `StringZilla` library - 🐚 __[Shell][faq-shell]__: Accelerate common CLI tools with `sz_` prefix - 📚 Researcher? Jump to [Algorithms & Design Decisions](#algorithms--design-decisions-📚) - 💡 Thinking to contribute? Look for ["good first issues"][first-issues] - 🤝 And check the [guide](https://github.com/ashvardanian/StringZilla/blob/main/CONTRIBUTING.md) to set up the environment - Want more bindings or features? Let [me](https://github.com/ashvardanian) know! [faq-shell]: https://github.com/ashvardanian/StringZilla/blob/main/cli/README.md [first-issues]: https://github.com/ashvardanian/StringZilla/issues __Who is this for?__ - For data-engineers parsing large datasets, like the [CommonCrawl](https://commoncrawl.org/), [RedPajama](https://github.com/togethercomputer/RedPajama-Data), or [LAION](https://laion.ai/blog/laion-5b/). - For software engineers optimizing strings in their apps and services. - For bioinformaticians and search engineers looking for edit-distances for [USearch](https://github.com/unum-cloud/usearch). - For [DBMS][faq-dbms] devs, optimizing `LIKE`, `ORDER BY`, and `GROUP BY` operations. - For hardware designers, needing a SWAR baseline for string-processing functionality. - For students studying SIMD/SWAR applications to non-data-parallel operations. [faq-dbms]: https://en.wikipedia.org/wiki/Database ## Performance <table style="width: 100%; text-align: center; table-layout: fixed;"> <colgroup> <col style="width: 25%;"> <col style="width: 25%;"> <col style="width: 25%;"> <col style="width: 25%;"> </colgroup> <tr> <th align="center">C</th> <th align="center">C++</th> <th align="center">Python</th> <th align="center">StringZilla</th> </tr>  <tr> <td colspan="4" align="center">find the first occurrence of a random word from text, ≅ 5 bytes long</td> </tr> <tr> <td align="center"> <code>strstr</code> 1 x86: 7.4 · arm: 2.0 GB/s </td> <td align="center"> <code>.find</code> x86: 2.9 · arm: 1.6 GB/s </td> <td align="center"> <code>.find</code> x86: 1.1 · arm: 0.6 GB/s </td> <td align="center"> <code>sz_find</code> x86: 10.6 · arm: 7.1 GB/s </td> </tr>  <tr> <td colspan="4" align="center">find the last occurrence of a random word from text, ≅ 5 bytes long</td> </tr> <tr> <td align="center">⚪</td> <td align="center"> <code>.rfind</code> x86: 0.5 · arm: 0.4 GB/s </td> <td align="center"> <code>.rfind</code> x86: 0.9 · arm: 0.5 GB/s </td> <td align="center"> <code>sz_rfind</code> x86: 10.8 · arm: 6.7 GB/s </td> </tr>  <tr> <td colspan="4" align="center">split lines separated by <code>\n</code> or <code>\r</code> 2</td> </tr> <tr> <td align="center"> <code>strcspn</code> 1 x86: 5.42 · arm: 2.19 GB/s </td> <td align="center"> <code>.find_first_of</code> x86: 0.59 · arm: 0.46 GB/s </td> <td align="center"> <code>re.finditer</code> x86: 0.06 · arm: 0.02 GB/s </td> <td align="center"> <code>sz_find_byteset</code> x86: 4.08 · arm: 3.22 GB/s </td> </tr>  <tr> <td colspan="4" align="center">find the last occurrence of any of 6 whitespaces 2</td> </tr> <tr> <td align="center">⚪</td> <td align="center"> <code>.find_last_of</code> x86: 0.25 · arm: 0.25 GB/s </td> <td align="center">⚪</td> <td align="center"> <code>sz_rfind_byteset</code> x86: 0.43 · arm: 0.23 GB/s </td> </tr>  <tr> <td colspan="4" align="center">Random string from a given alphabet, 20 bytes long 5</td> </tr> <tr> <td align="center"> <code>rand() % n</code> x86: 18.0 · arm: 9.4 MB/s </td> <td align="center"> <code>std::uniform_int_distribution</code> x86: 47.2 · arm: 20.4 MB/s </td> <td align="center"> <code>join(random.choices(...))</code> x86: 13.3 · arm: 5.9 MB/s </td> <td align="center"> <code>sz_fill_random</code> x86: 56.2 · arm: 25.8 MB/s </td> </tr>  <tr> <td colspan="4" align="center">Mapping characters with lookup table transforms</td> </tr> <tr> <td align="center">⚪</td> <td align="center"> <code>std::transform</code> x86: 3.81 · arm: 2.65 GB/s </td> <td align="center"> <code>str.translate</code> x86: 260.0 · arm: 140.0 MB/s </td> <td align="center"> <code>sz_lookup</code> x86: 21.2 · arm: 8.5 GB/s </td> </tr>  <tr> <td colspan="4" align="center">Get sorted order, ≅ 8 million English words 6</td> </tr> <tr> <td align="center"> <code>qsort_r</code> x86: 3.55 · arm: 5.77 s </td> <td align="center"> <code>std::sort</code> x86: 2.79 · arm: 4.02 s </td> <td align="center"> <code>numpy.argsort</code> x86: 7.58 · arm: 13.00 s </td> <td align="center"> <code>sz_sequence_argsort</code> x86: 1.91 · arm: 2.37 s </td> </tr>  <tr> <td colspan="4" align="center">Levenshtein edit distance, text lines ≅ 100 bytes long</td> </tr> <tr> <td align="center">⚪</td> <td align="center">⚪</td> <td align="center"> via <code>NLTK</code> 3 and <code>CuDF</code> x86: 1,615,306 · arm: 1,349,980 · cuda: 6,532,411,354 CUPS </td> <td align="center"> <code>szs_levenshtein_distances_t</code> x86: 3,434,427,548 · arm: 1,605,340,403 · cuda: 93,662,026,653 CUPS </td> </tr>  <tr> <td colspan="4" align="center">Needleman-Wunsch alignment scores, proteins ≅ 1 K amino acids long</td> </tr> <tr> <td align="center">⚪</td> <td align="center">⚪</td> <td align="center"> via <code>biopython</code> 4 x86: 575,981,513 · arm: 436,350,732 CUPS </td> <td align="center"> <code>szs_needleman_wunsch_scores_t</code> x86: 452,629,942 · arm: 520,170,239 · cuda: 9,017,327,818 CUPS </td> </tr> </table> StringZilla has a lot of functionality, most of which is covered by benchmarks across C, C++, Python and other languages. You can find those in the `./scripts` directory, with usage notes listed in the [`CONTRIBUTING.md`](CONTRIBUTING.md) file. Notably, if the CPU supports misaligned loads, even the 64-bit SWAR backends are faster than either standard library. > Most benchmarks were conducted on a 1 GB English text corpus, with an average word length of 6 characters. > The code was compiled with GCC 12, using `glibc` v2.35. > The benchmarks were performed on Arm-based Graviton3 AWS `c7g` instances and `r7iz` Intel Sapphire Rapids. > Most modern Arm-based 64-bit CPUs will have similar relative speedups. > Variance within x86 CPUs will be larger. > For CUDA benchmarks, the Nvidia H100 GPUs were used. > 1 Unlike other libraries, LibC requires strings to be NULL-terminated. > 2 Six whitespaces in the ASCII set are: ` \t\n\v\f\r`. Python's and other standard libraries have specialized functions for those. > 3 Most Python libraries for strings are also implemented in C. > 4 Unlike the rest of BioPython, the alignment score computation is [implemented in C](https://github.com/biopython/biopython/blob/master/Bio/Align/_pairwisealigner.c). > 5 All modulo operations were conducted with `uint8_t` to allow compilers more optimization opportunities. > The C++ STL and StringZilla benchmarks used a 64-bit [Mersenne Twister][faq-mersenne-twister] as the generator. > For C, C++, and StringZilla, an in-place update of the string was used. > In Python every string had to be allocated as a new object, which makes it less fair. > 6 Contrary to the popular opinion, Python's default `sorted` function works faster than the C and C++ standard libraries. > That holds for large lists or tuples of strings, but fails as soon as you need more complex logic, like sorting dictionaries by a string key, or producing the "sorted order" permutation. > The latter is very common in database engines and is most similar to `numpy.argsort`. > The current StringZilla solution can be at least 4x faster without loss of generality. [faq-mersenne-twister]: https://en.wikipedia.org/wiki/Mersenne_Twister ## Functionality StringZilla is compatible with most modern CPUs, and provides a broad range of functionality. It's split into 2 layers: 1. StringZilla: single-header C library and C++ wrapper for high-performance string operations. 2. StringZillas: parallel CPU/GPU backends used for large-batch operations and accelerators. Having a second C++/CUDA layer greatly simplifies the implementation of similarity scoring and fingerprinting functions, which would otherwise require too much error-prone boilerplate code in pure C. Both layers are designed to be extremely portable: - [x] across both little-endian and big-endian architectures. - [x] across 32-bit and 64-bit hardware architectures. - [x] across operating systems and compilers. - [x] across ASCII and UTF-8 encoded inputs. Not all features are available across all bindings. Consider contributing if you need a feature that's not yet implemented. | | Maturity | C | C++ | Python | Rust | JS | Swift | Go | | :----------------------------- | :------: | :---: | :---: | :----: | :---: | :---: | :---: | :---: | | Substring Search | 🌳 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | Character Set Search | 🌳 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | Sorting & Sequence Operations | 🌳 | ✅ | ✅ | ✅ | ✅ | ⚪ | ⚪ | ⚪ | | Streaming Hashes | 🌳 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | Small String Class | 🧐 | ✅ | ✅ | ❌ | ⚪ | ❌ | ❌ | ❌ | | Lazy Ranges, Compressed Arrays | 🌳 | ❌ | ✅ | ✅ | ✅ | ❌ | ⚪ | ⚪ | | | | | | | | | | | | Parallel Similarity Scoring | 🌳 | ✅ | ✅ | ✅ | ✅ | ⚪ | ⚪ | ⚪ | | Parallel Rolling Fingerprints | 🌳 | ✅ | ✅ | ✅ | ✅ | ⚪ | ⚪ | ⚪ | > 🌳 parts are used in production. > 🧐 parts are in beta. > 🚧 parts are under active development, and are likely to break in subsequent releases. > ✅ are implemented. > ⚪ are considered. > ❌ are not intended. ## Quick Start: Python 🐍 Python bindings are available on PyPI for Python 3.8+, and can be installed with `pip`. ```bash pip install stringzilla # for serial algorithms pip install stringzillas-cpus # for parallel multi-CPU backends pip install stringzillas-cuda # for parallel Nvidia GPU backend ``` You can immediately check the installed version and the used hardware capabilities with following commands: ```bash python -c "import stringzilla; print(stringzilla.__version__)" python -c "import stringzillas; print(stringzillas.__version__)" python -c "import stringzilla; print(stringzilla.__capabilities__)" # for serial algorithms python -c "import stringzillas; print(stringzillas.__capabilities__)" # for parallel algorithms ``` ### Basic Usage If you've ever used the Python `str`, `bytes`, `bytearray`, or `memoryview` classes, you'll know what to expect. StringZilla's `Str` class is a hybrid of the above, providing a `str`-like interface to byte arrays. ```python from stringzilla import Str, File text_from_str = Str('some-string') # no copies, just a view text_from_bytes = Str(b'some-array') # no copies, just a view text_from_file = Str(File('some-file.txt')) # memory-mapped file import numpy as np alphabet_array = np.arange(ord("a"), ord("z"), dtype=np.uint8) text_from_array = Str(memoryview(alphabet_array)) ``` The `File` class memory-maps a file from persistent storage without loading its copy into RAM. The contents of that file would remain immutable, and the mapping can be shared by multiple Python processes simultaneously. A standard dataset pre-processing use case would be to map a sizable textual dataset like Common Crawl into memory, spawn child processes, and split the job between them. ### Basic Operations - Length: `len(text) -> int` - Indexing: `text[42] -> str` - Slicing: `text[42:46] -> Str` - Substring check: `'substring' in text -> bool` - Hashing: `hash(text) -> int` - String conversion: `str(text) -> str` ### Advanced Operations ```py import sys x: bool = text.contains('substring', start=0, end=sys.maxsize) x: int = text.find('substring', start=0, end=sys.maxsize) x: int = text.count('substring', start=0, end=sys.maxsize, allowoverlap=False) x: str = text.decode(encoding='utf-8', errors='strict') x: Strs = text.split(separator=' ', maxsplit=sys.maxsize, keepseparator=False) x: Strs = text.rsplit(separator=' ', maxsplit=sys.maxsize, keepseparator=False) x: Strs = text.splitlines(keeplinebreaks=False, maxsplit=sys.maxsize) ``` It's important to note that the last function's behavior is slightly different from Python's `str.splitlines`. The [native version][faq-splitlines] matches `\n`, `\r`, `\v` or `\x0b`, `\f` or `\x0c`, `\x1c`, `\x1d`, `\x1e`, `\x85`, `\r\n`, `\u2028`, `\u2029`, including 3x two-byte-long runes. The StringZilla version matches only `\n`, `\v`, `\f`, `\r`, `\x1c`, `\x1d`, `\x1e`, `\x85`, avoiding two-byte-long runes. [faq-splitlines]: https://docs.python.org/3/library/stdtypes.html#str.splitlines ### Character Set Operations Python strings don't natively support character set operations. This forces people to use regular expressions, which are slow and hard to read. To avoid the need for `re.finditer`, StringZilla provides the following interfaces: ```py x: int = text.find_first_of('chars', start=0, end=sys.maxsize) x: int = text.find_last_of('chars', start=0, end=sys.maxsize) x: int = text.find_first_not_of('chars', start=0, end=sys.maxsize) x: int = text.find_last_not_of('chars', start=0, end=sys.maxsize) x: Strs = text.split_byteset(separator='chars', maxsplit=sys.maxsize, keepseparator=False) x: Strs = text.rsplit_byteset(separator='chars', maxsplit=sys.maxsize, keepseparator=False) ``` You can also transform the string using Look-Up Tables (LUTs), mapping it to a different character set. This would result in a copy - `str` for `str` inputs and `bytes` for other types. ```py x: str = text.translate('chars', {}, start=0, end=sys.maxsize, inplace=False) x: bytes = text.translate(b'chars', {}, start=0, end=sys.maxsize, inplace=False) ``` For efficiency reasons, pass the LUT as a string or bytes object, not as a dictionary. This can be useful in high-throughput applications dealing with binary data, including bioinformatics and image processing. Here is an example: ```py import stringzilla as sz look_up_table = bytes(range(256)) # Identity LUT image = open("/image/path.jpeg", "rb").read() sz.translate(image, look_up_table, inplace=True) ``` ### Hash Single-shot and incremental hashing are both supported: ```py import stringzilla as sz # One-shot - stable 64-bit output across all platforms! one = sz.hash(b"Hello, world!", seed=42) # Incremental updates return itself; digest does not consume state hasher = sz.Hasher(seed=42) hasher.update(b"Hello, ").update(b"world!") streamed = hasher.digest() # or `hexdigest()` for a string assert one == streamed ``` ### Collection-Level Operations Once split into a `Strs` object, you can sort, shuffle, and reorganize the slices with minimal memory footprint. If all the chunks are located in consecutive memory regions, the memory overhead can be as low as 4 bytes per chunk. ```python lines: Strs = text.split(separator='\n') # 4 bytes per line overhead for under 4 GB of text batch: Strs = lines.sample(seed=42) # 10x faster than `random.choices` lines.shuffle(seed=42) # or shuffle all lines in place and shard with slices lines_sorted: Strs = lines.sorted() # returns a new Strs in sorted order order: tuple = lines.argsort() # similar to `numpy.argsort` ``` Working on [RedPajama][redpajama], addressing 20 billion annotated English documents, one will need only 160 GB of RAM instead of terabytes. Once loaded, the data will be memory-mapped, and can be reused between multiple Python processes without copies. And of course, you can use slices to navigate the dataset and shard it between multiple workers. ```python lines[::3] # every third line lines[1::1] # every odd line lines[:-100:-1] # last 100 lines in reverse order ``` [redpajama]: https://github.com/togethercomputer/RedPajama-Data ### Iterators and Memory Efficiency Python's operations like `split()` and `readlines()` immediately materialize a `list` of copied parts. This can be very memory-inefficient for large datasets. StringZilla saves a lot of memory by viewing existing memory regions as substrings, but even more memory can be saved by using lazily evaluated iterators. ```py x: SplitIterator[Str] = text.split_iter(separator=' ', keepseparator=False) x: SplitIterator[Str] = text.rsplit_iter(separator=' ', keepseparator=False) x: SplitIterator[Str] = text.split_byteset_iter(separator='chars', keepseparator=False) x: SplitIterator[Str] = text.rsplit_byteset_iter(separator='chars', keepseparator=False) ``` StringZilla can easily be 10x more memory efficient than native Python classes for tokenization. With lazy operations, it practically becomes free. ```py import stringzilla as sz %load_ext memory_profiler text = open("enwik9.txt", "r").read() # 1 GB, mean word length 7.73 bytes %memit text.split() # increment: 8670.12 MiB (152 ms) %memit sz.split(text) # increment: 530.75 MiB (25 ms) %memit sum(1 for _ in sz.split_iter(text)) # increment: 0.00 MiB ``` ### Low-Level Python API Aside from calling the methods on the `Str` and `Strs` classes, you can also call the global functions directly on `str` and `bytes` instances. Assuming StringZilla CPython bindings are implemented [without any intermediate tools like SWIG or PyBind](https://ashvardanian.com/posts/pybind11-cpython-tutorial/), the call latency should be similar to native classes. ```py import stringzilla as sz contains: bool = sz.contains("haystack", "needle", start=0, end=sys.maxsize) offset: int = sz.find("haystack", "needle", start=0, end=sys.maxsize) count: int = sz.count("haystack", "needle", start=0, end=sys.maxsize, allowoverlap=False) ``` ### Similarity Scores StringZilla exposes high-performance, batch-oriented similarity via the `stringzillas` module. Use `DeviceScope` to pick hardware and optionally limit capabilities per engine. ```py import stringzilla as sz import stringzillas as szs cpu_scope = szs.DeviceScope(cpu_cores=4) # force CPU-only gpu_scope = szs.DeviceScope(gpu_device=0) # pick GPU 0 if available strings_a = sz.Strs(["kitten", "flaw"]) strings_b = sz.Strs(["sitting", "lawn"]) engine = szs.LevenshteinDistances( match=0, mismatch=2, # costs don't have to be 1 open=3, extend=1, # may be different in Bio capabilities=("serial",) # avoid SIMD 🤭 ) distances = engine(strings_a, strings_b, device=cpu_scope) assert int(distances[0]) == 3 and int(distances[1]) == 2 ``` Note, that this computes byte-level distances. For UTF-8 codepoints, use a different engine class: ```py strings_a = sz.Strs(["café", "αβγδ"]) strings_b = sz.Strs(["cafe", "αγδ"]) engine = szs.LevenshteinDistancesUTF8(capabilities=("serial",)) distances = engine(strings_a, strings_b, device=cpu_scope) assert int(distances[0]) == 1 and int(distances[1]) == 1 ``` For alignment scoring provide a 256×256 substitution matrix using NumPy: ```py import numpy as np import stringzilla as sz import stringzillas as szs substitution_matrix = np.zeros((256, 256), dtype=np.int8) substitution_matrix.fill(-1) # mismatch score np.fill_diagonal(substitution_matrix, 0) # match score engine = szs.NeedlemanWunsch(substitution_matrix=substitution_matrix, open=1, extend=1) scores = engine(strings_a, strings_b, device=cpu_scope) ``` Several Python libraries provide edit distance computation. Most are implemented in C but may be slower than StringZilla on large inputs. For proteins ~10k chars, 100 pairs: - [JellyFish](https://github.com/jamesturk/jellyfish): 62.3s - [EditDistance](https://github.com/roy-ht/editdistance): 32.9s - StringZilla: __0.8s__ Using the same proteins for Needleman-Wunsch alignment scores: - [BioPython](https://github.com/biopython/biopython): 25.8s - StringZilla: __7.8s__ <details> <summary>§ Example converting from BioPython to StringZilla.</summary> ```py import numpy as np from Bio import Align from Bio.Align import substitution_matrices aligner = Align.PairwiseAligner() aligner.substitution_matrix = substitution_matrices.load("BLOSUM62") aligner.open_gap_score = 1 aligner.extend_gap_score = 1 # Convert the matrix to NumPy subs_packed = np.array(aligner.substitution_matrix).astype(np.int8) subs_reconstructed = np.zeros((256, 256), dtype=np.int8) # Initialize all banned characters to a the largest possible penalty subs_reconstructed.fill(127) for packed_row, packed_row_aminoacid in enumerate(aligner.substitution_matrix.alphabet): for packed_column, packed_column_aminoacid in enumerate(aligner.substitution_matrix.alphabet): reconstructed_row = ord(packed_row_aminoacid) reconstructed_column = ord(packed_column_aminoacid) subs_reconstructed[reconstructed_row, reconstructed_column] = subs_packed[packed_row, packed_column] # Let's pick two examples of tripeptides (made of 3 amino acids) glutathione = "ECG" # Need to rebuild human tissue? thyrotropin_releasing_hormone = "QHP" # Or to regulate your metabolism? import stringzillas as szs engine = szs.NeedlemanWunsch(substitution_matrix=subs_reconstructed, open=1, extend=1) score = int(engine(sz.Strs([glutathione]), sz.Strs([thyrotropin_releasing_hormone]))[0]) assert score == aligner.score(glutathione, thyrotropin_releasing_hormone) # Equal to 6 ``` </details> ### Rolling Fingerprints MinHashing is a common technique for Information Retrieval, producing compact representations of large documents. For $D$ hash-functions and a text of length $L$, in the worst case it involves computing $O(D \cdot L)$ hashes. ```py import numpy as np import stringzilla as sz import stringzillas as szs texts = sz.Strs([ "quick brown fox jumps over the lazy dog", "quick brown fox jumped over a very lazy dog", ]) cpu = szs.DeviceScope(cpu_cores=4) ndim = 1024 window_widths = np.array([4, 6, 8, 10], dtype=np.uint64) engine = szs.Fingerprints( ndim=ndim, window_widths=window_widths, # optional alphabet_size=256, # default for byte strings capabilities=("serial",), # defaults to all, can also pass a `DeviceScope` ) hashes, counts = engine(texts, device=cpu) assert hashes.shape == (len(texts), ndim) assert counts.shape == (len(texts), ndim) assert hashes.dtype == np.uint32 and counts.dtype == np.uint32 ``` ### Serialization #### Filesystem Similar to how `File` can be used to read a large file, other interfaces can be used to dump strings to disk faster. The `Str` class has `write_to` to write the string to a file, and `offset_within` to obtain integer offsets of substring view in larger string for navigation. ```py web_archive = Str("<html>...</html><html>...</html>") _, end_tag, next_doc = web_archive.partition("</html>") # or use `find` next_doc_offset = next_doc.offset_within(web_archive) web_archive.write_to("next_doc.html") # no GIL, no copies, just a view ``` #### PyArrow A `Str` is easy to cast to [PyArrow](https://arrow.apache.org/docs/python/arrays.html#string-and-binary-types) buffers. ```py from pyarrow import foreign_buffer from stringzilla import Strs strs = Strs(["alpha", "beta", "gamma"]) arrow = foreign_buffer(strs.address, strs.nbytes, strs) ``` And only slightly harder to convert in reverse direction: ```py arr = pa.Array.from_buffers( pa.large_string() if strs.offsets_are_large else pa.string(), len(strs), [None, pa.foreign_buffer(strs.offsets_address, strs.offsets_nbytes, strs), pa.foreign_buffer(strs.tape_address, strs.tape_nbytes, strs)], ) ``` That means you can convert `Str` to `pyarrow.Buffer` and `Strs` to `pyarrow.Array` without extra copies. ## Quick Start: C/C++ 🛠️ The C library is header-only, so you can just copy the `stringzilla.h` header into your project. Same applies to C++, where you would copy the `stringzilla.hpp` header. Alternatively, add it as a submodule, and include it in your build system. ```sh git submodule add https://github.com/ashvardanian/StringZilla.git external/stringzilla git submodule update --init --recursive ``` Or using a pure CMake approach: ```cmake FetchContent_Declare( stringzilla GIT_REPOSITORY https://github.com/ashvardanian/StringZilla.git GIT_TAG main # or specify a version tag ) FetchContent_MakeAvailable(stringzilla) ``` Last, but not the least, you can also install it as a library, and link against it. This approach is worse for inlining, but brings [dynamic runtime dispatch](#dynamic-dispatch) for the most advanced CPU features. ### Basic Usage with C 99 and Newer There is a stable C 99 interface, where all function names are prefixed with `sz_`. Most interfaces are well documented, and come with self-explanatory names and examples. In some cases, hardware specific overloads are available, like `sz_find_skylake` or `sz_find_neon`. Both are companions of the `sz_find`, first for x86 CPUs with AVX-512 support, and second for Arm NEON-capable CPUs. ```c #include <stringzilla/stringzilla.h> // Initialize your haystack and needle sz_string_view_t haystack = {your_text, your_text_length}; sz_string_view_t needle = {your_subtext, your_subtext_length}; // Perform string-level operations auto-picking the backend or dispatching manually sz_cptr_t ptr = sz_find(haystack.start, haystack.length, needle.start, needle.length); sz_size_t substring_position = ptr ? (sz_size_t)(ptr - haystack.start) : SZ_SIZE_MAX; // SZ_SIZE_MAX if not found // Backend-specific variants return pointers as well sz_cptr_t ptr = sz_find_skylake(haystack.start, haystack.length, needle.start, needle.length); sz_cptr_t ptr = sz_find_haswell(haystack.start, haystack.length, needle.start, needle.length); sz_cptr_t ptr = sz_find_neon(haystack.start, haystack.length, needle.start, needle.length); // Hash strings at once sz_u64_t hash = sz_hash(haystack.start, haystack.length, 42); // 42 is the seed sz_u64_t checksum = sz_bytesum(haystack.start, haystack.length); // or accumulate byte values // Hash strings incrementally with "init", "update", and "digest": sz_hash_state_t state; sz_hash_state_init(&state, 42); sz_hash_state_update(&state, haystack.start, 1); // first char sz_hash_state_update(&state, haystack.start + 1, haystack.length - 1); // rest of the string sz_u64_t streamed_hash = sz_hash_state_digest(&state); // Perform collection level operations sz_sequence_t array = {your_handle, your_count, your_get_start, your_get_length}; sz_sequence_argsort(&array, &your_config); ``` <details> <summary>§ Mapping from LibC to StringZilla.</summary> By design, StringZilla has a couple of notable differences from LibC: 1. all strings are expected to have a length, and are not necessarily null-terminated. 2. every operations has a reverse order counterpart. That way `sz_find` and `sz_rfind` are similar to `strstr` and `strrstr` in LibC. Similarly, `sz_find_byte` and `sz_rfind_byte` replace `memchr` and `memrchr`. The `sz_find_byteset` maps to `strspn` and `strcspn`, while `sz_rfind_byteset` has no sibling in LibC. <table> <tr> <th>LibC Functionality</th> <th>StringZilla Equivalents</th> </tr> <tr> <td><code>memchr(haystack, needle, haystack_length)</code>, <code>strchr</code></td> <td><code>sz_find_byte(haystack, haystack_length, needle)</code></td> </tr> <tr> <td><code>memrchr(haystack, needle, haystack_length)</code></td> <td><code>sz_rfind_byte(haystack, haystack_length, needle)</code></td> </tr> <tr> <td><code>memcmp</code>, <code>strcmp</code></td> <td><code>sz_order</code>, <code>sz_equal</code></td> </tr> <tr> <td><code>strlen(haystack)</code></td> <td><code>sz_find_byte(haystack, haystack_length, needle)</code></td> </tr> <tr> <td><code>strcspn(haystack, needles)</code></td> <td><code>sz_rfind_byteset(haystack, haystack_length, needles_bitset)</code></td> </tr> <tr> <td><code>strspn(haystack, needles)</code></td> <td><code>sz_find_byteset(haystack, haystack_length, needles_bitset)</code></td> </tr> <tr> <td><code>memmem(haystack, haystack_length, needle, needle_length)</code>, <code>strstr</code></td> <td><code>sz_find(haystack, haystack_length, needle, needle_length)</code></td> </tr> <tr> <td><code>memcpy(destination, source, destination_length)</code></td> <td><code>sz_copy(destination, source, destination_length)</code></td> </tr> <tr> <td><code>memmove(destination, source, destination_length)</code></td> <td><code>sz_move(destination, source, destination_length)</code></td> </tr> <tr> <td><code>memset(destination, value, destination_length)</code></td> <td><code>sz_fill(destination, destination_length, value)</code></td> </tr> </table> </details> ### Basic Usage with C++ 11 and Newer There is a stable C++ 11 interface available in the `ashvardanian::stringzilla` namespace. It comes with two STL-like classes: `string_view` and `string`. The first is a non-owning view of a string, and the second is a mutable string with a [Small String Optimization][faq-sso]. ```cpp #include <stringzilla/stringzilla.hpp> namespace sz = ashvardanian::stringzilla; sz::string haystack = "some string"; sz::string_view needle = sz::string_view(haystack).substr(0, 4); auto substring_position = haystack.find(needle); // Or `rfind` auto hash = std::hash<sz::string_view>{}(haystack); // Compatible with STL's `std::hash` haystack.end() - haystack.begin() == haystack.size(); // Or `rbegin`, `rend` haystack.find_first_of(" \v\t") == 4; // Or `find_last_of`, `find_first_not_of`, `find_last_not_of` haystack.starts_with(needle) == true; // Or `ends_with` haystack.remove_prefix(needle.size()); // Why is this operation in-place?! haystack.contains(needle) == true; // STL has this only from C++ 23 onwards haystack.compare(needle) == 1; // Or `haystack <=> needle` in C++ 20 and beyond ``` StringZilla also provides string literals for automatic type resolution, [similar to STL][stl-literal]: ```cpp using sz::literals::operator""_sv; using std::literals::operator""sv; auto a = "some string"; // char const * auto b = "some string"sv; // std::string_view auto b = "some string"_sv; // sz::string_view ``` [stl-literal]: https://en.cppreference.com/w/cpp/string/basic_string_view/operator%22%22sv ### Similarity Scores StringZilla exposes high-performance, batch-oriented similarity via the `stringzillas/stringzillas.h` header. Use `szs_device_scope_t` to pick hardware and optionally limit capabilities per engine. ```cpp #include <stringzillas/stringzillas.h> szs_device_scope_t device = NULL; szs_device_scope_init_default(&device); szs_levenshtein_distances_t engine = NULL; szs_levenshtein_distances_init(0, 1, 1, 1, /*alloc*/ NULL, /*caps*/ sz_cap_serial_k, &engine); sz_sequence_u32tape_t strings_a {data_a, offsets_a, count}; // or `sz_sequence_u64tape_t` for large inputs sz_sequence_u32tape_t strings_b {data_b, offsets_b, count}; // or `sz_sequence_t` to pass generic containers sz_size_t distances[count]; szs_levenshtein_distances_u32tape(engine, device, &strings_a, &strings_b, distances, sizeof(distances[0])); szs_levenshtein_distances_free(engine); szs_device_scope_free(device); ``` To target a different device, use the appropriate `szs_device_scope_init_{cpu_cores,gpu_device}` function. When dealing with GPU backends, make sure to use the "unified memory" allocators exposed as `szs_unified_{alloc,free}`. Similar stable C ABIs are exposed for other workloads as well. - UTF-8: `szs_levenshtein_distances_utf8_{sequence,u32tape,u64tape}` - Needleman-Wunsch: `szs_needleman_wunsch_scores_{sequence,u32tape,u64tape}` - Smith-Waterman: `szs_smith_waterman_scores_{sequence,u32tape,u64tape}` Moreover, in C++ codebases one can tap into the raw templates implementing that functionality, customizing them with custom executors, SIMD plugins, etc. For that include `stringzillas/similarities.hpp` for C++ and `stringzillas/similarities.cuh` for CUDA. ```cpp #include <stringzillas/similarities.hpp> #include <stringzilla/types.hpp> // tape of strings #include <fork_union.hpp> // optional thread pool namespace sz = ashvardanian::stringzilla; namespace szs = ashvardanian::stringzillas; // Pack strings into an Arrow-like tape std::vector<std::string> left = {"kitten", "flaw"}; std::vector<std::string> right = {"sitting", "lawn"}; sz::arrow_strings_tape<char, sz::size_t, std::allocator<char>> tape_a, tape_b; auto _ = tape_a.try_assign(left.begin(), left.end()); auto _ = tape_b.try_assign(right.begin(), right.end()); // Run on the current thread using levenshtein_t = szs::levenshtein_distances<char, szs::linear_gap_costs_t, std::allocator<char>, sz_cap_serial_k>; levenshtein_t engine {szs::uniform_substitution_costs_t{0,1}, szs::linear_gap_costs_t{1}}; std::size_t distances[2]; auto _ = engine(tape_a, tape_b, distances); // Or run in parallel with a pool fork_union::basic_pool_t pool; auto _ = pool.try_spawn(std::thread::hardware_concurrency()); auto _ = engine(tape_a, tape_b, distances, pool); ``` All of the potentially failing StringZillas' interfaces return error codes, and none raise C++ exceptions. Parallelism is enabled at both collection-level and within individual pairs of large inputs. ### Rolling Fingerprints StringZilla exposes parallel fingerprinting (Min-Hashes or Count-Min-Sketches) via the `stringzillas/stringzillas.h` header. Use `szs_device_scope_t` to pick hardware and optionally limit capabilities per engine. ```c #include <stringzillas/stringzillas.h> szs_device_scope_t device = NULL; szs_device_scope_init_default(&device); szs_fingerprints_t engine = NULL; sz_size_t const dims = 1024; sz_size_t const window_widths[] = {4, 6, 8, 10}; szs_fingerprints_init(dims, /*alphabet*/ 256, window_widths, 4, /*alloc*/ NULL, /*caps*/ sz_cap_serial_k, &engine); sz_sequence_u32tape_t texts = {data, offsets, count}; sz_u32_t *min_hashes = (sz_u32_t*)szs_unified_alloc(count * dims * sizeof(*min_hashes)); sz_u32_t *min_counts = (sz_u32_t*)szs_unified_alloc(count * dims * sizeof(*min_counts)); szs_fingerprints_u32tape(engine, device, &texts, min_hashes, dims * sizeof(*min_hashes), // support strided matrices min_counts, dims * sizeof(*min_counts)); // for both output arguments szs_fingerprints_free(engine); szs_device_scope_free(device); ``` Moreover, in C++ codebases one can tap into the raw templates implementing that functionality, customizing them with custom executors, SIMD plugins, etc. For that include `stringzillas/fingerprints.hpp` for C++ and `stringzillas/fingerprints.cuh` for CUDA. ```cpp #include <stringzillas/fingerprints.hpp> #include <stringzilla/types.hpp> // tape of strings #include <fork_union.hpp> // optional thread pool namespace sz = ashvardanian::stringzilla; namespace szs = ashvardanian::stringzillas; // Pack strings into an Arrow-like tape std::vector<std::string> docs = {"alpha beta", "alpha betta"}; sz::arrow_strings_tape<char, sz::size_t, std::allocator<char>> tape; auto _ = tape.try_assign(docs.begin(), docs.end()); // Run on the current thread with a Rabin-Karp family hasher constexpr std::size_t dimensions_k = 256; constexpr std::size_t window_width_k = 7; using row_t = std::array<sz_u32_t, 256>; using fingerprinter_t = szs::floating_rolling_hashers<sz_cap_serial_k, dimensions_k>; fingerprinter_t engine; auto _ = engine.try_extend(window_width_k, dimensions_k); std::vector<row_t> hashes(docs.size()), counts(docs.size()); auto _ = engine(tape, hashes, counts); // Or run in parallel with a pool fork_union::basic_pool_t pool; auto _ = pool.try_spawn(std::thread::hardware_concurrency()); auto _ = engine(tape, hashes, counts, pool); ``` ### CUDA StringZilla provides CUDA C++ templates for composable string batch-processing operations. Different GPUs have varying warp sizes, shared memory capacities, and register counts, affecting algorithm selection, so it's important to query the `gpu_specs_t` via `gpu_specs_fetch`. For memory management, ensure that you use GPU-visible' unified memory` exposed in an STL-compatible manner as a `unified_alloc` template class. For error handling, `cuda_status_t` extends the traditional `status_t` with GPU-specific information. It's implicitly convertible to `status_t`, so you can use it in places expecting a `status_t`. Most algorithms can load-balance both a large number of small strings and a small number of large strings. Still, with large H100-scale GPUs, it's best to submit thousands of inputs at once. ### Memory Ownership and Small String Optimization Most operations in StringZilla don't assume any memory ownership. But in addition to the read-only search-like operations StringZilla provides a minimalistic C and C++ implementations for a memory owning string "class". Like other efficient string implementations, it uses the [Small String Optimization][faq-sso] (SSO) to avoid heap allocations for short strings. [faq-sso]: https://cpp-optimizations.netlify.app/small_strings/ ```c typedef union sz_string_t { struct internal { sz_ptr_t start; sz_u8_t length; char chars[SZ_STRING_INTERNAL_SPACE]; /// Ends with a null-terminator. } internal; struct external { sz_ptr_t start; sz_size_t length; sz_size_t space; /// The length of the heap-allocated buffer. sz_size_t padding; } external; } sz_string_t; ``` As one can see, a short string can be kept on the stack, if it fits within `internal.chars` array. Before 2015 GCC string implementation was just 8 bytes, and could only fit 7 characters. Different STL implementations today have different thresholds for the Small String Optimization. Similar to GCC, StringZilla is 32 bytes in size, and similar to Clang it can fit 22 characters on stack. Our layout might be preferential, if you want to avoid branches. If you use a different compiler, you may want to check its SSO buffer size with a [simple Gist](https://gist.github.com/ashvardanian/c197f15732d9855c4e070797adf17b21). | | `libstdc++` in GCC 13 | `libc++` in Clang 17 | StringZilla | | :-------------------- | ---------------------: | -------------------: | ----------: | | `sizeof(std::string)` | 32 | 24 | 32 | | Small String Capacity | 15 | __22__ | __22__ | This design has been since ported to many high-level programming languages. Swift, for example, [can store 15 bytes](https://developer.apple.com/documentation/swift/substring/withutf8(_:)#discussion) in the `String` instance itself. StringZilla implements SSO at the C level, providing the `sz_string_t` union and a simple API for primary operations. ```c sz_memory_allocator_t allocator; sz_string_t string; // Init and make sure we are on stack sz_string_init(&string); sz_string_is_on_stack(&string); // == sz_true_k // Optionally pre-allocate space on the heap for future insertions. sz_string_grow(&string, 100, &allocator); // == sz_true_k // Append, erase, insert into the string. sz_string_expand(&string, 0, "_Hello_", 7, &allocator); // == sz_true_k sz_string_expand(&string, SZ_SIZE_MAX, "world", 5, &allocator); // == sz_true_k sz_string_erase(&string, 0, 1); // Unpacking & introspection. sz_ptr_t string_start; sz_size_t string_length; sz_size_t string_space; sz_bool_t string_is_external; sz_string_unpack(string, &string_start, &string_length, &string_space, &string_is_external); sz_equal(string_start, "Hello_world", 11); // == sz_true_k // Reclaim some memory. sz_string_shrink_to_fit(&string, &allocator); // == sz_true_k sz_string_free(&string, &allocator); ``` Unlike the conventional C strings, the `sz_string_t` is allowed to contain null characters. To safely print those, pass the `string_length` to `printf` as well. ```c printf("%.*s\n", (int)string_length, string_start); ``` ### What's Wrong with the C Standard Library? StringZilla is not a drop-in replacement for the C Standard Library. It's designed to be a safer and more modern alternative. Conceptually: 1. LibC strings are expected to be null-terminated, so to use the efficient LibC implementations on slices of larger strings, you'd have to copy them, which is more expensive than the original string operation. 2. LibC functionality is asymmetric - you can find the first and the last occurrence of a character within a string, but you can't find the last occurrence of a substring. 3. LibC function names are typically very short and cryptic. 4. LibC lacks crucial functionality like hashing and doesn't provide primitives for less critical but relevant operations like fuzzy matching. Something has to be said about its support for UTF-8. Aside from a single-byte `char` type, LibC provides `wchar_t`: - The size of `wchar_t` is not consistent across platforms. On Windows, it's typically 16 bits (suitable for UTF-16), while on Unix-like systems, it's usually 32 bits (suitable for UTF-32). This inconsistency can lead to portability issues when writing cross-platform code. - `wchar_t` is designed to represent wide characters in a fixed-width format (UTF-16 or UTF-32). In contrast, UTF-8 is a variable-length encoding, where each character can take from 1 to 4 bytes. This fundamental difference means that `wchar_t` and UTF-8 are incompatible. StringZilla [partially addresses those issues](#unicode-utf-8-and-wide-characters). ### What's Wrong with the C++ Standard Library? | C++ Code | Evaluation Result | Invoked Signature | | :----------------------------------- | :---------------- | :----------------------------- | | `"Loose"s.replace(2, 2, "vath"s, 1)` | `"Loathe"` 🤢 | `(pos1, count1, str2, pos2)` | | `"Loose"s.replace(2, 2, "vath", 1)` | `"Love"` 🥰 | `(pos1, count1, str2, count