String Manipulation

class blaise.strings.Segmenter(word_dist: str | list[str] | dict[str, float] = 'en_wiki', n_branch_limit: int | None = None, length_power: float = 1)

Bases: object

A string segmenter that identifies word boundaries in a string of letters.

Parameters:

word_dist (str | list[str] | dict[str, float]) –
The source of word probabilities used for segmentation. * If a string, it is interpreted as a word dist name and loaded via

load_word_dist().
- If a list of strings, each word is assigned an equal probability
  (i.e., a uniform distribution).
- If a dictionary mapping words to probabilities, the values are
  normalised so that the total probability sums to 1.
n_branch_limit (int | None, optional) – When set, limits the number of candidate segmentations kept at each recursion step to the top n_branch_limit by score. If None (the default), all candidates are considered.
length_power (float) – This controls how much we favour long words over short words. A value of 1 means we are agnostic to length. A value of greater than one means we prefer longer words over shorter words.

Examples

If passed a list, words get equal probability:

>>> Segmenter(['HELLO', 'WORLD', 'HELL', 'O']).segment('HELLOWORLD')
shape: (2, 2)
┌──────────────┬───────────┐
│ text         ┆ score     │
│ ---          ┆ ---       │
│ str          ┆ f64       │
╞══════════════╪═══════════╡
│ HELL O WORLD ┆ 13.862944 │
│ HELLO WORLD  ┆ 13.862944 │
└──────────────┴───────────┘

If passed a dict, it uses the weights in the dict as probabilities. Results are returned with most likely first:

>>> Segmenter({'HELLO': 0.5, 'WORLD': 0.25, 'HELL': 0.2, 'O': 0.05}).segment('HELLOWORLD')
shape: (2, 2)
┌──────────────┬───────────┐
│ text         ┆ score     │
│ ---          ┆ ---       │
│ str          ┆ f64       │
╞══════════════╪═══════════╡
│ HELLO WORLD  ┆ 10.397208 │
│ HELL O WORLD ┆ 16.364956 │
└──────────────┴───────────┘

We can also pass in a length power to make the algorithm favour longer words:

>>> Segmenter(['HELLO', 'WORLD', 'HELL', 'O'], length_power=2).segment('HELLOWORLD')
shape: (2, 2)
┌──────────────┬──────────┐
│ text         ┆ score    │
│ ---          ┆ ---      │
│ str          ┆ f64      │
╞══════════════╪══════════╡
│ HELLO WORLD  ┆ 6.199697 │
│ HELL O WORLD ┆ 7.258732 │
└──────────────┴──────────┘

segment(text: str) → DataFrame: Segments text into words.

blaise.strings.calculate_ngrams(text: str, n: int) → dict[str, float]

Compute the n-gram frequencies of a string.

The function slides a window of length n over the input string and counts how many times each distinct n-character substring occurs. The frequencies are returned as a dictionary mapping each n-gram to its relative frequency (count divided by the total number of n-grams).

>>> calculate_ngrams("ABCABC", 3)
{'ABC': 0.5, 'BCA': 0.25, 'CAB': 0.25}

blaise.strings.check_is_alpha(s: str)

blaise.strings.is_alpha(s: str) → bool: Returns True if all letters are in a-z or A-Z.

blaise.strings.normalize_string(s: str) → str

Normalize a Unicode string to uppercase ASCII letters A-Z.

The function performs the following steps: 1. Normalizes the string to NFKD form, decomposing characters. 2. Encodes to ASCII, ignoring characters that cannot be represented. 3. Decodes back to a string. 4. Converts the result to uppercase. 5. Removes any characters that are not A-Z.

Parameters:: s (str) – The input Unicode string.
Returns:: The normalized string containing only uppercase ASCII letters.
Return type:: str

blaise.strings.restore_string(input_string, result_string): Restores the non-letter characters present in the input string into the result string. The strings must have the same number of letters and the result string must already be normalized. Note, does not restore the case of the input string.