Correctors

Character-level text corrector for post-processing OCR results.

CharLM

class manuscript.correctors.CharLM(weights=None, vocab=None, lexicon=None, device=None, mask_threshold=0.05, apply_threshold=0.95, max_edits=2, min_word_len=4, max_len=32, **kwargs)[source]

Bases: BaseModel

Character-level language model corrector using ONNX Runtime.

CharLM uses a Transformer-based masked language model to correct OCR errors at the character level. It analyzes character confidence and applies corrections based on learned substitution patterns.

Parameters:
  • weights (str or Path, optional) –

    Path or identifier for ONNX model weights. Supports:

    • Local file path: "path/to/model.onnx"

    • HTTP/HTTPS URL: "https://example.com/model.onnx"

    • GitHub release: "github://owner/repo/tag/file.onnx"

    • Google Drive: "gdrive:FILE_ID"

    • Preset name: "prereform_charlm_g1" or "modern_charlm_g1" (from pretrained_registry)

    • None: auto-downloads default preset (prereform_charlm_g1)

  • vocab (str or Path, optional) – Path to vocabulary JSON file. If None, inferred from weights location.

  • lexicon (str, Path, or set, optional) –

    Word list for dictionary-based validation. Supports:

    • Local file path: "path/to/words.txt"

    • Preset name: "prereform_words" or "modern_words" (from lexicon_registry)

    • Python set: {"word1", "word2", ...}

    • None: auto-downloads default lexicon for model preset (prereform_words for prereform_charlm_g1, modern_words for modern_charlm_g1)

  • device ({"cuda", "cpu"}, optional) – Compute device. Default is auto-detected.

  • mask_threshold (float, optional) – Confidence threshold below which characters are considered for correction. Default is 0.05.

  • apply_threshold (float, optional) – Minimum model confidence required to apply a correction. Default is 0.95.

  • max_edits (int, optional) – Maximum number of edits per word. Default is 2.

  • min_word_len (int, optional) – Minimum word length to attempt correction. Default is 4.

  • **kwargs – Additional configuration options.

  • max_len (int)

Examples

>>> from manuscript.correctors import CharLM
>>> corrector = CharLM()
>>> corrected_page = corrector.predict(page)

Methods

__call__(*args, **kwargs)

Call self as a function.

export(weights_path, vocab_path, output_path)

Export CharLM PyTorch model to ONNX format.

predict(page)

Apply character-level correction to a Page.

runtime_providers()

Get ONNX Runtime execution providers based on device.

train([words_path, text_path, pairs_path, ...])

Train CharLM character-level language model.

default_weights_name: str | None = 'prereform_charlm_g1'
pretrained_registry: Dict[str, str] = {'modern_charlm_g1': 'https://github.com/konstantinkozhin/manuscript-ocr/releases/download/v0.1.0/modern_charlm_g1.onnx', 'prereform_charlm_g1': 'https://github.com/konstantinkozhin/manuscript-ocr/releases/download/v0.1.0/prereform_charlm_g1.onnx'}
vocab_registry = {'modern_charlm_g1': 'https://github.com/konstantinkozhin/manuscript-ocr/releases/download/v0.1.0/modern_charlm_g1.json', 'prereform_charlm_g1': 'https://github.com/konstantinkozhin/manuscript-ocr/releases/download/v0.1.0/prereform_charlm_g1.json'}
lexicon_registry = {'modern_words': 'https://github.com/konstantinkozhin/manuscript-ocr/releases/download/v0.1.0/modern_words.txt', 'prereform_words': 'https://github.com/konstantinkozhin/manuscript-ocr/releases/download/v0.1.0/prereform_words.txt'}
default_lexicon_for_model = {'modern_charlm_g1': 'modern_words', 'prereform_charlm_g1': 'prereform_words'}
__init__(weights=None, vocab=None, lexicon=None, device=None, mask_threshold=0.05, apply_threshold=0.95, max_edits=2, min_word_len=4, max_len=32, **kwargs)[source]
Parameters:
predict(page)[source]

Apply character-level correction to a Page.

Parameters:

page (Page) – Input Page object with recognized text.

Returns:

Corrected Page object with updated word texts.

Return type:

Page

static train(words_path=None, text_path=None, pairs_path=None, charset_path=None, *, exp_dir='exp_charlm', max_words=1500000, max_pairs_edits=3, max_len=32, emb_size=192, n_layers=8, n_heads=6, ffn_size=1024, dropout=0.1, batch_size=256, accumulation_steps=2, use_amp=True, compile_model=False, epochs=50, lr=0.001, weight_decay=0.01, grad_clip=1.0, min_len=3, mask_prob=0.3, span_min=1, span_max=3, spans_min=1, spans_max=2, pairs_ratio=0.8, eval_ratio=0.01, seed=42, checkpoint=None, **extra_config)[source]

Train CharLM character-level language model.

Parameters:
  • words_path (str or Path, optional) – Path to words file (one word per line).

  • text_path (str or Path, optional) – Path to text file for n-gram dataset.

  • pairs_path (str or Path, optional) – Path to CSV file with incorrect/correct pairs.

  • charset_path (str or Path) – Path to charset file (allowed characters).

  • exp_dir (str, optional) – Experiment directory. Default is “exp_charlm”.

  • max_words (int, optional) – Maximum words to use from words file. Default is 1_500_000.

  • max_pairs_edits (int, optional) – Maximum number of character edits in pairs to include. Default is 3.

  • max_len (int, optional) – Maximum sequence length. Default is 32.

  • emb_size (int, optional) – Embedding size. Default is 192.

  • n_layers (int, optional) – Number of transformer layers. Default is 8.

  • n_heads (int, optional) – Number of attention heads. Default is 6.

  • ffn_size (int, optional) – Feed-forward network size. Default is 1024.

  • dropout (float, optional) – Dropout rate. Default is 0.1.

  • batch_size (int, optional) – Batch size. Default is 256.

  • accumulation_steps (int, optional) – Gradient accumulation steps. Default is 2.

  • use_amp (bool, optional) – Use automatic mixed precision (AMP). Default is True.

  • compile_model (bool, optional) – Use torch.compile for faster training. Default is False.

  • epochs (int, optional) – Number of epochs. Default is 50.

  • lr (float, optional) – Learning rate. Default is 1e-3.

  • weight_decay (float, optional) – Weight decay. Default is 0.01.

  • grad_clip (float, optional) – Gradient clipping. Default is 1.0.

  • min_len (int, optional) – Minimum word length. Default is 3.

  • mask_prob (float, optional) – Probability of using span masking. Default is 0.3.

  • span_min (int, optional) – Minimum span length for masking. Default is 1.

  • span_max (int, optional) – Maximum span length for masking. Default is 3.

  • spans_min (int, optional) – Minimum number of spans. Default is 1.

  • spans_max (int, optional) – Maximum number of spans. Default is 2.

  • pairs_ratio (float, optional) – Ratio of OCR pairs in mixed dataset (0.8 = 80% pairs, 20% ngrams). Default is 0.8.

  • eval_ratio (float, optional) – Evaluation set ratio. Default is 0.01.

  • seed (int, optional) – Random seed. Default is 42.

  • checkpoint (str, optional) – Path to checkpoint to resume from.

  • **extra_config – Additional config options.

Returns:

Path to the final checkpoint.

Return type:

str

static export(weights_path, vocab_path, output_path, max_len=32, emb_size=192, n_layers=8, n_heads=6, ffn_size=1024, opset_version=14, simplify=True)[source]

Export CharLM PyTorch model to ONNX format.

Parameters:
  • weights_path (str or Path) – Path to PyTorch checkpoint (.pt file).

  • vocab_path (str or Path) – Path to vocabulary JSON file.

  • output_path (str or Path) – Path to save ONNX model.

  • max_len (int, optional) – Maximum sequence length. Default is 32.

  • emb_size (int, optional) – Embedding size. Default is 192.

  • n_layers (int, optional) – Number of transformer layers. Default is 8.

  • n_heads (int, optional) – Number of attention heads. Default is 6.

  • ffn_size (int, optional) – Feed-forward network size. Default is 1024.

  • opset_version (int, optional) – ONNX opset version. Default is 14.

  • simplify (bool, optional) – Apply ONNX simplification. Default is True.

Return type:

None

Overview

CharLM is a character-level masked language model corrector that uses a Transformer architecture to fix OCR errors. It analyzes character-level context and applies corrections based on learned patterns.

Key features:

  • Character-level Transformer-based correction

  • Configurable confidence thresholds

  • Support for custom vocabularies and lexicons

  • ONNX Runtime inference for fast correction

  • Optional lexicon filtering to preserve known words

Available Presets

The following pretrained models are available:

Preset Name

Description

prereform_charlm_g1

Pre-reform Russian text (default)

modern_charlm_g1

Modern Russian text

Quick Example

Use create_page_from_text to quickly test correction on text:

from manuscript.utils import create_page_from_text
from manuscript.correctors import CharLM

# Create page from text with potential OCR errors
page = create_page_from_text(["Привѣтъ міръ", "Тестовая строка"])

# Apply correction (using default prereform model)
corrector = CharLM()
corrected = corrector.predict(page)

# Extract corrected text
for line in corrected.blocks[0].lines:
    text = " ".join(w.text for w in line.words)
    print(text)

Basic Usage

from manuscript import Pipeline
from manuscript.correctors import CharLM

# Create corrector with default preset
corrector = CharLM()

# Create corrector with specific preset
corrector = CharLM(weights="modern_charlm_g1")

# Use in pipeline
pipeline = Pipeline(corrector=corrector)
result = pipeline.predict("document.jpg")

Advanced Configuration

from manuscript.correctors import CharLM

# Fine-tune correction behavior
corrector = CharLM(
    weights="prereform_charlm_g1",
    mask_threshold=0.05,      # Characters with confidence below this are corrected
    apply_threshold=0.95,     # Model must be this confident to apply correction
    max_edits=2,              # Maximum edits per word
    min_word_len=4,           # Minimum word length to attempt correction
    device="cuda"             # Use GPU for inference
)

Using Custom Lexicon

You can provide a lexicon (word list) to prevent corrections of known words:

from manuscript.correctors import CharLM

# From preset
corrector = CharLM(
    weights="prereform_charlm_g1",
    lexicon="prereform_words"  # Use preset lexicon
)

# From file
corrector = CharLM(
    weights="prereform_charlm_g1",
    lexicon="path/to/words.txt"
)

# From set
my_words = {"слово1", "слово2", "слово3"}
corrector = CharLM(
    weights="prereform_charlm_g1",
    lexicon=my_words
)

Training Custom Model

You can train CharLM on your own data:

from manuscript.correctors import CharLM

# Train with OCR pairs dataset
checkpoint_path = CharLM.train(
    pairs_path="ocr_pairs.csv",      # CSV with incorrect,correct columns
    charset_path="charset.txt",       # Allowed characters
    exp_dir="my_charlm_exp",
    epochs=50,
    batch_size=256,
)

# Train with word list (self-supervised)
checkpoint_path = CharLM.train(
    words_path="words.txt",           # One word per line
    charset_path="charset.txt",
    exp_dir="my_charlm_exp",
)

Export to ONNX

from manuscript.correctors import CharLM

# Export trained model to ONNX
CharLM.export(
    weights_path="exp/checkpoints/charlm_epoch_50.pt",
    vocab_path="exp/vocab.json",
    output_path="my_model.onnx",
)