Correctors

Character-level text corrector for post-processing OCR results.

`CharLM`

class manuscript.correctors.CharLM(weights=None, vocab=None, lexicon=None, device=None, mask_threshold=0.05, apply_threshold=0.95, max_edits=2, min_word_len=4, max_len=32, **kwargs)[source]

Bases: BaseModel

Character-level language model corrector using ONNX Runtime.

CharLM uses a Transformer-based masked language model to correct OCR errors at the character level. It analyzes character confidence and applies corrections based on learned substitution patterns.

Parameters:

weights (str or Path, optional) –
Path or identifier for ONNX model weights. Supports:
- Local file path: "path/to/model.onnx"
- HTTP/HTTPS URL: "https://example.com/model.onnx"
- GitHub release: "github://owner/repo/tag/file.onnx"
- Google Drive: "gdrive:FILE_ID"
- Preset name: "prereform_charlm_g1" or "modern_charlm_g1" (from pretrained_registry)
- None: auto-downloads default preset (prereform_charlm_g1)
vocab (str or Path, optional) – Path to vocabulary JSON file. If None, inferred from weights location.
lexicon (str, Path, or set, optional) –
Word list for dictionary-based validation. Supports:
- Local file path: "path/to/words.txt"
- Preset name: "prereform_words" or "modern_words" (from lexicon_registry)
- Python set: {"word1", "word2", ...}
- None: auto-downloads default lexicon for model preset (prereform_words for prereform_charlm_g1, modern_words for modern_charlm_g1)
device ({"cuda", "cpu"}, optional) – Compute device. Default is auto-detected.
mask_threshold (float, optional) – Confidence threshold below which characters are considered for correction. Default is 0.05.
apply_threshold (float, optional) – Minimum model confidence required to apply a correction. Default is 0.95.
max_edits (int, optional) – Maximum number of edits per word. Default is 2.
min_word_len (int, optional) – Minimum word length to attempt correction. Default is 4.
**kwargs – Additional configuration options.
max_len (int)

Examples

>>> from manuscript.correctors import CharLM
>>> corrector = CharLM()
>>> corrected_page = corrector.predict(page)

Methods

`__call__`(args, *kwargs)	Call self as a function.
`export`(weights_path, vocab_path, output_path)	Export CharLM PyTorch model to ONNX format.
`predict`(page)	Apply character-level correction to a Page.
`runtime_providers`()	Get ONNX Runtime execution providers based on device.
`train`([words_path, text_path, pairs_path, ...])	Train CharLM character-level language model.

default_weights_name: str | None = 'prereform_charlm_g1'

pretrained_registry: Dict[str, str] = {'modern_charlm_g1': 'https://github.com/konstantinkozhin/manuscript-ocr/releases/download/v0.1.0/modern_charlm_g1.onnx', 'prereform_charlm_g1': 'https://github.com/konstantinkozhin/manuscript-ocr/releases/download/v0.1.0/prereform_charlm_g1.onnx'}

vocab_registry = {'modern_charlm_g1': 'https://github.com/konstantinkozhin/manuscript-ocr/releases/download/v0.1.0/modern_charlm_g1.json', 'prereform_charlm_g1': 'https://github.com/konstantinkozhin/manuscript-ocr/releases/download/v0.1.0/prereform_charlm_g1.json'}

lexicon_registry = {'modern_words': 'https://github.com/konstantinkozhin/manuscript-ocr/releases/download/v0.1.0/modern_words.txt', 'prereform_words': 'https://github.com/konstantinkozhin/manuscript-ocr/releases/download/v0.1.0/prereform_words.txt'}

default_lexicon_for_model = {'modern_charlm_g1': 'modern_words', 'prereform_charlm_g1': 'prereform_words'}

__init__(weights=None, vocab=None, lexicon=None, device=None, mask_threshold=0.05, apply_threshold=0.95, max_edits=2, min_word_len=4, max_len=32, **kwargs)[source]

Parameters:

weights (str | Path | None)
vocab (str | Path | None)
lexicon (str | Path | set | None)
device (str | None)
mask_threshold (float)
apply_threshold (float)
max_edits (int)
min_word_len (int)
max_len (int)

predict(page)[source]

Apply character-level correction to a Page.

Parameters:: page (Page) – Input Page object with recognized text.
Returns:: Corrected Page object with updated word texts.
Return type:: Page

static train(words_path=None, text_path=None, pairs_path=None, charset_path=None, *, exp_dir='exp_charlm', max_words=1500000, max_pairs_edits=3, max_len=32, emb_size=192, n_layers=8, n_heads=6, ffn_size=1024, dropout=0.1, batch_size=256, accumulation_steps=2, use_amp=True, compile_model=False, epochs=50, lr=0.001, weight_decay=0.01, grad_clip=1.0, min_len=3, mask_prob=0.3, span_min=1, span_max=3, spans_min=1, spans_max=2, pairs_ratio=0.8, eval_ratio=0.01, seed=42, checkpoint=None, **extra_config)[source]

Train CharLM character-level language model.

Parameters:

words_path (str or Path, optional) – Path to words file (one word per line).
text_path (str or Path, optional) – Path to text file for n-gram dataset.
pairs_path (str or Path, optional) – Path to CSV file with incorrect/correct pairs.
charset_path (str or Path) – Path to charset file (allowed characters).
exp_dir (str, optional) – Experiment directory. Default is “exp_charlm”.
max_words (int, optional) – Maximum words to use from words file. Default is 1_500_000.
max_pairs_edits (int, optional) – Maximum number of character edits in pairs to include. Default is 3.
max_len (int, optional) – Maximum sequence length. Default is 32.
emb_size (int, optional) – Embedding size. Default is 192.
n_layers (int, optional) – Number of transformer layers. Default is 8.
n_heads (int, optional) – Number of attention heads. Default is 6.
ffn_size (int, optional) – Feed-forward network size. Default is 1024.
dropout (float, optional) – Dropout rate. Default is 0.1.
batch_size (int, optional) – Batch size. Default is 256.
accumulation_steps (int, optional) – Gradient accumulation steps. Default is 2.
use_amp (bool, optional) – Use automatic mixed precision (AMP). Default is True.
compile_model (bool, optional) – Use torch.compile for faster training. Default is False.
epochs (int, optional) – Number of epochs. Default is 50.
lr (float, optional) – Learning rate. Default is 1e-3.
weight_decay (float, optional) – Weight decay. Default is 0.01.
grad_clip (float, optional) – Gradient clipping. Default is 1.0.
min_len (int, optional) – Minimum word length. Default is 3.
mask_prob (float, optional) – Probability of using span masking. Default is 0.3.
span_min (int, optional) – Minimum span length for masking. Default is 1.
span_max (int, optional) – Maximum span length for masking. Default is 3.
spans_min (int, optional) – Minimum number of spans. Default is 1.
spans_max (int, optional) – Maximum number of spans. Default is 2.
pairs_ratio (float, optional) – Ratio of OCR pairs in mixed dataset (0.8 = 80% pairs, 20% ngrams). Default is 0.8.
eval_ratio (float, optional) – Evaluation set ratio. Default is 0.01.
seed (int, optional) – Random seed. Default is 42.
checkpoint (str, optional) – Path to checkpoint to resume from.
**extra_config – Additional config options.

Returns:

Path to the final checkpoint.

Return type:

str

static export(weights_path, vocab_path, output_path, max_len=32, emb_size=192, n_layers=8, n_heads=6, ffn_size=1024, opset_version=14, simplify=True)[source]

Export CharLM PyTorch model to ONNX format.

Parameters:

weights_path (str or Path) – Path to PyTorch checkpoint (.pt file).
vocab_path (str or Path) – Path to vocabulary JSON file.
output_path (str or Path) – Path to save ONNX model.
max_len (int, optional) – Maximum sequence length. Default is 32.
emb_size (int, optional) – Embedding size. Default is 192.
n_layers (int, optional) – Number of transformer layers. Default is 8.
n_heads (int, optional) – Number of attention heads. Default is 6.
ffn_size (int, optional) – Feed-forward network size. Default is 1024.
opset_version (int, optional) – ONNX opset version. Default is 14.
simplify (bool, optional) – Apply ONNX simplification. Default is True.

Return type:

None

Overview

CharLM is a character-level masked language model corrector that uses a Transformer architecture to fix OCR errors. It analyzes character-level context and applies corrections based on learned patterns.

Key features:

Character-level Transformer-based correction
Configurable confidence thresholds
Support for custom vocabularies and lexicons
ONNX Runtime inference for fast correction
Optional lexicon filtering to preserve known words

Available Presets

The following pretrained models are available:

Preset Name	Description
`prereform_charlm_g1`	Pre-reform Russian text (default)
`modern_charlm_g1`	Modern Russian text

Quick Example

Use create_page_from_text to quickly test correction on text:

from manuscript.utils import create_page_from_text
from manuscript.correctors import CharLM

# Create page from text with potential OCR errors
page = create_page_from_text(["Привѣтъ міръ", "Тестовая строка"])

# Apply correction (using default prereform model)
corrector = CharLM()
corrected = corrector.predict(page)

# Extract corrected text
for line in corrected.blocks[0].lines:
    text = " ".join(w.text for w in line.words)
    print(text)

Basic Usage

from manuscript import Pipeline
from manuscript.correctors import CharLM

# Create corrector with default preset
corrector = CharLM()

# Create corrector with specific preset
corrector = CharLM(weights="modern_charlm_g1")

# Use in pipeline
pipeline = Pipeline(corrector=corrector)
result = pipeline.predict("document.jpg")

Advanced Configuration

from manuscript.correctors import CharLM

# Fine-tune correction behavior
corrector = CharLM(
    weights="prereform_charlm_g1",
    mask_threshold=0.05,      # Characters with confidence below this are corrected
    apply_threshold=0.95,     # Model must be this confident to apply correction
    max_edits=2,              # Maximum edits per word
    min_word_len=4,           # Minimum word length to attempt correction
    device="cuda"             # Use GPU for inference
)

Using Custom Lexicon

You can provide a lexicon (word list) to prevent corrections of known words:

from manuscript.correctors import CharLM

# From preset
corrector = CharLM(
    weights="prereform_charlm_g1",
    lexicon="prereform_words"  # Use preset lexicon
)

# From file
corrector = CharLM(
    weights="prereform_charlm_g1",
    lexicon="path/to/words.txt"
)

# From set
my_words = {"слово1", "слово2", "слово3"}
corrector = CharLM(
    weights="prereform_charlm_g1",
    lexicon=my_words
)

Training Custom Model

You can train CharLM on your own data:

from manuscript.correctors import CharLM

# Train with OCR pairs dataset
checkpoint_path = CharLM.train(
    pairs_path="ocr_pairs.csv",      # CSV with incorrect,correct columns
    charset_path="charset.txt",       # Allowed characters
    exp_dir="my_charlm_exp",
    epochs=50,
    batch_size=256,
)

# Train with word list (self-supervised)
checkpoint_path = CharLM.train(
    words_path="words.txt",           # One word per line
    charset_path="charset.txt",
    exp_dir="my_charlm_exp",
)

Export to ONNX

from manuscript.correctors import CharLM

# Export trained model to ONNX
CharLM.export(
    weights_path="exp/checkpoints/charlm_epoch_50.pt",
    vocab_path="exp/vocab.json",
    output_path="my_model.onnx",
)

Correctors

CharLM

Overview

Available Presets

Quick Example

Basic Usage

Advanced Configuration

Using Custom Lexicon

Training Custom Model

Export to ONNX

`CharLM`