Correctors
Character-level text corrector for post-processing OCR results.
CharLM
- class manuscript.correctors.CharLM(weights=None, vocab=None, lexicon=None, device=None, mask_threshold=0.05, apply_threshold=0.95, max_edits=2, min_word_len=4, max_len=32, **kwargs)[source]
Bases:
BaseModelCharacter-level language model corrector using ONNX Runtime.
CharLM uses a Transformer-based masked language model to correct OCR errors at the character level. It analyzes character confidence and applies corrections based on learned substitution patterns.
- Parameters:
weights (str or Path, optional) –
Path or identifier for ONNX model weights. Supports:
Local file path:
"path/to/model.onnx"HTTP/HTTPS URL:
"https://example.com/model.onnx"GitHub release:
"github://owner/repo/tag/file.onnx"Google Drive:
"gdrive:FILE_ID"Preset name:
"prereform_charlm_g1"or"modern_charlm_g1"(from pretrained_registry)None: auto-downloads default preset (prereform_charlm_g1)
vocab (str or Path, optional) – Path to vocabulary JSON file. If None, inferred from weights location.
lexicon (str, Path, or set, optional) –
Word list for dictionary-based validation. Supports:
Local file path:
"path/to/words.txt"Preset name:
"prereform_words"or"modern_words"(from lexicon_registry)Python set:
{"word1", "word2", ...}None: auto-downloads default lexicon for model preset (prereform_words for prereform_charlm_g1, modern_words for modern_charlm_g1)
device ({"cuda", "cpu"}, optional) – Compute device. Default is auto-detected.
mask_threshold (float, optional) – Confidence threshold below which characters are considered for correction. Default is 0.05.
apply_threshold (float, optional) – Minimum model confidence required to apply a correction. Default is 0.95.
max_edits (int, optional) – Maximum number of edits per word. Default is 2.
min_word_len (int, optional) – Minimum word length to attempt correction. Default is 4.
**kwargs – Additional configuration options.
max_len (int)
Examples
>>> from manuscript.correctors import CharLM >>> corrector = CharLM() >>> corrected_page = corrector.predict(page)
Methods
__call__(*args, **kwargs)Call self as a function.
export(weights_path, vocab_path, output_path)Export CharLM PyTorch model to ONNX format.
predict(page)Apply character-level correction to a Page.
runtime_providers()Get ONNX Runtime execution providers based on device.
train([words_path, text_path, pairs_path, ...])Train CharLM character-level language model.
- pretrained_registry: Dict[str, str] = {'modern_charlm_g1': 'https://github.com/konstantinkozhin/manuscript-ocr/releases/download/v0.1.0/modern_charlm_g1.onnx', 'prereform_charlm_g1': 'https://github.com/konstantinkozhin/manuscript-ocr/releases/download/v0.1.0/prereform_charlm_g1.onnx'}
- vocab_registry = {'modern_charlm_g1': 'https://github.com/konstantinkozhin/manuscript-ocr/releases/download/v0.1.0/modern_charlm_g1.json', 'prereform_charlm_g1': 'https://github.com/konstantinkozhin/manuscript-ocr/releases/download/v0.1.0/prereform_charlm_g1.json'}
- lexicon_registry = {'modern_words': 'https://github.com/konstantinkozhin/manuscript-ocr/releases/download/v0.1.0/modern_words.txt', 'prereform_words': 'https://github.com/konstantinkozhin/manuscript-ocr/releases/download/v0.1.0/prereform_words.txt'}
- default_lexicon_for_model = {'modern_charlm_g1': 'modern_words', 'prereform_charlm_g1': 'prereform_words'}
- __init__(weights=None, vocab=None, lexicon=None, device=None, mask_threshold=0.05, apply_threshold=0.95, max_edits=2, min_word_len=4, max_len=32, **kwargs)[source]
- static train(words_path=None, text_path=None, pairs_path=None, charset_path=None, *, exp_dir='exp_charlm', max_words=1500000, max_pairs_edits=3, max_len=32, emb_size=192, n_layers=8, n_heads=6, ffn_size=1024, dropout=0.1, batch_size=256, accumulation_steps=2, use_amp=True, compile_model=False, epochs=50, lr=0.001, weight_decay=0.01, grad_clip=1.0, min_len=3, mask_prob=0.3, span_min=1, span_max=3, spans_min=1, spans_max=2, pairs_ratio=0.8, eval_ratio=0.01, seed=42, checkpoint=None, **extra_config)[source]
Train CharLM character-level language model.
- Parameters:
words_path (str or Path, optional) – Path to words file (one word per line).
text_path (str or Path, optional) – Path to text file for n-gram dataset.
pairs_path (str or Path, optional) – Path to CSV file with incorrect/correct pairs.
charset_path (str or Path) – Path to charset file (allowed characters).
exp_dir (str, optional) – Experiment directory. Default is “exp_charlm”.
max_words (int, optional) – Maximum words to use from words file. Default is 1_500_000.
max_pairs_edits (int, optional) – Maximum number of character edits in pairs to include. Default is 3.
max_len (int, optional) – Maximum sequence length. Default is 32.
emb_size (int, optional) – Embedding size. Default is 192.
n_layers (int, optional) – Number of transformer layers. Default is 8.
n_heads (int, optional) – Number of attention heads. Default is 6.
ffn_size (int, optional) – Feed-forward network size. Default is 1024.
dropout (float, optional) – Dropout rate. Default is 0.1.
batch_size (int, optional) – Batch size. Default is 256.
accumulation_steps (int, optional) – Gradient accumulation steps. Default is 2.
use_amp (bool, optional) – Use automatic mixed precision (AMP). Default is True.
compile_model (bool, optional) – Use torch.compile for faster training. Default is False.
epochs (int, optional) – Number of epochs. Default is 50.
lr (float, optional) – Learning rate. Default is 1e-3.
weight_decay (float, optional) – Weight decay. Default is 0.01.
grad_clip (float, optional) – Gradient clipping. Default is 1.0.
min_len (int, optional) – Minimum word length. Default is 3.
mask_prob (float, optional) – Probability of using span masking. Default is 0.3.
span_min (int, optional) – Minimum span length for masking. Default is 1.
span_max (int, optional) – Maximum span length for masking. Default is 3.
spans_min (int, optional) – Minimum number of spans. Default is 1.
spans_max (int, optional) – Maximum number of spans. Default is 2.
pairs_ratio (float, optional) – Ratio of OCR pairs in mixed dataset (0.8 = 80% pairs, 20% ngrams). Default is 0.8.
eval_ratio (float, optional) – Evaluation set ratio. Default is 0.01.
seed (int, optional) – Random seed. Default is 42.
checkpoint (str, optional) – Path to checkpoint to resume from.
**extra_config – Additional config options.
- Returns:
Path to the final checkpoint.
- Return type:
- static export(weights_path, vocab_path, output_path, max_len=32, emb_size=192, n_layers=8, n_heads=6, ffn_size=1024, opset_version=14, simplify=True)[source]
Export CharLM PyTorch model to ONNX format.
- Parameters:
weights_path (str or Path) – Path to PyTorch checkpoint (.pt file).
vocab_path (str or Path) – Path to vocabulary JSON file.
output_path (str or Path) – Path to save ONNX model.
max_len (int, optional) – Maximum sequence length. Default is 32.
emb_size (int, optional) – Embedding size. Default is 192.
n_layers (int, optional) – Number of transformer layers. Default is 8.
n_heads (int, optional) – Number of attention heads. Default is 6.
ffn_size (int, optional) – Feed-forward network size. Default is 1024.
opset_version (int, optional) – ONNX opset version. Default is 14.
simplify (bool, optional) – Apply ONNX simplification. Default is True.
- Return type:
Overview
CharLM is a character-level masked language model corrector that uses a Transformer
architecture to fix OCR errors. It analyzes character-level context and applies
corrections based on learned patterns.
Key features:
Character-level Transformer-based correction
Configurable confidence thresholds
Support for custom vocabularies and lexicons
ONNX Runtime inference for fast correction
Optional lexicon filtering to preserve known words
Available Presets
The following pretrained models are available:
Preset Name |
Description |
|---|---|
|
Pre-reform Russian text (default) |
|
Modern Russian text |
Quick Example
Use create_page_from_text to quickly test correction on text:
from manuscript.utils import create_page_from_text
from manuscript.correctors import CharLM
# Create page from text with potential OCR errors
page = create_page_from_text(["Привѣтъ міръ", "Тестовая строка"])
# Apply correction (using default prereform model)
corrector = CharLM()
corrected = corrector.predict(page)
# Extract corrected text
for line in corrected.blocks[0].lines:
text = " ".join(w.text for w in line.words)
print(text)
Basic Usage
from manuscript import Pipeline
from manuscript.correctors import CharLM
# Create corrector with default preset
corrector = CharLM()
# Create corrector with specific preset
corrector = CharLM(weights="modern_charlm_g1")
# Use in pipeline
pipeline = Pipeline(corrector=corrector)
result = pipeline.predict("document.jpg")
Advanced Configuration
from manuscript.correctors import CharLM
# Fine-tune correction behavior
corrector = CharLM(
weights="prereform_charlm_g1",
mask_threshold=0.05, # Characters with confidence below this are corrected
apply_threshold=0.95, # Model must be this confident to apply correction
max_edits=2, # Maximum edits per word
min_word_len=4, # Minimum word length to attempt correction
device="cuda" # Use GPU for inference
)
Using Custom Lexicon
You can provide a lexicon (word list) to prevent corrections of known words:
from manuscript.correctors import CharLM
# From preset
corrector = CharLM(
weights="prereform_charlm_g1",
lexicon="prereform_words" # Use preset lexicon
)
# From file
corrector = CharLM(
weights="prereform_charlm_g1",
lexicon="path/to/words.txt"
)
# From set
my_words = {"слово1", "слово2", "слово3"}
corrector = CharLM(
weights="prereform_charlm_g1",
lexicon=my_words
)
Training Custom Model
You can train CharLM on your own data:
from manuscript.correctors import CharLM
# Train with OCR pairs dataset
checkpoint_path = CharLM.train(
pairs_path="ocr_pairs.csv", # CSV with incorrect,correct columns
charset_path="charset.txt", # Allowed characters
exp_dir="my_charlm_exp",
epochs=50,
batch_size=256,
)
# Train with word list (self-supervised)
checkpoint_path = CharLM.train(
words_path="words.txt", # One word per line
charset_path="charset.txt",
exp_dir="my_charlm_exp",
)
Export to ONNX
from manuscript.correctors import CharLM
# Export trained model to ONNX
CharLM.export(
weights_path="exp/checkpoints/charlm_epoch_50.pt",
vocab_path="exp/vocab.json",
output_path="my_model.onnx",
)