Recognizers

Text recognition models.

class manuscript.recognizers.TRBA(weights=None, config=None, charset=None, device=None, rotate_threshold=1.5, region_preparer='bbox', region_preparer_options=None, min_text_size=5, **kwargs)[source]

Bases: BaseRecognizer

Initialize TRBA text recognition model with ONNX Runtime.

Parameters:
  • weights (str or Path, optional) –

    Path or identifier for ONNX model weights. Supports:

    • Local file path: "path/to/model.onnx"

    • HTTP/HTTPS URL: "https://example.com/model.onnx"

    • GitHub release: "github://owner/repo/tag/file.onnx"

    • Google Drive: "gdrive:FILE_ID"

    • Preset name: "trba_lite_g1" or "trba_base_g1" (from pretrained_registry)

    • None: auto-downloads default preset (trba_lite_g1)

  • config (str or Path, optional) – Path or identifier for model configuration JSON. Same URL schemes as weights. If None, attempts to infer from weights location or uses default config for preset models.

  • charset (str or Path, optional) – Path or identifier for character set file. If None, attempts to find charset near weights or falls back to package default.

  • device ({"cuda", "coreml", "cpu"}, optional) –

    Compute device. If None, automatically selects CPU. For GPU/CoreML acceleration:

    • CUDA (NVIDIA): pip install onnxruntime-gpu

    • CoreML (Apple Silicon M1/M2/M3): pip install onnxruntime-silicon

    Default is None (CPU).

  • rotate_threshold (float or None, optional) – Aspect-ratio threshold for rotating vertical text-span crops before recognition. If height > width * rotate_threshold, crop is rotated 90 degrees clockwise. Set to 0 or None to disable. Default is 1.5.

  • region_preparer ({"bbox", "polygon_mask", "quad_warp"} or callable, optional) – Strategy used to convert Page polygons into recognition crops. "bbox" extracts axis-aligned bounding boxes for arbitrary polygons. "polygon_mask" masks pixels outside the polygon inside a tight crop and also supports arbitrary polygons. "quad_warp" rectifies only 4-point polygons with a perspective transform before recognition. A custom callable may also be provided and should return a list of prepared text regions. Default is "bbox".

  • region_preparer_options (dict or None, optional) – Optional configuration for built-in region preparers. Defaults to None. Typical options are pad for "bbox" and "polygon_mask", or output_size=(width, height) for "quad_warp". Non-quad polygons passed to "quad_warp" fall back to bbox crops by default.

  • min_text_size (int, optional) – Minimum crop width/height in pixels to run recognition for a text span. Text spans below this threshold are skipped. Default is 5.

  • **kwargs – Additional configuration options (reserved for future use).

Raises:

Notes

The class provides three main public methods:

  • predict - run recognition over text spans in a Page object.

  • train - high-level training entrypoint to train a TRBA model on custom datasets.

  • export - static method to export PyTorch model to ONNX format.

Model uses ONNX Runtime for fast inference on CPU and GPU. For GPU acceleration, install: pip install onnxruntime-gpu

Examples

Create recognizer with default preset (auto-downloads):

>>> from manuscript.recognizers import TRBA
>>> recognizer = TRBA()

Load from local ONNX file:

>>> recognizer = TRBA(weights="path/to/model.onnx")

Load from GitHub release:

>>> recognizer = TRBA(
...     weights="github://owner/repo/v1.0/model.onnx",
...     config="github://owner/repo/v1.0/config.json"
... )

Force CPU execution:

>>> recognizer = TRBA(weights="model.onnx", device="cpu")

Methods

__call__(*args, **kwargs)

Call self as a function.

export(weights_path, config_path, ...[, ...])

Export TRBA PyTorch model to ONNX format.

predict(page[, image, batch_size, ...])

Recognize text for text spans in a Page and return updated Page.

runtime_providers()

Get ONNX Runtime execution providers based on device.

train(train_csvs, train_roots[, val_csvs, ...])

Train TRBA text recognition model on custom datasets.

__init__(weights=None, config=None, charset=None, device=None, rotate_threshold=1.5, region_preparer='bbox', region_preparer_options=None, min_text_size=5, **kwargs)[source]
Parameters:
charset_registry = {'trba_base_g1': 'github://konstantinkozhin/manuscript-ocr/v0.1.0/trba_base_g1.txt', 'trba_lite_g1': 'github://konstantinkozhin/manuscript-ocr/v0.1.0/trba_lite_g1.txt', 'trba_lite_g2': 'github://konstantinkozhin/manuscript-ocr/v0.1.0/trba_lite_g2.txt'}
config_registry = {'trba_base_g1': 'github://konstantinkozhin/manuscript-ocr/v0.1.0/trba_base_g1.json', 'trba_lite_g1': 'github://konstantinkozhin/manuscript-ocr/v0.1.0/trba_lite_g1.json', 'trba_lite_g2': 'github://konstantinkozhin/manuscript-ocr/v0.1.0/trba_lite_g2.json'}
default_weights_name: str | None = 'trba_lite_g1'
static export(weights_path, config_path, charset_path, output_path, opset_version=14, simplify=True)[source]

Export TRBA PyTorch model to ONNX format.

This method converts a trained TRBA model from PyTorch to ONNX format, which can be used for faster inference with ONNX Runtime. The exported model can be loaded using TRBA(weights="model.onnx").

Parameters:
  • weights_path (str or Path) – Path to the PyTorch model weights file (.pth).

  • config_path (str or Path) – Path to the model configuration JSON file. Used to determine model architecture (img_h, img_w, max_len, hidden_size, etc.).

  • charset_path (str or Path) – Path to the charset file (charset.txt). Used to determine num_classes for the model.

  • output_path (str or Path) – Path where the ONNX model will be saved (.onnx).

  • opset_version (int, optional) – ONNX opset version to use for export. Default is 14.

  • simplify (bool, optional) – If True, applies ONNX graph simplification using onnx-simplifier to optimize the model. Requires onnx-simplifier package. Default is True.

Returns:

The ONNX model is saved to output_path.

Return type:

None

Raises:

Notes

The exported ONNX model has one output:

  • logits: Character predictions with shape (batch, max_length+1, num_classes)

The model uses greedy decoding (argmax) and supports dynamic batch size. The sequence length is fixed to max_length + 1 from the config (same as PyTorch inference mode for compatibility).

Architecture exported: - CNN backbone (SE-ResNet-31 or SE-ResNet-31-Lite) - Bidirectional LSTM encoder - Attention decoder (greedy decoding)

Note: Only the attention decoder is exported. CTC head is used only during training and is not included in the ONNX model.

Examples

Export TRBA model to ONNX:

>>> from manuscript.recognizers import TRBA
>>> TRBA.export(
...     weights_path="experiments/best_model/best_acc_weights.pth",
...     config_path="experiments/best_model/config.json",
...     charset_path="configs/charset.txt",
...     output_path="trba_model.onnx"
... )
Loading TRBA model...
=== TRBA ONNX Export ===
Max decoding length: 40
Input size: 64x256
[OK] ONNX model saved to: trba_model.onnx

Export with custom opset:

>>> TRBA.export(
...     weights_path="model.pth",
...     config_path="config.json",
...     charset_path="charset.txt",
...     output_path="model.onnx",
...     opset_version=16,
...     simplify=False
... )

Use the exported model for inference:

>>> from manuscript.detectors import EAST
>>> recognizer = TRBA(weights="trba_model.onnx")
>>> detector = EAST()
>>> det = detector.predict("page.jpg")
>>> result = recognizer.predict(det["page"], image="page.jpg")

See also

TRBA.__init__

Initialize TRBA recognizer with ONNX model.

predict(page, image=None, batch_size=32, debug_save_dir=None)[source]

Recognize text for text spans in a Page and return updated Page.

Parameters:
  • page (Page) – Page object with detected text-span polygons.

  • image (str, Path, numpy.ndarray, or PIL.Image, optional) – Source page image used to extract text regions. If None, recognition is skipped and a deep copy of page is returned.

  • batch_size (int, optional) – Number of prepared text regions to process simultaneously.

  • debug_save_dir (str or Path, optional) – If provided, saves the prepared recognition crops to this directory as *.png files together with index.json. Crops are saved after region_preparer and auto-rotation, i.e. in the same orientation that goes into recognizer inference.

Returns:

New Page object with recognized text and recognition_confidence filled for processed text spans.

Return type:

Page

pretrained_registry: Dict[str, str] = {'trba_base_g1': 'github://konstantinkozhin/manuscript-ocr/v0.1.0/trba_base_g1.onnx', 'trba_lite_g1': 'github://konstantinkozhin/manuscript-ocr/v0.1.0/trba_lite_g1.onnx', 'trba_lite_g2': 'github://konstantinkozhin/manuscript-ocr/v0.1.0/trba_lite_g2.onnx'}
static train(train_csvs, train_roots, val_csvs=None, val_roots=None, *, exp_dir=None, charset_path=None, encoding='utf-8', img_h=64, img_w=256, max_len=25, hidden_size=256, num_encoder_layers=3, cnn_in_channels=3, cnn_out_channels=512, cnn_backbone='seresnet31', ctc_weight=0.3, ctc_weight_decay_epochs=50, ctc_weight_min=0.0, max_grad_norm=5.0, batch_size=32, epochs=20, lr=0.001, optimizer='AdamW', scheduler='OneCycleLR', weight_decay=0.0, momentum=0.9, val_interval=1, val_size=3000, train_proportions=None, num_workers=0, seed=42, resume_from=None, save_interval=None, device='cuda', freeze_cnn='none', freeze_enc_rnn='none', freeze_attention='none', pretrain_weights='default', **extra_config)[source]

Train TRBA text recognition model on custom datasets.

Parameters:
  • train_csvs (str, Path or sequence of paths) – Path(s) to training CSV files. Each CSV should have columns: image_path (relative to train_roots) and text (ground truth transcription).

  • train_roots (str, Path or sequence of paths) – Root directory/directories containing training images. Must have same length as train_csvs.

  • val_csvs (str, Path, sequence of paths, or None, optional) – Path(s) to validation CSV files with same format as train_csvs. If None, no validation is performed. Default is None.

  • val_roots (str, Path, sequence of paths, or None, optional) – Root directory/directories for validation images. Must match length of val_csvs if provided. Default is None.

  • exp_dir (str or Path, optional) – Experiment directory where checkpoints and logs will be saved. If None, auto-generated based on timestamp. Default is None.

  • charset_path (str or Path, optional) – Path to character set file. If None, uses default charset from package. Default is None.

  • encoding (str, optional) – Text encoding for reading CSV files. Default is "utf-8".

  • img_h (int, optional) – Target height for input images (pixels). Default is 64.

  • img_w (int, optional) – Target width for input images (pixels). Default is 256.

  • max_len (int, optional) – Maximum sequence length for text recognition. Default is 25.

  • hidden_size (int, optional) – Hidden dimension size for RNN encoder/decoder. Default is 256.

  • num_encoder_layers (int, optional) – Number of Bidirectional LSTM layers in the encoder. Default is 2.

  • cnn_in_channels (int, optional) – Number of input channels for CNN backbone (3 for RGB, 1 for grayscale). Default is 3.

  • cnn_out_channels (int, optional) – Number of output channels from CNN backbone. Default is 512.

  • cnn_backbone ({"seresnet31", "seresnet31-lite"}, optional) – CNN backbone variant. "seresnet31" keeps the standard SE-ResNet-31, while "seresnet31-lite" enables a depthwise-lite version. Default is "seresnet31".

  • ctc_weight (float, optional) – Initial weight for CTC loss during training (CTC always used for stability): loss = attn_loss * (1 - ctc_weight) + ctc_loss * ctc_weight. CTC weight decays over epochs. Default is 0.3.

  • ctc_weight_decay_epochs (int, optional) – Number of epochs for CTC weight to decay to minimum. Default is 50.

  • ctc_weight_min (float, optional) – Minimum value for CTC weight after decay. Default is 0.0.

  • max_grad_norm (float, optional) – Maximum gradient norm for clipping (prevents gradient explosion/NaN). Default is 5.0.

  • batch_size (int, optional) – Training batch size. Default is 32.

  • epochs (int, optional) – Number of training epochs. Default is 20.

  • lr (float, optional) – Learning rate. Default is 1e-3.

  • optimizer ({"Adam", "SGD", "AdamW"}, optional) – Optimizer type. Default is "AdamW".

  • scheduler ({"ReduceLROnPlateau", "CosineAnnealingLR", "OneCycleLR", "None"}, optional) –

    Learning rate scheduler type:

    • "OneCycleLR" - one-cycle policy with cosine annealing (default, recommended)

    • "ReduceLROnPlateau" - reduce LR on validation loss plateau

    • "CosineAnnealingLR" - cosine annealing over epochs

    • "None" or None - constant learning rate

    Default is "OneCycleLR".

  • weight_decay (float, optional) – L2 weight decay coefficient. Default is 0.0.

  • momentum (float, optional) – Momentum for SGD optimizer. Default is 0.9.

  • val_interval (int, optional) – Perform validation every N epochs. Default is 1.

  • val_size (int, optional) – Maximum number of validation samples to use. Default is 3000.

  • train_proportions (sequence of float, optional) – Sampling proportions for multiple training datasets. Must sum to 1.0 and match length of train_csvs. If None, datasets are concatenated equally. Default is None.

  • num_workers (int, optional) – Number of data loading workers. Default is 0.

  • seed (int, optional) – Random seed for reproducibility. Default is 42.

  • resume_from (str or Path, optional) – Path to checkpoint file to resume training from. Default is None.

  • save_interval (int, optional) – Save checkpoint every N epochs. If None, only saves best model. Default is None.

  • device ({"cuda", "cpu"}, optional) – Training device. Default is "cuda".

  • freeze_cnn ({"none", "all", "first", "last"}, optional) – CNN freezing policy. Default is "none".

  • freeze_enc_rnn ({"none", "all", "first", "last"}, optional) – Encoder RNN freezing policy. Default is "none".

  • freeze_attention ({"none", "all"}, optional) – Attention module freezing policy. Default is "none".

  • pretrain_weights (str, Path, bool, or None, optional) –

    Pretrained weights to initialize from:

    • "default" or True - use release weights

    • None or False - train from scratch

    • str/Path - path or URL to custom weights file

    Default is "default".

  • **extra_config (dict, optional) – Additional configuration parameters passed to training config.

Returns:

Path to the best model checkpoint saved during training.

Return type:

str

Examples

Train on single dataset with validation:

>>> from manuscript.recognizers import TRBA
>>>
>>> best_model = TRBA.train(
...     train_csvs="data/train.csv",
...     train_roots="data/train_images",
...     val_csvs="data/val.csv",
...     val_roots="data/val_images",
...     exp_dir="./experiments/trba_exp1",
...     epochs=50,
...     batch_size=64,
...     img_h=64,
...     img_w=256,
... )
>>> print(f"Best model saved at: {best_model}")

Train on multiple datasets with custom proportions:

>>> train_csvs = ["data/dataset1/train.csv", "data/dataset2/train.csv"]
>>> train_roots = ["data/dataset1/images", "data/dataset2/images"]
>>> train_proportions = [0.7, 0.3]  # 70% from dataset1, 30% from dataset2
>>>
>>> best_model = TRBA.train(
...     train_csvs=train_csvs,
...     train_roots=train_roots,
...     train_proportions=train_proportions,
...     val_csvs="data/val.csv",
...     val_roots="data/val_images",
...     epochs=100,
...     lr=5e-4,
...     optimizer="AdamW",
...     weight_decay=1e-4,
... )

Resume training from checkpoint:

>>> best_model = TRBA.train(
...     train_csvs="data/train.csv",
...     train_roots="data/train_images",
...     resume_from="experiments/trba_exp1/checkpoints/last.pth",
...     epochs=100,
... )

Fine-tune from pretrained weights with frozen CNN:

>>> best_model = TRBA.train(
...     train_csvs="data/finetune.csv",
...     train_roots="data/finetune_images",
...     pretrain_weights="default",
...     freeze_cnn="all",
...     epochs=20,
...     lr=1e-4,
... )

Train with CTC for stability (always enabled):

>>> best_model = TRBA.train(
...     train_csvs="data/train.csv",
...     train_roots="data/train_images",
...     optimizer="AdamW",
...     scheduler="OneCycleLR",
...     lr=1e-3,
...     ctc_weight=0.3,
...     ctc_weight_decay_epochs=50,
...     max_grad_norm=5.0,
...     epochs=100,
... )