Detectors

Text detection models.

class manuscript.detectors.EAST(weights=None, device=None, *, target_size=1280, expand_ratio_w=1.4, expand_ratio_h=1.5, expand_power=0.6, score_thresh=0.6, iou_threshold=0.05, iou_threshold_standard=0.05, score_geo_scale=0.25, quantization=2, axis_aligned_output=True, remove_area_anomalies=False, anomaly_sigma_threshold=5.0, anomaly_min_box_count=30, use_tta=False, tta_iou_thresh=0.1, **kwargs)[source]

Bases: BaseModel

Initialize EAST text detector with ONNX Runtime.

Parameters:

weights (str or Path, optional) –
Path or identifier for ONNX model weights. Supports:
- Local file path: "path/to/model.onnx"
- HTTP/HTTPS URL: "https://example.com/model.onnx"
- GitHub release: "github://owner/repo/tag/file.onnx"
- Google Drive: "gdrive:FILE_ID"
- Preset name: "east_50_g1"
- None: auto-downloads default preset (east_50_g1)
device (str, optional) –
Compute device: "cuda", "coreml", or "cpu". If None, automatically selects CPU. For GPU/CoreML acceleration:
- CUDA (NVIDIA): pip install onnxruntime-gpu
- CoreML (Apple Silicon M1/M2/M3): pip install onnxruntime-silicon
Default is None (CPU).
target_size (int, optional) – Input image size for inference. Images are resized to (target_size, target_size). Default is 1280.
expand_ratio_w (float, optional) – Horizontal expansion factor applied to detected boxes after NMS. Default is 0.7.
expand_ratio_h (float, optional) – Vertical expansion factor applied to detected boxes after NMS. Default is 0.7.
expand_power (float, optional) – Power for non-linear box expansion. Controls how expansion scales with box size. - 1.0 = linear (small and large boxes expand equally) - <1.0 = small boxes expand more (e.g., 0.5, recommended for character-level detection) - >1.0 = large boxes expand more Default is 0.5.
score_thresh (float, optional) – Confidence threshold for selecting candidate detections before NMS. Default is 0.7.
iou_threshold (float, optional) – IoU threshold for locality-aware NMS merging phase. Default is 0.2.
iou_threshold_standard (float, optional) – IoU threshold for standard NMS after locality-aware merging. If None, uses the same value as iou_threshold. Default is None.
score_geo_scale (float, optional) – Scale factor for decoding geometry/score maps. Default is 0.25.
quantization (int, optional) – Quantization resolution for point coordinates during decoding. Default is 2.
axis_aligned_output (bool, optional) – If True, outputs axis-aligned rectangles instead of original quads. Default is True.
remove_area_anomalies (bool, optional) – If True, removes quads with extremely large area relative to the distribution. Default is False.
anomaly_sigma_threshold (float, optional) – Sigma threshold for anomaly area filtering. Default is 5.0.
anomaly_min_box_count (int, optional) – Minimum number of boxes required before anomaly filtering. Default is 30.
use_tta (bool, optional) – Enable Test-Time Augmentation (TTA). When enabled, inference is run on both the original and horizontally flipped image, and results are merged. This can improve detection of partially visible or edge text. Default is False.
tta_iou_thresh (float, optional) – IoU threshold for merging boxes from original and flipped images during TTA. Boxes with IoU > threshold are considered matches and merged. Default is 0.1.

Notes

The class provides two main public methods:

predict — run inference on a single image and return detections.
train — high-level training entrypoint to train an EAST model on custom datasets.

The detector uses ONNX Runtime for fast inference on CPU and GPU. For GPU acceleration, install: pip install onnxruntime-gpu

Methods

`__call__`(args, *kwargs)	Call self as a function.
`export`(weights_path, output_path[, ...])	Export EAST PyTorch model to ONNX format.
`predict`(img_or_path[, return_maps, ...])	Run EAST inference and return detection results.
`runtime_providers`()	Get ONNX Runtime execution providers based on device.
`train`(train_images, train_anns, val_images, ...)	Train EAST model on custom datasets.

default_weights_name: str | None = 'east_50_g1'

pretrained_registry: Dict[str, str] = {'east_50_g1': 'github://konstantinkozhin/manuscript-ocr/v0.1.0/east_50_g1.onnx'}

__init__(weights=None, device=None, *, target_size=1280, expand_ratio_w=1.4, expand_ratio_h=1.5, expand_power=0.6, score_thresh=0.6, iou_threshold=0.05, iou_threshold_standard=0.05, score_geo_scale=0.25, quantization=2, axis_aligned_output=True, remove_area_anomalies=False, anomaly_sigma_threshold=5.0, anomaly_min_box_count=30, use_tta=False, tta_iou_thresh=0.1, **kwargs)[source]

Parameters:

weights (str | Path | None)
device (str | None)
target_size (int)
expand_ratio_w (float)
expand_ratio_h (float)
expand_power (float)
score_thresh (float)
iou_threshold (float)
iou_threshold_standard (float | None)
score_geo_scale (float)
quantization (int)
axis_aligned_output (bool)
remove_area_anomalies (bool)
anomaly_sigma_threshold (float)
anomaly_min_box_count (int)
use_tta (bool)
tta_iou_thresh (float)

predict(img_or_path, return_maps=False, sort_reading_order=True, split_into_columns=True, max_columns=10)[source]

Run EAST inference and return detection results.

Parameters:

img_or_path (str or pathlib.Path or numpy.ndarray) – Path to an image file or an RGB image provided as a NumPy array with shape (H, W, 3) in uint8 format.
return_maps (bool, optional) – If True, returns raw model score and geometry maps under keys "score_map" and "geo_map". Default is False.
sort_reading_order (bool, optional) – If True, sorts detected words in natural reading order (left-to-right, top-to-bottom) and groups them into text lines. Default is True.
split_into_columns (bool, optional) – If True and sort_reading_order=True, segments the page into columns (separate Blocks). If False, treats entire page as single column. Only used when sort_reading_order=True. Default is True.
max_columns (int, optional) – Maximum number of columns to detect when split_into_columns=True. Higher values allow more columns to be detected. Only used when sort_reading_order=True and split_into_columns=True. Default is 10.

Returns:

Dictionary with the following keys:

"page"Page
Parsed detection result as a Page object containing Block(s) with Line(s) of Word objects. Each Word has polygon coordinates and confidence scores. Words and Lines have reading order indices.
"score_map"numpy.ndarray or None
Raw score map produced by the network if return_maps=True.
"geo_map"numpy.ndarray or None
Raw geometry map if return_maps=True.

Return type:

dict

Notes

The method performs: (1) image loading, (2) resizing and normalization, (3) model inference, (4) quad decoding, (5) NMS, (6) box expansion, (7) scaling coordinates back to original size, (8) optional reading order sorting into lines.

Test-Time Augmentation (TTA):

When use_tta=True is set during initialization, the method runs inference on both the original and horizontally flipped image, then merges results. Boxes from both views are matched by IoU and merged by taking the union of coordinates with averaged scores. This can improve detection of text near image edges or partially visible text.

For visualization, use the external visualize_page utility:

>>> from manuscript.utils import visualize_page
>>> result = model.predict(img_path)
>>> vis_img = visualize_page(img, result["page"])

Examples

Perform inference and get structured output:

>>> from manuscript.detectors import EAST
>>> model = EAST()
>>> img_path = r"example/ocr_example_image.jpg"
>>> result = model.predict(img_path)
>>> page = result["page"]
>>> # Access first line's first word
>>> first_word = page.blocks[0].lines[0].words[0]
>>> print(f"Confidence: {first_word.detection_confidence}")

Visualize results separately:

>>> from manuscript.utils import visualize_page, read_image
>>> result = model.predict(img_path)
>>> img = read_image(img_path)
>>> vis_img = visualize_page(img, result["page"])
>>> vis_img.show()

static train(train_images, train_anns, val_images, val_anns, *, experiment_root='./experiments', model_name='resnet_quad', backbone_name='resnet50', pretrained_backbone=True, freeze_first=True, target_size=1024, score_geo_scale=None, epochs=500, batch_size=3, accumulation_steps=1, lr=0.001, grad_clip=5.0, early_stop=100, use_sam=True, sam_type='asam', use_lookahead=True, use_ema=False, use_multiscale=True, use_ohem=True, ohem_ratio=0.5, use_focal_geo=True, focal_gamma=2.0, resume_from=None, val_interval=1, num_workers=0, device=None)[source]

Train EAST model on custom datasets.

Parameters:

train_images (str, Path or sequence of paths) – Path(s) to training image folders.
train_anns (str, Path or sequence of paths) – Path(s) to COCO-format JSON annotation files corresponding to train_images.
val_images (str, Path or sequence of paths) – Path(s) to validation image folders.
val_anns (str, Path or sequence of paths) – Path(s) to COCO-format JSON annotation files corresponding to val_images.
experiment_root (str, optional) – Base directory where experiment folders will be created. Default is "./experiments".
model_name (str, optional) – Folder name inside experiment_root for logs and checkpoints. Default is "resnet_quad".
backbone_name ({"resnet50", "resnet101"}, optional) –
Backbone architecture to use. Options:
- "resnet50" — ResNet-50 (faster, less parameters)
- "resnet101" — ResNet-101 (slower, more capacity)
Default is "resnet50".
pretrained_backbone (bool, optional) – Use ImageNet-pretrained backbone weights. Default True.
freeze_first (bool, optional) – Freeze lowest layers of the backbone. Default True.
target_size (int, optional) – Resize shorter side of images to this size. Default 1024.
score_geo_scale (float, optional) – Multiplier to recover original coordinates from score/geo maps. If None, automatically taken from the model. Default None.
epochs (int, optional) – Number of training epochs. Default 500.
batch_size (int, optional) – Batch size per GPU. Default 3.
accumulation_steps (int, optional) –
Number of gradient accumulation steps. Effective batch size will be batch_size * accumulation_steps. Use this to train with larger effective batch sizes when GPU memory is limited. For example:
- batch_size=2, accumulation_steps=4 → effective batch size = 8
- batch_size=1, accumulation_steps=8 → effective batch size = 8
Default is 1 (no accumulation).
lr (float, optional) – Learning rate. Default 1e-3.
grad_clip (float, optional) – Gradient clipping value (L2 norm). Default 5.0.
early_stop (int, optional) – Patience (epochs without improvement) for early stopping. Default 100.
use_sam (bool, optional) – Enable SAM optimizer. Default True.
sam_type ({"sam", "asam"}, optional) – Variant of SAM to use. Default "asam".
use_lookahead (bool, optional) – Wrap optimizer with Lookahead. Default True.
use_ema (bool, optional) – Maintain EMA version of model weights. Default False.
use_multiscale (bool, optional) – Random multi-scale training. Default True.
use_ohem (bool, optional) – Online Hard Example Mining. Default True.
ohem_ratio (float, optional) – Ratio of hard negatives for OHEM. Default 0.5.
use_focal_geo (bool, optional) – Apply focal loss to geometry channels. Default True.
focal_gamma (float, optional) – Gamma for focal geometry loss. Default 2.0.
resume_from (str or Path, optional) – Resume training from a previous experiment: a) experiment directory, b) …/checkpoints/, c) direct path to last_state.pt. Default None.
val_interval (int, optional) – Run validation every N epochs. Default 1.
num_workers (int, optional) – Number of worker processes for data loading. Set to 0 for single-process loading (safer on Windows). Default 0.
device (torch.device, optional) – CUDA or CPU device. Auto-selects if None.

Returns:

Best model weights (EMA if enabled, otherwise base model).

Return type:

torch.nn.Module

Examples

Train on two datasets with validation:

>>> from manuscript.detectors import EAST
>>>
>>> train_images = [
...     "/data/archive/train_images",
...     "/data/ddi/train_images"
... ]
>>> train_anns = [
...     "/data/archive/train.json",
...     "/data/ddi/train.json"
... ]
>>> val_images = [
...     "/data/archive/test_images",
...     "/data/ddi/test_images"
... ]
>>> val_anns = [
...     "/data/archive/test.json",
...     "/data/ddi/test.json"
... ]
>>>
>>> best_model = EAST.train(
...     train_images=train_images,
...     train_anns=train_anns,
...     val_images=val_images,
...     val_anns=val_anns,
...     backbone_name="resnet50",
...     target_size=256,
...     epochs=20,
...     batch_size=4,
...     use_sam=False,
...     freeze_first=False,
...     val_interval=3,
... )
>>> print("Best checkpoint loaded:", best_model)

static export(weights_path, output_path, backbone_name=None, input_size=1280, opset_version=14, simplify=True)[source]

Export EAST PyTorch model to ONNX format.

This method converts a trained EAST model from PyTorch to ONNX format, which can be used for faster inference with ONNX Runtime. The exported model can be loaded using EAST(weights_path="model.onnx", use_onnx=True).

Parameters:

weights_path (str or Path) – Path to the PyTorch model weights file (.pth).
output_path (str or Path) – Path where the ONNX model will be saved (.onnx).
backbone_name ({"resnet50", "resnet101"}, optional) – Backbone architecture of the model. If None, will be automatically detected from the checkpoint. Must match the architecture used during training. Default is None (auto-detect).
input_size (int, optional) – Input image size (height and width). The model will accept images of shape (batch, 3, input_size, input_size). Default is 1280.
opset_version (int, optional) – ONNX opset version to use for export. Default is 14.
simplify (bool, optional) – If True, applies ONNX graph simplification using onnx-simplifier to optimize the model. Requires onnx-simplifier package. Default is True.

Returns:

The ONNX model is saved to output_path.

Return type:

None

Raises:

ImportError – If required packages (torch, onnx) are not installed.
FileNotFoundError – If weights_path does not exist.
ValueError – If backbone_name doesn’t match the checkpoint architecture.

Notes

The exported ONNX model has two outputs:

score_map: Text confidence map with shape (batch, 1, H, W)
geo_map: Geometry map with shape (batch, 8, H, W)

The model supports dynamic batch size and image dimensions through dynamic axes configuration.

Automatic Backbone Detection:

The method automatically detects the backbone architecture from the checkpoint by analyzing the number of parameters in layer4. This prevents mismatches between checkpoint and architecture that could lead to incorrect exports.

Examples

Export with automatic backbone detection:

>>> from manuscript.detectors import EAST
>>> EAST.export_to_onnx(
...     weights_path="east_resnet50.pth",
...     output_path="east_model.onnx"
... )
Auto-detected backbone: resnet50
Exporting to ONNX (opset 14)...
[OK] ONNX model saved to: east_model.onnx

Export with explicit backbone:

>>> EAST.export_to_onnx(
...     weights_path="custom_weights.pth",
...     output_path="custom_model.onnx",
...     backbone_name="resnet101",
...     input_size=1024,
...     simplify=False
... )

Use the exported model for inference:

>>> detector = EAST(
...     weights_path="east_model.onnx",
...     use_onnx=True,
...     device="cuda"
... )
>>> result = detector.predict("image.jpg")