Detectors

Text detection models.

class manuscript.detectors.EAST(weights=None, device=None, force_download=False, *, target_size=1280, expand_ratio_w=1.4, expand_ratio_h=1.5, expand_power=0.6, score_thresh=0.6, iou_threshold=0.05, iou_threshold_standard=0.05, score_geo_scale=0.25, quantization=2, axis_aligned_output=True, remove_area_anomalies=False, anomaly_sigma_threshold=5.0, anomaly_min_box_count=30, use_tta=False, tta_iou_thresh=0.1, **kwargs)[source]

Bases: BaseDetector

Initialize EAST text detector with ONNX Runtime.

Parameters:

weights (str or Path, optional) –
Path or identifier for ONNX model weights. Supports:
- Local file path: "path/to/model.onnx"
- HTTP/HTTPS URL: "https://example.com/model.onnx"
- GitHub release: "github://owner/repo/tag/file.onnx"
- Google Drive: "gdrive:FILE_ID"
- Preset name: "east_50_g1"
- None: auto-downloads default preset (east_50_g1)
device (str, optional) –
Compute device: "cuda", "coreml", or "cpu". If None, automatically selects CPU. For GPU/CoreML acceleration:
- CUDA (NVIDIA): pip install onnxruntime-gpu
- CoreML (Apple Silicon M1/M2/M3): pip install onnxruntime-silicon
Default is None (CPU).
target_size (int, optional) – Input image size for inference. Images are resized to (target_size, target_size). Default is 1280.
expand_ratio_w (float, optional) – Horizontal expansion factor applied to detected boxes after NMS. Default is 0.7.
expand_ratio_h (float, optional) – Vertical expansion factor applied to detected boxes after NMS. Default is 0.7.
expand_power (float, optional) – Power for non-linear box expansion. Controls how expansion scales with box size. - 1.0 = linear (small and large boxes expand equally) - <1.0 = small boxes expand more (e.g., 0.5, recommended for character-level detection) - >1.0 = large boxes expand more Default is 0.5.
score_thresh (float, optional) – Confidence threshold for selecting candidate detections before NMS. Default is 0.7.
iou_threshold (float, optional) – IoU threshold for locality-aware NMS merging phase. Default is 0.2.
iou_threshold_standard (float, optional) – IoU threshold for standard NMS after locality-aware merging. If None, uses the same value as iou_threshold. Default is None.
score_geo_scale (float, optional) – Scale factor for decoding geometry/score maps. Default is 0.25.
quantization (int, optional) – Quantization resolution for point coordinates during decoding. Default is 2.
axis_aligned_output (bool, optional) – If True, outputs axis-aligned rectangles instead of original quads. Default is True.
remove_area_anomalies (bool, optional) – If True, removes quads with extremely large area relative to the distribution. Default is False.
anomaly_sigma_threshold (float, optional) – Sigma threshold for anomaly area filtering. Default is 5.0.
anomaly_min_box_count (int, optional) – Minimum number of boxes required before anomaly filtering. Default is 30.
use_tta (bool, optional) – Enable Test-Time Augmentation (TTA). When enabled, inference is run on both the original and horizontally flipped image, and results are merged. This can improve detection of partially visible or edge text. Default is False.
tta_iou_thresh (float, optional) – IoU threshold for merging boxes from original and flipped images during TTA. Boxes with IoU > threshold are considered matches and merged. Default is 0.1.
force_download (bool)

Notes

The class provides two main public methods:

predict — run inference on a single image and return detections.
train — high-level training entrypoint to train an EAST model on custom datasets.

The detector uses ONNX Runtime for fast inference on CPU and GPU. For GPU acceleration, install: pip install onnxruntime-gpu

Methods

`__call__`(args, *kwargs)	Call self as a function.
`export`(weights_path, output_path[, ...])	Export EAST PyTorch model to ONNX format.
`predict`(img_or_path)	Run EAST inference and return detected page structure.
`runtime_providers`()	Get ONNX Runtime execution providers based on device.
`train`(train_images, train_anns, val_images, ...)	Train EAST model on custom datasets.

default_weights_name: str | None = 'east_50_g1'

pretrained_registry: Dict[str, str] = {'east_50_g1': 'github://konstantinkozhin/manuscript-ocr/v0.1.0/east_50_g1.onnx'}

__init__(weights=None, device=None, force_download=False, *, target_size=1280, expand_ratio_w=1.4, expand_ratio_h=1.5, expand_power=0.6, score_thresh=0.6, iou_threshold=0.05, iou_threshold_standard=0.05, score_geo_scale=0.25, quantization=2, axis_aligned_output=True, remove_area_anomalies=False, anomaly_sigma_threshold=5.0, anomaly_min_box_count=30, use_tta=False, tta_iou_thresh=0.1, **kwargs)[source]

Parameters:

weights (str | Path | None)
device (str | None)
force_download (bool)
target_size (int)
expand_ratio_w (float)
expand_ratio_h (float)
expand_power (float)
score_thresh (float)
iou_threshold (float)
iou_threshold_standard (float | None)
score_geo_scale (float)
quantization (int)
axis_aligned_output (bool)
remove_area_anomalies (bool)
anomaly_sigma_threshold (float)
anomaly_min_box_count (int)
use_tta (bool)
tta_iou_thresh (float)

predict(img_or_path)[source]

Run EAST inference and return detected page structure.

Parameters:: img_or_path (str or pathlib.Path or numpy.ndarray) – Path to an image file or an RGB image provided as a NumPy array with shape (H, W, 3) in uint8 format.
Returns:: Parsed detection result as a Page object containing a single Block with a single Line of TextSpan objects. Each TextSpan has polygon coordinates, confidence score, and sequential order index.
Return type:: Page

Notes

The method performs: (1) image loading, (2) resizing and normalization, (3) model inference, (4) quad decoding, (5) NMS, (6) box expansion, (7) scaling coordinates back to original size.

Test-Time Augmentation (TTA):

When use_tta=True is set during initialization, the method runs inference on both the original and horizontally flipped image, then merges results. Boxes from both views are matched by IoU and merged by taking the union of coordinates with averaged scores. This can improve detection of text near image edges or partially visible text.

For visualization, use the external visualize_page utility:

>>> from manuscript.utils import visualize_page
>>> page = model.predict(img_path)
>>> vis_img = visualize_page(img, page)

Examples

Perform inference and get structured output:

>>> from manuscript.detectors import EAST
>>> model = EAST()
>>> img_path = r"example/ocr_example_image.jpg"
>>> page = model.predict(img_path)
>>> # Access first line's first text span
>>> first_text_span = page.blocks[0].lines[0].text_spans[0]
>>> print(f"Confidence: {first_text_span.detection_confidence}")

Visualize results separately:

>>> from manuscript.utils import visualize_page, read_image
>>> page = model.predict(img_path)
>>> img = read_image(img_path)
>>> vis_img = visualize_page(img, page)
>>> vis_img.show()

static train(train_images, train_anns, val_images, val_anns, *, experiment_root='./experiments', model_name='resnet_quad', backbone_name='resnet50', pretrained_backbone=True, freeze_first=False, target_size=1024, score_geo_scale=None, score_map_shrink_ratio=0.3, epochs=500, batch_size=3, accumulation_steps=1, lr=0.0001, lr_scheduler='cosine_restart', lr_scheduler_params=None, augmentation_config=None, grad_clip=5.0, early_stop=100, use_sam=False, sam_type='asam', use_lookahead=False, use_ema=False, use_multiscale=True, use_ohem=False, ohem_ratio=0.5, use_focal_geo=True, focal_gamma=2.0, resume_from=None, val_interval=1, num_workers=0, log_collage=True, device=None)[source]

Train EAST model on custom datasets.

Parameters:

train_images (str, Path or sequence of paths) – Path(s) to training image folders.
train_anns (str, Path or sequence of paths) – Path(s) to COCO-format JSON annotation files corresponding to train_images.
val_images (str, Path or sequence of paths) – Path(s) to validation image folders.
val_anns (str, Path or sequence of paths) – Path(s) to COCO-format JSON annotation files corresponding to val_images.
experiment_root (str, optional) – Base directory where experiment folders will be created. Default is "./experiments".
model_name (str, optional) – Folder name inside experiment_root for logs and checkpoints. Default is "resnet_quad".
backbone_name ({"resnet50", "resnet101"}, optional) –
Backbone architecture to use. Options:
- "resnet50" — ResNet-50 (faster, less parameters)
- "resnet101" — ResNet-101 (slower, more capacity)
Default is "resnet50".
pretrained_backbone (bool, optional) – Use ImageNet-pretrained backbone weights. Default True.
freeze_first (bool, optional) – Freeze lowest layers of the backbone. Default True.
target_size (int, optional) – Resize shorter side of images to this size. Default 1024.
score_geo_scale (float, optional) – Multiplier to recover original coordinates from score/geo maps. If None, automatically taken from the model. Default None.
score_map_shrink_ratio (float, optional) – Shrink ratio for score map. Default 0.3.
epochs (int, optional) – Number of training epochs. Default 500.
batch_size (int, optional) – Batch size per GPU. Default 3.
accumulation_steps (int, optional) –
Number of gradient accumulation steps. Effective batch size will be batch_size * accumulation_steps. Use this to train with larger effective batch sizes when GPU memory is limited. For example:
- batch_size=2, accumulation_steps=4 → effective batch size = 8
- batch_size=1, accumulation_steps=8 → effective batch size = 8
Default is 1 (no accumulation).
lr (float, optional) – Learning rate. Default 1e-3.
lr_scheduler ({"cosine_restart", "cosine", "linear", "step", "exponential", "plateau", "none"}, optional) – Learning rate scheduler type. Default "cosine_restart".
lr_scheduler_params (dict, optional) –
Extra scheduler parameters (depends on scheduler type). Common keys:
- cosine_restart: t0, t_mult, eta_min
- cosine: t_max, eta_min
- linear: final_factor
- step: step_size, gamma
- exponential: gamma
- plateau: factor, patience, min_lr
augmentation_config (dict, optional) –
Full augmentation configuration passed to EASTDataset. This is the preferred place for all augmentation settings. Any keys here override the corresponding defaults. Supported keys include:
- quad_source for COCO polygon-to-quad conversion: "auto" preserves 4-point polygons and falls back to "min_area_rect" for longer polygons, "as_is" accepts only 4-point polygons, and "min_area_rect" always fits the minimum-area rectangle
- flip_prob, color_jitter
- small_rotate_prob, small_rotate_deg
- perspective_prob, perspective_scale
- blur_prob, blur_ksize_range
- noise_prob, noise_std, salt_pepper_prob
- jpeg_prob, jpeg_quality_range
- shading_prob, shading_strength
- gamma_prob, gamma_range
- downscale_prob, downscale_range
- negative_prob
grad_clip (float, optional) – Gradient clipping value (L2 norm). Default 5.0.
early_stop (int, optional) – Patience (epochs without improvement) for early stopping. Default 100.
use_sam (bool, optional) – Enable SAM optimizer. Default True.
sam_type ({"sam", "asam"}, optional) – Variant of SAM to use. Default "asam".
use_lookahead (bool, optional) – Wrap optimizer with Lookahead. Default True.
use_ema (bool, optional) – Maintain EMA version of model weights. Default False.
use_multiscale (bool, optional) – Random multi-scale training. Default True.
use_ohem (bool, optional) – Online Hard Example Mining. Default True.
ohem_ratio (float, optional) – Ratio of hard negatives for OHEM. Default 0.5.
use_focal_geo (bool, optional) – Apply focal loss to geometry channels. Default True.
focal_gamma (float, optional) – Gamma for focal geometry loss. Default 2.0.
resume_from (str or Path, optional) – Resume training from a previous experiment or initialize from local weights: a) experiment directory, b) …/checkpoints/, c) direct path to full-state checkpoint (last_state.pt). If a weights-only file is provided (e.g. best.pth), model weights are loaded, but training starts in the current experiment_root/model_name directory. Default None.
val_interval (int, optional) – Run validation every N epochs. Default 1.
num_workers (int, optional) – Number of worker processes for data loading. Set to 0 for single-process loading (safer on Windows). Default 0.
log_collage (bool, optional) – Whether to generate and log validation collage images to TensorBoard. Disable to save memory on GPUs with limited VRAM. Default True.
device (torch.device, optional) – CUDA or CPU device. Auto-selects if None.

Returns:

Best model weights (EMA if enabled, otherwise base model).

Return type:

torch.nn.Module

Examples

Train on two datasets with validation:

>>> from manuscript.detectors import EAST
>>>
>>> train_images = [
...     "/data/archive/train_images",
...     "/data/ddi/train_images"
... ]
>>> train_anns = [
...     "/data/archive/train.json",
...     "/data/ddi/train.json"
... ]
>>> val_images = [
...     "/data/archive/test_images",
...     "/data/ddi/test_images"
... ]
>>> val_anns = [
...     "/data/archive/test.json",
...     "/data/ddi/test.json"
... ]
>>>
>>> best_model = EAST.train(
...     train_images=train_images,
...     train_anns=train_anns,
...     val_images=val_images,
...     val_anns=val_anns,
...     backbone_name="resnet50",
...     target_size=256,
...     epochs=20,
...     batch_size=4,
...     use_sam=False,
...     freeze_first=False,
...     val_interval=3,
... )
>>> print("Best checkpoint loaded:", best_model)

Configure augmentations via augmentation_config:

>>> aug_cfg = {
...     "quad_source": "auto",
...     "flip_prob": 0.02,
...     "color_jitter": (0.1, 0.1, 0.1, 0.05),
...     "small_rotate_prob": 0.15,
...     "small_rotate_deg": 1.5,
...     "perspective_prob": 0.08,
...     "perspective_scale": 0.015,
...     "blur_prob": 0.1,
...     "blur_ksize_range": (3, 5),
...     "noise_prob": 0.1,
...     "noise_std": 0.008,
...     "salt_pepper_prob": 0.0005,
...     "jpeg_prob": 0.1,
...     "jpeg_quality_range": (75, 95),
...     "shading_prob": 0.1,
...     "shading_strength": 0.1,
...     "gamma_prob": 0.2,
...     "gamma_range": (0.95, 1.05),
...     "downscale_prob": 0.1,
...     "downscale_range": (0.7, 0.95),
...     "negative_prob": 0.05,
... }
>>> best_model = EAST.train(
...     train_images=train_images,
...     train_anns=train_anns,
...     val_images=val_images,
...     val_anns=val_anns,
...     augmentation_config=aug_cfg,
... )

static export(weights_path, output_path, backbone_name=None, input_size=1280, opset_version=14, simplify=True)[source]

Export EAST PyTorch model to ONNX format.

This method converts a trained EAST model from PyTorch to ONNX format, which can be used for faster inference with ONNX Runtime. The exported model can be loaded using EAST(weights_path="model.onnx", use_onnx=True).

Parameters:

weights_path (str or Path) – Path to the PyTorch model weights file (.pth).
output_path (str or Path) – Path where the ONNX model will be saved (.onnx).
backbone_name ({"resnet50", "resnet101"}, optional) – Backbone architecture of the model. If None, will be automatically detected from the checkpoint. Must match the architecture used during training. Default is None (auto-detect).
input_size (int, optional) – Input image size (height and width). The model will accept images of shape (batch, 3, input_size, input_size). Default is 1280.
opset_version (int, optional) – ONNX opset version to use for export. Default is 14.
simplify (bool, optional) – If True, applies ONNX graph simplification using onnx-simplifier to optimize the model. Requires onnx-simplifier package. Default is True.

Returns:

The ONNX model is saved to output_path.

Return type:

None

Raises:

ImportError – If required packages (torch, onnx) are not installed.
FileNotFoundError – If weights_path does not exist.
ValueError – If backbone_name doesn’t match the checkpoint architecture.

Notes

The exported ONNX model has two outputs:

score_map: Text confidence map with shape (batch, 1, H, W)
geo_map: Geometry map with shape (batch, 8, H, W)

The model supports dynamic batch size and image dimensions through dynamic axes configuration.

Automatic Backbone Detection:

The method automatically detects the backbone architecture from the checkpoint by analyzing the number of parameters in layer4. This prevents mismatches between checkpoint and architecture that could lead to incorrect exports.

Examples

Export with automatic backbone detection:

>>> from manuscript.detectors import EAST
>>> EAST.export(
...     weights_path="east_resnet50.pth",
...     output_path="east_model.onnx"
... )
Auto-detected backbone: resnet50
Exporting to ONNX (opset 14)...
[OK] ONNX model saved to: east_model.onnx

Export with explicit backbone:

>>> EAST.export(
...     weights_path="custom_weights.pth",
...     output_path="custom_model.onnx",
...     backbone_name="resnet101",
...     input_size=1024,
...     simplify=False
... )

Use the exported model for inference:

>>> detector = EAST(
...     weights_path="east_model.onnx",
...     use_onnx=True,
...     device="cuda"
... )
>>> result = detector.predict("image.jpg")

See also

EAST.__init__: Initialize EAST detector with ONNX support using use_onnx=True.

class manuscript.detectors.YOLO(weights=None, config=None, device=None, force_download=False, *, score_thresh=0.1, class_ids=None, target_size=None, axis_aligned_output=True, containment_threshold=0.9, **kwargs)[source]

Bases: BaseDetector

Initialize YOLO text detector with ONNX Runtime.

Parameters:

weights (str or Path, optional) –
Path or identifier for ONNX model weights. Supports:
- Local file path: "path/to/model.onnx"
- HTTP/HTTPS URL: "https://example.com/model.onnx"
- GitHub release: "github://owner/repo/tag/file.onnx"
- Google Drive: "gdrive:FILE_ID"
- Preset name: "yolo26s_obb_text_g1" or "yolo26x_obb_text_g1"
- None: auto-downloads default preset (yolo26x_obb_text_g1)
The ONNX model may return either standard detections in xyxy, score, class_id format with shape [N, 6] / [1, N, 6] or oriented detections in cx, cy, w, h, score, class_id, angle format with shape [N, 7] / [1, N, 7].
config (str or Path, optional) – Path or identifier for model configuration YAML. Same URL schemes as weights. If None, attempts to infer a YAML file next to the weights or uses the preset config from config_registry.
device (str, optional) –
Compute device: "cuda", "coreml", or "cpu". If None, automatically selects CPU. For GPU/CoreML acceleration:
- CUDA (NVIDIA): pip install onnxruntime-gpu
- CoreML (Apple Silicon M1/M2/M3): pip install onnxruntime-silicon
Default is None (CPU).
score_thresh (float, optional) – Confidence threshold applied to model outputs after ONNX inference and before the additional containment cleanup pass. Default is 0.1.
class_ids (sequence of int or None, optional) – Optional whitelist of class IDs to keep. If None, all classes are kept. Default is None.
target_size (int or None, optional) – Square inference size used for letterbox preprocessing. Images are resized into (target_size, target_size) before ONNX inference. If None, the detector tries to read imgsz from a YAML config located next to the weights or downloaded from the preset registry. Unknown/custom weights without YAML fall back to 1280.
axis_aligned_output (bool, optional) – If True (default), OBB detections are converted to standard axis-aligned rectangles. If False, OBB detections are returned as rotated polygons via page / polygons and as cx, cy, w, h, score, class_id, angle rows in boxes. For non-OBB models this flag has no effect.
containment_threshold (float or None, optional) – Removes a smaller box when at least this fraction of its area is covered by a larger box. For example, 0.9 removes boxes that are contained by 90% or more. Set to None to disable this extra cleanup. Default is 0.9.
force_download (bool)

Notes

The class provides one main public method:

predict - run inference on a single image and return detections.

Available presets:

"yolo26s_obb_text_g1" - YOLO26-S OBB text detector
"yolo26x_obb_text_g1" - YOLO26-X OBB text detector

Methods

`__call__`(args, *kwargs)	Call self as a function.
`predict`(img_or_path)	Run YOLO ONNX inference on a single image and return detected page structure.
`runtime_providers`()	Get ONNX Runtime execution providers based on device.
`train`(args, *kwargs)

export

default_weights_name: str | None = 'yolo26x_obb_text_g1'

default_target_size = 1280

pretrained_registry: Dict[str, str] = {'yolo26s_obb_text_g1': 'https://github.com/konstantinkozhin/manuscript-ocr/releases/download/v0.1.0/yolo26s_obb_text_g1.raw.onnx', 'yolo26x_obb_text_g1': 'https://github.com/konstantinkozhin/manuscript-ocr/releases/download/v0.1.0/yolo26x_obb_text_g1.raw.onnx'}

config_registry: Dict[str, str] = {'yolo26s_obb_text_g1': 'https://github.com/konstantinkozhin/manuscript-ocr/releases/download/v0.1.0/yolo26s_obb_text_g1.raw.yaml', 'yolo26x_obb_text_g1': 'https://github.com/konstantinkozhin/manuscript-ocr/releases/download/v0.1.0/yolo26x_obb_text_g1.raw.yaml'}

__init__(weights=None, config=None, device=None, force_download=False, *, score_thresh=0.1, class_ids=None, target_size=None, axis_aligned_output=True, containment_threshold=0.9, **kwargs)[source]

Parameters:

weights (str | Path | None)
config (str | Path | None)
device (str | None)
force_download (bool)
score_thresh (float)
class_ids (Sequence[int] | None)
target_size (int | None)
axis_aligned_output (bool)
containment_threshold (float | None)

predict(img_or_path)[source]

Run YOLO ONNX inference on a single image and return detected page structure.

Parameters:: img_or_path (str or pathlib.Path or numpy.ndarray) – Path to an image file or an RGB image provided as a NumPy array with shape (H, W, 3) in uint8 format.
Returns:: Parsed detection result as a Page object containing a single Block with a single Line of TextSpan objects.
Return type:: Page

Examples

Run inference and get structured output:

>>> from manuscript.detectors import YOLO
>>> model = YOLO(weights="yolo26x_obb_text_g1")
>>> page = model.predict("page.jpg")
>>> first_text_span = page.blocks[0].lines[0].text_spans[0]
>>> print(first_text_span.detection_confidence)

EAST: Training Notes

The EAST detector in manuscript-ocr is based on the architecture proposed in EAST: An Efficient and Accurate Scene Text Detector (Zhou et al., CVPR 2017). The training procedure has been significantly reworked compared to the original: the loss weighting scheme, augmentation pipeline, quadrilateral annotation handling, and support for mixed annotations have all been modified. Pretrained weights were produced by the project authors.

EAST Training Quads

EAST training expects quadrilateral targets. When loading COCO segmentation polygons, use augmentation_config["quad_source"] in EAST.train(...) to control how polygons are converted into 4-point training quads:

"auto" keeps existing 4-point polygons as-is and falls back to minAreaRect for longer polygons.
"as_is" accepts only 4-point polygons and skips polygons with a different number of vertices.
"min_area_rect" always fits the minimum-area rectangle and matches the legacy conversion path.