Detectors
Text detection models.
- class manuscript.detectors.EAST(weights=None, device=None, *, target_size=1280, expand_ratio_w=1.4, expand_ratio_h=1.5, expand_power=0.6, score_thresh=0.6, iou_threshold=0.05, iou_threshold_standard=0.05, score_geo_scale=0.25, quantization=2, axis_aligned_output=True, remove_area_anomalies=False, anomaly_sigma_threshold=5.0, anomaly_min_box_count=30, use_tta=False, tta_iou_thresh=0.1, **kwargs)[source]
Bases:
BaseModelInitialize EAST text detector with ONNX Runtime.
- Parameters:
weights (str or Path, optional) –
Path or identifier for ONNX model weights. Supports:
Local file path:
"path/to/model.onnx"HTTP/HTTPS URL:
"https://example.com/model.onnx"GitHub release:
"github://owner/repo/tag/file.onnx"Google Drive:
"gdrive:FILE_ID"Preset name:
"east_50_g1"None: auto-downloads default preset (east_50_g1)
device (str, optional) –
Compute device:
"cuda","coreml", or"cpu". If None, automatically selects CPU. For GPU/CoreML acceleration:CUDA (NVIDIA):
pip install onnxruntime-gpuCoreML (Apple Silicon M1/M2/M3):
pip install onnxruntime-silicon
Default is
None(CPU).target_size (int, optional) – Input image size for inference. Images are resized to
(target_size, target_size). Default is 1280.expand_ratio_w (float, optional) – Horizontal expansion factor applied to detected boxes after NMS. Default is 0.7.
expand_ratio_h (float, optional) – Vertical expansion factor applied to detected boxes after NMS. Default is 0.7.
expand_power (float, optional) – Power for non-linear box expansion. Controls how expansion scales with box size. - 1.0 = linear (small and large boxes expand equally) - <1.0 = small boxes expand more (e.g., 0.5, recommended for character-level detection) - >1.0 = large boxes expand more Default is 0.5.
score_thresh (float, optional) – Confidence threshold for selecting candidate detections before NMS. Default is 0.7.
iou_threshold (float, optional) – IoU threshold for locality-aware NMS merging phase. Default is 0.2.
iou_threshold_standard (float, optional) – IoU threshold for standard NMS after locality-aware merging. If None, uses the same value as iou_threshold. Default is None.
score_geo_scale (float, optional) – Scale factor for decoding geometry/score maps. Default is 0.25.
quantization (int, optional) – Quantization resolution for point coordinates during decoding. Default is 2.
axis_aligned_output (bool, optional) – If True, outputs axis-aligned rectangles instead of original quads. Default is True.
remove_area_anomalies (bool, optional) – If True, removes quads with extremely large area relative to the distribution. Default is False.
anomaly_sigma_threshold (float, optional) – Sigma threshold for anomaly area filtering. Default is 5.0.
anomaly_min_box_count (int, optional) – Minimum number of boxes required before anomaly filtering. Default is 30.
use_tta (bool, optional) – Enable Test-Time Augmentation (TTA). When enabled, inference is run on both the original and horizontally flipped image, and results are merged. This can improve detection of partially visible or edge text. Default is False.
tta_iou_thresh (float, optional) – IoU threshold for merging boxes from original and flipped images during TTA. Boxes with IoU > threshold are considered matches and merged. Default is 0.1.
Notes
The class provides two main public methods:
predict— run inference on a single image and return detections.train— high-level training entrypoint to train an EAST model on custom datasets.
The detector uses ONNX Runtime for fast inference on CPU and GPU. For GPU acceleration, install:
pip install onnxruntime-gpuMethods
__call__(*args, **kwargs)Call self as a function.
export(weights_path, output_path[, ...])Export EAST PyTorch model to ONNX format.
predict(img_or_path[, return_maps, ...])Run EAST inference and return detection results.
runtime_providers()Get ONNX Runtime execution providers based on device.
train(train_images, train_anns, val_images, ...)Train EAST model on custom datasets.
- pretrained_registry: Dict[str, str] = {'east_50_g1': 'github://konstantinkozhin/manuscript-ocr/v0.1.0/east_50_g1.onnx'}
- __init__(weights=None, device=None, *, target_size=1280, expand_ratio_w=1.4, expand_ratio_h=1.5, expand_power=0.6, score_thresh=0.6, iou_threshold=0.05, iou_threshold_standard=0.05, score_geo_scale=0.25, quantization=2, axis_aligned_output=True, remove_area_anomalies=False, anomaly_sigma_threshold=5.0, anomaly_min_box_count=30, use_tta=False, tta_iou_thresh=0.1, **kwargs)[source]
- Parameters:
device (str | None)
target_size (int)
expand_ratio_w (float)
expand_ratio_h (float)
expand_power (float)
score_thresh (float)
iou_threshold (float)
iou_threshold_standard (float | None)
score_geo_scale (float)
quantization (int)
axis_aligned_output (bool)
remove_area_anomalies (bool)
anomaly_sigma_threshold (float)
anomaly_min_box_count (int)
use_tta (bool)
tta_iou_thresh (float)
- predict(img_or_path, return_maps=False, sort_reading_order=True, split_into_columns=True, max_columns=10)[source]
Run EAST inference and return detection results.
- Parameters:
img_or_path (str or pathlib.Path or numpy.ndarray) – Path to an image file or an RGB image provided as a NumPy array with shape
(H, W, 3)inuint8format.return_maps (bool, optional) – If True, returns raw model score and geometry maps under keys
"score_map"and"geo_map". Default is False.sort_reading_order (bool, optional) – If True, sorts detected words in natural reading order (left-to-right, top-to-bottom) and groups them into text lines. Default is True.
split_into_columns (bool, optional) – If True and
sort_reading_order=True, segments the page into columns (separate Blocks). If False, treats entire page as single column. Only used whensort_reading_order=True. Default is True.max_columns (int, optional) – Maximum number of columns to detect when
split_into_columns=True. Higher values allow more columns to be detected. Only used whensort_reading_order=Trueandsplit_into_columns=True. Default is 10.
- Returns:
Dictionary with the following keys:
"page"PageParsed detection result as a Page object containing Block(s) with Line(s) of Word objects. Each Word has polygon coordinates and confidence scores. Words and Lines have reading order indices.
"score_map"numpy.ndarray or NoneRaw score map produced by the network if
return_maps=True.
"geo_map"numpy.ndarray or NoneRaw geometry map if
return_maps=True.
- Return type:
Notes
The method performs: (1) image loading, (2) resizing and normalization, (3) model inference, (4) quad decoding, (5) NMS, (6) box expansion, (7) scaling coordinates back to original size, (8) optional reading order sorting into lines.
Test-Time Augmentation (TTA):
When
use_tta=Trueis set during initialization, the method runs inference on both the original and horizontally flipped image, then merges results. Boxes from both views are matched by IoU and merged by taking the union of coordinates with averaged scores. This can improve detection of text near image edges or partially visible text.For visualization, use the external
visualize_pageutility:>>> from manuscript.utils import visualize_page >>> result = model.predict(img_path) >>> vis_img = visualize_page(img, result["page"])
Examples
Perform inference and get structured output:
>>> from manuscript.detectors import EAST >>> model = EAST() >>> img_path = r"example/ocr_example_image.jpg" >>> result = model.predict(img_path) >>> page = result["page"] >>> # Access first line's first word >>> first_word = page.blocks[0].lines[0].words[0] >>> print(f"Confidence: {first_word.detection_confidence}")
Visualize results separately:
>>> from manuscript.utils import visualize_page, read_image >>> result = model.predict(img_path) >>> img = read_image(img_path) >>> vis_img = visualize_page(img, result["page"]) >>> vis_img.show()
- static train(train_images, train_anns, val_images, val_anns, *, experiment_root='./experiments', model_name='resnet_quad', backbone_name='resnet50', pretrained_backbone=True, freeze_first=True, target_size=1024, score_geo_scale=None, epochs=500, batch_size=3, accumulation_steps=1, lr=0.001, grad_clip=5.0, early_stop=100, use_sam=True, sam_type='asam', use_lookahead=True, use_ema=False, use_multiscale=True, use_ohem=True, ohem_ratio=0.5, use_focal_geo=True, focal_gamma=2.0, resume_from=None, val_interval=1, num_workers=0, device=None)[source]
Train EAST model on custom datasets.
- Parameters:
train_images (str, Path or sequence of paths) – Path(s) to training image folders.
train_anns (str, Path or sequence of paths) – Path(s) to COCO-format JSON annotation files corresponding to
train_images.val_images (str, Path or sequence of paths) – Path(s) to validation image folders.
val_anns (str, Path or sequence of paths) – Path(s) to COCO-format JSON annotation files corresponding to
val_images.experiment_root (str, optional) – Base directory where experiment folders will be created. Default is
"./experiments".model_name (str, optional) – Folder name inside
experiment_rootfor logs and checkpoints. Default is"resnet_quad".backbone_name ({"resnet50", "resnet101"}, optional) –
Backbone architecture to use. Options:
"resnet50"— ResNet-50 (faster, less parameters)"resnet101"— ResNet-101 (slower, more capacity)
Default is
"resnet50".pretrained_backbone (bool, optional) – Use ImageNet-pretrained backbone weights. Default
True.freeze_first (bool, optional) – Freeze lowest layers of the backbone. Default
True.target_size (int, optional) – Resize shorter side of images to this size. Default
1024.score_geo_scale (float, optional) – Multiplier to recover original coordinates from score/geo maps. If None, automatically taken from the model. Default
None.epochs (int, optional) – Number of training epochs. Default
500.batch_size (int, optional) – Batch size per GPU. Default
3.accumulation_steps (int, optional) –
Number of gradient accumulation steps. Effective batch size will be
batch_size * accumulation_steps. Use this to train with larger effective batch sizes when GPU memory is limited. For example:batch_size=2, accumulation_steps=4→ effective batch size = 8batch_size=1, accumulation_steps=8→ effective batch size = 8
Default is
1(no accumulation).lr (float, optional) – Learning rate. Default
1e-3.grad_clip (float, optional) – Gradient clipping value (L2 norm). Default
5.0.early_stop (int, optional) – Patience (epochs without improvement) for early stopping. Default
100.use_sam (bool, optional) – Enable SAM optimizer. Default
True.sam_type ({"sam", "asam"}, optional) – Variant of SAM to use. Default
"asam".use_lookahead (bool, optional) – Wrap optimizer with Lookahead. Default
True.use_ema (bool, optional) – Maintain EMA version of model weights. Default
False.use_multiscale (bool, optional) – Random multi-scale training. Default
True.use_ohem (bool, optional) – Online Hard Example Mining. Default
True.ohem_ratio (float, optional) – Ratio of hard negatives for OHEM. Default
0.5.use_focal_geo (bool, optional) – Apply focal loss to geometry channels. Default
True.focal_gamma (float, optional) – Gamma for focal geometry loss. Default
2.0.resume_from (str or Path, optional) – Resume training from a previous experiment: a) experiment directory, b) …/checkpoints/, c) direct path to last_state.pt. Default
None.val_interval (int, optional) – Run validation every N epochs. Default
1.num_workers (int, optional) – Number of worker processes for data loading. Set to 0 for single-process loading (safer on Windows). Default
0.device (torch.device, optional) – CUDA or CPU device. Auto-selects if None.
- Returns:
Best model weights (EMA if enabled, otherwise base model).
- Return type:
torch.nn.Module
Examples
Train on two datasets with validation:
>>> from manuscript.detectors import EAST >>> >>> train_images = [ ... "/data/archive/train_images", ... "/data/ddi/train_images" ... ] >>> train_anns = [ ... "/data/archive/train.json", ... "/data/ddi/train.json" ... ] >>> val_images = [ ... "/data/archive/test_images", ... "/data/ddi/test_images" ... ] >>> val_anns = [ ... "/data/archive/test.json", ... "/data/ddi/test.json" ... ] >>> >>> best_model = EAST.train( ... train_images=train_images, ... train_anns=train_anns, ... val_images=val_images, ... val_anns=val_anns, ... backbone_name="resnet50", ... target_size=256, ... epochs=20, ... batch_size=4, ... use_sam=False, ... freeze_first=False, ... val_interval=3, ... ) >>> print("Best checkpoint loaded:", best_model)
- static export(weights_path, output_path, backbone_name=None, input_size=1280, opset_version=14, simplify=True)[source]
Export EAST PyTorch model to ONNX format.
This method converts a trained EAST model from PyTorch to ONNX format, which can be used for faster inference with ONNX Runtime. The exported model can be loaded using
EAST(weights_path="model.onnx", use_onnx=True).- Parameters:
weights_path (str or Path) – Path to the PyTorch model weights file (.pth).
output_path (str or Path) – Path where the ONNX model will be saved (.onnx).
backbone_name ({"resnet50", "resnet101"}, optional) – Backbone architecture of the model. If None, will be automatically detected from the checkpoint. Must match the architecture used during training. Default is None (auto-detect).
input_size (int, optional) – Input image size (height and width). The model will accept images of shape
(batch, 3, input_size, input_size). Default is 1280.opset_version (int, optional) – ONNX opset version to use for export. Default is 14.
simplify (bool, optional) – If True, applies ONNX graph simplification using onnx-simplifier to optimize the model. Requires
onnx-simplifierpackage. Default is True.
- Returns:
The ONNX model is saved to
output_path.- Return type:
None
- Raises:
ImportError – If required packages (torch, onnx) are not installed.
FileNotFoundError – If
weights_pathdoes not exist.ValueError – If backbone_name doesn’t match the checkpoint architecture.
Notes
The exported ONNX model has two outputs:
score_map: Text confidence map with shape(batch, 1, H, W)geo_map: Geometry map with shape(batch, 8, H, W)
The model supports dynamic batch size and image dimensions through dynamic axes configuration.
Automatic Backbone Detection:
The method automatically detects the backbone architecture from the checkpoint by analyzing the number of parameters in layer4. This prevents mismatches between checkpoint and architecture that could lead to incorrect exports.
Examples
Export with automatic backbone detection:
>>> from manuscript.detectors import EAST >>> EAST.export_to_onnx( ... weights_path="east_resnet50.pth", ... output_path="east_model.onnx" ... ) Auto-detected backbone: resnet50 Exporting to ONNX (opset 14)... [OK] ONNX model saved to: east_model.onnx
Export with explicit backbone:
>>> EAST.export_to_onnx( ... weights_path="custom_weights.pth", ... output_path="custom_model.onnx", ... backbone_name="resnet101", ... input_size=1024, ... simplify=False ... )
Use the exported model for inference:
>>> detector = EAST( ... weights_path="east_model.onnx", ... use_onnx=True, ... device="cuda" ... ) >>> result = detector.predict("image.jpg")
See also
EAST.__init__Initialize EAST detector with ONNX support using
use_onnx=True.