Detectors
Text detection models.
- class manuscript.detectors.EAST(weights=None, device=None, force_download=False, *, target_size=1280, expand_ratio_w=1.4, expand_ratio_h=1.5, expand_power=0.6, score_thresh=0.6, iou_threshold=0.05, iou_threshold_standard=0.05, score_geo_scale=0.25, quantization=2, axis_aligned_output=True, remove_area_anomalies=False, anomaly_sigma_threshold=5.0, anomaly_min_box_count=30, use_tta=False, tta_iou_thresh=0.1, **kwargs)[source]
Bases:
BaseDetectorInitialize EAST text detector with ONNX Runtime.
- Parameters:
weights (str or Path, optional) –
Path or identifier for ONNX model weights. Supports:
Local file path:
"path/to/model.onnx"HTTP/HTTPS URL:
"https://example.com/model.onnx"GitHub release:
"github://owner/repo/tag/file.onnx"Google Drive:
"gdrive:FILE_ID"Preset name:
"east_50_g1"None: auto-downloads default preset (east_50_g1)
device (str, optional) –
Compute device:
"cuda","coreml", or"cpu". If None, automatically selects CPU. For GPU/CoreML acceleration:CUDA (NVIDIA):
pip install onnxruntime-gpuCoreML (Apple Silicon M1/M2/M3):
pip install onnxruntime-silicon
Default is
None(CPU).target_size (int, optional) – Input image size for inference. Images are resized to
(target_size, target_size). Default is 1280.expand_ratio_w (float, optional) – Horizontal expansion factor applied to detected boxes after NMS. Default is 0.7.
expand_ratio_h (float, optional) – Vertical expansion factor applied to detected boxes after NMS. Default is 0.7.
expand_power (float, optional) – Power for non-linear box expansion. Controls how expansion scales with box size. - 1.0 = linear (small and large boxes expand equally) - <1.0 = small boxes expand more (e.g., 0.5, recommended for character-level detection) - >1.0 = large boxes expand more Default is 0.5.
score_thresh (float, optional) – Confidence threshold for selecting candidate detections before NMS. Default is 0.7.
iou_threshold (float, optional) – IoU threshold for locality-aware NMS merging phase. Default is 0.2.
iou_threshold_standard (float, optional) – IoU threshold for standard NMS after locality-aware merging. If None, uses the same value as iou_threshold. Default is None.
score_geo_scale (float, optional) – Scale factor for decoding geometry/score maps. Default is 0.25.
quantization (int, optional) – Quantization resolution for point coordinates during decoding. Default is 2.
axis_aligned_output (bool, optional) – If True, outputs axis-aligned rectangles instead of original quads. Default is True.
remove_area_anomalies (bool, optional) – If True, removes quads with extremely large area relative to the distribution. Default is False.
anomaly_sigma_threshold (float, optional) – Sigma threshold for anomaly area filtering. Default is 5.0.
anomaly_min_box_count (int, optional) – Minimum number of boxes required before anomaly filtering. Default is 30.
use_tta (bool, optional) – Enable Test-Time Augmentation (TTA). When enabled, inference is run on both the original and horizontally flipped image, and results are merged. This can improve detection of partially visible or edge text. Default is False.
tta_iou_thresh (float, optional) – IoU threshold for merging boxes from original and flipped images during TTA. Boxes with IoU > threshold are considered matches and merged. Default is 0.1.
force_download (bool)
Notes
The class provides two main public methods:
predict— run inference on a single image and return detections.train— high-level training entrypoint to train an EAST model on custom datasets.
The detector uses ONNX Runtime for fast inference on CPU and GPU. For GPU acceleration, install:
pip install onnxruntime-gpuMethods
__call__(*args, **kwargs)Call self as a function.
export(weights_path, output_path[, ...])Export EAST PyTorch model to ONNX format.
predict(img_or_path)Run EAST inference and return detected page structure.
runtime_providers()Get ONNX Runtime execution providers based on device.
train(train_images, train_anns, val_images, ...)Train EAST model on custom datasets.
- pretrained_registry: Dict[str, str] = {'east_50_g1': 'github://konstantinkozhin/manuscript-ocr/v0.1.0/east_50_g1.onnx'}
- __init__(weights=None, device=None, force_download=False, *, target_size=1280, expand_ratio_w=1.4, expand_ratio_h=1.5, expand_power=0.6, score_thresh=0.6, iou_threshold=0.05, iou_threshold_standard=0.05, score_geo_scale=0.25, quantization=2, axis_aligned_output=True, remove_area_anomalies=False, anomaly_sigma_threshold=5.0, anomaly_min_box_count=30, use_tta=False, tta_iou_thresh=0.1, **kwargs)[source]
- Parameters:
device (str | None)
force_download (bool)
target_size (int)
expand_ratio_w (float)
expand_ratio_h (float)
expand_power (float)
score_thresh (float)
iou_threshold (float)
iou_threshold_standard (float | None)
score_geo_scale (float)
quantization (int)
axis_aligned_output (bool)
remove_area_anomalies (bool)
anomaly_sigma_threshold (float)
anomaly_min_box_count (int)
use_tta (bool)
tta_iou_thresh (float)
- predict(img_or_path)[source]
Run EAST inference and return detected page structure.
- Parameters:
img_or_path (str or pathlib.Path or numpy.ndarray) – Path to an image file or an RGB image provided as a NumPy array with shape
(H, W, 3)inuint8format.- Returns:
Parsed detection result as a Page object containing a single Block with a single Line of TextSpan objects. Each TextSpan has polygon coordinates, confidence score, and sequential
orderindex.- Return type:
Notes
The method performs: (1) image loading, (2) resizing and normalization, (3) model inference, (4) quad decoding, (5) NMS, (6) box expansion, (7) scaling coordinates back to original size.
Test-Time Augmentation (TTA):
When
use_tta=Trueis set during initialization, the method runs inference on both the original and horizontally flipped image, then merges results. Boxes from both views are matched by IoU and merged by taking the union of coordinates with averaged scores. This can improve detection of text near image edges or partially visible text.For visualization, use the external
visualize_pageutility:>>> from manuscript.utils import visualize_page >>> page = model.predict(img_path) >>> vis_img = visualize_page(img, page)
Examples
Perform inference and get structured output:
>>> from manuscript.detectors import EAST >>> model = EAST() >>> img_path = r"example/ocr_example_image.jpg" >>> page = model.predict(img_path) >>> # Access first line's first text span >>> first_text_span = page.blocks[0].lines[0].text_spans[0] >>> print(f"Confidence: {first_text_span.detection_confidence}")
Visualize results separately:
>>> from manuscript.utils import visualize_page, read_image >>> page = model.predict(img_path) >>> img = read_image(img_path) >>> vis_img = visualize_page(img, page) >>> vis_img.show()
- static train(train_images, train_anns, val_images, val_anns, *, experiment_root='./experiments', model_name='resnet_quad', backbone_name='resnet50', pretrained_backbone=True, freeze_first=False, target_size=1024, score_geo_scale=None, score_map_shrink_ratio=0.3, epochs=500, batch_size=3, accumulation_steps=1, lr=0.0001, lr_scheduler='cosine_restart', lr_scheduler_params=None, augmentation_config=None, grad_clip=5.0, early_stop=100, use_sam=False, sam_type='asam', use_lookahead=False, use_ema=False, use_multiscale=True, use_ohem=False, ohem_ratio=0.5, use_focal_geo=True, focal_gamma=2.0, resume_from=None, val_interval=1, num_workers=0, log_collage=True, device=None)[source]
Train EAST model on custom datasets.
- Parameters:
train_images (str, Path or sequence of paths) – Path(s) to training image folders.
train_anns (str, Path or sequence of paths) – Path(s) to COCO-format JSON annotation files corresponding to
train_images.val_images (str, Path or sequence of paths) – Path(s) to validation image folders.
val_anns (str, Path or sequence of paths) – Path(s) to COCO-format JSON annotation files corresponding to
val_images.experiment_root (str, optional) – Base directory where experiment folders will be created. Default is
"./experiments".model_name (str, optional) – Folder name inside
experiment_rootfor logs and checkpoints. Default is"resnet_quad".backbone_name ({"resnet50", "resnet101"}, optional) –
Backbone architecture to use. Options:
"resnet50"— ResNet-50 (faster, less parameters)"resnet101"— ResNet-101 (slower, more capacity)
Default is
"resnet50".pretrained_backbone (bool, optional) – Use ImageNet-pretrained backbone weights. Default
True.freeze_first (bool, optional) – Freeze lowest layers of the backbone. Default
True.target_size (int, optional) – Resize shorter side of images to this size. Default
1024.score_geo_scale (float, optional) – Multiplier to recover original coordinates from score/geo maps. If None, automatically taken from the model. Default
None.score_map_shrink_ratio (float, optional) – Shrink ratio for score map. Default
0.3.epochs (int, optional) – Number of training epochs. Default
500.batch_size (int, optional) – Batch size per GPU. Default
3.accumulation_steps (int, optional) –
Number of gradient accumulation steps. Effective batch size will be
batch_size * accumulation_steps. Use this to train with larger effective batch sizes when GPU memory is limited. For example:batch_size=2, accumulation_steps=4→ effective batch size = 8batch_size=1, accumulation_steps=8→ effective batch size = 8
Default is
1(no accumulation).lr (float, optional) – Learning rate. Default
1e-3.lr_scheduler ({"cosine_restart", "cosine", "linear", "step", "exponential", "plateau", "none"}, optional) – Learning rate scheduler type. Default
"cosine_restart".lr_scheduler_params (dict, optional) –
Extra scheduler parameters (depends on scheduler type). Common keys:
cosine_restart:t0,t_mult,eta_mincosine:t_max,eta_minlinear:final_factorstep:step_size,gammaexponential:gammaplateau:factor,patience,min_lr
augmentation_config (dict, optional) –
Full augmentation configuration passed to
EASTDataset. This is the preferred place for all augmentation settings. Any keys here override the corresponding defaults. Supported keys include:quad_sourcefor COCO polygon-to-quad conversion:"auto"preserves 4-point polygons and falls back to"min_area_rect"for longer polygons,"as_is"accepts only 4-point polygons, and"min_area_rect"always fits the minimum-area rectangleflip_prob,color_jittersmall_rotate_prob,small_rotate_degperspective_prob,perspective_scaleblur_prob,blur_ksize_rangenoise_prob,noise_std,salt_pepper_probjpeg_prob,jpeg_quality_rangeshading_prob,shading_strengthgamma_prob,gamma_rangedownscale_prob,downscale_rangenegative_prob
grad_clip (float, optional) – Gradient clipping value (L2 norm). Default
5.0.early_stop (int, optional) – Patience (epochs without improvement) for early stopping. Default
100.use_sam (bool, optional) – Enable SAM optimizer. Default
True.sam_type ({"sam", "asam"}, optional) – Variant of SAM to use. Default
"asam".use_lookahead (bool, optional) – Wrap optimizer with Lookahead. Default
True.use_ema (bool, optional) – Maintain EMA version of model weights. Default
False.use_multiscale (bool, optional) – Random multi-scale training. Default
True.use_ohem (bool, optional) – Online Hard Example Mining. Default
True.ohem_ratio (float, optional) – Ratio of hard negatives for OHEM. Default
0.5.use_focal_geo (bool, optional) – Apply focal loss to geometry channels. Default
True.focal_gamma (float, optional) – Gamma for focal geometry loss. Default
2.0.resume_from (str or Path, optional) – Resume training from a previous experiment or initialize from local weights: a) experiment directory, b) …/checkpoints/, c) direct path to full-state checkpoint (last_state.pt). If a weights-only file is provided (e.g. best.pth), model weights are loaded, but training starts in the current
experiment_root/model_namedirectory. DefaultNone.val_interval (int, optional) – Run validation every N epochs. Default
1.num_workers (int, optional) – Number of worker processes for data loading. Set to 0 for single-process loading (safer on Windows). Default
0.log_collage (bool, optional) – Whether to generate and log validation collage images to TensorBoard. Disable to save memory on GPUs with limited VRAM. Default
True.device (torch.device, optional) – CUDA or CPU device. Auto-selects if None.
- Returns:
Best model weights (EMA if enabled, otherwise base model).
- Return type:
torch.nn.Module
Examples
Train on two datasets with validation:
>>> from manuscript.detectors import EAST >>> >>> train_images = [ ... "/data/archive/train_images", ... "/data/ddi/train_images" ... ] >>> train_anns = [ ... "/data/archive/train.json", ... "/data/ddi/train.json" ... ] >>> val_images = [ ... "/data/archive/test_images", ... "/data/ddi/test_images" ... ] >>> val_anns = [ ... "/data/archive/test.json", ... "/data/ddi/test.json" ... ] >>> >>> best_model = EAST.train( ... train_images=train_images, ... train_anns=train_anns, ... val_images=val_images, ... val_anns=val_anns, ... backbone_name="resnet50", ... target_size=256, ... epochs=20, ... batch_size=4, ... use_sam=False, ... freeze_first=False, ... val_interval=3, ... ) >>> print("Best checkpoint loaded:", best_model)
Configure augmentations via
augmentation_config:>>> aug_cfg = { ... "quad_source": "auto", ... "flip_prob": 0.02, ... "color_jitter": (0.1, 0.1, 0.1, 0.05), ... "small_rotate_prob": 0.15, ... "small_rotate_deg": 1.5, ... "perspective_prob": 0.08, ... "perspective_scale": 0.015, ... "blur_prob": 0.1, ... "blur_ksize_range": (3, 5), ... "noise_prob": 0.1, ... "noise_std": 0.008, ... "salt_pepper_prob": 0.0005, ... "jpeg_prob": 0.1, ... "jpeg_quality_range": (75, 95), ... "shading_prob": 0.1, ... "shading_strength": 0.1, ... "gamma_prob": 0.2, ... "gamma_range": (0.95, 1.05), ... "downscale_prob": 0.1, ... "downscale_range": (0.7, 0.95), ... "negative_prob": 0.05, ... } >>> best_model = EAST.train( ... train_images=train_images, ... train_anns=train_anns, ... val_images=val_images, ... val_anns=val_anns, ... augmentation_config=aug_cfg, ... )
- static export(weights_path, output_path, backbone_name=None, input_size=1280, opset_version=14, simplify=True)[source]
Export EAST PyTorch model to ONNX format.
This method converts a trained EAST model from PyTorch to ONNX format, which can be used for faster inference with ONNX Runtime. The exported model can be loaded using
EAST(weights_path="model.onnx", use_onnx=True).- Parameters:
weights_path (str or Path) – Path to the PyTorch model weights file (.pth).
output_path (str or Path) – Path where the ONNX model will be saved (.onnx).
backbone_name ({"resnet50", "resnet101"}, optional) – Backbone architecture of the model. If None, will be automatically detected from the checkpoint. Must match the architecture used during training. Default is None (auto-detect).
input_size (int, optional) – Input image size (height and width). The model will accept images of shape
(batch, 3, input_size, input_size). Default is 1280.opset_version (int, optional) – ONNX opset version to use for export. Default is 14.
simplify (bool, optional) – If True, applies ONNX graph simplification using onnx-simplifier to optimize the model. Requires
onnx-simplifierpackage. Default is True.
- Returns:
The ONNX model is saved to
output_path.- Return type:
None
- Raises:
ImportError – If required packages (torch, onnx) are not installed.
FileNotFoundError – If
weights_pathdoes not exist.ValueError – If backbone_name doesn’t match the checkpoint architecture.
Notes
The exported ONNX model has two outputs:
score_map: Text confidence map with shape(batch, 1, H, W)geo_map: Geometry map with shape(batch, 8, H, W)
The model supports dynamic batch size and image dimensions through dynamic axes configuration.
Automatic Backbone Detection:
The method automatically detects the backbone architecture from the checkpoint by analyzing the number of parameters in layer4. This prevents mismatches between checkpoint and architecture that could lead to incorrect exports.
Examples
Export with automatic backbone detection:
>>> from manuscript.detectors import EAST >>> EAST.export( ... weights_path="east_resnet50.pth", ... output_path="east_model.onnx" ... ) Auto-detected backbone: resnet50 Exporting to ONNX (opset 14)... [OK] ONNX model saved to: east_model.onnx
Export with explicit backbone:
>>> EAST.export( ... weights_path="custom_weights.pth", ... output_path="custom_model.onnx", ... backbone_name="resnet101", ... input_size=1024, ... simplify=False ... )
Use the exported model for inference:
>>> detector = EAST( ... weights_path="east_model.onnx", ... use_onnx=True, ... device="cuda" ... ) >>> result = detector.predict("image.jpg")
See also
EAST.__init__Initialize EAST detector with ONNX support using
use_onnx=True.
- class manuscript.detectors.YOLO(weights=None, config=None, device=None, force_download=False, *, score_thresh=0.1, class_ids=None, target_size=None, axis_aligned_output=True, containment_threshold=0.9, **kwargs)[source]
Bases:
BaseDetectorInitialize YOLO text detector with ONNX Runtime.
- Parameters:
weights (str or Path, optional) –
Path or identifier for ONNX model weights. Supports:
Local file path:
"path/to/model.onnx"HTTP/HTTPS URL:
"https://example.com/model.onnx"GitHub release:
"github://owner/repo/tag/file.onnx"Google Drive:
"gdrive:FILE_ID"Preset name:
"yolo26s_obb_text_g1"or"yolo26x_obb_text_g1"None: auto-downloads default preset (yolo26x_obb_text_g1)
The ONNX model may return either standard detections in
xyxy, score, class_idformat with shape[N, 6]/[1, N, 6]or oriented detections incx, cy, w, h, score, class_id, angleformat with shape[N, 7]/[1, N, 7].config (str or Path, optional) – Path or identifier for model configuration YAML. Same URL schemes as
weights. IfNone, attempts to infer a YAML file next to the weights or uses the preset config fromconfig_registry.device (str, optional) –
Compute device:
"cuda","coreml", or"cpu". If None, automatically selects CPU. For GPU/CoreML acceleration:CUDA (NVIDIA):
pip install onnxruntime-gpuCoreML (Apple Silicon M1/M2/M3):
pip install onnxruntime-silicon
Default is
None(CPU).score_thresh (float, optional) – Confidence threshold applied to model outputs after ONNX inference and before the additional containment cleanup pass. Default is
0.1.class_ids (sequence of int or None, optional) – Optional whitelist of class IDs to keep. If
None, all classes are kept. Default isNone.target_size (int or None, optional) – Square inference size used for letterbox preprocessing. Images are resized into
(target_size, target_size)before ONNX inference. IfNone, the detector tries to readimgszfrom a YAML config located next to the weights or downloaded from the preset registry. Unknown/custom weights without YAML fall back to1280.axis_aligned_output (bool, optional) – If
True(default), OBB detections are converted to standard axis-aligned rectangles. IfFalse, OBB detections are returned as rotated polygons viapage/polygonsand ascx, cy, w, h, score, class_id, anglerows inboxes. For non-OBB models this flag has no effect.containment_threshold (float or None, optional) – Removes a smaller box when at least this fraction of its area is covered by a larger box. For example,
0.9removes boxes that are contained by90%or more. Set toNoneto disable this extra cleanup. Default is0.9.force_download (bool)
Notes
The class provides one main public method:
predict- run inference on a single image and return detections.
Available presets:
"yolo26s_obb_text_g1"- YOLO26-S OBB text detector"yolo26x_obb_text_g1"- YOLO26-X OBB text detector
Methods
__call__(*args, **kwargs)Call self as a function.
predict(img_or_path)Run YOLO ONNX inference on a single image and return detected page structure.
runtime_providers()Get ONNX Runtime execution providers based on device.
train(*args, **kwargs)export
- default_target_size = 1280
- pretrained_registry: Dict[str, str] = {'yolo26s_obb_text_g1': 'https://github.com/konstantinkozhin/manuscript-ocr/releases/download/v0.1.0/yolo26s_obb_text_g1.raw.onnx', 'yolo26x_obb_text_g1': 'https://github.com/konstantinkozhin/manuscript-ocr/releases/download/v0.1.0/yolo26x_obb_text_g1.raw.onnx'}
- config_registry: Dict[str, str] = {'yolo26s_obb_text_g1': 'https://github.com/konstantinkozhin/manuscript-ocr/releases/download/v0.1.0/yolo26s_obb_text_g1.raw.yaml', 'yolo26x_obb_text_g1': 'https://github.com/konstantinkozhin/manuscript-ocr/releases/download/v0.1.0/yolo26x_obb_text_g1.raw.yaml'}
- __init__(weights=None, config=None, device=None, force_download=False, *, score_thresh=0.1, class_ids=None, target_size=None, axis_aligned_output=True, containment_threshold=0.9, **kwargs)[source]
- predict(img_or_path)[source]
Run YOLO ONNX inference on a single image and return detected page structure.
- Parameters:
img_or_path (str or pathlib.Path or numpy.ndarray) – Path to an image file or an RGB image provided as a NumPy array with shape
(H, W, 3)inuint8format.- Returns:
Parsed detection result as a Page object containing a single Block with a single Line of TextSpan objects.
- Return type:
Examples
Run inference and get structured output:
>>> from manuscript.detectors import YOLO >>> model = YOLO(weights="yolo26x_obb_text_g1") >>> page = model.predict("page.jpg") >>> first_text_span = page.blocks[0].lines[0].text_spans[0] >>> print(first_text_span.detection_confidence)
EAST: Training Notes
The EAST detector in manuscript-ocr is based on the architecture proposed in
EAST: An Efficient and Accurate Scene Text Detector
(Zhou et al., CVPR 2017). The training procedure has been significantly reworked
compared to the original: the loss weighting scheme, augmentation pipeline,
quadrilateral annotation handling, and support for mixed annotations have all
been modified. Pretrained weights were produced by the project authors.
EAST Training Quads
EAST training expects quadrilateral targets. When loading COCO
segmentation polygons, use augmentation_config["quad_source"] in
EAST.train(...) to control how polygons are converted into 4-point
training quads:
"auto"keeps existing 4-point polygons as-is and falls back tominAreaRectfor longer polygons."as_is"accepts only 4-point polygons and skips polygons with a different number of vertices."min_area_rect"always fits the minimum-area rectangle and matches the legacy conversion path.