Correctors
==========

Character-level text corrector for post-processing OCR results.

``CharLM``
----------

.. autoclass:: manuscript.correctors.CharLM
   :members:
   :undoc-members:
   :show-inheritance:

Overview
~~~~~~~~

``CharLM`` is a character-level masked language model corrector that uses a Transformer 
architecture to fix OCR errors. It analyzes character-level context and applies 
corrections based on learned patterns.

**Key features:**

- Character-level Transformer-based correction
- Configurable confidence thresholds
- Support for custom vocabularies and lexicons
- ONNX Runtime inference for fast correction
- Optional lexicon filtering to preserve known words

Available Presets
~~~~~~~~~~~~~~~~~

The following pretrained models are available:

.. list-table::
   :widths: 30 70
   :header-rows: 1

   * - Preset Name
     - Description
   * - ``prereform_charlm_g1``
     - Pre-reform Russian text (default)
   * - ``modern_charlm_g1``
     - Modern Russian text

Quick Example
~~~~~~~~~~~~~

Use ``create_page_from_text`` to quickly test correction on text:

.. code-block:: python

    from manuscript.utils import create_page_from_text
    from manuscript.correctors import CharLM

    # Create page from text with potential OCR errors
    page = create_page_from_text(["Привѣтъ міръ", "Тестовая строка"])

    # Apply correction (using default prereform model)
    corrector = CharLM()
    corrected = corrector.predict(page)

    # Extract corrected text
    for line in corrected.blocks[0].lines:
        text = " ".join(w.text for w in line.words)
        print(text)

Basic Usage
~~~~~~~~~~~

.. code-block:: python

    from manuscript import Pipeline
    from manuscript.correctors import CharLM

    # Create corrector with default preset
    corrector = CharLM()

    # Create corrector with specific preset
    corrector = CharLM(weights="modern_charlm_g1")

    # Use in pipeline
    pipeline = Pipeline(corrector=corrector)
    result = pipeline.predict("document.jpg")

Advanced Configuration
~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    from manuscript.correctors import CharLM

    # Fine-tune correction behavior
    corrector = CharLM(
        weights="prereform_charlm_g1",
        mask_threshold=0.05,      # Characters with confidence below this are corrected
        apply_threshold=0.95,     # Model must be this confident to apply correction
        max_edits=2,              # Maximum edits per word
        min_word_len=4,           # Minimum word length to attempt correction
        device="cuda"             # Use GPU for inference
    )

Using Custom Lexicon
~~~~~~~~~~~~~~~~~~~~

You can provide a lexicon (word list) to prevent corrections of known words:

.. code-block:: python

    from manuscript.correctors import CharLM

    # From preset
    corrector = CharLM(
        weights="prereform_charlm_g1",
        lexicon="prereform_words"  # Use preset lexicon
    )

    # From file
    corrector = CharLM(
        weights="prereform_charlm_g1",
        lexicon="path/to/words.txt"
    )

    # From set
    my_words = {"слово1", "слово2", "слово3"}
    corrector = CharLM(
        weights="prereform_charlm_g1",
        lexicon=my_words
    )

Training Custom Model
~~~~~~~~~~~~~~~~~~~~~

You can train CharLM on your own data:

.. code-block:: python

    from manuscript.correctors import CharLM

    # Train with OCR pairs dataset
    checkpoint_path = CharLM.train(
        pairs_path="ocr_pairs.csv",      # CSV with incorrect,correct columns
        charset_path="charset.txt",       # Allowed characters
        exp_dir="my_charlm_exp",
        epochs=50,
        batch_size=256,
    )

    # Train with word list (self-supervised)
    checkpoint_path = CharLM.train(
        words_path="words.txt",           # One word per line
        charset_path="charset.txt",
        exp_dir="my_charlm_exp",
    )

Export to ONNX
~~~~~~~~~~~~~~

.. code-block:: python

    from manuscript.correctors import CharLM

    # Export trained model to ONNX
    CharLM.export(
        weights_path="exp/checkpoints/charlm_epoch_50.pt",
        vocab_path="exp/vocab.json",
        output_path="my_model.onnx",
    )