Skip to content

Markup Handling

HTML and rich-text documents contain markup (tags, attributes, inline formatting) that must not be translated. Falara uses a two-stage extraction and reassembly process to handle markup transparently.


The Problem

Sending raw HTML to a translation model risks:

  • Tags being modified or translated
  • Attribute values being altered
  • Structural elements being reordered or dropped

The Solution: Typed Placeholders

Before translation, the markup processor extracts all non-translatable markup and replaces it with typed Unicode placeholders using the bracket characters and (U+27E6 / U+27E7).

Example:

Source HTML:

<p>This is <strong>important</strong> and <a href="#ref">linked</a>.</p>

After extraction, the translator receives:

This is ⟦1⟧important⟦/1⟧ and ⟦2⟧linked⟦/2⟧.

After translation into German:

Dies ist ⟦1⟧wichtig⟦/1⟧ und ⟦2⟧verlinkt⟦/2⟧.

After reassembly:

<p>Dies ist <strong>wichtig</strong> und <a href="#ref">verlinkt</a>.</p>


Placeholder Types

Type Description
Paired tags Opening/closing element pairs — ⟦1⟧...⟦/1⟧
Void elements Self-closing tags (e.g. <br>, <img>) — ⟦br1⟧
Variables Template variables or expressions — preserved verbatim
Entities HTML entities — preserved and restored
Format markers Whitespace / structural markers

Transparency

The placeholder process is fully transparent to API consumers:

  • POST /v1/jobs and POST /v1/jobs/file — you send the original content
  • GET /v1/jobs/{id}/resulttranslation field contains clean text with placeholders removed
  • GET /v1/jobs/{id}/download — file has markup fully reassembled

You never interact with placeholders directly.


Blocked Segments

If a segment's markup cannot be safely reassembled after translation (e.g. the translator dropped a placeholder), the segment is marked as blocked. The job reaches completed_with_blocks instead of completed. Blocked segments retain their source text in the output.