Markup Handling¶
HTML and rich-text documents contain markup (tags, attributes, inline formatting) that must not be translated. Falara uses a two-stage extraction and reassembly process to handle markup transparently.
The Problem¶
Sending raw HTML to a translation model risks:
- Tags being modified or translated
- Attribute values being altered
- Structural elements being reordered or dropped
The Solution: Typed Placeholders¶
Before translation, the markup processor extracts all non-translatable markup and replaces it with typed Unicode placeholders using the bracket characters ⟦ and ⟧ (U+27E6 / U+27E7).
Example:
Source HTML:
After extraction, the translator receives:
After translation into German:
After reassembly:
Placeholder Types¶
| Type | Description |
|---|---|
| Paired tags | Opening/closing element pairs — ⟦1⟧...⟦/1⟧ |
| Void elements | Self-closing tags (e.g. <br>, <img>) — ⟦br1⟧ |
| Variables | Template variables or expressions — preserved verbatim |
| Entities | HTML entities — preserved and restored |
| Format markers | Whitespace / structural markers |
Transparency¶
The placeholder process is fully transparent to API consumers:
POST /v1/jobsandPOST /v1/jobs/file— you send the original contentGET /v1/jobs/{id}/result—translationfield contains clean text with placeholders removedGET /v1/jobs/{id}/download— file has markup fully reassembled
You never interact with placeholders directly.
Blocked Segments¶
If a segment's markup cannot be safely reassembled after translation (e.g. the translator dropped a placeholder), the segment is marked as blocked. The job reaches completed_with_blocks instead of completed. Blocked segments retain their source text in the output.