Configuring the XML Filter
The XML Filter controls how GlobalSight presents data in XML files to translators.
For example the filter can define which parts of the file are:
- Translatable, such as normal content
- Non-translatable, such as HTML markup
You can configure the XML filter for use, using the different settings provided.
Contents
- 1 Configuring the XML Filter
- 2 Using the Filter
- 3 Filter Settings
- 3.1 XML Rule
- 3.2 Convert HTML Entity
- 3.3 Extended Whitespace Characters
- 3.4 Placeholder Consolidation
- 3.5 Placeholder Trimming
- 3.6 Save non-ASCII Characters As
- 3.7 Whitespace Handling
- 3.8 Element post-filter
- 3.9 CDATA post-filter
- 3.10 SID Support
- 3.11 Check Well-formedness
- 3.12 Empty Tag Format
- 3.13 Generate Language Information
- 3.14 Tag Management
Configuring the XML Filter
To configure the XML filter:
- Login as Project Manager
- Select Data Source->Filter Configuration
- Click Add in the Xml Filter row
- Configure the new XML filter
- Click Save to save this filterError creating thumbnail: File missing
- Create a file profile with an XML filterError creating thumbnail: File missing
Using the Filter
You can choose different options for using the filter:
- Using an XML rule
- Using XML filter settings
In this example XML file, the bold text should be translated when imported to GlobalSight.
<?xml version="1.0"?> <sample> <aaa>This is aaa text.</aaa> <bbb>This is bbb text. <bbb2>This is bbb2 text.</bbb2> </bbb> <ccc sampleAttribute="MySampleAttribute"> This is ccc text. </ccc> </sample>
Using an XML Rule
To use an XML rule:
- Select Data Source->XML Rule to create an XML Rule
- Add the following rule content:
<?xml version="1.0"?> <schemarules> <ruleset schema="sample"> <dont-translate path="/sample/aaa"/> <dont-translate path="/sample/bbb"/> <dont-translate path="/sample/bbb//*"/> <translate path="/sample/ccc/@*"/> <translate path="/sample/ccc" inline="yes"/> </ruleset> </schemarules>
- Tie the XML Rule to an XML filterError creating thumbnail: File missing
- Select File Profile–>Filter to import the sample XML
Use XML Filter Settings
To use XML filter settings:
- Select Data Source->Filter Configuration to create an XML filter
- Add aaa, bbb and bbb2 to Content Inclusion Tags with Exclude typeError creating thumbnail: File missing
- Add ccc to Translatable Attribute TagsError creating thumbnail: File missing
- Select File Profile – Filter to configure the XML filter to import the sample XML file
Filter Settings
The XML filter contains several features that can be applied during configuration.
XML Rule
You can select an XML rule and tie it to a file profile. This rule is then applied to a job.
Convert HTML Entity
You can choose whether to convert either a HTML entity or a character when the job file is exported. For example:
<p>I am testing <b>Entity</b>.</p>
To use an entity as the entity in the tag in the exported file, uncheck it as follows:
<p>I am testing <b>Entity</b>.</p>
To use a character as the entity in the tag, use the format:
<p>I am testing <b>Entity</b>.</p>
Extended Whitespace Characters
You can designate non-whitespace characters as being whitespace. For example, your content can designate a separator using:
<p>*****--*****</p>
To hide the separator from the translator, enter * - in the field.
By default, only standard whitespace characters such as tabs, new lines and spaces are considered whitespaces. Whitespace is appended to the markup and is not shown as a separate text segment.
To add new characters to those that define whitespace, indicate that segments using these are now considered whitespace segments. This is in addition to standard whitespace characters. These whitespace segments are treated as markup and are ignored during translation.
The field should contain a set of characters separated by spaces. For example, > < ¤ - ).
Placeholder Consolidation
You can consolidate placeholders adjacent to each other into one placeholder, or treat each one separately. For example, the following content is treated as having four placeholders by default:
<b><i> testing, testing... </b></i>
To consolidate the placeholders, the content is treated as having two placeholders, one for <b><i> and one for </b></i>.
Options for placeholder consolidation apply regardless of how the placeholders were generated. They can be either embeddable tags or parts of translatable tags. Entities never become placeholders. Select one of the following from the Placeholder Consolidation list:
- Do not consolidate: each embeddable tag in a sequence of adjacent tags is treated as a separate placeholder
- Consolidate adjacent: a sequence of adjacent embeddable tags is treated as one placeholder
- Consolidate adjacent ignore whitespace: a sequence of adjacent embeddable tags is treated as one placeholder, but also merges in the whitespace between the adjacent placeholders
Use the Consolidate Adjacent or Consolidate Adjacent Ignore Whitespace option depending on the format of the file. It is easier for translators when as few placeholders as possible are exposed, but do not ignore the whitespace when a translator needs to enter new text between the tags.
In this example, <b> is an embeddable tag:
This is a <b>nice</b> <b>green</b> hat.
is parsed as:
This is a [g1]nice[/g1] [g2]green[/g2] hat.
For this to be translated as:
This is a [g1]nice[/g1] and [g2]green[/g2] hat.
you cannot consolidate placeholders that include whitespace
In the following text, however, it is clear that the whitespace between the tags is markup. For this example, the best option is Consolidate adjacent ignore whitespace, with two placeholders instead of six.
In general, when the whitespace between tags is considered markup and can be ignored in all cases, use the Consolidate adjacent ignore whitespace option.
See also <href><book><xref ref=”ref”>Open File</xref></book></href>
Placeholder Trimming
You can merge leading and trailing embeddable markup in markup segments that are adjacent. Use placeholder trimming when a translator should not add content before or after leading and trailing segment placeholders.
In this demonstration, <b> and <a> are embeddable tags and <p> is a markup tag.
Input:
<p><a href=”#”>Click_<b>here</b>.</nowiki></a></nowiki><p>
Output, with no placeholder trimming as default:
Markup Segment: <p>
Text Segment: [g1]Click_[g2]here[/g2].[/g1]
Markup segment: <p>
Output, with placeholder trimming:
Markup Segment: <p><a href=”#”>
Text Segment: Click_[g1]here[/g1].
Markup segment: </a><p>
Save non-ASCII Characters As
You can specify whether to save non-ASCII characters as characters, by default, or as numeric entities. By default, the filter passes all characters through from source to target unless the character has an entity defined for it. If the encoding of the target file does not support a character, then the output file is invalid.
This is a useful option for writing content in Unicode but storing the output in a target encoding that does not support all of the Unicode characters. For example, this is the case when writing Chinese and storing the content in ASCII. To save double-byte characters in ISO-encoded files, choose numeric entity.
Whitespace Handling
In HTML and XML, whitespace is not significant by default. This means that parsers usually interpret any amount of whitespace as a single space. In HTML, this is why you use
or for spacing.
Exceptions apply. For example, using the @xml:preserve attribute to indicate that the whitespace should not be normalized, and also other software that expects the whitespace not to be normalized, as in collapsed. For more details on the former and similar cases, see the "Preserve Whitespace Tags" section. For more details on the latter, see the Preserve radio button in this section.
For filtering, whitespace normalization is usually limited to trimming whitespace at the start and end of segments. This whitespace is then re-inserted when saving the target asset through a process called whitespace repair. Additional normalization of the whitespace occurs between words within a single segment, but this happens in TM leveraging, which is outside of the filter processing.
The way that GlobalSight filters handle whitespace affects what is presented for translation.
- In HTML and XML, whitespace normally does not matter.
- Normalizing whitespace makes ICE matching space-insensitive. Not normalizing whitespace downgrades ICE matching to 100% if you change a format by just adding a whitespace
- Preserving the formatting on multi-line entries makes translation difficult. In the user interface, elements are inserted in edit boxes. If whitespace is not trimmed, the content can show as:
I am a long sentence
- Preserving the same formatting on the target side is nearly impossible, since the translator has to count the spaces and insert them accurately. If the translation requires the rearranging of words, preserving the whitespace is even more difficult
Option buttons: these specify whether the filter collapses multiple whitespace characters into a single whitespace element, or preserves the original amount of whitespace.
A character is considered a whitespace character if it is:
- A Unicode space character such as SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR, that is not a non-breaking space ('\u00A0', '\u2007', '\u202F')
- One of the following:
- '\u0009', HORIZONTAL TABULATION
- '\u000A', LINE FEED
- '\u000B', VERTICAL TABULATION
- '\u000C', FORM FEED
- '\u000D', CARRIAGE RETURN
- '\u001C', FILE SEPARATOR
- '\u001D', GROUP SEPARATOR
- '\u001E', RECORD SEPARATOR
- '\u001F', UNIT SEPARATOR
Element post-filter
You can specify the filter used to process non-CDATA content. Select the post-filter that corresponds to the content type. For example, if your XML content contains escaped HTML code, select the HTML filter. If the Element post-filter option is set, the text content of all XML elements in the source document are passed on to the post-filter specified.
The post-filter choices are:
- Blank – by default. The content is treated as markup and is not exposed for translation. For all other choices, the content is passed through the specified filter
- HTML filter
- JavaScript filter
A post-filter always uses the default configuration for the filter you select to be the post-filter.
CDATA post-filter
You can specify the filter used to process CDATA containing other content types. Choose the post-filter that corresponds to the content type.
If the CDATA post-filter option is set, the contents of all CDATA blocks in the source XML document are passed on to the post-filter specified.
The post-filter options are:
- Blank – by default. The CDATA content is treated as markup and is not exposed for translation. For all other choices, the CDATA content is passed through the filter specified
- HTML filter
- JavaScript filter
A post-filter always uses the default configuration for the filter you select to be the post-filter.
SID Support
This allows the filter to associate an ID with a segment. The ID finds the most appropriate ICE match. The SID is assigned to all textual segments within the scope of the tag, matching the SID configuration. The matching process examines the SID and the content. The SID narrows the context of the match, but does not have to be unique.
Define an attribute on specific elements that are used to provide the segment identifier value (SID) for segments surrounded by this tag. Provide a tag name and an attribute name for SID support. Both names can be regular expressions.
For example: if the tag is p|li and the attribute is id.*, then whenever a tag named p or li has an attribute starting with id*, the value of that attribute is used as the SID value. If the XML source has IDs assigned for paragraphs and list items, specifying p|li for Tag Name and id.* for Attribute Name allows you to use the resulting SIDs to determine ICE matches. You can use regular expressions for SIDs on all tags that have IDs by specifying.* for Tag Name and ID for Attribute Name.
This important feature reduces the amount of material for translation by determining ICE matches using SIDs.
Check Well-formedness
The XML filter checks that the source XML asset is well-formed before segmenting it. It also checks the target XML asset after saving it. In cases where the XML contains references to external entities, this check fails and prevents the filter from running. In this case, you can disable it by de-selecting this option. In most cases, leave this option enabled to verify that the target assets are created correctly.
Empty Tag Format
Empty XML elements can be represented in the exported job file in two different ways. For example, an empty element is:
<msg name=”1b1FontTimes”></msg>.
When “Open(<tag></tag>) “ is checked, the result is:
<msg name=”1b1FontTimes”></msg>.
When “Open(<tag/>) “ is checked, the result is:
<msg name=”1b1FontTimes”/>.
Generate Language Information
When checked, the filter inserts language information using the xml:lang attribute when generating the target documents. The value of the LANG attribute is determined by the target locale. The following rules are used:
- Add the attribute to the root element in the document
- If any elements in the document already have a xml:lang attribute, replace the value of xml:lang with the target locale if that the content of that element is translatable
Tag Management
To manage tags, select one of the following from the Select a section list.
Embeddable Tags
A set of tags that are treated as embeddable markup. Tags listed here become embeddable markup and are replaced by a placeholder when they are found within the textual context. Tags found within the markup are retained as markup.
To add an embeddable tag:
For tags without attributes, for example, tag <b>:
You should <b>never</b> install without first performing a backup.
Fill out tag name b.
For tags with attributes and where you want text between the tags, fill the attribute value as a condition.
For example:
You can read <book name="Book One">Book1</book> and don't read <book name="Book Two">Book2</book>.
The <book> tag has different attributes. Only the tag with the attribute “Book One” is embeddable. The tag setting is:
Opening it in popup editor:
Embedded tags, or "inline" tags as they are also known, that are not in the list are treated as the start of a new segment. For example, the following sentence in XML:
You should <b>never</b> install without first performing a backup.
appears as follows when you open the asset in Popup editor:
Inline markup that is not in the Embeddable Tags list appears as follows when you add the <b> tag to the list:
The sentence reads more easily as one segment, since inline markup has been added to the Embeddable Tags list.
Translatable attribute tags
A set of tags can have attributes that contain translatable content. Tags with at least one of the defined translatable attributes can be configured as textual content, surrounded by placeholders of the tag itself. They can also form a separate segment. This configuration allows for conditional attributes.
To add a translatable attribute tag:
For the Book One attribute in the sentence:
See characters <book name="Book One">Open File</book>.
Enter the the tag name book and the translatable attribute “name”. The tag setting is:
Opened in Popup editor:
Content inclusion tags
A set of tags marking content to be excluded or included. By default, all content of the file is parsed, or included. Everything within the tag is excluded when in the exclude set. The parsing mode can be switched back on using an include tag. This configuration allows for attribute conditioning.
For example, using <exclude> as an exclude tag and <include> as an include tag:
Translate me <exclude> Skip me <include> But translate me </include> Skip me too </exclude> Translate me please
After the filtering process, this text is segmented as follows:
- Translatable text segment
Translate me - Markup segment
<exclude>
Skip me
<include> - Translatable text segment
But translate me - Markup segment
</include>
Skip me too
</exclude> - Translatable text segment
Translate me please
A tag can be both excluded and embedded. This causes the content of the excluded tag to form an embeddable placeholder. For example, an index tag that is defined as excluded and embedded with the following content:
This section describes the <index>engine:booster</index>engine booster.
The text within the index tag is inserted into the placeholder. The content is parsed as follows:
This section describes the [g1][/g1] engine booster.
To add a content inclusion tag:
An example tag setting is:
Opened in popup editor:
Entities
An entity is a mapping between a special character (for example, <) and a name (for example, lt). Defining an entity allows the filter to:
- Convert the named entity (<) to its mapped character during parsing
- Convert the mapped character to the named entity during saves
To add an entity tag:
An entity itself never needs to be translated. For example, with the following sentence:
This section describes the © entity.
- When the entities are treated as “PlaceHolder”:
The sentence is parsed as: This section describes the [x1] entity.
- When the entity is treated as “Text” and saved as “Entity”, you need to enter the correct entity code:
The correct character “©” shows in the Character column in the Entities tag list. The entity remains an entity in the exported job file because the Save As value is Entity. The entity protected as placeholder to avoid wrong operation for the translator.
- When the entity is treated as “Text” and saved as “Character”, you need to enter the correct entity:code:
The sentence is parsed as:
This section describes the © entity.
The entity remains an entity in the exported job file because the “Save As” value is “Character”.
Processing Instructions
By default, the filter treats processing instructions as markup, which forces segment breaks. For example, Epic editor inserts a processing instruction that indicates the last position of the cursor. XML see these instructions as comments. You can change how GlobalSight handles them during segmentation.
To add a processing instruction:
All processing instructions, including the XML declaration, begin with <? and end with ?>. The name of the processing instruction follows the initial <? In this example:
This <?piname key="value" ?> is an instruction processing.
The name of the process instruction name is “piname”.
- Using the handling mode “As Markup” works like an excluded tag. The setting is:
Opened in popup editor:
- Using the handling mode “As Embeddable Markup”, the sentence is parsed as:
This [x1] is an instruction processing.
- Using the handling mode “Remove from Target”, the sentence is parsed as:
This is an instruction processing.
The processing instruction is replaced by one space and removed from the exported file.
Preserve Whitespace Tags
Whitespace in element text is usually normalized to a single space during segmentation. To preserve the original whitespace for the contents of one or more elements, add them to the “Preserve Whitespace Tags” list. For example, the XML snippet:
<text>oddly spaced text</text>
is usually normalized to:
<text>oddly spaced text</text>
If the text element is added to Preserve Whitespace Tags, it is segmented as:
<text>oddly spaced text</text>
To preserve whitespace tags:
This is similar to the whitespace handling feature of the XML filter,. The difference is that the setting value is applied into and affects all translatable text content.
To handle other content whitespace as a single whitespace and still preserve the whitespace in the <text> element, the setting is:
Whitespace Handling is set to Collapse into Single Whitespace.
CDATA post-filter Tags
Special functions in CDATA that need special handling are set here. Regular expressions can be used for all of the CDATA content.
To add a CDATA post-filter tag:
This CDATA only contains a function and does not need to be translated.
<Value><![CDATA[ function selectPopulation() { … … } ]]></Value>
Using a regular expression rule, the CDATA content can be expressed as “\s*function\s+.*”. To use the feature correctly, knowledge of basic regular expression is needed. The setting is:
The CDATA is not extracted in the job.