Removing HTML and Unicode

Removing HTML formatting

When you look at an XHTML report, they usually only see the part intended for humans. You see headings, paragraphs, numeric values or even complex tables. How the layout of a report is created is determined by the Hypertext Markup Language (HTML).

For example, HTML elements can be used to specify the size of a text or the color it should have. Tables with their rows and columns are also composed of various HTML elements, the so-called HTML tags.

When tagging text blocks, it can easily happen that not only the actual value is tagged, but also part of the HTML formatting. This is even unavoidable when tagging multiple paragraphs or tables.

However, when checking the tagging, this formatting information is very distracting. From a technical point of view, they are part of a fact, but the actual value is often hidden in a collection of formatting such as <div>, <span>, <td>, <tr> and many others.

In many places, the software tries to suppress the display of HTML formatting and only display the values. This happens mainly where the values are to be displayed in tables or lists. The full tagged fact can be viewed via the "Show Full Fact Value" function from the context menu of the presentation view or the table view. If a fact does contain HTML formatting, it can be toggled on and off in the window.


A fact value including the HTML formatting	A fact value with the HTML formatting removed

Most HTML tags occur as a pair that encapsulates a value. This circumstance sometimes makes it difficult to remove HTML formatting completely. If only half of the HTML tag pairs are included in tagging, it is difficult to distinguish formatting from value afterwards. If you still find HTML formatting in the program, please contact us.

Removing custom defined Unicode characters

Unicode is an internationally used standard for the uniform, electronic storage of text. A character set containing not only the Latin alphabet used in Germany, but also many others. For example, the Hebrew alphabet or Japanese Kanji. In total, the Unicode Standard includes about 145,000 different characters.

In addition, the Unicode Standard contains a section that can be freely defined by users. There you can "invent" your own characters, so to speak. This is sometimes done by designers. A space that is a bit shorter or a dot that is larger are good examples.

However, not all systems can display all characters. If it is not possible, placeholders are usually displayed. Often as different kinds of rectangles. The software includes a function to replace unknown Unicode characters in the tables and lists with spaces.

Usually you will not notice much of this because custom defined Unicode characters are very rare in financial reports.

Removing HTML and Unicode

Removing HTML formatting

See also: How to prove that XHTML table tags exist?

Removing custom defined Unicode characters