XmHTML's HTML parser is fairly powerfull in that it is capable of repairing even the most terrible HTML documents as well as converting a non HTML 3.2 conforming document to a HTML 3.2 conforming one. The only reason for the existance of these document verification and repair capabilities is that XmHTML only works with fully balanced HTML documents. A balanced HTML document is a document in which each terminated HTML element has its opening and closing members at the same level.
typedef struct _XmHTMLObject{ htmlEnum id; /* ID for this element */ String element; /* element text */ String attributes; /* attributes for this element, if any */ Boolean is_end; /* true when this is a closing element */ Boolean terminated; /* true when element has a closing counterpart */ int line; /* line number for this element */ struct _XmHTMLObject *next; struct _XmHTMLObject *prev; }XmHTMLObject;The id field of this structure describes the type of element. The table at the end of this document lists all elements that XmHTML knows of.
When id is HT_ZTEXT, the element field contains plain text as read from the document (character escape sequences not expanded). The attributes, is_end and terminated elements are meaningless.
In all other cases, the element field contains the element name and the attributes field contains possible attributes for this element. When an element is terminated (that is, has a closing counterpart), the terminated field will be True, and the is_end field indicates whether the current element is an opening or a closing one. Only unterminated or opening elements can have attributes.
The element and attributes fields are contained in the same memory buffer, where the latter is separated from the former by a NULL character. When freeing an object, freeing the element field will also free the attribute field.
The line field contains the line number in the source document where the element is located.
The objects field in the XmHTMLDocumentCallbackStruct contains the starting point of the parser tree.
Programmers that want to use the generated parser tree for different purposes might be interested in some of the XmHTML private functions for extracting attribute values and character escape sequence expansion.
typedef struct { int reason; /* the reason the callback was called */ XEvent *event; /* always NULL for XmNparserCallback */ int no; /* total error count uptil now */ int line_no; /* input line number where error was detected */ int start_pos; /* absolute index where error starts */ int end_pos; /* absolute index where error ends */ parserError error; /* type of error */ int action; /* suggested correction action */ String err_msg; /* error message */ }XmHTMLParserCallbackStruct, *XmHTMLParserCallbackStructPtr;This table lists all possible values for the action field, together with a short description of what the parser response will be.
Action | Description |
---|---|
XmHTML_REMOVE | offending element will be removed |
XmHTML_INSERT | insert missing element |
XmHTML_SWITCH | switch offending and expected element |
XmHTML_KEEP | keep offending element |
XmHTML_IGNORE | ignore, proceed as if nothing happened |
XmHTML_TERMINATE | terminate parser |
Shown below are all possible values for the error field (default action is displayed in bold), allowed actions and the value of the err_msg field. When the action field is set to an action that is not allowed for an error, XmHTML will use the default action.
error: | HTML_UNKNOWN_ELEMENT |
---|---|
actions: | XmHTML_REMOVE, XmHTML_TERMINATE |
err_msg: | %s: unknown HTML identifier |
error: | HTML_UNKNOWN_ESCAPE |
actions: | XmHTML_REMOVE, XmHTML_TERMINATE |
err_msg: | %s: unknown character escape sequence |
error: | HTML_BAD |
actions: | XmHTML_REMOVE, XmHTML_IGNORE, XmHTML_TERMINATE |
err_msg: | Terrible HTML! element %s completely out of balance. |
error: | HTML_OPEN_BLOCK |
actions: | XmHTML_INSERT, XmHTML_REMOVE, XmHTML_KEEP |
err_msg: | A new block level element (%s) was encountered while %s is still open. |
error: | HTML_CLOSE_BLOCK |
actions: | XmHTML_REMOVE, XmHTML_INSERT, XmHTML_KEEP, XmHTML_TERMINATE |
err_msg: | A closing block level element (%s) was encountered while it " was never opened. |
error: | HTML_OPEN_ELEMENT |
actions: | XmHTML_REMOVE, XmHTML_SWITCH, XmHTML_TERMINATE |
err_msg: | Unbalanced terminator: got %s while %s is required. |
error: | HTML_VIOLATION |
actions: | XmHTML_REMOVE, XmHTML_KEEP, XmHTML_TERMINATE |
err_msg: | %s may not occur inside %s |
error: | HTML_INTERNAL |
actions: | XmHTML_TERMINATE, XmHTML_IGNORE |
err_msg: | Internal parser error |
typedef struct { int reason; /* the reason the callback was called */ XEvent *event; /* always NULL for XmNdocumentCallback */ Boolean html32; /* True when document was HTML 3.2 conforming */ Boolean verified; /* True when document has been verified */ Boolean balanced; /* True when parser tree is balanced */ int pass_level; /* current parser level count. Starts at 0 */ Boolean redo; /* See below */ XmHTMLObject *objects; /* parser tree starting point */ }XmHTMLDocumentCallbackStruct;
extern Boolean _XmHTMLTagCheck(char *attributes, char *tag);
extern Boolean _XmHTMLTagCheckValue(char *attributes, char *tag, char *check);
extern char *_XmHTMLTagGetValue(char *attributes, char *tag);
extern int _XmHTMLTagGetNumber(char *attributes, char *tag, int def);
The following function searches and expands any character escape sequences in the given string:
extern void _XmHTMLExpandEscapes(char *string);This function recognizes all escape sequences from the ISO 8895-1 character set, as well as all &# character escapes below 160. Escape sequences are not required to have a terminating semi-colon.
id | Element | Terminated | id | Element | Terminated |
HT_DOCTYPE | !doctype | False | HT_A | a | True |
HT_ADDRESS | address | True | HT_APPLET | applet | True |
HT_AREA | area | False | HT_B | b | True |
HT_BASE | base | False | HT_BASEFONT | basefont | False |
HT_BIG | big | True | HT_BLOCKQUOTE | blockquote | True |
HT_BODY | body | True | HT_BR | br | False |
HT_CAPTION | caption | True | HT_CENTER | center | True |
HT_CITE | cite | True | HT_CODE, | code | True |
HT_DD | dd | True | HT_DFN | dfn | True |
HT_DIR | dir | True | HT_DIV | div | True |
HT_DL | dl | True | HT_DT | dt | True |
HT_EM | em | True | HT_FONT | font | True |
HT_FORM | form | True | HT_FRAME | frame | True |
HT_FRAMESET | frameset | True | HT_H1 | h2 | True |
HT_H2 | h2 | True | HT_H3 | h3 | True |
HT_H4 | h4 | True | HT_H5 | h5 | True |
HT_H6 | h6 | True | HT_HEAD | head | True |
HT_HR | hr | False | HT_HTML | html | True |
HT_I | i | True | HT_IMG | img | False |
HT_INPUT | input | False | HT_ISINDEX | isindex | False |
HT_KBD | kbd | True | HT_LI | li | True |
HT_LINK | link | False | HT_MAP | map | True |
HT_MENU, | menu | True | HT_META | meta | False |
HT_NOFRAMES | noframes | True | HT_OL | ol | True |
HT_OPTION | option | True | HT_P | p | True |
HT_PARAM | param | False | HT_PRE | pre | True |
HT_SAMP | samp | True | HT_SCRIPT | script | True |
HT_SELECT | select | True | HT_SMALL | small | True |
HT_STRIKE | strike | True | HT_STRONG | strong | True |
HT_STYLE | style | True | HT_SUB | sub | True |
HT_SUP | sup | True | HT_TAB | tab | False |
HT_TABLE | table | True | HT_TD | td | True |
HT_TEXTAREA | textarea | True | HT_TH | th | True |
HT_TITLE, | title | True | HT_TR | tr | True |
HT_TT | tt | True | HT_U | u | True |
HT_UL | ul | True | HT_VAR | var | True |
HT_ZTEXT | plain text | False |
©Copyright 1996-1997 by Ripley Software Development
Last update: September 19, 1997 by Koen