XmHTML Parser Description

Overview
Parser Tree
Document Verification
Document Repair
XmNparserCallback
XmNdocumentCallback
Private functions
XmHTML Element Identifiers

Overview

This document describes XmHTML's HTML parser in detail and provides background information on the how and why of document verification and repair. It is targetted towards programmers that want to make full use of the parser and document callback resources as well as programmers that want to use the generated parser tree for different purposes.

XmHTML's HTML parser is fairly powerfull in that it is capable of repairing even the most terrible HTML documents as well as converting a non HTML 3.2 conforming document to a HTML 3.2 conforming one. The only reason for the existance of these document verification and repair capabilities is that XmHTML only works with fully balanced HTML documents. A balanced HTML document is a document in which each terminated HTML element has its opening and closing members at the same level.

Parser Tree

When a document is loaded into XmHTML, the parser translates this document to a doubly linked list of objects (referred to as the Parser Tree). Each object contains either a HTML element (and its attributes) or plain text.

typedef struct _XmHTMLObject{
	htmlEnum id;		/* ID for this element */
	String element;		/* element text */
	String attributes;	/* attributes for this element, if any */
	Boolean is_end;		/* true when this is a closing element */
	Boolean terminated;	/* true when element has a closing counterpart */
	int line;		/* line number for this element */
	struct _XmHTMLObject *next;
	struct _XmHTMLObject *prev;
}XmHTMLObject;

The id field of this structure describes the type of element. The table at the end of this document lists all elements that XmHTML knows of.

When id is HT_ZTEXT, the element field contains plain text as read from the document (character escape sequences not expanded). The attributes, is_end and terminated elements are meaningless.

In all other cases, the element field contains the element name and the attributes field contains possible attributes for this element. When an element is terminated (that is, has a closing counterpart), the terminated field will be True, and the is_end field indicates whether the current element is an opening or a closing one. Only unterminated or opening elements can have attributes.

The element and attributes fields are contained in the same memory buffer, where the latter is separated from the former by a NULL character. When freeing an object, freeing the element field will also free the attribute field.

The line field contains the line number in the source document where the element is located.

The objects field in the XmHTMLDocumentCallbackStruct contains the starting point of the parser tree.

Programmers that want to use the generated parser tree for different purposes might be interested in some of the XmHTML private functions for extracting attribute values and character escape sequence expansion.

Document Verification

Document Repair

XmNparserCallback

typedef struct
{
	int reason;		/* the reason the callback was called */
	XEvent *event;		/* always NULL for XmNparserCallback */
	int no;			/* total error count uptil now */
	int line_no;		/* input line number where error was detected */
	int start_pos;		/* absolute index where error starts */
	int end_pos;		/* absolute index where error ends */
	parserError error;	/* type of error */
	int action;		/* suggested correction action */
	String err_msg;		/* error message */
}XmHTMLParserCallbackStruct, *XmHTMLParserCallbackStructPtr;

This table lists all possible values for the action field, together with a short description of what the parser response will be.

Action Description
XmHTML_REMOVE offending element will be removed
XmHTML_INSERT insert missing element
XmHTML_SWITCH switch offending and expected element
XmHTML_KEEP keep offending element
XmHTML_IGNORE ignore, proceed as if nothing happened
XmHTML_TERMINATE terminate parser

Action	Description
XmHTML_REMOVE	offending element will be removed
XmHTML_INSERT	insert missing element
XmHTML_SWITCH	switch offending and expected element
XmHTML_KEEP	keep offending element
XmHTML_IGNORE	ignore, proceed as if nothing happened
XmHTML_TERMINATE	terminate parser

Shown below are all possible values for the error field (default action is displayed in bold), allowed actions and the value of the err_msg field. When the action field is set to an action that is not allowed for an error, XmHTML will use the default action.

error: HTML_UNKNOWN_ELEMENT
actions: XmHTML_REMOVE, XmHTML_TERMINATE
err_msg: %s: unknown HTML identifier

error:	HTML_UNKNOWN_ELEMENT
actions:	XmHTML_REMOVE, XmHTML_TERMINATE
err_msg:	`%s: unknown HTML identifier`

error: HTML_UNKNOWN_ESCAPE
actions: XmHTML_REMOVE, XmHTML_TERMINATE
err_msg: %s: unknown character escape sequence

error: HTML_BAD
actions: XmHTML_REMOVE, XmHTML_IGNORE, XmHTML_TERMINATE
err_msg: Terrible HTML! element %s completely out of balance.

error: HTML_OPEN_BLOCK
actions: XmHTML_INSERT, XmHTML_REMOVE, XmHTML_KEEP
err_msg: A new block level element (%s) was encountered while %s is still open.

error: HTML_CLOSE_BLOCK
actions: XmHTML_REMOVE, XmHTML_INSERT, XmHTML_KEEP, XmHTML_TERMINATE
err_msg: A closing block level element (%s) was encountered while it " was never opened.

error: HTML_OPEN_ELEMENT
actions: XmHTML_REMOVE, XmHTML_SWITCH, XmHTML_TERMINATE
err_msg: Unbalanced terminator: got %s while %s is required.

error: HTML_VIOLATION
actions: XmHTML_REMOVE, XmHTML_KEEP, XmHTML_TERMINATE
err_msg: %s may not occur inside %s

error: HTML_INTERNAL
actions: XmHTML_TERMINATE, XmHTML_IGNORE
err_msg: Internal parser error

XmNdocumentCallback

typedef struct
{
	int reason;		/* the reason the callback was called */
	XEvent *event;		/* always NULL for XmNdocumentCallback */
	Boolean html32;		/* True when document was HTML 3.2 conforming */
	Boolean verified;	/* True when document has been verified */
	Boolean balanced;	/* True when parser tree is balanced */
	int pass_level;		/* current parser level count. Starts at 0 */
	Boolean redo;		/* See below */
	XmHTMLObject *objects;	/* parser tree starting point */
}XmHTMLDocumentCallbackStruct;

Private Functions

XmHTML uses a number of functions to extract values from the attributes field of the XmHTMLObject structures. This section gives a brief overview of these functions, along with the prototypes. The functions themselves are defined in the header file XmHTMLfuncs.h.

extern Boolean _XmHTMLTagCheck(char *attributes, char *tag);

Returns True when tag is present in the given attributes.

extern Boolean _XmHTMLTagCheckValue(char *attributes, char *tag, char *check);

Returns True when tag has the specified value check and False if not.

extern char *_XmHTMLTagGetValue(char *attributes, char *tag);

Returns the value of tag if found in the given attributes, NULL otherwise. The return value must be freed by the caller.

extern int _XmHTMLTagGetNumber(char *attributes, char *tag, int def);

Returns the numerical value of tag if found in the given attributes. def specifies the return value if tag is not found.

The following function searches and expands any character escape sequences in the given string:

extern void _XmHTMLExpandEscapes(char *string);

This function recognizes all escape sequences from the ISO 8895-1 character set, as well as all &# character escapes below 160. Escape sequences are not required to have a terminating semi-colon.

XmHTML Element Identifiers

This table lists the internal identifiers, the name of the corresponding HTML element and whether an element is terminated or not. It includes the complete set of HTML 3.2 elements, as well as a small number of extensions.

XmHTML Element Identifiers
id	Element	Terminated	id	Element	Terminated
HT_DOCTYPE	!doctype	False	HT_A	a	True
HT_ADDRESS	address	True	HT_APPLET	applet	True
HT_AREA	area	False	HT_B	b	True
HT_BASE	base	False	HT_BASEFONT	basefont	False
HT_BIG	big	True	HT_BLOCKQUOTE	blockquote	True
HT_BODY	body	True	HT_BR	br	False
HT_CAPTION	caption	True	HT_CENTER	center	True
HT_CITE	cite	True	HT_CODE,	code	True
HT_DD	dd	True	HT_DFN	dfn	True
HT_DIR	dir	True	HT_DIV	div	True
HT_DL	dl	True	HT_DT	dt	True
HT_EM	em	True	HT_FONT	font	True
HT_FORM	form	True	HT_FRAME	frame	True
HT_FRAMESET	frameset	True	HT_H1	h2	True
HT_H2	h2	True	HT_H3	h3	True
HT_H4	h4	True	HT_H5	h5	True
HT_H6	h6	True	HT_HEAD	head	True
HT_HR	hr	False	HT_HTML	html	True
HT_I	i	True	HT_IMG	img	False
HT_INPUT	input	False	HT_ISINDEX	isindex	False
HT_KBD	kbd	True	HT_LI	li	True
HT_LINK	link	False	HT_MAP	map	True
HT_MENU,	menu	True	HT_META	meta	False
HT_NOFRAMES	noframes	True	HT_OL	ol	True
HT_OPTION	option	True	HT_P	p	True
HT_PARAM	param	False	HT_PRE	pre	True
HT_SAMP	samp	True	HT_SCRIPT	script	True
HT_SELECT	select	True	HT_SMALL	small	True
HT_STRIKE	strike	True	HT_STRONG	strong	True
HT_STYLE	style	True	HT_SUB	sub	True
HT_SUP	sup	True	HT_TAB	tab	False
HT_TABLE	table	True	HT_TD	td	True
HT_TEXTAREA	textarea	True	HT_TH	th	True
HT_TITLE,	title	True	HT_TR	tr	True
HT_TT	tt	True	HT_U	u	True
HT_UL	ul	True	HT_VAR	var	True
HT_ZTEXT	plain text	False

error:	HTML_UNKNOWN_ESCAPE
actions:	XmHTML_REMOVE, XmHTML_TERMINATE
err_msg:	`%s: unknown character escape sequence`

error:	HTML_BAD
actions:	XmHTML_REMOVE, XmHTML_IGNORE, XmHTML_TERMINATE
err_msg:	`Terrible HTML! element %s completely out of balance.`

error:	HTML_OPEN_BLOCK
actions:	XmHTML_INSERT, XmHTML_REMOVE, XmHTML_KEEP
err_msg:	`A new block level element (%s) was encountered while %s is still open.`

error:	HTML_CLOSE_BLOCK
actions:	XmHTML_REMOVE, XmHTML_INSERT, XmHTML_KEEP, XmHTML_TERMINATE
err_msg:	`A closing block level element (%s) was encountered while it " was never opened.`

error:	HTML_OPEN_ELEMENT
actions:	XmHTML_REMOVE, XmHTML_SWITCH, XmHTML_TERMINATE
err_msg:	`Unbalanced terminator: got %s while %s is required.`

error:	HTML_VIOLATION
actions:	XmHTML_REMOVE, XmHTML_KEEP, XmHTML_TERMINATE
err_msg:	`%s may not occur inside %s`

error:	HTML_INTERNAL
actions:	XmHTML_TERMINATE, XmHTML_IGNORE
err_msg:	`Internal parser error`