XML Schema Inference

Table of contents

Introduction

This article describes how XML Schema Inference engine is designed and implemented. The implementation is available as System.Xml.Schema.XmlSchemaInference (in System.Xml.dll), in .NET 2.0 API.

This article will help those people who want to know how XmlSchemaInference generates a set of schemas from its input XML document (via XmlReader).

There are two general categories of the design: attribute inference and element interence. There are some special topics such as data type inference, content type inference, and particle inference.

Our XmlSchemaInference design is largely equal to the one from Microsoft. Originally it was "XSDInference" that was delivered as an external tool from Microsoft (now it is on gotdotnet). As far as I know it contains one 60KB-ish, 2000 lines of a source file, and a decent documentation on how the inference is done. Here I write something similar, but including what and why it cannot support some features, beyond XmlSchemaInference design.

How to use XmlSchemaInference

Here is a minimum example:

using System;
using System.Xml;
using System.Xml.Schema;
 
public class Tset
{
    public static void Main (string [] args)
    {
        XmlSchemaInference infer = new XmlSchemaInference ();
        foreach (string filename in args) {
            using (XmlReader reader = XmlReader.Create (filename)) {
                XmlSchemaSet ss = infer.InferSchema (reader);
                foreach (XmlSchema xs in ss.Schemas ())
                    xs.Write (Console.Out);
            }
        }
    }
}

This tiny driver is worthy of playing with any of XML documents for a while :-)

Arguments and Options

XmlSchamaInference.InferSchema() takes an XmlReader and optionally an existing XmlSchemaSet as arguments and returns an XmlSchemaSet. Actually, as long as I tried, MS implementation returns the same instance as it received.

The input XmlSchemaSet is expected to be such one that was generated by another XmlSchemaInference session. That is to not mess the inference engine up. However, since actually there is no way to detect the origin of XmlSchemaSet, it just accepts everything.

Before starting the actual inference, the input XmlSchemaSet is first compiled. The purpose is mainly to acquire GlobalElements and GlobalAttributes, but what is rather important here is that illegal schemas will be excluded.

Besides the input arguments, there are two option flags:

  • Occurrence: restricted or relaxed. For relaxed inference, attributes and content item elements are always optional.
  • TypeInference: restricted or relaxed. For relaxed inference, it always regards text data types as xs:string, while restricted inference adds more specific datatypes.

Entrypoint

Next, it moves XmlReader's cursor forward to the document element and processes it. It tries to find the corresponding global element which has the same QName as the document element has. If there is an existing definition, then the extension of the existing definition is done so that the definition and the XmlReader content won't contradict.

Example:

Instance:
 
<foo>ABC</foo>
 
Existing schema:
 
<xs:schema xmlns:xs='http://www.w3.org/2001/XMLSchema'>
  <xs:element name='foo' type='xs:int' />
</xs:schema>
 
Result:
 
<xs:schema xmlns:xs='http://www.w3.org/2001/XMLSchema'>
  <xs:element name='foo' type='xs:string' />
</xs:schema>

If it is going to define a new element, then the namespace is an important factor. The new XmlSchemaElement (xs:element) instance must have the same target namespace URI as that of XmlSchema (xs:schema) which will contain it. And since XmlSchemaElement.QualifiedName is not provided unless it is compiled, it is impossible that those new schema components cannot be acquired via XmlSchemaSet.GlobalElements etc. (those new items are not compiled yet).

For schema inference implementors: when processing schema inference, you must be always careful that the property you are going to use is not dependent on compilation. For .NET System.Xml.Schema, post-compilation information is provided only after Compile().

Global and local components

By the way, current XmlSchemaInference never infers non-document-element children as a global component, unless those children are in other namespaces. Those elements which have the same target namespace are inferred as complexType children. On the contrary, since "external" element definitions (which have different target namespaces) can be used only via a reference in model groups, they must be defined globally.

Example:

Instance:
 
<products>
  <category>
    <product>foo</product>
    <product>bar</product>
  </category>
  <product>hoge</product>
  <product>fuga</product>
</products>
 
Result:
 
<xs:schema
  attributeFormDefault="unqualified"
  elementFormDefault="qualified"
  xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="products">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="category">
          <xs:complexType>
            <xs:sequence>
              <xs:element maxOccurs="unbounded"
                name="product" type="xs:string" />
            </xs:sequence>
          </xs:complexType>
        </xs:element>
        <xs:element maxOccurs="unbounded"
          name="product" type="xs:string" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

There seems no special reason that the Microsoft developers designed the inference engine as such (since anyways those external elements must be defined as references). They supports those external bits nicely, there is no reason not to support such "global inferences" against those non-external elements.

Since those two elements which have identical name might be inferred as different elements, you should always be careful at the results.

(It might be supported in the future. However it might not be easy for current MS developers to support it, since they said that they could not improve it since 2004).

Attributes

Attribute inference is an easy part of inference (the first process must be element inference though).


Targets

The target of inference is limited. xmlns:* attributes are not inferred (they don't have schema definitions anyways). Similarly, xsi:* attributes are ignored. But note that xsi:type might be used in the inference to set nillable="true".

The content of a definition is easy: name, occurrence (use="required" or "optional"), and type.

Generating xs:attribute

Attribute names are explicit. But if there is a namespace URI (i.e. there is a prefix) for the attribute, it is a global attribute. In that case, we must provide the definition of the corresponding global attribute, and create xs:attribute which uses "ref" attribute to point to the definition. It is required even if the attribute's target namespace is the same as that of the containing element (and the rule that the attribute definition must reside in the corresponding schema applies here too).

With related to the discussion above, for xml:* attributes the inference engine will import "http://www.w3.org/2001/xml.xsd" in the resulting XmlSchemaSet. It is not "inferred". It is the predefined result.

use = "required" ? "optional" ?

Occurrence matters only when XmlSchemaInference.Occurrence property is set as "Restricted" (when it is "Relaxed" it is always optional). use="required" is possible only in limited situations.

When an attribute definition "a" in an element "e" infers an element "x" instance of type "e", if there is attribute "a", then use becomes "required". If the next instance "y" does not have "a", then it must become "optional". Similarly, if there is an attribute "b" which was not in the existing definition of "e", then it must define attribute "b" with the attribute use "optional" (otherwise "x" will become invalid). That is, when infering attributes, whether the container element is newly inferred or being expanded significantly matters.

Example:

Input:
 
<foo b='value'/>
 
Existing schema:
 
<xs:schema attributeFormDefault="unqualified"
  elementFormDefault="qualified"
  xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="foo">
    <xs:complexType>
      <xs:attribute name="a" type="xs:string" use="required" />
    </xs:complexType>
  </xs:element>
</xs:schema>
 
Result:
 
<xs:schema attributeFormDefault="unqualified"
  elementFormDefault="qualified"
  xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="foo">
    <xs:complexType>
      <xs:attribute name="a" type="xs:string" use="optional" />
      <xs:attribute name="b" type="xs:string" use="optional" />
    </xs:complexType>
  </xs:element>
</xs:schema>

type

Type inference on attributes is simple, because it is always simpleType. If the attribute is new, then it can be inferred only from the string value. If the attribute is not a new one, then the value must be valid against the existing definition. If it was not valid, then we must find the common base data type. I'll describe later on how to determine the data type from value strings.


complexType

To define attributes, the element type must be complex (complexType). Since there might be already element type definition, some kind of transformation might happen:

  • If there was no type information (no "type" attribute, neither "simpleType" nor "complexType" children), then simply a "complexType" is created.
  • If the content is a "simpleType", then "complexType", "simpleContent" and "extension" is created in order, and the "simpleType" is set as the base type (in either "base" attribute or a child of the "extension").
  • If there was "complexType" it does not matter.

Attributes are added to either the "complexType" itself or "extension" or "restriction" element of the child content model of the complexType.

Not supported inference: attributeGroup

It is still possible to create attributeGroup and have references to them in a complexType. However, there are too complicated possibilities that those groups might be referenced in many places and the inference might require updating those attributes in attributeGroups. So if the input element has a reference to attribute group, then it just rejects it (exception).

Not supported inference: anyAttribute

Wildcard component constraint is complicated. A specific QName must not be covered by wildcards. So if we try to add wildcard, it must be aware of all existed attributes. Even if did not result in errors, it still have to ignore all attributes that is covered by the wildcard, and the wildcard target must not be changed (since there might be other attributes that was originally not covered, or on the counterpart there might be other attributes that was dependent on the range that is being removed). Thus it is better to reject anyAttribute items.

Content Type

kind and particle

Unlike attributes, elements can be either simpleType or complexType. Here I write about the content kind part of "content type" (in XML Schema Structures speak). For "content type particle" part, I'll visit later. Practically, there are four patterns, according to XmlSchemaContentType enumeration:

  • empty
  • text only
  • element only
  • mixed

Empty content

An empty type will be inferred when the input element (instance) was empty (of course) i.e. when XmlReader.IsEmptyElement was true or it reached EndElement while no content has occurred.

It is not difficult to "express" an empty content type. MS XmlSchemaInference sets no type information. In Mono, I set xs:string. I wonder which is better, but according to XML Schema Structures 3.3.2 if there is no type information, the element type is ur-type (anyType), thus the schema will allow any content, so xs:string is more restricting alternative. The better solution would be to have an empty sequence (I just didn't that because it somewhat messes code up).

Text content

When a Text (including CDATA section and significant whitespaces) occurs, then the type must allow simple contents. If the existing type is:

  • empty, then the type must be xs:string (will describe later why it must be xs:string)
  • text only or mixed, then content kind is unchanged (though it must adjust its data type)
  • element only, then content kind must become mixed (set mixed="true" on the complexType)

When infering the type of text content, it must also do predefined type inference. However, there are some cases that it don't have to do that:

  • If the type is complexType. Then its kind is mixed and thus there is no chance to set a predefined type there (there is no way to set a predefined data type against those text nodes that might be split).
  • If the element already existed and the content kind was empty. In that case, we can set only xs:string because there is no other (useful) pattern that can allow an empty string, even if the new text node is a simple integer.
  • If the type is non-predefined simpleType. Currently(?) there is no System.Xml API that supports simple text validation against a simpleType (XmlSchemaSimpleType) which takes derivation (by restriction, list, union) into consideration. Thus it is impossible to examine if the new text node is acceptable against the simpleType or not (I once suggested to have such functionality to MS, but they couldn't understand what it means).

(In the last case, it would be still possible to walk base types up, but in Mono it just sets xs:string. MS.NET has much limited support around here; it never allows custom simpleType.)

Other than those cases, it must handle predefined type inference. If there is existing predefined type, it first examines if the input text value is valid against the existing type. If it allows, then we can leave it as is. If it does not allow the value, then the type must be "relaxed". See predefined type inference described later.

Element content

When an element child item occurred, then the container element type must be transformed to complexType that accepts complex content:

  • If the existing type was simpleType, it is replaced with a new complexType whose content is simpleContent restriction (existing simpleType is copied into it)
  • If the existing type was complexType and had a simpleContent, then it is replaced with a complexContent and all existing attribute definitions in the simpleContent are copied. The complexContent becomes mixed="true" and simple type information is discarded.

Particle inference

Model groups and particles

Ok, time to dive into the deepest part. Model groups and particles are both used to represent content model. These terms are somewhat different. Model group (described in XML Schema Structures section 3.8) can occur as a child of either "complexType", "extension" or "restriction". Particle (described in 3.9) can be either as a model group or its children. Namely, xs:element and xs:any is not model groups and thus they cannot appear as an immediate child of a complexType.


Supported patterns

It might sound surprising, but the patterns supported in XmlSchemaInference is very limited.

First, occurrence is limited to 0, 1 or maxOccurs (i.e. the same patterns as DTD has). I don't think it sounds inconvenient (on the contrary, I would feel it is inconvenient if the inference results in for example maxOccurs="12" after feeding just an example instance).

Next, the supported content model patterns are only two:

  • (A) a sequence that contains element particles.
  • (B) a sequence that contains a choice whose maxOccurs="unbounded" and which contains element particles.

A looks like:

<xs:sequence>
  <xs:element name="Foo" />
  <xs:element name="Bar" />
</xs:sequence>

B looks like:

<xs:sequence>
  <xs:choice maxOccurs="unbounded">
    <xs:element name="Foo" />
    <xs:element name="Bar" />
  </xs:choice>
</xs:sequence>

All other patterns are rejected. For concrete example:

  • xs:group is not supported
  • xs:all is not supported
  • xs:any is not supported
  • xs:choice itself is not supported

Since XmlSchemaInference accepts only such schema sets that is generated by another XmlSchemaInference session, those "other patterns" are not supported anyways. But why other patterns than the first two are not supported?

  • It is too complicated to expand xs:group references. And since there is a possibility that the whole schema set could be valid in absence of the actual xs:group definition (see XML Schema Structures section 5.5.3 Missing Sub Components). Thus such model group that contains a reference to xs:group is likely to be invalid when those model groups got provided.
  • It is not obvious for me why xs:all is not supported, but yes it is less fascinating schema component in cost of complicated support. If we support it, then we will have to keep track of all the element content which is already occurred, and had to set minOccurs="0" for such children of xs:all that actually didn't appear. It would be still possible to replace xs:all to xs:choice or xs:sequence when it became impossible for xs:all to allow the input (e.g. in such case that the same element occurred twice).
  • It is (maybe theoretically, at least practically) impossible to support xs:any. What happens if current particle is xs:any and the input element is not acceptable to the xs:any. But if the parent is xs:sequence, then it must be accepted as xs:any. If we insert an xs:element in front of the xs:any, then the occurrence must be minOccurs="0" (since it didn't occur previously). But to do that, xs:any must be modified to not accept the element's namespace (to not violate Unique Particle Attribution described in XML Schema Structures section 3.8.6), that is unacceptable (there might be such instances that are allowed by that wildcard). And if on the contrary we extend the xs:any to allow the instance, it might also result in Unique Particle Attribution violation with related to the sibling particles.
  • Supporting xs:choice is further complicated (to keep track of the descendant of the xs:choice), and it is very easy to happen Unique Particle Attribution violation.

... thus, it is realistic to support only two patterns described above.

("A sequence that contains a choice ..." looks like a roundabout and I wonder why MS developers designed it as such, but it might be easy way for them to design so.)

Particle inference progress

The two patterns described above also determines how to progress inference (because, only because it is easy). The basic line is, to start with (A)simple sequence inference, and once it became impossible, then switch to (B)sequence of choice inference. The (B) form is the final form that cannot be expanded anymore.

For (B) case, first it finds the matching xs:element in xs:choice. If it exists, the corresponding xs:element is expanded to allow current instance element (new xs:element cannot be defined; it will violate Unique Particle Attribution). If it does not exist, then it adds a new xs:element to xs:choice.

For (A) case, it is more complicated. Since xs:sequence content must appear in order, "current particle position on inference" also matters. Also, the occurrence flag that indicates "if current particle already appeared" is also important.

It finds the matching xs:element (note that no other kind of particle is allowed) in the sequence from the beginning of the sequence to "the current particle minus 1" (For the first item this search is nothing). If there was such an xs:element definition, then this sequence became impossible to stand anymore. It is because, if there is such xs:elements that has the same qualified name, then they must have the same type definition (according to XML Schema Structures 3.8.6). If the inference result in different element definition, then it becomes invalid ("theoretically" we could still leave backup of XmlSchemaSet, store XmlReader into XmlDocument and use XmlNodeReader instead, and go back when it became invalid, but it is too messy).

If there is no such xs:element, then we compare the instance element name and the "current particle" name.

  • If they match, then the instance is inferred as to be valid against the current particle. In that case, if "occurrence flag" is OFF, then set it as ON, and if the flags is already ON, then set maxOccurs="unbounded". And extend the current particle as to accept the instance element.
  • If the input QName doesn't match, if the "occurrence flag" is ON, then it was already inferred successfully, thus move to the next sequence item and continue comparison (if no more particle, create a new element definition). If the flag is OFF, then it means that the last particle didn't appear. There are two possible processing at that state. Maybe MS.NET set the particle xs:element as minOccurs="0" (optional), while Mono just regards that the sequence inference failed and switches to pattern (B) above.

If the input finished and there are still remaining particles in the sequence, then those remaining items become optional i.e. minOccurs="0".

Element particle matching

When finding an element by QName, the target name might be XmlSchemaElement.RefName instead of XmlSchemaElement.Name for those elements that have different target namespace. That means, current target namespace (the corresponding target namespace of the XmlSchema that contains the XmlSchemaComplexType that contains the XmlSchemaSequence) is always kept track. Since XmlSchemaElement items might not be compiled, its QualifiedName property is not ready to be used.

Text data type (predefined type) inference

The final part is predefined type inference (XML Schema Datatypes). Well, I've mostly described what we should pay attention yet.

MS XmlSchemaInference rejects non-predefined simple types. There is no chance that simple restriction, list and union are inferred (it is also true to Mono XmlSchemaInference).

Thus, all XML Schema facets are useless here.

If all datatypes fail to validate the value, then the type becomes xs:string. It also applies to when doing "merged inference". Merged inference is a such inference that examines if existing predefined type allows the input, and in case of invalidity it walks up base types and continue validation.

In Microsoft XmlSchemaInference, numeric inference is very strictly done. I think it is still useful if it handles integer, decimal and double, but MS implementation starts from unsignedByte, byte, short... and float. If the instance did not show negative values, then the resulting predefined type is always non-negative. I wonder how it makes sense.

By the way the obsolete MS XSDInference had a bug that value text "0" and "1" are inferred as xs:boolean. It is fixed in XmlSchemaInference. The reason why such an inference is bad is that if the next value was "2" then it is not acceptable for xs:boolean and thus it must become xs:string (there is no chance to be xs:byte because there might be "true" or "false" values which xs:boolean accepts while xs:byte does not).

Other than numeric types, xs:boolean, xs:dateTime and xs:duration could be inferred.

Those derived types from xs:string are not inferred. No matter how string values were valid for xs:token, it is not inferred. Infering those derived types makes little sense. Also, xs:anyURI and xs:QName are not inferred (especially the latter one is strongly not practical since it depends on the context namespaces).

That's all

XML Schema inference might sound a difficult thesis, but it could be simple as long as we keep simple rules. What we actually needed is an explicit guideline on how we can expect those inference engines to create schemas.