Contents:



XML Schema: SAPI

Quick-links: [Elements] [Attributes] [Source]

This schema describes the SAPI 5.0 TTS XML grammar format. The SAPI TTS XML schema is included in the TTS XML parser. Hence, it is not necessary to include the schema in the XML file when authoring a grammar. NOTE: This schema is based on the Microsoft schema language and is not fully W3C compliant. This schema will be rewritten and will be compliant with the W3C standard once it has been approved by the W3C.

This schema describes the following elements and attributes:

Elements

Attributes

Element-specific Attributes

<BOOKMARK>
<CONTEXT>
<EMPH>
<LANG>
<PARTOFSP>
<PITCH>
<PRON>
<RATE>
<SAPI> (document element)
<SILENCE>
<SPELL>
<VOICE>
<VOLUME>

ABSMIDDLE
ABSSPEED
ID
LANGID
LEVEL
MARK
MIDDLE
MSEC
OPTIONAL
PART
REQUIRED
SPEED
SYM

 

Document conventions:

  • [] - optional
  • []* - zero or more times
  • + - one or more times

SAPI Elements

<BOOKMARK>

Inserts a bookmark into the input stream using the bookmark element. If an application specifies interest in bookmark events, it will receive an event when synthesis has passed this element in an input stream. If the audio output destination supports handling of events, then an application will receive this event once the synthesized speech up to this bookmark has been output. Otherwise, an application receives a bookmark event when the voice implementation has synthesized speech up to this bookmark.

syntax:

<BOOKMARK

 

MARK = int

/>

content:

empty

order:

many (default)

parents:

SAPI

children:

(none)

attributes:

MARK

model:

closed

source:

<ElementType name="BOOKMARK" content="empty" model="closed">
         <description>Inserts a bookmark into the input stream using the bookmark element. If an application specifies interest in bookmark events, it will receive an event when synthesis has passed this element in an input stream. If the audio output destination supports handling of events, then an application will receive this event once the synthesized speech up to this bookmark has been output. Otherwise, an application receives a bookmark event when the voice implementation has synthesized speech up to this bookmark. </description>
         <attribute type="MARK"/>
</ElementType>

<CONTEXT>

The context can specify the type of normalization rules which should be applied to the scoped text. SAPI does not guarantee any predefined contexts.

syntax:

<CONTEXT

 

ID = string

>

 

mixed content

</CONTEXT>

content:

mixed

order:

many (default)

parents:

SAPI

children:

(none)

attributes:

ID

model:

closed

source:

<ElementType name="CONTEXT" content="mixed" model="closed">
         <description>The context can specify the type of normalization rules which should be applied to the scoped text. SAPI does not guarantee any predefined contexts. </description>
         <attribute type="ID"/>
</ElementType>

<EMPH>

Places emphasis on the words contained by this element.

syntax:

<EMPH />

content:

empty

order:

many (default)

parents:

SAPI

children:

(none)

attributes:

(none)

model:

closed

source:

<ElementType name="EMPH" content="empty" model="closed">
         <description>Places emphasis on the words contained by this element. </description>
</ElementType>

<LANG>

Changes the LANGID of the scoped text. When the LANGID is changed, SAPI will try to detect if the current voice can handle the new language. If voice does not speak the specified language, then an engine must choose another language it speaks as a best attempt. Using the VOICE tag and REQUIRED attribute, this fall back path can be prevented if not desirable.

syntax:

<LANG

 

LANGID = int

>

 

mixed content

</LANG>

content:

mixed

order:

many (default)

parents:

SAPI

children:

(none)

attributes:

LANGID

model:

closed

source:

<ElementType name="LANG" content="mixed" model="closed">
         <description>Changes the LANGID of the scoped text. When the LANGID is changed, SAPI will try to detect if the current voice can handle the new language. If voice does not speak the specified language, then an engine must choose another language it speaks as a best attempt. Using the VOICE tag and REQUIRED attribute, this fall back path can be prevented if not desirable. 
</description>
         <attribute type="LANGID"/>
</ElementType>

<PARTOFSP>

The part of speech of contained word(s). The PARTOFSP tag is used to force a particular pronunciation of a word (for example, the word record as a noun versus the word record as a verb).

syntax:

<PARTOFSP

 

PART = enumeration: noun|verb|modifier|function|interjection|unknown

>

 

mixed content

</PARTOFSP>

content:

mixed

order:

many (default)

parents:

SAPI

children:

(none)

attributes:

PART

model:

closed

source:

<ElementType name="PARTOFSP" content="mixed" model="closed">
         <description>The part of speech of contained word(s). The PARTOFSP tag is used to force a particular pronunciation of a word (for example, the word record as a noun versus the word record as a verb). </description>
         <attribute type="PART"/>
</ElementType>

<PITCH>

The scoped/global element PITCH modifies the underlying numerical values of a speech block. Relative attribute values, those preceded by a dash (-) or a plus sign (+), increment the underlying numerical value by the specified amount. SAPI compliant engines have the option of supporting only the guaranteed range of values and behaving as -10 for adjustments below -10 and behaving as +10 for values above +10.

syntax:

<PITCH

 [

ABSMIDDLE = int ]

 

MIDDLE = int

>

 

mixed content

</PITCH>

content:

mixed

order:

many (default)

parents:

SAPI

children:

(none)

attributes:

ABSMIDDLE, MIDDLE

model:

closed

source:

<ElementType name="PITCH" content="mixed" model="closed">
         <description>The scoped/global element PITCH modifies the underlying numerical values of a speech block. Relative attribute values, those preceded by a dash (-) or a plus sign (+), increment the underlying numerical value by the specified amount. SAPI compliant engines have the option of supporting only the guaranteed range of values and behaving as -10 for adjustments below -10 and behaving as +10 for values above +10.</description>
         <attribute type="MIDDLE"/>
         <attribute type="ABSMIDDLE"/>
</ElementType>

<PRON>

Pronounces the contained text (possibly empty) according to the provided Unicode string.

syntax:

<PRON

 

SYM = char

>

 

mixed content

</PRON>

content:

mixed

order:

many (default)

parents:

SAPI

children:

(none)

attributes:

SYM

model:

open

source:

<ElementType name="PRON" content="mixed" model="open">
         <description>Pronounces the contained text (possibly empty) according to the provided Unicode string. 
         </description>
         <attribute type="SYM"/>
</ElementType>

<RATE>

Set the relative speed adjustment at which words are synthesized.

syntax:

<RATE

 [

ABSSPEED = int ]

 [

SPEED = int ]

>

 

mixed content

</RATE>

content:

mixed

order:

many (default)

parents:

SAPI

children:

(none)

attributes:

ABSSPEED, SPEED

model:

closed

source:

<ElementType name="RATE" content="mixed" model="closed">
         <description>Set the relative speed adjustment at which words are synthesized.</description>
         <attribute type="SPEED"/>
         <attribute type="ABSSPEED"/>
</ElementType>

<SAPI>

At the beginning of the SAPI tag, the state of the voice is the same state as the insertion point of the SAPI tag. At the close of the SAPI tag, the voice returns to the same state as that of the insertion point. SAPI tags may be nested. When a nested SAPI tag is closed, the voice state returns to what it was at the insertion point of the nested tag.

syntax:

<SAPI >

 

(many)

 

<BOOKMARK>

 

<SILENCE>

 

<EMPH>

 

<SPELL>

 

<PARTOFSP>

 

<PRON>

 

<LANG>

 

<VOICE>

 

<RATE>

 

<VOLUME>

 

<PITCH>

 

<CONTEXT>

 

mixed content

</SAPI>

content:

mixed

order:

many (default)

parents:

No parents found. This is probably the document element.

children:

BOOKMARK, CONTEXT, EMPH, LANG, PARTOFSP, PITCH, PRON, RATE, SILENCE, SPELL, VOICE, VOLUME

attributes:

(none)

model:

open

source:

<ElementType name="SAPI" content="mixed" model="open">
         <description>At the beginning of the SAPI tag, the state of the voice is the same state as the insertion point of the SAPI tag. At the close of the SAPI tag, the voice returns to the same state as that of the insertion point. SAPI tags may be nested. When a nested SAPI tag is closed, the voice state returns to what it was at the insertion point of the nested tag. </description>
         <element type="BOOKMARK"/>
         <element type="SILENCE"/>
         <element type="EMPH">
                 <description> Place emphasis on the words contained by this element. It is up to the engine implementation to design what emphasis is for the engine. </description>
         </element>
         <element type="SPELL">
                 <description>Spell out words letter by letter contained by this element. NOTE: The engine should not normalize the text scoped in the SPELL tag.  This includes numbers, words, etc. Words which contain punctuation, such as “U.S.A” should spell out the letters as well as the punctuation scoped within the tag. </description>
         </element>
         <element type="PARTOFSP"/>
         <element type="PRON">
                 <description>String representing a phoneme for a language supported by the voice implementing synthesized speech. </description>
         </element>
         <element type="LANG"/>
         <element type="VOICE"/>
         <element type="RATE"/>
         <element type="VOLUME">
                 <description>0 to 100 (no overflow allowed)</description>
         </element>
         <element type="PITCH">
                 <description>Set the relative pitch adjustment of synthesized speech.</description>
         </element>
         <element type="CONTEXT"/>
</ElementType>

<SILENCE>

Produces silence for a specified number of milliseconds to the output audio stream.

syntax:

<SILENCE

 

MSEC = int

/>

content:

empty

order:

many (default)

parents:

SAPI

children:

(none)

attributes:

MSEC

model:

closed

source:

<ElementType name="SILENCE" content="empty" model="closed">
         <description>Produces silence for a specified number of milliseconds to the output audio stream. </description>
         <attribute type="MSEC"/>
</ElementType>

<SPELL>

Spells out words letter by letter contained by this element. Note: The engine should not normalize the text scoped in the SPELL tag. This includes numbers, words, etc. Words that contain punctuation, such as "U.S.A." should spell out the letters as well as the punctuation scoped within the tag.

syntax:

<SPELL />

content:

empty

order:

many (default)

parents:

SAPI

children:

(none)

attributes:

(none)

model:

closed

source:

<ElementType name="SPELL" content="empty" model="closed">
         <description>Spells out words letter by letter contained by this element. 
Note:  The engine should not normalize the text scoped in the SPELL tag. This includes numbers, words, etc. Words that contain punctuation, such as "U.S.A." should spell out the letters as well as the punctuation scoped within the tag. </description>
</ElementType>

<VOICE>

Sets which voice implementation is used for synthesis of associated input stream text. The best voice implementation given the required and optional attributes will be selected by SAPI.

syntax:

<VOICE

 [

OPTIONAL = string ]

 [

REQUIRED = string ]

>

 

mixed content

</VOICE>

content:

mixed

order:

many (default)

parents:

SAPI

children:

(none)

attributes:

OPTIONAL, REQUIRED

model:

closed

source:

<ElementType name="VOICE" content="mixed" model="closed">
         <description>Sets which voice implementation is used for synthesis of associated input stream text. The best voice implementation given the required and optional attributes will be selected by SAPI. </description>
         <attribute type="REQUIRED"/>
         <attribute type="OPTIONAL"/>
</ElementType>

<VOLUME>

The scoped/global elements VOLUME modify the underlying numerical values of a speech block. The underlying value can never be below zero or exceed 100. All negative value entries will result in zero and all values above 100 will result in 100. VOLUME may also receive an absolute value (no '-' or '+' character) of an integer between zero and 100.

syntax:

<VOLUME

 

LEVEL = int

>

 

mixed content

</VOLUME>

content:

mixed

order:

many (default)

parents:

SAPI

children:

(none)

attributes:

LEVEL

model:

closed

source:

<ElementType name="VOLUME" content="mixed" model="closed">
         <description>The scoped/global elements VOLUME modify the underlying numerical values of a speech block. The underlying value can never be below zero or exceed 100. All negative value entries will result in zero and all values above 100 will result in 100. VOLUME may also receive an absolute value (no '-' or '+' character) of an integer between zero and 100. </description>
         <attribute type="LEVEL"/>
</ElementType>

SAPI Attributes

<... ABSMIDDLE="">

The value can range from –10 to +10. A value of 0 sets a voice to speak at its default pitch. A value of –10 sets a voice to speak at three-fourths (or Ύ) of its default pitch. A value of +10 sets a voice to speak at four-thirds (or 4/3) of its default pitch. Each increment between –10 and +10 is logarithmically distributed such that incrementing/decrementing by 1 is multiplying/dividing the pitch by the 24th root of 2 (about 1.03). Values more extreme than –10 and 10 will be passed to an engine but SAPI 5compliant engines may not support such extremes and instead may clip the pitch to the maximum or minimum pitch it supports. Values of –24 and +24 must lower and raise pitch by 1 octave respectively. All incrementing/decrementing by 1 must multiply/divide the pitch by the 24th root of 2. When scoped, this attribute is absolute.

syntax:

[ ABSMIDDLE = int ]

required:

no (default)

datatype:

int

elements:

PITCH

source:

<AttributeType name="ABSMIDDLE" dt:type="int">
         <description> The value can range from –10 to +10. A value of 0 sets a voice to speak at its default pitch. A value of –10 sets a voice to speak at three-fourths (or Ύ) of its default pitch. A value of +10 sets a voice to speak at four-thirds (or 4/3) of its default pitch. Each increment between –10 and +10 is logarithmically distributed such that incrementing/decrementing by 1 is multiplying/dividing the pitch by the 24th root of 2 (about 1.03). Values more extreme than –10 and 10 will be passed to an engine but SAPI 5compliant engines may not support such extremes and instead may clip the pitch to the maximum or minimum pitch it supports. Values of –24 and +24 must lower and raise pitch by 1 octave respectively. All incrementing/decrementing by 1 must multiply/divide the pitch by the 24th root of 2. When scoped, this attribute is absolute.</description>
</AttributeType>

<... ABSSPEED="">

The value can range from –10 to +10. A value of 0 sets a voice to speak at its default rate. A value of –10 sets a voice to speak at one-third (or 1/3) of its default rate. A value of +10 sets a voice to speak at 3 times its default rate. Each increment between –10 and +10 is logarithmically distributed such that incrementing/decrementing by 1 is multiplying/dividing the rate by the 10th root of 3 (about 1.12). Values more extreme than –10 and +10 will be passed to an engine, but SAPI 5compliant engines may not support such extremes and instead may clip the rate to the maximum or minimum rate it supports. When scoped, this attribute is absolute.

syntax:

[ ABSSPEED = int ]

required:

no (default)

datatype:

int

elements:

RATE

source:

<AttributeType name="ABSSPEED" dt:type="int">
         <description>The value can range from –10 to +10. A value of 0 sets a voice to speak at its default rate. A value of –10 sets a voice to speak at one-third (or 1/3) of its default rate. A value of +10 sets a voice to speak at 3 times its default rate. Each increment between –10 and +10 is logarithmically distributed such that incrementing/decrementing by 1 is multiplying/dividing the rate by the 10th root of 3 (about 1.12). Values more extreme than –10 and +10 will be passed to an engine, but SAPI 5compliant engines may not support such extremes and instead may clip the rate to the maximum or minimum rate it supports. When scoped, this attribute is absolute.</description>
</AttributeType>

<... ID="">

This specifies the type of context. Refer to the SAPI documentation for the vairous context ids.

syntax:

ID = string

required:

yes

datatype:

string

elements:

CONTEXT

source:

<AttributeType name="ID" dt:type="string" required="yes">
         <description>This specifies the type of context. Refer to the SAPI documentation for the vairous context ids.</description>
</AttributeType>

<... LANGID="">

Language identifier. The language identifier is specified as a hexadecimal value. For example, the LANGID for English (US) expressed in the hexadecimal form is 409.

syntax:

LANGID = int

required:

yes

datatype:

int

elements:

LANG

source:

<AttributeType name="LANGID" dt:type="int" required="yes">
         <description>Language identifier. The language identifier is specified as a hexadecimal value. For example, the LANGID for English (US) expressed in the hexadecimal form is 409. </description>
</AttributeType>

<... LEVEL="">

This specifies the volume as percent of the maximum volume of the current voice. Each voice implementation has it’s own maximum volume. This value must between 0 and 100 inclusive. Values above 100 or below 0 are clipped to 100 and 0 respectively.

syntax:

LEVEL = int

required:

yes

datatype:

int

elements:

VOLUME

source:

<AttributeType name="LEVEL" dt:type="int" required="yes">
         <description> This specifies the volume as percent of the maximum volume of the current voice. Each voice implementation has it’s own maximum volume. This value must between 0 and 100 inclusive. Values above 100 or below 0 are clipped to 100 and 0 respectively.</description>
</AttributeType>

<... MARK="">

The value of a bookmark may be any string or integer.

syntax:

MARK = int

required:

yes

datatype:

int

elements:

BOOKMARK

source:

<AttributeType name="MARK" dt:type="int" required="yes">
         <description>The value of a bookmark may be any string or integer. </description>
</AttributeType>

<... MIDDLE="">

The value can range from –10 to +10. A value of 0 sets a voice to speak at its default pitch. A value of –10 sets a voice to speak at three-fourths (or Ύ) of its default pitch. A value of +10 sets a voice to speak at four-thirds (or 4/3) of its default pitch. Each increment between –10 and +10 is logarithmically distributed such that incrementing/decrementing by 1 is multiplying/dividing the pitch by the 24th root of 2 (about 1.03). Values more extreme than –10 and 10 will be passed to an engine but SAPI 5compliant engines may not support such extremes and instead may clip the pitch to the maximum or minimum pitch it supports. Values of –24 and +24 must lower and raise pitch by 1 octave respectively. All incrementing/decrementing by 1 must multiply/divide the pitch by the 24th root of 2. When scoped, this attribute is relative.

syntax:

MIDDLE = int

required:

yes

datatype:

int

elements:

PITCH

source:

<AttributeType name="MIDDLE" dt:type="int" required="yes">
         <description>The value can range from –10 to +10. A value of 0 sets a voice to speak at its default pitch. A value of –10 sets a voice to speak at three-fourths (or Ύ) of its default pitch. A value of +10 sets a voice to speak at four-thirds (or 4/3) of its default pitch. Each increment between –10 and +10 is logarithmically distributed such that incrementing/decrementing by 1 is multiplying/dividing the pitch by the 24th root of 2 (about 1.03). Values more extreme than –10 and 10 will be passed to an engine but SAPI 5compliant engines may not support such extremes and instead may clip the pitch to the maximum or minimum pitch it supports. Values of –24 and +24 must lower and raise pitch by 1 octave respectively. All incrementing/decrementing by 1 must multiply/divide the pitch by the 24th root of 2. When scoped, this attribute is relative.</description>
</AttributeType>

<... MSEC="">

Number of milliseconds, from zero to 65535, of silence. Value entries that exceed this range should be limited to 65535. Value entries that are below this range (negative values) should be set to zero.

syntax:

MSEC = int

required:

yes

datatype:

int

elements:

SILENCE

source:

<AttributeType name="MSEC" dt:type="int" required="yes">
         <description>Number of milliseconds, from zero to 65535, of silence. Value entries that exceed this range should be limited to 65535. Value entries that are below this range (negative values) should be set to zero. </description>
</AttributeType>

<... OPTIONAL="">

The XML parser selects the first voice registered containing all of the specified attributes. A string that contains semicolon-delimited sub-strings is used to specify the attributes. The speak call will fail if the parser cannot find the required tags.

syntax:

[ OPTIONAL = string ]

required:

no (default)

datatype:

string

elements:

VOICE

source:

<AttributeType name="OPTIONAL" dt:type="string">
         <description>The XML parser selects the first voice registered containing all of the specified attributes. A string that contains semicolon-delimited sub-strings is used to specify the attributes. The speak call will fail if the parser cannot find the required tags. 
</description>
</AttributeType>

<... PART="">

String name of part of speech. Valid SAPI parts of speech arenoun, verb, modifier, function, interjection and unknown.

syntax:

PART = enumeration: noun|verb|modifier|function|interjection|unknown

required:

yes

datatype:

enumeration

values:

noun|verb|modifier|function|interjection|unknown

elements:

PARTOFSP

source:

<AttributeType name="PART" dt:type="enumeration" dt:values="noun|verb|modifier|function|interjection|unknown" required="yes">
         <description> String name of part of speech. Valid SAPI parts of speech arenoun, verb, modifier, function, interjection and unknown. </description>
</AttributeType>

<... REQUIRED="">

The XML parser selects the first voice registered containing all of the specified attributes. A string that contains semicolon-delimited sub-strings is used to specify the attributes. The speak call will fail if the parser cannot find the required tags.

syntax:

[ REQUIRED = string ]

required:

no (default)

datatype:

string

elements:

VOICE

source:

<AttributeType name="REQUIRED" dt:type="string">
         <description>The XML parser selects the first voice registered containing all of the specified attributes. A string that contains semicolon-delimited sub-strings is used to specify the attributes. The speak call will fail if the parser cannot find the required tags. 
</description>
</AttributeType>

<... SPEED="">

The value can range from –10 to +10. A value of 0 sets a voice to speak at its default rate. A value of –10 sets a voice to speak at one-third (or 1/3) of its default rate. A value of +10 sets a voice to speak at 3 times its default rate. Each increment between –10 and +10 is logarithmically distributed such that incrementing/decrementing by 1 is multiplying/dividing the rate by the 10th root of 3 (about 1.12). Values more extreme than –10 and +10 will be passed to an engine, but SAPI 5compliant engines may not support such extremes and instead may clip the rate to the maximum or minimum rate it supports. When scoped, this attribute is relative.

syntax:

[ SPEED = int ]

required:

no (default)

datatype:

int

elements:

RATE

source:

<AttributeType name="SPEED" dt:type="int">
         <description>The value can range from –10 to +10. A value of 0 sets a voice to speak at its default rate. A value of –10 sets a voice to speak at one-third (or 1/3) of its default rate. A value of +10 sets a voice to speak at 3 times its default rate. Each increment between –10 and +10 is logarithmically distributed such that incrementing/decrementing by 1 is multiplying/dividing the rate by the 10th root of 3 (about 1.12). Values more extreme than –10 and +10 will be passed to an engine, but SAPI 5compliant engines may not support such extremes and instead may clip the rate to the maximum or minimum rate it supports. When scoped, this attribute is relative.</description>
</AttributeType>

<... SYM="">

String representing a phoneme for a language supported by the voice implementing synthesizing speech. Refer to SAPI Phoneme Spec.

syntax:

SYM = char

required:

yes

datatype:

char

elements:

PRON

source:

<AttributeType name="SYM" dt:type="char" required="yes">
         <description>String representing a phoneme for a language supported by the voice implementing synthesizing speech. Refer to SAPI Phoneme Spec.</description>
</AttributeType>

SAPI Source

<Schema name="SAPI" xmlns="urn:schemas-microsoft-com:xml-data" xmlns:dt="urn:schemas-microsoft-com:datatypes">
         <description> This schema describes the SAPI 5.0 TTS XML grammar format. The SAPI TTS XML schema is included in the TTS XML parser. Hence, it is not necessary to include the schema in the XML file when authoring a grammar. NOTE: This schema is based on the Microsoft schema language and is not fully W3C compliant. This schema will be rewritten and will be compliant with the W3C standard once it has been approved by the W3C.</description>
         <!-- Attribute definitions -->
         <AttributeType name="ID" dt:type="string" required="yes">
                 <description>This specifies the type of context. Refer to the SAPI documentation for the vairous context ids.</description>
         </AttributeType>
         <AttributeType name="SYM" dt:type="char" required="yes">
                 <description>String representing a phoneme for a language supported by the voice implementing synthesizing speech. Refer to SAPI Phoneme Spec.</description>
         </AttributeType>
         <AttributeType name="LANGID" dt:type="int" required="yes">
                 <description>Language identifier. The language identifier is specified as a hexadecimal value. For example, the LANGID for English (US) expressed in the hexadecimal form is 409. </description>
         </AttributeType>
         <AttributeType name="LEVEL" dt:type="int" required="yes">
                 <description> This specifies the volume as percent of the maximum volume of the current voice. Each voice implementation has it’s own maximum volume. This value must between 0 and 100 inclusive. Values above 100 or below 0 are clipped to 100 and 0 respectively.</description>
         </AttributeType>
         <AttributeType name="MARK" dt:type="int" required="yes">
                 <description>The value of a bookmark may be any string or integer. </description>
         </AttributeType>
         <AttributeType name="MIDDLE" dt:type="int" required="yes">
                 <description>The value can range from –10 to +10. A value of 0 sets a voice to speak at its default pitch. A value of –10 sets a voice to speak at three-fourths (or Ύ) of its default pitch. A value of +10 sets a voice to speak at four-thirds (or 4/3) of its default pitch. Each increment between –10 and +10 is logarithmically distributed such that incrementing/decrementing by 1 is multiplying/dividing the pitch by the 24th root of 2 (about 1.03). Values more extreme than –10 and 10 will be passed to an engine but SAPI 5compliant engines may not support such extremes and instead may clip the pitch to the maximum or minimum pitch it supports. Values of –24 and +24 must lower and raise pitch by 1 octave respectively. All incrementing/decrementing by 1 must multiply/divide the pitch by the 24th root of 2. When scoped, this attribute is relative.</description>
         </AttributeType>
         <AttributeType name="MSEC" dt:type="int" required="yes">
                 <description>Number of milliseconds, from zero to 65535, of silence. Value entries that exceed this range should be limited to 65535. Value entries that are below this range (negative values) should be set to zero. </description>
         </AttributeType>
         <AttributeType name="OPTIONAL" dt:type="string">
                 <description>The XML parser selects the first voice registered containing all of the specified attributes. A string that contains semicolon-delimited sub-strings is used to specify the attributes. The speak call will fail if the parser cannot find the required tags. 
</description>
         </AttributeType>
         <AttributeType name="REQUIRED" dt:type="string">
                 <description>The XML parser selects the first voice registered containing all of the specified attributes. A string that contains semicolon-delimited sub-strings is used to specify the attributes. The speak call will fail if the parser cannot find the required tags. 
</description>
         </AttributeType>
         <AttributeType name="SPEED" dt:type="int">
                 <description>The value can range from –10 to +10. A value of 0 sets a voice to speak at its default rate. A value of –10 sets a voice to speak at one-third (or 1/3) of its default rate. A value of +10 sets a voice to speak at 3 times its default rate. Each increment between –10 and +10 is logarithmically distributed such that incrementing/decrementing by 1 is multiplying/dividing the rate by the 10th root of 3 (about 1.12). Values more extreme than –10 and +10 will be passed to an engine, but SAPI 5compliant engines may not support such extremes and instead may clip the rate to the maximum or minimum rate it supports. When scoped, this attribute is relative.</description>
         </AttributeType>
         <AttributeType name="PART" dt:type="enumeration" dt:values="noun|verb|modifier|function|interjection|unknown" required="yes">
                 <description> String name of part of speech. Valid SAPI parts of speech arenoun, verb, modifier, function, interjection and unknown. </description>
         </AttributeType>
         <AttributeType name="ABSMIDDLE" dt:type="int">
                 <description> The value can range from –10 to +10. A value of 0 sets a voice to speak at its default pitch. A value of –10 sets a voice to speak at three-fourths (or Ύ) of its default pitch. A value of +10 sets a voice to speak at four-thirds (or 4/3) of its default pitch. Each increment between –10 and +10 is logarithmically distributed such that incrementing/decrementing by 1 is multiplying/dividing the pitch by the 24th root of 2 (about 1.03). Values more extreme than –10 and 10 will be passed to an engine but SAPI 5compliant engines may not support such extremes and instead may clip the pitch to the maximum or minimum pitch it supports. Values of –24 and +24 must lower and raise pitch by 1 octave respectively. All incrementing/decrementing by 1 must multiply/divide the pitch by the 24th root of 2. When scoped, this attribute is absolute.</description>
         </AttributeType>
         <AttributeType name="ABSSPEED" dt:type="int">
                 <description>The value can range from –10 to +10. A value of 0 sets a voice to speak at its default rate. A value of –10 sets a voice to speak at one-third (or 1/3) of its default rate. A value of +10 sets a voice to speak at 3 times its default rate. Each increment between –10 and +10 is logarithmically distributed such that incrementing/decrementing by 1 is multiplying/dividing the rate by the 10th root of 3 (about 1.12). Values more extreme than –10 and +10 will be passed to an engine, but SAPI 5compliant engines may not support such extremes and instead may clip the rate to the maximum or minimum rate it supports. When scoped, this attribute is absolute.</description>
         </AttributeType>
         <!-- Definition of SAPI Element -->
         <ElementType name="SAPI" content="mixed" model="open">
                 <description>At the beginning of the SAPI tag, the state of the voice is the same state as the insertion point of the SAPI tag. At the close of the SAPI tag, the voice returns to the same state as that of the insertion point. SAPI tags may be nested. When a nested SAPI tag is closed, the voice state returns to what it was at the insertion point of the nested tag. </description>
                 <element type="BOOKMARK"/>
                 <element type="SILENCE"/>
                 <element type="EMPH">
                          <description> Place emphasis on the words contained by this element. It is up to the engine implementation to design what emphasis is for the engine. </description>
                 </element>
                 <element type="SPELL">
                          <description>Spell out words letter by letter contained by this element. NOTE: The engine should not normalize the text scoped in the SPELL tag.  This includes numbers, words, etc. Words which contain punctuation, such as “U.S.A” should spell out the letters as well as the punctuation scoped within the tag. </description>
                 </element>
                 <element type="PARTOFSP"/>
                 <element type="PRON">
                          <description>String representing a phoneme for a language supported by the voice implementing synthesized speech. </description>
                 </element>
                 <element type="LANG"/>
                 <element type="VOICE"/>
                 <element type="RATE"/>
                 <element type="VOLUME">
                          <description>0 to 100 (no overflow allowed)</description>
                 </element>
                 <element type="PITCH">
                          <description>Set the relative pitch adjustment of synthesized speech.</description>
                 </element>
                 <element type="CONTEXT"/>
         </ElementType>
         <!-- Definition of elements -->
         <!--Definition of BOOKMRK Element -->
         <ElementType name="BOOKMARK" content="empty" model="closed">
                 <description>Inserts a bookmark into the input stream using the bookmark element. If an application specifies interest in bookmark events, it will receive an event when synthesis has passed this element in an input stream. If the audio output destination supports handling of events, then an application will receive this event once the synthesized speech up to this bookmark has been output. Otherwise, an application receives a bookmark event when the voice implementation has synthesized speech up to this bookmark. </description>
                 <attribute type="MARK"/>
         </ElementType>
         <!-- Definition of SILENCE Element -->
         <ElementType name="SILENCE" content="empty" model="closed">
                 <description>Produces silence for a specified number of milliseconds to the output audio stream. </description>
                 <attribute type="MSEC"/>
         </ElementType>
         <!-- Definition of EMPH Element -->
         <ElementType name="EMPH" content="empty" model="closed">
                 <description>Places emphasis on the words contained by this element. </description>
         </ElementType>
         <!-- Definition of SPELL Element -->
         <ElementType name="SPELL" content="empty" model="closed">
                 <description>Spells out words letter by letter contained by this element. 
Note:  The engine should not normalize the text scoped in the SPELL tag. This includes numbers, words, etc. Words that contain punctuation, such as "U.S.A." should spell out the letters as well as the punctuation scoped within the tag. </description>
         </ElementType>
         <!-- Definition of PARTOFSP Element -->
         <ElementType name="PARTOFSP" content="mixed" model="closed">
                 <description>The part of speech of contained word(s). The PARTOFSP tag is used to force a particular pronunciation of a word (for example, the word record as a noun versus the word record as a verb). </description>
                 <attribute type="PART"/>
         </ElementType>
         <!--Definition of PRON Element-->
         <ElementType name="PRON" content="mixed" model="open">
                 <description>Pronounces the contained text (possibly empty) according to the provided Unicode string. 
         </description>
                 <attribute type="SYM"/>
         </ElementType>
         <!-- Definition of LANG Element -->
         <ElementType name="LANG" content="mixed" model="closed">
                 <description>Changes the LANGID of the scoped text. When the LANGID is changed, SAPI will try to detect if the current voice can handle the new language. If voice does not speak the specified language, then an engine must choose another language it speaks as a best attempt. Using the VOICE tag and REQUIRED attribute, this fall back path can be prevented if not desirable. 
</description>
                 <attribute type="LANGID"/>
         </ElementType>
         <!-- Definition of VOICE Element -->
         <ElementType name="VOICE" content="mixed" model="closed">
                 <description>Sets which voice implementation is used for synthesis of associated input stream text. The best voice implementation given the required and optional attributes will be selected by SAPI. </description>
                 <attribute type="REQUIRED"/>
                 <attribute type="OPTIONAL"/>
         </ElementType>
         <!-- Definition of RATE Element -->
         <ElementType name="RATE" content="mixed" model="closed">
                 <description>Set the relative speed adjustment at which words are synthesized.</description>
                 <attribute type="SPEED"/>
                 <attribute type="ABSSPEED"/>
         </ElementType>
         <!-- Definition of VOLUME Element -->
         <ElementType name="VOLUME" content="mixed" model="closed">
                 <description>The scoped/global elements VOLUME modify the underlying numerical values of a speech block. The underlying value can never be below zero or exceed 100. All negative value entries will result in zero and all values above 100 will result in 100. VOLUME may also receive an absolute value (no '-' or '+' character) of an integer between zero and 100. </description>
                 <attribute type="LEVEL"/>
         </ElementType>
         <!-- Definition of PITCH Element -->
         <ElementType name="PITCH" content="mixed" model="closed">
                 <description>The scoped/global element PITCH modifies the underlying numerical values of a speech block. Relative attribute values, those preceded by a dash (-) or a plus sign (+), increment the underlying numerical value by the specified amount. SAPI compliant engines have the option of supporting only the guaranteed range of values and behaving as -10 for adjustments below -10 and behaving as +10 for values above +10.</description>
                 <attribute type="MIDDLE"/>
                 <attribute type="ABSMIDDLE"/>
         </ElementType>
         <!-- Definition of CONTEXT Element -->
         <ElementType name="CONTEXT" content="mixed" model="closed">
                 <description>The context can specify the type of normalization rules which should be applied to the scoped text. SAPI does not guarantee any predefined contexts. </description>
                 <attribute type="ID"/>
         </ElementType>
</Schema>

Schema Attributes Reference:

open model

The element can contain elements, attributes, and text not specified in the content model. This is the default value.

closed model

The element cannot contain elements, attributes, and text except for that specified in the content model. DTDs use a closed model.

textOnly content

The element can contain only text, not elements. Note that if the model attribute is set to "open", the element can contain text and additional elements.

eltOnly content

The element can contain only the elements, not free text. Note that if the model attribute is set to "open", the element can contain text and additional elements.

empty content

The element cannot contain text or elements. Note that if the model attribute is set to "open", the element can contain text and additional elements.

mixed content

The element can contain a mix of named elements and text. This is the default value.

one order

Permits only one of a set of elements.

seq order

Requires the elements to appear in the specified sequence.

many order

Permits the elements to appear (or not appear) in any order. This is the default.

Datatype Reference:

bin.base64 datatype

MIME-style Base64 encoded binary BLOB.

bin.hex datatype

Hexadecimal digits representing octets.

boolean datatype

0 or 1, where 0 == "false" and 1 =="true".

char datatype

String, one character long.

date datatype

Date in a subset ISO 8601 format, without the time data. For example: "1994-11-05".

dateTime datatype

Date in a subset of ISO 8601 format, with optional time and no optional zone. Fractional seconds can be as precise as nanoseconds. For example, "1988-04-07T18:39:09".

dateTime.tz datatype

Date in a subset ISO 8601 format, with optional time and optional zone. Fractional seconds can be as precise as nanoseconds. For example: "1988-04-07T18:39:09-08:00".

entity datatype

Represents the XML ENTITY type.

entities datatype

Represents the XML ENTITIES type.

enumeration datatype

Represents an enumerated type (supported on attributes only).

fixed.14.4 datatype

Same as "number" but no more than 14 digits to the left of the decimal point, and no more than 4 to the right.

float datatype

Real number, with no limit on digits; can potentially have a leading sign, fractional digits, and optionally an exponent. Punctuation as in U.S. English. Values range from 1.7976931348623157E+308 to 2.2250738585072014E-308.

id datatype

Represents the XML ID type.

idref datatype

Represents the XML IDREF type.

idrefs datatype

Represents the XML IDREFS type.

int datatype

Number, with optional sign, no fractions, and no exponent.

nmtoken datatype

Represents the XML NMTOKEN type.

nmtokens datatype

Represents the XML NMTOKENS type.

notation datatype

Represents a NOTATION type.

number datatype

Number, with no limit on digits; can potentially have a leading sign, fractional digits, and optionally an exponent. Punctuation as in U.S. English. (Values have same range as most significant number, R8, 1.7976931348623157E+308 to 2.2250738585072014E-308.)

string datatype

Represents a string type.

time datatype

Time in a subset ISO 8601 format, with no date and no time zone. For example: "08:15:27".

time.tz datatype

Time in a subset ISO 8601 format, with no date but optional time zone. For example: "08:1527-05:00".

i1 datatype

Integer represented in one byte. A number, with optional sign, no fractions, no exponent. For example: "1, 127, -128".

i2 datatype

Integer represented in one word. A number, with optional sign, no fractions, no exponent. For example: "1, 703, -32768".

i4 datatype

Integer represented in four bytes. A number, with optional sign, no fractions, no exponent. For example: "1, 703, -32768, 148343, -1000000000".

r4 datatype

Real number, with seven digit precision; can potentially have a leading sign, fractional digits, and optionally an exponent. Punctuation as in U.S. English. Values range from 3.40282347E+38F to 1.17549435E-38F.

r8

Real number, with 15 digit precision; can potentially have a leading sign, fractional digits, and optionally an exponent. Punctuation as in U.S. English. Values range from 1.7976931348623157E+308 to 2.2250738585072014E-308.

ui1 datatype

Unsigned integer. A number, unsigned, no fractions, no exponent. For example: "1, 255".

ui2 datatype

Unsigned integer, two bytes. A number, unsigned, no fractions, no exponent. For example: "1, 255, 65535".

ui4 datatype

Unsigned integer, four bytes. A number, unsigned, no fractions, no exponent. For example: "1, 703, 3000000000".

uri datatype

Universal Resource Identifier (URI). For example, "urn:schemas-microsoft-com:Office9".

uuid datatype

Hexadecimal digits representing octets, optional embedded hyphens that are ignored. For example: "333C7BC4-460F-11D0-BC04-0080C7055A83".