XML - 管理資料交換/XMLWebAudio

存在可以將任何文字檔案轉換為音訊檔案的軟體。可以使用 Windows 或 Mac 作業系統中提供的軟體或非常便宜的獨立軟體（例如 TextAloud）將文字檔案轉換為音訊檔案。TextAloud 允許使用者修改語音、閱讀速度和其他功能。它可以在網上找到免費版本。這些系統可以透過多種方式修改語音，以符合使用者的個人喜好。這些系統不會透過網際網路提供檔案，供使用者搜尋和收聽。

潛力

透過正確結合 XML 技術、行動通訊服務以及已有的軟體/硬體，網際網路廣播的概念可以擴充套件到比目前更大量的內容。大多數網際網路廣播以音樂檔案和節目廣播內容的形式存在。網際網路廣播的選擇可以擴充套件到包括任何現有的文字檔案，其中包括新聞報道、政府檔案、教育材料和各種官方記錄。一個商業示例是，一名推銷員在前往與客戶進行銷售拜訪的路上，透過在汽車中收聽檔案來詳細瞭解客戶的購買歷史。另一個例子包括現有的語言轉換軟體，它可以讓遠在異國他鄉的人收聽和學習其他地方正在開發的技術。

要求

這項技術需要三個領域共同努力才能使該流程正常執行。1. XML 技術必須包括一組商定的 XML 標籤，用於在內容生成器/分發者和使用者之間傳輸檔案。2. 行動通訊服務必須能夠以可用的格式將資料傳遞到終端使用者系統。3. 硬體和軟體必須能夠使用傳送的文件併為使用者播放它們。這包括語音處理瀏覽器的進一步開發。

第二個和第三個要求超出了本章關於 XML 的範圍。但是，正在進行相關工作。W3C（全球資訊網聯盟）目前正在進行行動網路倡議，該倡議將為軟體供應商、內容提供商、硬體（手機）製造商、瀏覽器開發人員和移動服務運營商設定一些標準。一個正在考慮的建議是最大頁面重量為 10K（一篇典型的雜誌文章可以容納在這個範圍內）。廣告的可用性和其形式目前正在討論中。交付協議預計為 http。移動裝置的連線可能很慢，但音訊檔案不需要流式傳輸。目前參與的供應商包括諾基亞、愛立信、惠普、法國電信和 Opera。

第一個要求將包括一組 XML 標籤，所有文字檔案內容生成器（例如新聞機構、政府、教育機構和官方記錄生成器）都可以使用這些標籤來生成其內容檔案。因此，他們的內容可以被訪問並存儲在可搜尋的資料庫中，並且可以隨時從支援移動瀏覽器裝置的任何地方下載和播放。

現有的標籤集

存在一組稱為 SSML（合成語音標記語言）的 XML 標籤。這組標籤允許控制語音生成的足夠方面，以便使用者可以生成和操作個性化的語音。文字到語音系統使用這些標籤來獲取文字檔案並生成可聽的文字語音。

文件結構、文字處理和發音元素與屬性

speak - 根元素 xml:lang - 屬性

                Language (indicates the natural language of the file, such as “en-US”); this 
                is preferred to be indicated only on the voice element so as to eliminate 
                changes in a voice in the midst of a voice file.

xml:base - 屬性

                base URI Attribute (optional)

示例

<speak version="1.0"

        xmlns="http://www.w3.org/2001/10/synthesis"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                  http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
        xml:lang="en-US">
 ... the body ...

</speak>

lexicon - 元素

       for pronunciation, (an empty element)

meta - 元素

       (an empty element); includes a string that contains some information about the  
       ensuing data; it can declare a content type of “http” in the case of a file that 
       doesn’t have generated header fields from the originating server.

metadata - 元素

       can provide broader information about data as it accesses a metadata schema.

p - 元素

       text structure, represents a paragraph. It can only contain the following elements:   
       audio, break, emphasis, mark, phoneme, prosody, say-as, sub, s, voice.

s - 元素

       text structure, Element; represents a sentence. It can only contain the following   
       elements: audio, break, emphasis, mark, phoneme, prosody, say-as, sub, voice.

say-as - 元素

       available attributes: interpret-as, format, and detail phoneme with interpret-as being 
       the only required one. The tag set may only contain text to be rendered by a voice    
       synthesizer. This tag helps a browser to know more about the manner in which the  
       enclosed text is to be voiced.

format - 屬性

               this attribute gives additional hints as to the rendering of voiced text.	detail - Attribute
               this attribute is for indicating the level of detail to be applied to voiced  
               text. An example would be a special form of emphasis such as the reading of 
               computer code in a block of text.

Phoneme - 元素

       a pronunciation indicator for the text to speech engine. The engine does not render 
       the contents of the tag, thus the tag can be empty. The attributes for the tag provide 
       what the engine will use to help with language specific pronunciation factors.     
       However, any text between the tag set will be rendered on screen in  a visual browser  
       for hearing impaired users. This tag can only contain text, no elements.
       alphabet - attribute 
               for Phoneme, used to specify a particular version of an alphabet, optional
       ph - Attribute
               a required attribute for phoneme, used to specify the string to be pronounced.

示例

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"

        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                  http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
        xml:lang="en-US">
 <phoneme alphabet="ipa" ph="təmei̥ɾou̥"> pomegranate </phoneme>

</speak>

sub - 元素

       an element used to specify within its “alias” attribute the pronounced version of some 
       written text that is between the tag set. Example:

_AARP

韻律和風格 - 韻律涵蓋諸如音調、語調、對話節奏、音高、響度、聲音持續時間、分塊（單詞單元，不一定句子）等方面。

voice - 元素

       indicates the type of voice to use, all the attributes are optional, however, not 
       indicating any attributes at all is considered an error. The “lang” attribute takes  
       precedence; all other attributes are equal.

lang - 屬性

                for voice element, indicates the language for the voice.

gender - 屬性 age - 屬性 variant - 屬性 name - 屬性

示例

<voice gender="male">Show me a person without a goal</voice>

 <voice gender="male" variant="2">
 and I'll show you a stock clerk.
 </voice>
 <voice name="James">Show me a stock clerk with a goal and I'll show you someone who will change the world.</voice>

emphasis - 元素

       contains text to be emphasized by the speech processor (with stress or intensity). It  
       has one attribute:

level - 屬性

                indicating the degree of emphasis.

示例

天才本身不會談論天才的禮物，他們只會談論

<emphasis level="strong"> hard work and long hours. </emphasis>

“emphasis”元素可以包含文字以及以下元素：audio - 元素 desc - 元素

                if the content is not speech then the “desc” tag should be used to describe   
                the content. This description can be used in a text output for the hearing   
                impaired.

break - 元素 emphasis - 元素 mark - 元素 phoneme - 元素 prosody - 元素 say-as - 元素 sub - 元素 voice - 元素

break - 元素

       wherever the element is used between words it indicates a pause in the reading of the  
       text; attributes are: “strength” with values of: none (meaning no pause even if the 
       system would normally put one there), x-weak, weak, medium, strong, x-strong; “time” 
       with values of either milliseconds: 250ms or seconds: 2s.

prosody - 元素

       controls the pitch, speaking rate and volume of a generated voice. Attributes   
       are optional but it is considered an error if no attributes are set. 
       pitch - Attribute
       contour - Attribute
       range - Attribute
       rate - Attribute
       duration - Attribute
       volume - Attribute

其他允許插入音訊檔案以及生成語音內容的元素。

audio - 元素

       may be empty but if it contains anything it should be the text that the speech 
       generator could convert to a voice in place of the audio file.

示例

 <audio src="JCPennyQuote.au">Every business is built on friendship.</audio>

mark - 元素

       an empty tag that places a named marker into the content. When the processor  
       reaches a “mark” element one of two things happens. One, the processor is provided 
       with the info to retrieve the desired position in the content, two, an event is issued 
       that includes the content at the desired position. It has one attribute which is:
       name - Attribute

desc - 元素

XML Web Audio 的未來潛力

可以引入額外的標籤來包含日期、檔案標題、作者、源語言以及有關檔案的其他元資料。擴充套件現有的標籤集將使檔案能夠使用多種方法儲存和搜尋在資料庫中。它們將允許儲存與實際文字/音訊檔案相關的資料，這些資料對潛在使用者來說非常有價值。使用者可以根據檔案的來源日期、檔案的來源國家以及檔案的主題或標題進行搜尋。

結論

使用 SSML（XML 的子集），可以從任何文字檔案（例如新聞報道、政府檔案、教育材料或官方記錄）生成音訊檔案。這些內容可以透過行動通訊服務和網路進行傳遞。這些檔案可以在移動瀏覽器裝置上播放。這將構成比現有的嚴格音樂或節目內容形式更大的網際網路廣播市場。這將為移動使用者提供對大量資訊來源的按需訪問，從而產生許多用途。