Games

How To Make A Speech Synthesis Editor — Smashing Magazine

How To Make A Speech Synthesis Editor — Smashing Magazine

About The Writer

Knut Melvær is a humanities technologist presently working as Developer Advocate at Sanity.io. He has previously been a know-how marketing consultant and developer at …
Extra about Knut…

Voice Assistants are on their approach into individuals’s houses, wrists, and pockets. That signifies that some of our content material can be spoken out loud with the assistance of digital speech synthesis. In this tutorial, you’ll discover ways to make a What You Get Is What You Hear (WYGIWYH) editor for speech synthesis using Sanity.io’s editor for Moveable Text.

When Steve Jobs unveiled the Macintosh in 1984, it stated “Hello” to us from the stage. Even at that time, speech synthesis wasn’t really a new know-how: Bell Labs developed the vocoder as early as within the late 30s, and the concept of a voice assistant pc made it into individuals’s awareness when Stanley Kubrick made the vocoder the voice of HAL9000 in 2001: A Area Odyssey (1968).

It wasn’t earlier than the introduction of Apple’s Siri, Amazon Echo, and Google Assistant in the mid 2015s that voice interfaces truly discovered their approach right into a broader public’s houses, wrists, and pockets. We’re still in an adoption part, but it appears that evidently these voice assistants are right here to stay.

In different words, the online isn’t just passive textual content on a display anymore. Net editors and UX designers need to get accustomed to creating content and providers that must be spoken out loud.

We’re already shifting fast in the direction of utilizing content management techniques that permit us work with our content headlessly and through APIs. The final piece is to make editorial interfaces that make it easier to tailor content for voice. So let’s do exactly that!

What Is SSML

Whereas net browsers use W3C’s specification for HyperText Markup Language (HTML) to visually render documents, most voice assistants use Speech Synthesis Markup Language (SSML) when producing speech.

A minimal instance utilizing the basis component <converse>, and the paragraph (<p>) and sentence (<s>) tags:

<converse>
<p>
<s>This is the primary sentence of the paragraph.</s>
<s>Right here’s one other sentence.</s>
</p>
</converse>
Press play to take heed to the snippet:
Your browser does not help the
audio factor.

Where SSML will get present is once we introduce tags for <emphasis> and <prosody> (pitch):

<converse>
<p>
<s>Put some <emphasis power=”strong”>additional weight on these words</emphasis></s>
<s>And say <prosody pitch=”high” price=”fast”>this a bit greater and quicker</prosody>!</s>
</p>
</converse>

Press play to take heed to the snippet:
Your browser does not help the
audio factor.

SSML has more options, however this is enough to get a feel for the fundamentals. Now, let’s take a better take a look at the editor that we’ll use to make the speech synthesis modifying interface.

The Editor For Moveable Textual content

To make this editor, we’ll use the editor for Moveable Text that features in Sanity.io. Moveable Textual content is a JSON specification for wealthy textual content modifying, that may be serialized into any markup language, corresponding to SSML. This means you possibly can simply use the same text snippet in multiple locations using totally different markup languages.

Sanity.io’s default editor for Portable TextSanity.io’s default editor for Moveable Text (Giant preview)

Putting in Sanity

Sanity.io is a platform for structured content material that comes with an open-source modifying setting constructed with React.js. It takes two minutes to get all of it up and operating.

Sort npm i -g @sanity/cli && sanity init into your terminal, and comply with the instructions. Select “empty”, once you’re prompted for a undertaking template.

For those who don’t need to comply with this tutorial and make this editor from scratch, you may also clone this tutorial’s code and comply with the instructions in README.md.

When the editor is downloaded, you run sanity begin in the undertaking folder to start out it up. It is going to start a improvement server that use Scorching Module Reloading to replace modifications as you edit its information.

How To Configure Schemas In Sanity Studio

Creating The Editor Information

We’ll start by making a folder referred to as ssml-editor in the /schemas folder. In that folder, we’ll put some empty information:

/ssml-tutorial/schemas/ssml-editor
├── alias.js
├── emphasis.js
├── annotations.js
├── preview.js
├── prosody.js
├── sayAs.js
├── blocksToSSML.js
├── speech.js
├── SSMLeditor.css
└── SSMLeditor.js

Now we will add content schemas in these information. Content schemas are what defines the info structure for the wealthy text, and what Sanity Studio makes use of to generate the editorial interface. They are easy JavaScript objects that principally require only a identify and a kind.

We will additionally add a title and a description to make a bit nicer for editors. For example, this can be a schema for a simple textual content subject for a title:

export default
identify: ‘title’,
sort: ‘string’,
title: ‘Title’,
description: ‘Titles ought to be brief and descriptive’

Sanity Studio with a title field and an editor for Portable TextThe studio with our title area and the default editor (Giant preview)

Moveable Text is built on the thought of wealthy text as knowledge. This is highly effective because it enables you to query your rich textual content, and convert it into just about any markup you want.

It is an array of objects referred to as “blocks” which you’ll be able to think of as the “paragraphs”. In a block, there’s an array of youngsters spans. Every block can have a method and a set of mark definitions, which describe knowledge buildings distributed on the youngsters spans.

Sanity.io comes with an editor that can read and write to Moveable Textual content, and is activated by putting the block sort inside an array area, like this:

// speech.js
export default
identify: ‘speech’,
sort: ‘array’,
title: ‘SSML Editor’,
of: [
type: ‘block’
]

An array may be of multiple varieties. For an SSML-editor, these could possibly be blocks for audio information, but that falls outdoors of the scope of this tutorial.

The last thing we need to do is so as to add a content material sort the place this editor can be utilized. Most assistants use a simple content material model of “intents” and “fulfillments”:

  • Intents
    Often an inventory of strings utilized by the AI mannequin to delineate what the consumer needs to get completed.
  • Fulfillments
    This occurs when an “intent” is identified. A achievement typically is — or no less than — comes with some kind of response.

So let’s make a simple content sort referred to as achievement that use the speech synthesis editor. Make a brand new file referred to as achievement.js and reserve it within the /schema folder:

// achievement.js
export default
identify: ‘achievement’,
sort: ‘doc’,
title: ‘Achievement’,
of: [

name: ‘title’,
type: ‘string’,
title: ‘Title’,
description: ‘Titles should be short and descriptive’
,

name: ‘response’,
type: ‘speech’

]

Save the file, and open schema.js. Add it to your studio like this:

// schema.js
import createSchema from ‘half:@sanity/base/schema-creator’
import schemaTypes from ‘all:part:@sanity/base/schema-type’
import fullfillment from ‘./fullfillment’
import speech from ‘./speech’

export default createSchema(
identify: ‘default’,
varieties: schemaTypes.concat([
fullfillment,
speech,
])
)

When you now run sanity begin in your command line interface inside the venture’s root folder, the studio will begin up regionally, and you’ll have the ability to add entries for fulfillments. You possibly can maintain the studio operating whereas we go on, as it’ll auto-reload with new modifications once you save the information.

Including SSML To The Editor

By default, the block sort provides you with a regular editor for visually oriented wealthy textual content with heading types, decorator types for emphasis and powerful, annotations for links, and lists. Now we need to override these with the audial ideas present in SSML.

We begin with defining the totally different content material buildings, with useful descriptions for the editors, that we’ll add to the block in SSMLeditorSchema.js as configurations for annotations. Those are “emphasis”, “alias”, “prosody”, and “say as”.

Emphasis

We start with “emphasis”, which controls how a lot weight is placed on the marked text. We outline it as a string with an inventory of predefined values that the consumer can choose from:

// emphasis.js
export default
identify: ’emphasis’,
sort: ‘object’,
title: ‘Emphasis’,
description:
‘The power of the emphasis put on the contained text’,
fields: [

name: ‘level’,
type: ‘string’,
options:
list: [
value: ‘strong’, title: ‘Strong’ ,
value: ‘moderate’, title: ‘Moderate’ ,
value: ‘none’, title: ‘None’ ,
value: ‘reduced’, title: ‘Reduced’
]

]

Alias

Typically the written and the spoken term differ. As an example, you need to use the abbreviation of a phrase in a written textual content, but have the entire phrase learn aloud. For instance:

<s>This can be a <sub alias=”Speech Synthesis Markup Language”>SSML</sub> tutorial</s>

Press play to take heed to the snippet:
Your browser does not help the
audio aspect.

The input area for the alias is an easy string:

// alias.js
export default
identify: ‘alias’,
sort: ‘object’,
title: ‘Alias (sub)’,
description:
‘Replaces the contained text for pronunciation. This enables a doc to include both a spoken and written type.’,
fields: [

name: ‘text’,
type: ‘string’,
title: ‘Replacement text’,

]

Prosody

With the prosody property we will control totally different points how text ought to be spoken, like pitch, price, and volume. The markup for this could appear to be this:

<s>Say this with an <prosody pitch=”x-low”>additional low pitch</prosody>, and this <prosody fee=”fast” quantity=”loud”>loudly with a fast fee</prosody></s>

Press play to take heed to the snippet:
Your browser does not help the
audio component.

This input could have three fields with predefined string choices:

// prosody.js
export default
identify: ‘prosody’,
sort: ‘object’,
title: ‘Prosody’,
description: ‘Control of the pitch, speaking price, and quantity’,
fields: [

name: ‘pitch’,
type: ‘string’,
title: ‘Pitch’,
description: ‘The baseline pitch for the contained text’,
options:
list: [
value: ‘x-low’, title: ‘Extra low’ ,
value: ‘low’, title: ‘Low’ ,
value: ‘medium’, title: ‘Medium’ ,
value: ‘high’, title: ‘High’ ,
value: ‘x-high’, title: ‘Extra high’ ,
value: ‘default’, title: ‘Default’
]

,

identify: ‘price’,
sort: ‘string’,
title: ‘Fee’,
description:
‘A change within the talking price for the contained textual content’,
choices:
listing: [
value: ‘x-slow’, title: ‘Extra slow’ ,
value: ‘slow’, title: ‘Slow’ ,
value: ‘medium’, title: ‘Medium’ ,
value: ‘fast’, title: ‘Fast’ ,
value: ‘x-fast’, title: ‘Extra fast’ ,
value: ‘default’, title: ‘Default’
]

,

identify: ‘volume’,
sort: ‘string’,
title: ‘Quantity’,
description: ‘The quantity for the contained text.’,
options:
listing: [
value: ‘silent’, title: ‘Silent’ ,
value: ‘x-soft’, title: ‘Extra soft’ ,
value: ‘medium’, title: ‘Medium’ ,
value: ‘loud’, title: ‘Loud’ ,
value: ‘x-loud’, title: ‘Extra loud’ ,
value: ‘default’, title: ‘Default’
]

]

Say As

The final one we need to embrace is <say-as>. This tag lets us exercise a bit extra management over how certain info is pronounced. We will even use it to bleep out words if you must redact one thing in voice interfaces. That’s @!%&© helpful!

<s>Do I’ve to <say-as interpret-as=”expletive”>frakking</say-as> <say-as interpret-as=”verbatim”>spell</say-as> it out for you!?</s>

Press play to take heed to the snippet:
Your browser doesn’t help the
audio aspect.

// sayAs.js
export default
identify: ‘sayAs’,
sort: ‘object’,
title: ‘Say as…’,
description: ‘Permits you to indicate information about the kind of text construct that’s contained inside the aspect. It additionally helps specify the level of detail for rendering
the contained text.’,
fields: [

name: ‘interpretAs’,
type: ‘string’,
title: ‘Interpret as…’,
options:
list: [
value: ‘cardinal’, title: ‘Cardinal numbers’ ,

value: ‘ordinal’,
title: ‘Ordinal numbers (1st, 2nd, 3th…)’
,
value: ‘characters’, title: ‘Spell out characters’ ,
value: ‘fraction’, title: ‘Say numbers as fractions’ ,
value: ‘expletive’, title: ‘Blip out this word’ ,

value: ‘unit’,
title: ‘Adapt unit to singular or plural’
,

value: ‘verbatim’,
title: ‘Spell out letter by letter (verbatim)’
,
value: ‘date’, title: ‘Say as a date’ ,
value: ‘telephone’, title: ‘Say as a telephone number’
]

,

identify: ‘date’,
sort: ‘object’,
title: ‘Date’,
fields: [

name: ‘format’,
type: ‘string’,
description: ‘The format attribute is a sequence of date field character codes. Supported field character codes in format are y, m, d for year, month, and day (of the month) respectively. If the field code appears once for year, month, or day then the number of digits expected are 4, 2, and 2 respectively. If the field code is repeated then the number of expected digits is the number of times the code is repeated. Fields in the date text may be separated by punctuation and/or spaces.’
,

name: ‘detail’,
type: ‘number’,
validation: Rule =>
Rule.required()
.min(0)
.max(2),
description: ‘The detail attribute controls the spoken form of the date. For detail=’1′ only the day fields and one of month or year fields are required, although both may be supplied’

]

]

Now we will import these in an annotations.js file, which makes things a bit tidier.

// annotations.js
export default as alias from ‘./alias’
export default as emphasis from ‘./emphasis’
export default as prosody from ‘./prosody’
export default as sayAs from ‘./sayAs’

Now we will import these annotation varieties into our most important schemas:

// schema.js
import createSchema from “part:@sanity/base/schema-creator”
import schemaTypes from “all:part:@sanity/base/schema-type”
import achievement from ‘./achievement’
import speech from ‘./ssml-editor/speech’
import
alias,
emphasis,
prosody,
sayAs
from ‘./annotations’

export default createSchema(
identify: “default”,
varieties: schemaTypes.concat([
fulfillment,
speech,
alias,
emphasis,
prosody,
sayAs
])
)

Finally, we will now add these to the editor like this:

// speech.js
export default
identify: ‘speech’,
sort: ‘array’,
title: ‘SSML Editor’,
of: [

type: ‘block’,
styles: [],
lists: [],
marks:
decorators: [],
annotations: [
type: ‘alias’,
type: ’emphasis’,
type: ‘prosody’,
type: ‘sayAs’
]

]

Discover that we also added empty arrays to types, and interior designers. This disables the default types and interior designers (like bold and emphasis) since they don’t make that a lot sense on this particular case.

Customizing The Look And Really feel

Now we now have the performance in place, but since we haven’t specified any icons, every annotation will use the default icon, which makes the editor onerous to truly use for authors. So let’s fix that!

With the editor for Moveable Textual content it’s potential to inject React elements each for the icons and for a way the marked textual content ought to be rendered. Here, we’ll just let some emoji do the work for us, but you might clearly go far with this, making them dynamic and so forth. For prosody we’ll even make the icon change depending on the quantity chosen. Word that I omitted the fields in these snippets for brevity, you shouldn’t remove them in your native information.

// alias.js
import React from ‘react’

export default
identify: ‘alias’,
sort: ‘object’,
title: ‘Alias (sub)’,
description: ‘Replaces the contained text for pronunciation. This enables a document to include both a spoken and written type.’,
fields: [
/* all the fields */
],
blockEditor:
icon: () => ‘?’,
render: ( youngsters ) => <span>youngsters ?</span>,
,
;

// emphasis.js
import React from ‘react’

export default
identify: ’emphasis’,
sort: ‘object’,
title: ‘Emphasis’,
description: ‘The power of the emphasis put on the contained textual content’,
fields: [
/* all the fields */
],
blockEditor:
icon: () => ‘?’,
render: ( youngsters ) => <span>youngsters ?</span>,
,
;

// prosody.js
import React from ‘react’

export default
identify: ‘prosody’,
sort: ‘object’,
title: ‘Prosody’,
description: ‘Management of the pitch, talking fee, and quantity’,
fields: [
/* all the fields */
],
blockEditor:
icon: () => ‘?’,
render: ( youngsters, volume ) => (
<span>
youngsters [‘x-loud’, ‘loud’].consists of(quantity) ? ‘?’ : ‘?’
</span>
),
,
;

// sayAs.js
import React from ‘react’

export default
identify: ‘sayAs’,
sort: ‘object’,
title: ‘Say as…’,
description: ‘Enables you to indicate details about the type of text assemble that is contained inside the factor. It also helps specify the extent of element for rendering the contained textual content.’,
fields: [
/* all the fields */
],
blockEditor:
icon: () => ‘?’,
render: props => <span>props.youngsters ?</span>,
,
;

The customized SSML editorThe editor with our custom SSML marks (Giant preview)

Now you have got an editor for modifying text that can be utilized by voice assistants. However wouldn’t it’s kinda helpful if editors additionally might preview how the textual content truly will sound like?

Adding A Preview Button Using Google’s Textual content-to-Speech

Native speech synthesis help is definitely on its means for browsers. But on this tutorial, we’ll use Google’s Text-to-Speech API which supports SSML. Constructing this preview performance may also be an indication of how you serialize Moveable Textual content into SSML in no matter service you need to use this for.

Wrapping The Editor In A React Element

We start with opening the SSMLeditor.js file and add the following code:

// SSMLeditor.js
import React, Fragment from ‘react’;
import BlockEditor from ‘part:@sanity/form-builder’;

export default perform SSMLeditor(props)
return (
<Fragment>
<BlockEditor …props />
</Fragment>
);

We’ve now wrapped the editor in our personal React element. All of the props it wants, together with the info it accommodates, are passed down in real-time. To truly use this element, it’s a must to import it into your speech.js file:

// speech.js
import React from ‘react’
import SSMLeditor from ‘./SSMLeditor.js’

export default
identify: ‘speech’,
sort: ‘array’,
title: ‘SSML Editor’,
inputComponent: SSMLeditor,
of: [

type: ‘block’,
styles: [],
lists: [],
marks:
decorators: [],
annotations: [
type: ‘alias’ ,
type: ’emphasis’ ,
type: ‘prosody’ ,
type: ‘sayAs’ ,
],
,
,
],

Once you save this and the studio reloads, it should look pretty much exactly the same, however that’s because we haven’t began tweaking the editor but.

Convert Moveable Textual content To SSML

The editor will save the content as Moveable Text, an array of objects in JSON that makes it straightforward to convert wealthy textual content into whatever format you want it to be. If you convert Moveable Text into another syntax or format, we name that “serialization”. Hence, “serializers” are the recipes for a way the rich textual content ought to be transformed. In this section, we’ll add serializers for speech synthesis.

You could have already made the blocksToSSML.js file. Now we’ll want so as to add our first dependency. Begin by operating the terminal command npm init -y contained in the ssml-editor folder. It will add a package deal.json the place the editor’s dependencies might be listed.

Once that’s finished, you possibly can run npm set up @sanity/block-content-to-html to get a library that makes it easier to serialize Moveable Textual content. We’re utilizing the HTML-library because SSML has the same XML syntax with tags and attributes.

This can be a bunch of code, so do be happy to copy-paste it. I’ll explain the sample right under the snippet:

// blocksToSSML.js
import blocksToHTML, h from ‘@sanity/block-content-to-html’

const serializers =
marks:
prosody: ( youngsters, mark: price, pitch, quantity ) =>
h(‘prosody’, attrs: price, pitch, volume , youngsters),
alias: ( youngsters, mark: text ) =>
h(‘sub’, attrs: alias: textual content , youngsters),
sayAs: ( youngsters, mark: interpretAs ) =>
h(‘say-as’, attrs: ‘interpret-as’: interpretAs , youngsters),
break: ( youngsters, mark: time, power ) =>
h(‘break’, attrs: time: ‘$timems’, power , youngsters),
emphasis: ( youngsters, mark: degree ) =>
h(’emphasis’, attrs: degree , youngsters)

export const blocksToSSML = blocks => blocksToHTML( blocks, serializers )

This code will export a perform that takes the array of blocks and loop by means of them. Every time a block accommodates a mark, it’s going to search for a serializer for the sort. In case you have marked some text to have emphasis, it this perform from the serializers object:

emphasis: ( youngsters, mark: degree ) =>
h(’emphasis’, attrs: degree , youngsters)

Perhaps you acknowledge the parameter from where we outlined the schema? The h() perform lets us defined an HTML aspect, that is, right here we “cheat” and makes it return an SSML aspect referred to as <emphasis>. We also give it the attribute degree if that is outlined, and place the youngsters parts within it — which usually would be the textual content you’ve got marked up with emphasis.

“_type”: “block”,
“_key”: “f2c4cf1ab4e0”,
“style”: “normal”,
“markDefs”: [

“_type”: “emphasis”,
“_key”: “99b28ed3fa58”,
“level”: “strong”

],
“children”: [

“_type”: “span”,
“_key”: “f2c4cf1ab4e01”,
“text”: “Say this strongly!”,
“marks”: [
“99b28ed3fa58”
]

]

That’s how the above structure in Moveable Textual content will get serialized to this SSML:

<emphasis degree=”strong”>Say this strongly</emphasis>

If you need help for more SSML tags, you’ll be able to add more annotations within the schema, and add the annotation varieties to the marks part within the serializers.

Now we’ve got a perform that returns SSML markup from our marked up wealthy textual content. The last half is to make a button that lets us send this markup to a text-to-speech service.

Including A Preview Button That Speaks Back To You

Ideally, we should always have used the browser’s speech synthesis capabilities within the Net API. That method, we might have gotten away with less code and dependencies.

As of early 2019, nevertheless, native browser help for speech synthesis continues to be in its early levels. It seems to be like help for SSML is on the best way, and there’s proof of concepts of client-side JavaScript implementations for it.

Likelihood is that you’re going to use this content with a voice assistant anyhow. Each Google Assistant and Amazon Echo (Alexa) help SSML as responses in a achievement. On this tutorial, we’ll use Google’s text-to-speech API, which additionally sounds good and help several languages.

Start by obtaining an API key by signing up for Google Cloud Platform (will probably be free for the first 1 million characters you course of). When you’re signed up, you can also make a new API key on this page.

Now you’ll be able to open your PreviewButton.js file, and add this code to it:

// PreviewButton.js
import React from ‘react’
import Button from ‘part:@sanity/elements/buttons/default’
import blocksToSSML from ‘./blocksToSSML’

// You have to be cautious with sharing this key
// I put it here to maintain the code easy
const API_KEY = ‘<yourAPIkey>’
const GOOGLE_TEXT_TO_SPEECH_URL = ‘https://texttospeech.googleapis.com/v1beta1/text:synthesize?key=’ + API_KEY

const converse = async blocks =>
// Serialize blocks to SSML
const ssml = blocksToSSML(blocks)
// Prepare the Google Text-to-Speech configuration
const body = JSON.stringify(
enter: ssml ,
// Choose the language code and voice identify (A-F)
voice: languageCode: ‘en-US’, identify: ‘en-US-Wavenet-A’ ,
// Use MP3 in an effort to play in browser
audioConfig: audioEncoding: ‘MP3’
)
// Ship the SSML string to the API
const res = await fetch(GOOGLE_TEXT_TO_SPEECH_URL,
technique: ‘POST’,
physique
).then(res => res.json())
// Play the returned audio with the Browser’s Audo API
const audio = new Audio(‘knowledge:audio/wav;base64,’ + res.audioContent)
audio.play()

export default perform PreviewButton (props)
return <Button fashion= marginTop: ‘1em’ onClick=() => converse(props.blocks)>Converse text</Button>

I’ve stored this preview button code to a minimal to make it simpler to comply with this tutorial. In fact, you may build it out by adding state to point out if the preview is processing or make it potential to preview with the totally different voices that Google’s API helps.

Add the button to SSMLeditor.js:

// SSMLeditor.js
import React, Fragment from ‘react’;
import BlockEditor from ‘part:@sanity/form-builder’;
import PreviewButton from ‘./PreviewButton’;

export default perform SSMLeditor(props)
return (
<Fragment>
<BlockEditor …props />
<PreviewButton blocks=props.value />
</Fragment>
);

Now it is best to have the ability to mark up your text with the totally different annotations, and hear the outcome when pushing “Speak text”. Cool, isn’t it?

You’ve Created A Speech Synthesis Editor, And Now What?

When you’ve got adopted this tutorial, you’ve been by means of how you should use the editor for Moveable Text in Sanity Studio to make custom annotations and customize the editor. You should use these expertise for all types of things, not only to make a speech synthesis editor. You could have also been via find out how to serialize Moveable Text into the syntax you need. Obviously, that is additionally useful in the event you’re constructing frontends in React or Vue. You possibly can even use these expertise to generate Markdown from Moveable Textual content.

We haven’t coated the way you truly use this along with a voice assistant. If you wish to attempt, you should use much of the same logic as with the preview button in a serverless perform, and set it because the API endpoint for a achievement utilizing webhooks, e.g. with Dialogflow.

In case you’d like me to write down a tutorial on how one can use the speech synthesis editor with a voice assistant, be happy to offer me a touch on Twitter or share within the feedback section under.

Further Reading on SmashingMag:

Smashing Editorial(dm, ra, yk, il)