Search  

Using Text Cleanup Rules to Improve Text to Speech

A Text Cleanup Rule can be used to perform a text search and replace on a document prior to the text to speech operation. Instead of performing a simple text match, advanced pattern matching (using regular expressions) can be used specify exactly what text is matched.

The power of Text Cleanup Rules is best illustrated with some examples. Text2Go comes with a number of Text Cleanup Rules.

Text Cleanup Rules that come with Text2Go

 

Example 1 - Percentage Ranges

A common example of a percentage range is:

A recent study showed that between 60%-65% of people preferred the colour green over blue.

Without a Text Cleanup Rule this would be pronounced as follows.

A recent study showed that between 60 percent minus 65 percent of people preferred the colour green over blue.

This is clearly incorrect.

A Text Cleanup Rule can be written that will match any percentage range in the form nn%-mm% (where nn and mm are numbers between 0-100) and replace the text with 'nn to mm percent'. The example text becomes:

A recent study showed that between 60 to 65 percent of people preferred the colour green over blue.

 

Example 2 - Replacing Breaks in a Document with a Pause

Another common example where Text Cleanup Rules can be useful is in identifying breaks in a document and inserting a pause. For example, a row of ******** or ------------- is often used to denote a break in a document. By default these breaks would be pronounced as asterix, asterix, asterix.... and minus, minus, minus... This very quickly becomes tiresome.

Text2Go includes a rule to indentify these breaks and replace them with a pause. A single rule can handle both forms of break and will match two or more *'s or -'s, with or without spaces in between.

 

Example 3 - Removing References from Research Papers

As a final example, Text Cleanup rules can be used to remove references from a research paper. A research paper that's peppered with references becomes almost unbearable to listen to. There are a number of different reference formats. Here are a couple that Text2Go currently handle using Text Cleanup Rules.

The TUR syndrome can occur with other operations including transcervical resection of the endometrium (TCRE),15 28 72 TUR of bladder tumours,20 38 cystoscopy,109 127 arthroscopy,69 rectal tumour surgery, vesical ultrasonic lithotripsy and percutaneus nephrolithotripsy.19 24 115

Imatinib is a potent and specific inhibitor of the KIT protein-tyrosine kinase, which is constitutively activated in more than 90% of GISTs as a result of gain-of-function mutations in the KIT protooncogene [1115]. The presence of KIT is readily detected through reactivity with the CD117 antigen on immunohistochemical assay, a marker that establishes the diagnosis in a gastrointestinal tract mesenchymal neoplasm with characteristic histologic features [16].

 

Creating your Own Text Cleanup Rules

It's possible to create your own Text Cleanup Rules but in order to do this you need to be able to understand regular expressions. Watch the following video to see a simple example of adding a Text Cleanup Rule

Watch how to check pronunciation before converting text to speech

The Camtasia Studio video content presented here requires JavaScript to be enabled and the latest version of the Macromedia Flash Player. If you are you using a browser with JavaScript disabled please enable it now. Otherwise, please update your version of the free Flash Player by downloading here.

Watch a video of Text Cleanup Rules in action

 

To add a new Text Cleanup Rule, select Text Cleanup Rules... from the Pronunciation menu.

Text Cleanup Rule Menu

Then click on the Add... button.

Adding or Editing a Text Cleanup Rule

The most  important fields are the Find and Replace fields.

Find

In the Find field you must enter the text you wish to match. You can use regular expressions to create some very powerful search criteria. To help you create regular expressions, there is a Cheat Sheet available that lists the most common characters used to define regular expressions. Click on the Show Cheat Sheet button to show or hide the Cheat Sheet.

Find Options

Case Sensitive
  • If checked, a case-sensitive comparison is used to match the text. i.e. 'cat' will only match 'cat', NOT 'Cat', 'CAT', 'CaT', etc. This is useful when matching acronymns.
  • If not checked, a case insensitive comparison is used to match the text. i.e 'cat' will match 'cat', 'Cat', 'CAT', 'CaT', etc.
Singleline Specifies single-line mode. Changes the meaning of the dot (.) so it matches every character (instead of every character except \n).
Multiline If selected it changes the meaning of ^ and $ so they match at the beginning and end, respectively, of any line, and not just the beginning and end of the entire string.

Replace

Enter the text you wish to replace the original text. You can leave this field blank if you just want to remove the specific text, rather than replacing it.

It's possible to include text that matches the find expression as part of the replace expression. In the above example we only want to change the # symbol into 'number'. We still want to speak the digits that follow as a number. Therefore our replacement expression is 'number $1'. The $1 instructs Text2Go to use the first captured group of text from the find expression. Captured groups are any text that is enclosed in parenthesis (). In the above example this is (\d+) which means one or more digits (e.g. 12, 1234, 768). If there were two capture groups in the find expression, you could refer to the second using $2.

Original Text

It's vital that you provide a sample of text to test the Text Cleanup Rule on. The easiest way to do this is to copy and paste some text from the original text that needs cleaning up.

Tip. It's also a good idea to include some text that you don't want to match. In the above example you might want to include some text like 'This is the # symbol.' to ensure that your expression doesn't replace the # symbol with 'number' in this case.

Updated Text

This field contains the text after the search and replace has been performed. It is updated whenever you change the Find, Replace, Find Options or Original Text.

You can listen to the text by clicking on the Listen to text button.

Restricting the Scope of a Text Cleanup Rule

The following fields can be used to restrict a Text Cleanup Rule to a specific voice in the same way that they are used to restrict pronunciation corrections. These options are probably less applicable to Text Cleanup Rules but have been provided for consistency and flexibility.

Voice(s)

Used to restrict a rule to a particular voice or voice(s). This allows you to cleanup text that is mispronounced by a particular voice. By default, a text cleanup rule will be used for all voices. Note that you don't need to specify the entire voice name and case is ignored. You can also specify more than one voice.

e.g. 'karen' will restrict the entry to the voice 'RealSpeak Karen'.

Language Variant

Used to restrict a rule to a particular language variant. Current English language variants include

ENU or en-US US English
ENA or en-AU Australian Engish
ENG or en-GB UK English
ENI or en-GB Indian English

By default, an entry will match all language variants. To restrict an entry to Australian English voices use

en-AU

To restrict an entry to Indian English voices, due to the fact it's classified as en-GB (UK English), use the following code

ENI

Vendor

Used to restrict a rule to a particular vendor/manufacturer of voices (e.g. RealSpeak). By default, a rule will match voices supplied by all vendors. e.g. To restrict an entry to just RealSpeak voices use

RealSpeak

Website

Restrict a rule to a particular website. For example, if you only want the rule applied on slashdot.org, enter slashdot.org. You need only enter enough of the URL to uniquely identify the website. This example will match all pages on the Slashdot website. You can also restrict a rule to a subdomain (e.g. blog.text2go.com) or a particular area of a website (e.g. theage.com.au/technology).

Rule Library

A 'User Text Cleanup Rule Library' is created the first time you go to create a Text Cleanup Rule. By default all new rules will be placed in this library. However it is possible to organise your rules into multiple libraries. This field lets you choose the library to place a new rule in or move an existing rule from one library to another.

Note: The Libraries that come with Text2Go are read-only. The rules within them cannot be modified or removed and new rules cannot be added. However they can be switched off. The reason for this is that these libraries are updated regularly and if you modified them you would lose all your changes the next time you received a new version from Text2Go.

Learning More About Regular Expressions

The best introduction to regular expressions can be found here.

The 30 Minute RegEx Tutorial

There is an excellent free program for creating and testing regular expressions called Expresso. We can't recommend this program highly enough. This a great tool for learning and experimenting with regular expressions.

Sharing Text Cleanup Rules

Text2Go uses an automatic-update like service to share rules amongst Text2Go users. By default an update is done every 2 days. However this can be changed on the Pronunciation Options tab.

Pronunciation Options

You can change how often you wish to check for updates or turn the service off completely. You can also decide whether you want to share your text cleanup rules with others. By default this is enabled, but once again you can turn it off. You can also decide whether you would like to contribute anonymously (the default) or identify your contributions with your email address.

Creating a New Rule Library

Sometimes you may want to place a certain set of rules into a separate library. You can create your own library to do this.

Select the Text Cleanup Rules... command from the Pronunciation menu.

Text Cleanup Rule Libraries

Click on Add... to create a new library.

Add Dictionary

Enter some details to identify your library and click OK.

You have now created a new library. You can add new rules to this library or move entries from an existing library into this library.

Ordering of Text Cleanup Rules

Text Cleanup Rules will be applied to a document in the order in which they are listed. Ideally all Text Cleanup Rules should work independently and the order should not matter.

 Text Cleanup Rule Libraries can be enabled, disabled and the order changed from the Text2Go Text Cleanup Rules window. This windows is available from the Pronunciation -> Text Cleanup Rules... menu.

Managing Text Cleanup Rule Libraries

 

 

 Tutorials

Copyright 2007-2010 by Tumbywood Software   Terms Of Use  Privacy Statement