Using Text Cleanup Rules to Improve Text to Speech
A Text Cleanup Rule can be used to perform a text search and replace on a document prior to the text to speech operation. Instead of performing a simple text match, advanced pattern matching (using regular expressions) can be used specify exactly what text is matched.
The power of Text Cleanup Rules is best illustrated with some examples. Text2Go comes with a number of Text Cleanup Rules.

Example 1 - Percentage Ranges
A common example of a percentage range is:
A recent study showed that between 60%-65% of people preferred the colour green over blue.
Without a Text Cleanup Rule this would be pronounced as follows.
A recent study showed that between 60 percent minus 65 percent of people preferred the colour green over blue.
This is clearly incorrect.
A Text Cleanup Rule can be written that will match any percentage range in the form nn%-mm% (where nn and mm are numbers between 0-100) and replace the text with 'nn to mm percent'. The example text becomes:
A recent study showed that between 60 to 65 percent of people preferred the colour green over blue.
Example 2 - Replacing Breaks in a Document with a Pause
Another common example where Text Cleanup Rules can be useful is in identifying breaks in a document and inserting a pause. For example, a row of ******** or ------------- is often used to denote a break in a document. By default these breaks would be pronounced as asterix, asterix, asterix.... and minus, minus, minus... This very quickly becomes tiresome.
Text2Go includes a rule to indentify these breaks and replace them with a pause. A single rule can handle both forms of break and will match two or more *'s or -'s, with or without spaces in between.
Example 3 - Removing References from Research Papers
As a final example, Text Cleanup rules can be used to remove references from a research paper. A research paper that's peppered with references becomes almost unbearable to listen to. There are a number of different reference formats. Here are a couple that Text2Go currently handle using Text Cleanup Rules.
The TUR syndrome can occur with other operations including transcervical resection of the endometrium (TCRE),15 28 72 TUR of bladder tumours,20 38 cystoscopy,109 127 arthroscopy,69 rectal tumour surgery, vesical ultrasonic lithotripsy and percutaneus nephrolithotripsy.19 24 115
Imatinib is a potent and specific inhibitor of the KIT protein-tyrosine kinase, which is constitutively activated in more than 90% of GISTs as a result of gain-of-function mutations in the KIT protooncogene [11–15]. The presence of KIT is readily detected through reactivity with the CD117 antigen on immunohistochemical assay, a marker that establishes the diagnosis in a gastrointestinal tract mesenchymal neoplasm with characteristic histologic features [16].
Creating your Own Text Cleanup Rules
It's possible to create your own Text Cleanup Rules but in order to do this you need to be able to understand regular expressions. Watch the following video to see a simple example of adding a Text Cleanup Rule
The Camtasia Studio video content presented here requires JavaScript to be enabled and the latest version of the Macromedia Flash Player. If you are you using a browser with JavaScript disabled please enable it now. Otherwise, please update your version of the free Flash Player by downloading here.
Watch a video of Text Cleanup Rules in action
To add a new Text Cleanup Rule, select Text Cleanup Rules... from the Pronunciation menu.

Then click on the Add... button.

The most important fields are the Find and Replace fields.
Find
In the Find field you must enter the text you wish to match. You can use regular expressions to create some very powerful search criteria. To help you create regular expressions, there is a Cheat Sheet available that lists the most common characters used to define regular expressions. Click on the
button to show or hide the Cheat Sheet.
Find Options
| Case Sensitive |
- If checked, a case-sensitive comparison is used to match the text. i.e. 'cat' will only match 'cat', NOT 'Cat', 'CAT', 'CaT', etc. This is useful when matching acronymns.
- If not checked, a case insensitive comparison is used to match the text. i.e 'cat' will match 'cat', 'Cat', 'CAT', 'CaT', etc.
|
| Singleline |
Specifies single-line mode. Changes the meaning of the dot (.) so it matches every character (instead of every character except \n). |
| Multiline |
If selected it changes the meaning of ^ and $ so they match at the beginning and end, respectively, of any line, and not just the beginning and end of the entire string. |
Replace
Enter the text you wish to replace the original text. You can leave this field blank if you just want to remove the specific text, rather than replacing it.
It's possible to include text that matches the find expression as part of the replace expression. In the above example we only want to change the # symbol into 'number'. We still want to speak the digits that follow as a number. Therefore our replacement expression is 'number $1'. The $1 instructs Text2Go to use the first captured group of text from the find expression. Captured groups are any text that is enclosed in parenthesis (). In the above example this is (\d+) which means one or more digits (e.g. 12, 1234, 768). If there were two capture groups in the find expression, you could refer to the second using $2.
Original Text
It's vital that you provide a sample of text to test the Text Cleanup Rule on. The easiest way to do this is to copy and paste some text from the original text that needs cleaning up.
Tip. It's also a good idea to include some text that you don't want to match. In the above example you might want to include some text like 'This is the # symbol.' to ensure that your expression doesn't replace the # symbol with 'number' in this case.
Updated Text
This field contains the text after the search and replace has been performed. It is updated whenever you change the Find, Replace, Find Options or Original Text.
You can listen to the text by clicking on the
button.
Restricting the Scope of a Text Cleanup Rule
The following fields can be used to restrict a Text Cleanup Rule to a specific voice in the same way that they are used to restrict pronunciation corrections. These options are probably less applicable to Text Cleanup Rules but have been provided for consistency and flexibility.
Voice(s)
Used to restrict a rule to a particular voice or voice(s). This allows you to cleanup text that is mispronounced by a particular voice. By default, a text cleanup rule will be used for all voices. Note that you don't need to specify the entire voice name and case is ignored. You can also specify more than one voice.
e.g. 'karen' will restrict the entry to the voice 'RealSpeak Karen'.
Language Variant
Used to restrict a rule to a particular language variant. Current English language variants include
| ENU or en-US |
US English |
| ENA or en-AU |
Australian Engish |
| ENG or en-GB |
UK English |
| ENI or en-GB |
Indian English |
By default, an entry will match all language variants. To restrict an entry to Australian English voices use
en-AU
To restrict an entry to Indian English voices, due to the fact it's classified as en-GB (UK English), use the following code
ENI
Vendor
Used to restrict a rule to a particular vendor/manufacturer of voices (e.g. RealSpeak). By default, a rule will match voices supplied by all vendors. e.g. To restrict an entry to just RealSpeak voices use
RealSpeak
Website
Restrict a rule to a particular website. For example, if you only want the rule applied on slashdot.org, enter slashdot.org. You need only enter enough of the URL to uniquely identify the website. This example will match all pages on the Slashdot website. You can also restrict a rule to a subdomain (e.g. blog.text2go.com) or a particular area of a website (e.g. theage.com.au/technology).
Rule Library
A 'User Text Cleanup Rule Library' is created the first time you go to create a Text Cleanup Rule. By default all new rules will be placed in this library. However it is possible to organise your rules into multiple libraries. This field lets you choose the library to place a new rule in or move an existing rule from one library to another.
Note: The Libraries that come with Text2Go are read-only. The rules within them cannot be modified or removed and new rules cannot be added. However they can be switched off. The reason for this is that these libraries are updated regularly and if you modified them you would lose all your changes the next time you received a new version from Text2Go.
Learning More About Regular Expressions
The best introduction to regular expressions can be found here.
The 30 Minute RegEx Tutorial
There is an excellent free program for creating and testing regular expressions called Expresso. We can't recommend this program highly enough. This a great tool for learning and experimenting with regular expressions.
Sharing Text Cleanup Rules
Text2Go uses an automatic-update like service to share rules amongst Text2Go users. By default an update is done every 2 days. However this can be changed on the Pronunciation Options tab.

You can change how often you wish to check for updates or turn the service off completely. You can also decide whether you want to share your text cleanup rules with others. By default this is enabled, but once again you can turn it off. You can also decide whether you would like to contribute anonymously (the default) or identify your contributions with your email address.
Creating a New Rule Library
Sometimes you may want to place a certain set of rules into a separate library. You can create your own library to do this.
Select the Text Cleanup Rules... command from the Pronunciation menu.

Click on Add... to create a new library.

Enter some details to identify your library and click OK.
You have now created a new library. You can add new rules to this library or move entries from an existing library into this library.
Ordering of Text Cleanup Rules
Text Cleanup Rules will be applied to a document in the order in which they are listed. Ideally all Text Cleanup Rules should work independently and the order should not matter.
Text Cleanup Rule Libraries can be enabled, disabled and the order changed from the Text2Go Text Cleanup Rules window. This windows is available from the Pronunciation -> Text Cleanup Rules... menu.
