Nerd-Author Fun v2: Text Analysis

I promised this week that my blog post would be about some of my C# coding, which also happens to dovetail in beautifully with my writing. I’ve taken my earlier work and begun the super-charging process. That being said: this is just the beginning. In the future I plan to make it available, far more powerful and with a few of the bugs ironed out.

The general premise behind the program is that it can load your story from a text file, and then allow you to analyse it. At the moment it is sans-UI – which means it doesn’t have pretty user windows, checkboxes and other controls. I’m calling it Text Analysis Command Line (TAC). As it’s a command line program you have to type commands in to operate it.

So what can it do?

Like any good program it contains help – typing ‘?’ will give you a list and basic description of the available commands; typing ‘<command> ?’ will give you detailed options on that particular command.

wordcount-1

When you see a pipe symbol ‘|’ it means or. Square brackets (‘[‘, ‘]’) mean optional.

Most commands can either display output on the screen or save the results to a file. If using a single greater-than symbol (‘>’) the file will be saved (unless it already exists). Using the double option ‘>>’ will save the file, overwriting it if necessary.

Below is a description of all of the currently available commands. The results are based on processing Vengeance Will Come, my scifi/fantasy adventure (available now):

wordcount. You can display the frequency of every word used. Earlier in the year I bought Scrivener (left). For the most part it’s a great program but I was disappointed there was no way to export (or even easily query) word count data. The image of TAC (right) shows a snippet of both the textual version (default) and the ‘basic mode’ (using -b option). The basic mode is valuable if opening the file in Excel to do pretty graphs.

wordcount can also provide wordcount-word lengththe number of words which begin with given letters (-f) or the length of words (-l).

Just in case you’re curious the longest word at 20 characters is ‘uncharacteristically’. The three 16’s are: ‘conspiratorially’, ‘incontrovertible’ and ‘responsibilities’.

For the purpose of completeness, I’ll briefly mention the data command. At the moment it’s limited, a means to interrogate the data. In order to do all of this (and future) processing I painstakingly categorise every character of text into a type. Using the -expseg option outputs this information.

datacmd-2
-expseg option

 

At this stage the only two other data commands are -sen (output sentence). For example outputting the sentence at segment 128 is:

At first light they invade my mind, besieging it to the point of exhaustion.

And -block (output block) at 128:

“I wish that I didn’t know the future; that I couldn’t see the prophecies unfold before me. At first light they invade my mind, besieging it to the point of exhaustion. Even in my fitful sleep they haunt me as wild animals stalk the scent of blood, turning what little rest I get into an extension of my waking nightmare. I cannot escape.

The find command is powerful and will be leveraged heavily in future updates.

find-1

Unsurprisingly, find locates the occurrences of a specified word. Importantly the before and after options allow displaying the word in a variable level of context (e.g. want to see 10 words preceding the word, or only 5?).

find can also locate every instance of a specified type of punctuation. Want to know how often I use exclamation marks? Typing ‘find -p’ brings up a list of punctuation options from which a selection can be made.

find-2

The answer is of course 45 (as displayed on the screenshot). However, now I know exactly where they are (and in what context).

find-3.PNG
The first use of ! occurs at the 660th character in my novel.

I’m a big believer in not over-using the exclamation mark, so a tool like this would let me easily see how often I’ve used it in a given book (and calculate the amount of text between each usage). More importantly, it can also let me track down when I’ve used a ” instead of a “ which seems to happen no matter how careful I am.

This brings me to the end of the tour of TAC v0.0.1, I hope you liked it.

Advertisements

Got thoughts?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s