I’ve spent a few days goofing off from writing. Well, kind off…it was writing-related.
I wrote a Java program that can load and process my novel. Now having done that load work will enable me to add useful tools in the future, but for now I just did some basic word frequency analysis. Sounds like some nerd fun? And it was.
First, technical stuff and then some results:
Loading it into the program turned out to be more difficult than I expected. Part of the difficulty was how I defined things on the page. When I was younger I’d have told you that anywhere there is a gap between blocks of text then it is a paragraph. In my mind, at least, the concept of a paragraph is stretched out-of-shape by the frequent carriage returns of dialogue.
I’m sure there’s probably a technical term (which I’m happy to be told)., but I didn’t want to research it. So, I solved the problem like any fiction author: I just made words up.
for all time until I find a better name, they shall be known as minor blocks (green) and major blocks (blue). The term paragraph may now be discontinued.
(I suspect I’m already in the process of changing my mind…)
Before you peruse the results, you might wonder what possible good a function like this might be? (Admittedly at the moment there is too much information). The tool could be used in the following ways:
- There are some words, which are so peculiar or powerful that they should only be used once in a story. This tool will help locate those words. For example: gruesome (0), or horror (4). Wow, there’s a lot of cry (10) / crying (5) going on. I really need to check that… Point proven.
- There are also some words that mean-nothing and should be replaced with more descriptive terms, like interesting (3).
- It could help expose word-use problems. For example, when my characters want to swear they say “frak”. If I find a “frack” or a “fak” then I know I’ve made a mistake.
- Nerdy pleasure (hey, it’s valid for me)
When considering these results please note the following caveats:
- Not all bugs have been ironed out; give me a 5% margin for error.
- Contractions are included (so “don’t” and “do not” is counted as 2 words)
- There are no exclusions yet (“a”, “is” etc are included)
For a novel slightly over 86K words, I was surprised with the results.
- 8,443 unique words
- The top 10 most frequent words account for 18,624 words. (the, to, and, a of, he, you, was, his, I).
- Most frequent words per first letter: Unsurprisingly mostly character names. (A = and; B = be; C = could; D = Danyel; E = even; F = for; G = get; H = he; I = I; J = Jessica; K = Keeshar; L = like; M = Menas; N = not; O = of; P = people; Q = Queen; R = Regent; S = said; T = the; U = up; V = very; W = was; X = Xu; Y = you; Z = Zekkari).
- Everything above 15 characters long was a processing error 🙂