Count the Unique Words in War and Peace
This is a fun little example of line-oriented data processing.
Tolstoy’s War and Peace is sprawling and immense, and it uses a lot of words. But how many unique words does it contain?
Counting in Bash
Example
./countwords.sh < ./warandpeace.txt.gz
The first version of this program is a shell script.
cat - | gunzip | grep -o -E "(\\w|')+" | grep -v -P '^\d' | tr '[:upper:]' '[:lower:]' | sort | uniq | wc -l
Read from left to right:
- read stdin
- uncompress it, because it is compressed with
gzip
grep
out the words (skip the numbers)- make everything lower case
- find the unique words
- count them
This is what we are going to do in Deno.
Version #1: Streaming with No Shortcuts
Example
./countwords.ts < ./warandpeace.txt.gz
This version makes a great example because it does a little bit of everything, but it is starting to get cluttered and is a little hard to read. It demonstrates that you can simultaneously stream a large amount of data between a large number of different processing steps, one of them being in your Deno process.
Version #2: Streaming with a Little Help from Bash
Example
./countwords2.ts < ./warandpeace.txt.gz
This version is more practical, easier to read, and quite a bit faster than the
first (see code comments). Rather than shelling out individual commands, it
shells out bash
scripts that run several processes together.
Version #3: Non-Streaming
Example
./countwords3.ts < ./warandpeace.txt.gz
This version is written without streaming. This is more of a fun proof-of-concept than practical solution. Each step runs and completes before the next begins. This takes a lot longer to run than the other versions, and it uses more memory, but hey - it runs.
Bonus: A Text-to-Speech Reader
Example
zcat ./warandpeace.txt.gz | ./read.ts
I’ve never read War and Peace, but now my computer has! It may make your ears bleed, but this script will read out the entire series of four volumes - over a few days - reliably. It is a nice reminder that not everything that seems like a good idea is actually a good idea. Enjoy!