This script, which I found online and then modified slightly, consists of two operations:
- Extract all unique words contained in a text file, list them alphabetically, and save as a separate file.
- Count all the unique words in the file, determine their frequency of use, and save the top 100 (or a higher or lower number) in a list sorted by frequency.
Save the script as, e.g. wordlist.sh
, and copy a source text to a file and save it. Then run the following in the terminal:
bash wordlist.sh
#!/bin/bash
# Extracts all unique words from a text and creates an alphabetical list of all words used, and a separate frequency list of the 100 most used words in the source file.
# Assumes file containing text to be parsed is located in home directory; change path if desired, e.g. */home/username/Desktop/filename.txt*, etc.
# Fill in name of file that contains the source text, e.g. *original.txt*, as below.
# For Unique Word List, fill in file destination of extracted word list, e.g. *word_list_out.txt*; any existing contents will be overwritten. File will be created automatically if not present.
# For a sorted list of only the 100 (etc.) of the most frequently used words in the source file, fill in file destination of extracted word list, e.g. *word_freq_out.txt*; any existing contents will be overwritten. File will be created automatically if not present.
# Either of the two operations below can be disabled with a "#" at the start of each line.
#Unique Word List:
cat original.txt | \
tr [[:upper:]] [[:lower:]] | \
grep -oE '[[:alpha:]]{1,}' | \
sort | uniq -i | \
sort --dictionary-order | cat > word_list_out.txt \
# Most Frequently Used Words List; change "100" to a different number if desired.
cat original.txt | \
tr [[:upper:]] [[:lower:]] | \
grep -oE '[[:alpha:]]{1,}' | \
sort | uniq -c -i | \
sort -rn | head -n 100 | cat > word_freq_out.txt \
You could add an initial operation to auto-download the source text from a website, but personally I would prefer to find the exact block of text I want, then copy and paste it myself.
So, what would this be good for?
If, for example, you’re studying a foreign language, you can use this script to personalize a vocabulary list for a particular article, piece of literature, professional or technical document, etc., so as to greatly focus your study, and with source material selected by you.
You can even import the generated lists into the Anki
e-flashcard application so you can easily learn and review. And Anki
flatpak works on the Librem 5, so you can study on-the-go.
(Thanks to: https://www.compciv.org/recipes/cli/reusable-shell-scripts/)