Script to Extract List of All Unique Words in a Text, plus Frequency

amarok · April 11, 2022, 5:38pm

This script, which I found online and then modified slightly, consists of two operations:

Extract all unique words contained in a text file, list them alphabetically, and save as a separate file.
Count all the unique words in the file, determine their frequency of use, and save the top 100 (or a higher or lower number) in a list sorted by frequency.

Save the script as, e.g. wordlist.sh, and copy a source text to a file and save it. Then run the following in the terminal:

bash wordlist.sh

#!/bin/bash

# Extracts all unique words from a text and creates an alphabetical list of all words used, and a separate frequency list of the 100 most used words in the source file.
# Assumes file containing text to be parsed is located in home directory; change path if desired, e.g. */home/username/Desktop/filename.txt*, etc.
# Fill in name of file that contains the source text, e.g. *original.txt*, as below.
# For Unique Word List, fill in file destination of extracted word list, e.g. *word_list_out.txt*; any existing contents will be overwritten. File will be created automatically if not present.
# For a sorted list of only the 100 (etc.) of the most frequently used words in the source file, fill in file destination of extracted word list, e.g. *word_freq_out.txt*; any existing contents will be overwritten. File will be created automatically if not present.
# Either of the two operations below can be disabled with a "#" at the start of each line.

#Unique Word List:
cat original.txt | \
    tr [[:upper:]] [[:lower:]] | \
    grep -oE '[[:alpha:]]{1,}' | \
    sort | uniq -i | \
    sort --dictionary-order | cat > word_list_out.txt \
# Most Frequently Used Words List; change "100" to a different number if desired.
cat original.txt | \
    tr [[:upper:]] [[:lower:]] | \
    grep -oE '[[:alpha:]]{1,}' | \
    sort | uniq -c -i | \
    sort -rn | head -n 100 | cat > word_freq_out.txt \

You could add an initial operation to auto-download the source text from a website, but personally I would prefer to find the exact block of text I want, then copy and paste it myself.

So, what would this be good for?

If, for example, you’re studying a foreign language, you can use this script to personalize a vocabulary list for a particular article, piece of literature, professional or technical document, etc., so as to greatly focus your study, and with source material selected by you.

You can even import the generated lists into the Anki e-flashcard application so you can easily learn and review. And Anki flatpak works on the Librem 5, so you can study on-the-go.

(Thanks to: https://www.compciv.org/recipes/cli/reusable-shell-scripts/)

amarok · April 11, 2022, 5:45pm

Examples, based on several paragraphs of an italian text I copied from a website:

Unique word list

è
È
a
abitanti
abruzzese
abruzzo
ad
agro
ai
al
albani
albano
alcune
all
alla
alle
allo
alti
alto
altri
alture
anche
andamento
aniene
anno
annuale
annui
antichità
anzi
anzio
appennini
appennino
arcipelago
area
ariccia
arriva
arrone
assiali
atlantiche
attorno
attraversare
attuale
aumentano
aurunci
ausoni
bacino
bagnato
bassa
bolsena
bonifica
bracciano
breve
c
cala
campania
cantari
capo
capoluogo
caratteristica
caratteristiche
caratterizza
castelli
castiglione
castro
catene
centrale
che
ciascuno
cicolano
cielo
cimini
circeo
città
civitavecchia
clima
colli
collinare
collinari
colloca
come
compatri
composto
comprendeva
compresa
comune
comuni
con
confina
confine
continentale
continua
corre
corrono
corsa
corso
costa
costiera
costiero
cui
d
da
dagli
dai
dal
dall
dalla
dalle
davanti
decine
degradano
dei
del
dell
della
delle
destra
di
diagonalmente
dimensioni
dintorni
direttamente
distinti
distribuite
dolcemente
dopo
dove
duchessa
due
e
ed
elementi
eliofania
enclave
era
erge
ernici
esposti
essere
est
estendendosi
estensione
estiva
eterogeneità
fascia
felice
finiscono
fino
fiora
fisiche
fiume
fiumi
fiumicino
foce
fonte
formia
foro
fossili
fra
fredda
freddi
frosinone
gabii
gaeta
garigliano
generale
genere
gennaio
geologici
gestite
giornate
giuturna
gli
golfo
gorzano
grottaferrata
gruppi
gruppo
i
idrografico
il
in
inferiore
inferiori
inoltre
interessare
intermedie
internamente
interno
invece
invernale
inverni
inverno
isolato
isole
italia
italiano
km
l
la
lacustre
laga
laghetti
laghi
lago
latina
latium
laziale
laziali
lazio
le
lepini
limitata
limite
linaro
linea
liri
lo
lombardia
loro
luglio
lungo
m
ma
maggior
maggiormente
mar
marche
marciana
mare
maremma
maremmana
mareografico
marta
massicci
massimi
media
mediamente
medie
medio
mentre
meridionale
meridione
meteorologiche
minimi
minori
mm
modeste
molise
molte
molto
monitorato
montalto
monte
monti
montuose
montuosi
nei
nel
nell
nella
nelle
nemi
nera
nettuno
nevicate
nevose
non
nona
nonostante
nord
notevole
notturne
numero
numerosi
o
occasioni
occupa
occupata
occupato
oltre
omogenee
operata
ordinario
ore
orientale
origine
ormai
ovest
paglia
palmarola
paludi
pantano
paralleli
parte
partendo
particolare
per
perturbazioni
più
piana
pianura
pianure
piccola
piccole
piega
piuttosto
pluviometrici
poi
pontine
pontino
ponza
ponziano
popolata
porci
porzione
possono
posti
prata
precipitazioni
presenta
presente
presenti
presenza
presso
prevalentemente
prevalenza
prima
principale
principali
promontorio
prosciugato
prossime
prossimità
provincia
punto
quali
quella
quelle
quello
questa
questi
questo
qui
quota
quote
raggiungendo
raggiungere
raggiungono
rarissime
reatina
reatini
regionale
regione
registrano
registrare
regolare
relativamente
remota
restante
ricoperto
ricorda
rieti
rigide
riguardo
rilievi
risulta
risultano
ritaglia
roma
romani
romano
sabatini
sabbiosa
sabini
sacco
san
scarse
scendono
scorrono
secca
secco
seconda
segnalato
seguendo
seguita
sei
sempre
separati
sereno
serie
settentrionale
settentrionali
sfondo
si
simbruini
sinistra
sole
soltanto
sono
spazio
sporadiche
sporgenze
stagione
stagioni
statuto
stazioni
su
sua
successione
sud
suddivisi
sui
sul
sullo
suo
superficie
superiori
temperatura
temperature
tempio
terra
territorio
testimoniata
tevere
tirrenico
tirreno
tolfa
toscana
tra
tramonto
tratta
tratto
tre
treia
tributari
trova
trovano
troviamo
tuscolo
tutta
tutte
tutto
ufficio
umbria
un
una
unica
uno
va
valle
valori
variabilità
variano
varie
vaticano
velletri
versante
verso
vesta
vi
vico
viterbo
volsci
volsini
vulcanica
zero
zona
zone

Top 100 list:

     44 di
     37 il
     34 e
     26 la
     23 con
     22 monti
     22 i
     15 del
     14 si
     14 a
     13 che
     12 in
     12 della
     11 regione
     11 dei
     10 è
     10 da
      8 sono
      8 più
      8 le
      8 l
      7 verso
      7 una
      7 sud
      7 nel
      7 al
      6 tra
      6 tevere
      6 roma
      6 nord
      6 montuosi
      6 lazio
      6 lago
      6 dell
      6 anche
      5 un
      5 troviamo
      5 parte
      5 ovest
      5 nella
      5 gruppi
      5 est
      5 ed
      5 dalla
      5 confine
      4 zona
      4 vulcanica
      4 valori
      4 tirreno
      4 sui
      4 quella
      4 prossimità
      4 per
      4 origine
      4 mm
      4 città
      4 campania
      3 zone
      3 valle
      3 trovano
      3 territorio
      3 suo
      3 sabini
      3 romano
      3 qui
      3 questi
      3 precipitazioni
      3 pianure
      3 nei
      3 monte
      3 mare
      3 mar
      3 loro
      3 italia
      3 gaeta
      3 dall
      3 dal
      3 corso
      3 clima
      3 annui
      3 ai
      3 agro
      2 volsini
      2 umbria
      2 trova
      2 tratta
      2 toscana
      2 simbruini
      2 settentrionale
      2 sacco
      2 sabatini
      2 risulta
      2 rilievi
      2 registrano
      2 reatini
      2 raggiungono
      2 promontorio
      2 presenta
      2 possono
      2 piuttosto

I could have used the -i operator with uniq to make the script ignore case. (See uniq --help.) I’ve added this to the OP.

JR-Fi · April 11, 2022, 6:34pm

It’s good that you thought of something positive to use it on. Clever way to target most useful words. “Laziness” is one of the most powerful forces in humanity (hey, it’s why computers were built)

My thought pattern led me to “text fingerprinting” where characteristics of text (including choice of expressions and synonyms, frequency of “and” etc., common spelling errors and so on) are used to identify who wrote what and possibly identify a person (to a certain statistical accuracy or inaccuracy). You could also analyze a list of passwords (characters) and passphrases with similar technique to find out what not use / how to target attack (depending what side you are on). Listing most uncommon words might reveal also something about the text content and meaning. Case could be used to identify names and possibly create a draft of a meta-information keywords - repeat to few texts and then use some network-visualization / mindmapper to create a map of how everything connects (could do that to foreign words too, I suppose). Some rudimentary algorithms use the number of certain words to determine the writers mood as well as if the text is happy/angry/etc. (“I’m sorry, Dave. You have used verbs so many times, I think you are angry. I can not let you continue until you have calmed down. … Dave, why are you writing gibberish with you forehead again so many times?” ).

Gavaudan · April 11, 2022, 6:49pm

Just FYI, if your script has the magic byte (#!/bin/bash) you don’t have to specify bash. Same goes for eg #!/bin/python.

amarok · April 11, 2022, 6:54pm

Thanks…still learning. So, like this?
./wordlist.sh

Gavaudan · April 11, 2022, 7:25pm

Yes indeed.

Edit: assuming, of course, you’re executing from the same location that the script is in.

irvinewade · April 12, 2022, 1:16am

The arguments to tr should each be enclosed in single quotes, since otherwise the shell will treat those character matches to tr as things for potential expansion by the shell itself - and hence your command will malfunction randomly depending on your current directory, the files therein, and your shell ‘glob’ settings.

As an illustration, my current directory contains a file called t i.e. no doubt some junk temporary file created years ago and inadvertently not cleaned up by me.

ls -l [[:lower:]]
will give me the details of that file, and
echo [[:lower:]]
will just tell me its name (t).

Shell learning: Keep in mind all the punctuation characters that have special meaning in the shell (there are lots of them) and always choose the correct type of quotes (or backslash) to neuter that meaning when this is the desired behaviour.

Also
cat filename | somecommand | ...
is usually equivalent to
somecommand <filename | ...

Occasionally, for readability, I would opt for the former e.g. particularly if somecommand is quite long, so that the actual input file name will be harder to find - and sometimes for maintainability e.g. I am likely to insert further commands at the beginning of the pipeline.

Same issue with the cat > at the end.

Also don’t continue with \ the last part of a command.

Ciao.

Dwaff · April 12, 2022, 1:24am

With bash, You can always put a redirection at the beginning of the command. Or indeed, in a middle of the command too. For some reason, most people find it awkward though.

irvinewade · April 12, 2022, 1:29am

OK, so for the benefit of @amarok, also equivalent to
<filename somecommand | ...
Even better.

amarok · April 12, 2022, 12:23pm

Thanks, @irvinewade && @Dwaff. I only copied the above from a tutorial and then mimicked it for the additional op. So what would be the preferred rewrite of this, based on what you said?

Dwaff · April 12, 2022, 2:37pm

No preference. The version with cat in front spawns one extra process, but in this particular case it does not matter at all. Take what you like best.

irvinewade · April 12, 2022, 10:45pm

<original.txt \
    tr '[[:upper:]]' '[[:lower:]]' | \
    grep -oE '[[:alpha:]]{1,}' | \
    sort | uniq -i | \
    sort --dictionary-order >word_list_out.txt

Then be a bit sceptical about the qualifications of the person who wrote the tutorial.

Putting the args to tr in quotes - there’s no “preference” about it. Without quotes, it is wrong and it will fail.

Removing the trailing backslash, will not currently malfunction but at high risk of malfunctioning in some future edit. No real issue of preference.

Getting rid of the redundant cats - that’s just my preference.

amarok · April 12, 2022, 11:12pm

A similar, potentially useful script for extracting entire lines that begin with a letter or symbol, in their original order, and excluding lines that start with digits:

#!/bin/bash

cat subtitles.txt | \
    grep -v '^[0-9]' | \
    cat > lines_out.txt \

As an additional foreign language study use-case:

Download a movie subtitle file from, e.g. opensubtitles[.]org, and save as a text file. The numerous entries will each look something like this excerpt with item number, time-stamps, and text (note the leading hyphens and leading diacritic capitals in some positions):

23
00:07:53,967 --> 00:07:57,084
Non fa quei corsi,
quella scuola serale? Eh!

24
00:07:58,367 --> 00:08:00,278
- Magari ha fatto tardi...
- Vignani!

25
00:08:00,767 --> 00:08:03,122
È arrivato, è arrivato!
Niente, grazie!

Run the script on the file and the result is cleaned up:

Non fa quei corsi,
quella scuola serale? Eh!

- Magari ha fatto tardi...
- Vignani!

È arrivato, è arrivato!
Niente, grazie!

Now it’s easy to import the output into Anki or just read the file itself for language study.

Internalizing whole sentences of dialog at once, instead of just single words, is more efficient and more natural than just reading literature to build language skills.

EDIT: See below for an improved, more concise script.

irvinewade · April 12, 2022, 11:39pm

Your script has lost its leading # though - and this is surely an example where the redundant cats should be removed, leaving you with just the grep command and no pipeline.

amarok · April 12, 2022, 11:45pm

Thanks. Fixed the #. (copy/paste error)

I guess I’m confused. How do I designate the destination file?
Please rewrite it for me so I know I’ve understood it.

Like this?
grep -v '^[0-9]' subtitles.txt > lines_out.txt

irvinewade · April 13, 2022, 12:29am

Yes, that’s fine. Or if you want to make it more obscure :

<subtitles.txt grep -v '^[0-9]' >lines_out.txt

FYI, in some cases you can do the -v differently e.g. grep '^[^0-9]' but be careful in any case to understand that the two occurrences of ^ have completely different meanings. man grep is your friend.

amarok · April 13, 2022, 12:32am

Excellent, thanks! I really like this stuff, as I always loved building Excel macros to do all my tedious data chores at work.

tibfulv · April 18, 2022, 7:07am

As a future project, you can also let the user choose which list to make. Getopt would be involved. Implementation is left as an exercise for the reader.

jemptymethod · November 22, 2024, 6:22pm

Sorry to necro this two and a half years later, I just joined the forum today after finding this post through a google search. I find this very interesting, as I am planning to create a vocabulary from Italian text; the next step for me will be, for each word, to execute curl to access an online Italian-English dictionary. I won’t go that far here, but I am very thankful I found this post!