A similar, potentially useful script for extracting entire lines that begin with a letter or symbol, in their original order, and excluding lines that start with digits:
#!/bin/bash
cat subtitles.txt | \
grep -v '^[0-9]' | \
cat > lines_out.txt \
As an additional foreign language study use-case:
- Download a movie subtitle file from, e.g. opensubtitles[.]org, and save as a text file. The numerous entries will each look something like this excerpt with item number, time-stamps, and text (note the leading hyphens and leading diacritic capitals in some positions):
23
00:07:53,967 --> 00:07:57,084
Non fa quei corsi,
quella scuola serale? Eh!
24
00:07:58,367 --> 00:08:00,278
- Magari ha fatto tardi...
- Vignani!
25
00:08:00,767 --> 00:08:03,122
È arrivato, è arrivato!
Niente, grazie!
- Run the script on the file and the result is cleaned up:
Non fa quei corsi,
quella scuola serale? Eh!
- Magari ha fatto tardi...
- Vignani!
È arrivato, è arrivato!
Niente, grazie!
- Now it’s easy to import the output into Anki or just read the file itself for language study.
Internalizing whole sentences of dialog at once, instead of just single words, is more efficient and more natural than just reading literature to build language skills.
EDIT: See below for an improved, more concise script.