Word clouds are becoming increasingly popular in data science and analytics. They are successors of tag clouds or better said younger siblings. The main difference is that tag clouds are usually made of tags manually written by user(s) or chosen from a predefined list and word clouds are based on some advanced text analytics mostly performed by software algorithms. Word clouds are used as a tool to extract essence of some text corpus and to represent it in a graphic form. Words with the higher frequency are shown bigger and those with lower frequency smaller or even excluded. This is the basic concept but the algorithm is quite a bit more complex. But is it possible to apply the word cloud algorithm on custom text without coding?
The short answer is yes. The long answer is bit more complicated and require installation of Python and word cloud package. There are few open source Python packages you can use to generate word clouds from text corpus. Most of them have multiple dependencies which means they require other packages to be installed for them to function properly.
The real reason for multiple packages is because the process of making word cloud requires multiple preparatory steps on the corpus text like tokenization, lemmatization, stemming, etc. Each step can be manually tweaked to produce desired output, leading finally to graphical representation of word cloud. Therefore, to be able to extract maximum from your word cloud you will have to jump into the code and “get your hands dirty” at some point as Python word cloud packages are meant to be used in your code and not as stand alone applications.
However we will show you how to install a word cloud package that doesn’t require coding skills. It has CLI (Command Line Interface) so when we install Python, package and setup environment you’ll be able to generate word cloud images directly from command line with some degree of configuration and customization. We will use instance of Ubuntu Linux as operating system to install word cloud cli but the package should work on other operating systems as well.
First lets check if the Python is already installed. We will check for Python version 3.
$ python3 --version Python 3.8.10
If you get output like this you can skip following installation steps.
$ sudo apt install software-properties-common $ sudo add-apt-repository ppa:deadsnakes/ppa $ sudo apt update $ sudo apt install python3.8 $ sudo apt install python3-pip
The installation of software-properties-common package gives you ability to add custom PPA repositories, like this deadsnakes repository that offers newer Python releases than standard Ubuntu repos. You also need PIP (Package Installer for Python) so you can easily install required Python packages.
Now you are ready to install word cloud package.
$ pip install wordcloud
The word cloud package is installed and ready to use. You can import it in your code or you can use CLI version as stand alone command – wordcloud_cli. But before we can use it as a command we need to add path of the directory where the Python packages are installed to PATH environment variable.
echo "export PATH=\"`python3 -m site --user-base`/bin:\$PATH\"" >> ~/.bashrc source ~/.bashrc
After you set PATH properly you can check the options of wordcloud_cli command:
$ wordcloud_cli --help usage: wordcloud_cli [-h] [--text file] [--regexp regexp] [--stopwords file] [--imagefile file] [--fontfile path] [--mask file] [--colormask file] [--contour_width width] [--contour_color color] [--relative_scaling rs] [--margin width] [--width width] [--height height] [--color color] [--background color] [--no_collocations] [--include_numbers] [--min_word_length min_word_length] [--prefer_horizontal ratio] [--scale scale] [--colormap map] [--mode mode] [--max_words N] [--min_font_size size] [--max_font_size size] [--font_step step] [--random_state seed] [--no_normalize_plurals] [--repeat] [--version] A simple command line interface for wordcloud module. optional arguments: -h, --help show this help message and exit --text file specify file of words to build the word cloud (default: stdin) --regexp regexp override the regular expression defining what constitutes a word --stopwords file specify file of stopwords (containing one word per line) to remove from the given text after parsing --imagefile file file the completed PNG image should be written to (default: stdout) --fontfile path path to font file you wish to use (default: DroidSansMono) --mask file mask to use for the image form --colormask file color mask to use for image coloring --contour_width width if greater than 0, draw mask contour (default: 0) --contour_color color use given color as mask contour color - accepts any value from PIL.ImageColor.getcolor --relative_scaling rs scaling of words by frequency (0 - 1) --margin width spacing to leave around words --width width define output image width --height height define output image height --color color use given color as coloring for the image - accepts any value from PIL.ImageColor.getcolor --background color use given color as background color for the image - accepts any value from PIL.ImageColor.getcolor --no_collocations do not add collocations (bigrams) to word cloud (default: add unigrams and bigrams) --include_numbers include numbers in wordcloud? --min_word_length min_word_length only include words with more than X letters --prefer_horizontal ratio ratio of times to try horizontal fitting as opposed to vertical --scale scale scaling between computation and drawing --colormap map matplotlib colormap name --mode mode use RGB or RGBA for transparent background --max_words N maximum number of words --min_font_size size smallest font size to use --max_font_size size maximum font size for the largest word --font_step step step size for the font --random_state seed random seed --no_normalize_plurals whether to remove trailing 's' from words --repeat whether to repeat words and phrases --version show program's version number and exit
As you can see there are whole a lot of different options you can play with while creating word clouds. You should have your text corpus ready as txt file. For test purpose you can copy/paste some text from Wikipedia into corpus.txt file and generate word cloud like this:
$ wordcloud_cli --text corpus.txt --imagefile wordcloud.png
This will generate wordcloud.png image of corpus.txt text with default settings. To customize it a bit you could make image a bit bigger with –width and –height arguments, and you could limit default number of words appearing in the word cloud which is 200 words to 100 using –max-words argument. You could also make stopwords.txt file with some custom stopwords. These are words you don’t want to appear in the word cloud. Now the command would look like this:
$ wordcloud_cli --text corpus.txt --stopwords stopwords.txt --max_words 100 --width 1000 --height 600 --imagefile wordcloud.png
Finally, if you are working on remote server you will want to download your word cloud image to be able to see it. There are many options you can do that, such as getting the file through scp, ftp or putting the image to your webserver public directory if you have one. We are going to show you how to do that in the next tutorial. However, there’s one more method especially useful for images so it fits this purpose well. It’s through transfer.sh service which offer file hosting up to 10GB for 14 days, more than enough for testing and you can upload files easily through command line.
$ curl --upload-file ./wordcloud.png https://transfer.sh/wordcloud.png https://transfer.sh/YjUyWp/wordcloud.png
After successful update service will return the web URL of a wordcloud.png image you can then access or download using web browser.