For my project, I created a word cloud showing the frequency at which different words appear in Shakespeare’s Romeo and Juliet. Shakespeare uses repetition frequently for emphasis, and as a Theater and Computer Science double major, I knew I wanted to apply my computer science skills to the most celebrated script in history, to see if visualizing the frequency of repeated words in the script would allow me to see the play in a more nuanced way. The dataset I used was just the text for Romeo and Juliet. I had to do a little bit of cleaning, like creating a txt file full of stop words and trimming the text of Romeo and Juliet to only include the script and not the publishing or licensing information. Then, I wrote a program in Java to count the frequency at which words appear by sorting them into a tree. After the text has been analyzed, it gets sent to a word cloud maker, which sizes each word depending on the frequency at which at they appear and assigns them a random color from a color roster, and outputs an HTML file that contains the word cloud. From there, I just had to embed that HTML in my subdomain, and here’s the final product:
Though I did clean the text file, I did have some trouble getting it to work nicely. For example, each character’s line in this show is indicated by the first three letters of their name. I addressed this by adding the first three letters of most characters names to the list of stop words. However, I didn’t want to remove “cap” (capulet), as it is already a word spoken in the script, and therefore if I removed it, then I would be altering the data too much. If I could go back and redo the cleaning process, I would find a way to remove the name of whoever speaks the line in every line, so that we wouldn’t have a leftover “cap” right at the top, much bigger than it should be. I also would think about removing the stage directions (exit, exeunt, which are only in the stage directions and not spoken in the play). However, there is a lot of debate about whether or not stage directions should be analyzed alongside the text of a play (I thoroughly believe that it should be), so I’m not sure what I would do! Aside from that, I am very happy with how my word cloud turned out, and I think it has plenty of potential for analysis! For example, the fact that “wife” is one of the more frequent words in this play but “husband” isn’t could be used to analyze how a women’s worth in this play is determined by her marriage, but a man’s worth is determined by much more — prince, sir, and man all show up as frequently used words as well. All in all, I think my word cloud (or more importantly, the program I wrote to create the word clouds) worked really well, and I created an intuitive way to visualize the frequency of words in a text.