Elaina Boyle Midterm

For my project, I created a word cloud showing the frequency at which different words appear in Shakespeare’s Romeo and Juliet. Shakespeare uses repetition frequently for emphasis, and as a Theater and Computer Science double major, I knew I wanted to apply my computer science skills to the most celebrated script in history, to see if visualizing the frequency of repeated words in the script would allow me to see the play in a more nuanced way. The dataset I used was just the text for Romeo and Juliet. I had to do a little bit of cleaning, like creating a txt file full of stop words and trimming the text of Romeo and Juliet to only include the script and not the publishing or licensing information. Then, I wrote a program in Java to count the frequency at which words appear by sorting them into a tree. After the text has been analyzed, it gets sent to a word cloud maker, which sizes each word depending on the frequency at which at they appear and assigns them a random color from a color roster, and outputs an HTML file that contains the word cloud. From there, I just had to embed that HTML in my subdomain, and here’s the final product:

 tell   much   cap   word   art   again   prince   hath   montague   speak   give   both   two   day   i’ll   exeunt   like   think   night   must   young   house   call   till   let   being   gone   away   make   death   bid   hence   mother   par   enter   dead   marry   romeo   sir   hear   fair   shall   back   capulet   paris   scene   nurse   life   die   mine   father   juliet   dear   wife   hast   heart   time   watch   friar   tis   lord   too   light   hand   find   lady   part   sweet   come   ay   well   up   eyes   heaven   say   man   take   look   know   stand   good   face   see   go   old   very   therefore   now   stay   madam   love   exit   comes   tybalt   god   name   true   wilt   why   bed 

Though I did clean the text file, I did have some trouble getting it to work nicely. For example, each character’s line in this show is indicated by the first three letters of their name. I addressed this by adding the first three letters of most characters names to the list of stop words. However, I didn’t want to remove “cap” (capulet), as it is already a word spoken in the script, and therefore if I removed it, then I would be altering the data too much. If I could go back and redo the cleaning process, I would find a way to remove the name of whoever speaks the line in every line, so that we wouldn’t have a leftover “cap” right at the top, much bigger than it should be. I also would think about removing the stage directions (exit, exeunt, which are only in the stage directions and not spoken in the play). However, there is a lot of debate about whether or not stage directions should be analyzed alongside the text of a play (I thoroughly believe that it should be), so I’m not sure what I would do! Aside from that, I am very happy with how my word cloud turned out, and I think it has plenty of potential for analysis! For example, the fact that “wife” is one of the more frequent words in this play but “husband” isn’t could be used to analyze how a women’s worth in this play is determined by her marriage, but a man’s worth is determined by much more — prince, sir, and man all show up as frequently used words as well. All in all, I think my word cloud (or more importantly, the program I wrote to create the word clouds) worked really well, and I created an intuitive way to visualize the frequency of words in a text.

css.php