Textmining Friends: S1 E3 - The One with the Sentiment Analysis

This Friends textmining effort in R was my Saturday project during a range of very snowy Saturdays we had here in Edmonton in September. It makes heavy use of the tidyverse, Text Mining in R by Julia Silge and David Robinson (which is highly recommended), and piping from the magrittr package (which makes things so much nicer.) If you haven’t read the previous two episodes, they are:

  1. S1 E1, The One with all the Import and Cleanup, and
  2. S1 E2, The One with the Most Frequent Words.

As in Episode 2, you can find a tutorial by Rich Majerus on how to loop with ggplot2 here.

Disclaimer – I do not claim to be an expert in textmining. There may be faster, smarter, or nicer ways to achieve a certain thing. Still, maybe you’ll find something interesting for your own projects - or just some funny tidbit about your favourite show. In this third “episode,” we’ll look at sentiment analyses wtih different lexicons, and - lo and behold - it may not be working too well on a show that makes heavy use of irony and sarcasm. But! It’s still fun.

Isabell Hubert 2018

website | twitter


Prep

We’ll load the following libraries:

save.image()
load(".RData")
friends <- readRDS("friends-df.rds")

library(dplyr)
library(tidyr)
library(readr)
library(stringr)
library(tidytext)
library(magrittr)
library(ggplot2)

Sentiment Analysis

The tidytext package comes with a few lexicons that hold sentiment scores. From Text Mining with R:

The nrc lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.

The bing lexicon categorizes words in a binary fashion into positive and negative categories.

The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.

Loughran appears to use a similar system to Bing.

get_sentiments("afinn")    # numerical score -5 to 5

## # A tibble: 2,476 x 2
##    word       score
##    <chr>      <int>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # … with 2,466 more rows

get_sentiments("nrc")          # assigned to sentiment categories

## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # … with 13,891 more rows

get_sentiments("bing")     # positive or negative

## # A tibble: 6,788 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faced     negative 
##  2 2-faces     negative 
##  3 a+          positive 
##  4 abnormal    negative 
##  5 abolish     negative 
##  6 abominable  negative 
##  7 abominably  negative 
##  8 abominate   negative 
##  9 abomination negative 
## 10 abort       negative 
## # … with 6,778 more rows

get_sentiments("loughran") # positive or negative

## # A tibble: 4,149 x 2
##    word         sentiment
##    <chr>        <chr>    
##  1 abandon      negative 
##  2 abandoned    negative 
##  3 abandoning   negative 
##  4 abandonment  negative 
##  5 abandonments negative 
##  6 abandons     negative 
##  7 abdicated    negative 
##  8 abdicates    negative 
##  9 abdicating   negative 
## 10 abdication   negative 
## # … with 4,139 more rows

Most X Words

For example, using the NRC lexicon, we can find the ten most [sentiment] (joyous, angry, etc.) words by season:

The following function, topXwords.plot, accepts three arguments: 1. The dataframe you would like to apply the function to; 2. A text string denoting the sentiment you would like to analyze; 3. The number of top words you would like the function to plot (something between 5 and 10 works well; anything larger and things get too crowded.)

Let’s look at the top 5 most joyous words, as an example:

seasons = c(1,2,3,4,5,6,7,8,9,10)
friendsNames = c("Monica", "Rachel", "Chandler", "Joey", "Ross", "Phoebe")

topXwords.plot <- function(df, sent.input, top, na.rm = TRUE, ...){

  # create for loop 
  for (i in seq_along(seasons)) { 

    # create df for each season
        sent.cat <- get_sentiments("nrc") %>% 
          filter(sentiment == sent.input)

    top10season <- df %>%
          filter(season == seasons[i]) %>%
          inner_join(sent.cat) %>%
          count(word, sort = TRUE) %>%
           top_n(top)

    top10plot <- ggplot(top10season, aes(reorder(word, n), n)) +
    geom_col(show.legend = FALSE, aes(fill = factor(n))) +
        labs(x = NULL, y = "Count", title = paste("Top 10", sent.input, "Words - Season", seasons[i])) +
        coord_flip()

    print(top10plot)
  }
}

topXwords.plot(friends, "joy", 5)

## Joining, by = "word"

## Selecting by n

## Joining, by = "word"

## Selecting by n

## Joining, by = "word"
## Selecting by n

## Joining, by = "word"
## Selecting by n

## Joining, by = "word"
## Selecting by n

## Joining, by = "word"
## Selecting by n

## Joining, by = "word"
## Selecting by n

## Joining, by = "word"
## Selecting by n

## Joining, by = "word"
## Selecting by n

## Joining, by = "word"
## Selecting by n

We can already see that “god,” something that we know characters often use as an exclamation when they are exasperated or frustrated, loads heavily on joy. This is a general problem with sentiment analysis (as is the fact that it cannot grasp satire or irony.)

Here are the top 5 words for the remaining NRC sentiments (at some point, this will become a nested loop… this point is not now):

topXwords.plot(friends, "trust", 5)
topXwords.plot(friends, "fear", 5)
topXwords.plot(friends, "negative", 5)
topXwords.plot(friends, "sadness", 5)
topXwords.plot(friends, "anger", 5)
topXwords.plot(friends, "surprise", 5)
topXwords.plot(friends, "positive", 5)
topXwords.plot(friends, "disgust", 5)
topXwords.plot(friends, "joy", 5)
topXwords.plot(friends, "anticipation", 5)

This is where we get more clues as to how sentiment analysis may not be the greatest tool in this context. Example: “God” loads heavily on fear, anticipation, joy, and positive. “Wait” is negative and “mother” is sad.

Let’s move on to plus/minus.

Plus/Minus

How does the sentiment change across seasons and episodes?

Let’s first look at the entire show, using the Bing lexicon:

friends %>%
  inner_join(get_sentiments("bing")) %>%
  count(season, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
ggplot(aes(season, sentiment)) +
  geom_col(show.legend = FALSE, alpha = 0.8, aes(fill = sentiment > 0)) +
    geom_smooth(method = "loess", color = "white", se = FALSE) +
    labs(title = "+/- across the Show ", x = NULL, y = "+/-") +
    scale_x_continuous(breaks = c(seasons)) +
    theme(legend.position = "none")

## Joining, by = "word"

Season 6 appears to be the “darkest,” and Season 10 the most positive. The fact that Friends is a comedy show, and that the average sentiment by season is overwhelmingly negative, supports the notion that sentiment analysis on a word level may not be too useful here.

Let’s look at the +/- by episodes:

# create graphing function
plusm.s.viz <- function(df, na.rm = TRUE, ...){

  # create for loop
  for (i in seq_along(seasons)) { 

    # create plot for each season
        plot <- ggplot(subset(df, df$season == seasons[i]), aes(episode, sentiment)) +
          geom_col(aes(fill = sentiment > 0), show.legend = FALSE, alpha = 0.8) +
            geom_smooth(method = "loess", color = "white", se = FALSE) +
            labs(title = paste("Plus/Minus Sentiment by Season - Season ", seasons[i]) , x = NULL, y = "+/-") +
            coord_cartesian(ylim = c(-60,40)) +
            theme(legend.position = "none")
    print(plot)
  }
}

# run graphing function
friends %>%
  inner_join(get_sentiments("bing")) %>%
  count(season, episode, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
    plusm.s.viz

## Joining, by = "word"

Most episodes have a net negative sentiment; Season 10 has the most net positive episodes.

Season 2 and 7 develop to be more positive towards the end.

Outlier Episodes

There are two strong “Debbie Downer” episodes, one in Season 3, and the other in 6 - what are they?

friends %>%
  inner_join(get_sentiments("bing")) %>%
  count(season, episode, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
    arrange(sentiment) %>%
    filter(season == 3 | season == 6) %>%
    top_n(-2)

## Joining, by = "word"

## Selecting by sentiment

## # A tibble: 2 x 5
##   season episode negative positive sentiment
##    <int>   <int>    <dbl>    <dbl>     <dbl>
## 1      6       4       86       29       -57
## 2      3      14       91       35       -56

It is Season 3, episode 14:

S3 E14, “The One with Phoebe’s Ex-Partner”:

Phoebe is reunited with her former singing partner who thinks “Smelly Cat” has hit potential; Chandler dates a woman with a prosthetic leg.

and

friends %>%
  inner_join(get_sentiments("bing")) %>%
  count(season, episode, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
    arrange(sentiment) %>%
    filter(season == 6) %>%
    top_n(-2)

## Joining, by = "word"

## Selecting by sentiment

## # A tibble: 2 x 5
##   season episode negative positive sentiment
##    <int>   <int>    <dbl>    <dbl>     <dbl>
## 1      6       4       86       29       -57
## 2      6      11       81       42       -39

Season 6, Episode 4: “The One where Joey loses his Insurance”:

Joey’s health benefits lapse just as he develops a hernia. Ross is given a professorship at NYU and eager to impress his students, he puts on a fake English accent.

Which words make these episodes so “down”?

friends %>%
    filter(season == 3) %>%
    filter(episode == 14) %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup() %>%
  group_by(sentiment) %>%
  top_n(4) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment - Downer Episode S3E14",
       x = NULL) +
  coord_flip()

## Joining, by = "word"

## Selecting by n

Checking the transcript, it seems that Rachel uses “blah” 14 times in a row to mock Ross. And then, of course, “Smelly Cat” is being sung a whole lot…

friends %>%
    filter(season == 6) %>%
    filter(episode == 4) %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup() %>%
  group_by(sentiment) %>%
  top_n(4) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment - Downer Episode S6E4",
       x = NULL) +
  coord_flip()

## Joining, by = "word"

## Selecting by n

“Wrong,” “die,” and “hell.” This makes sense, as Joey is bothered by a hernia.

What about the happiest episode in the entire show, in Season 9?

friends %>%
  inner_join(get_sentiments("bing")) %>%
  count(season, episode, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
    filter(season == 9) %>%
    top_n(1)

## Joining, by = "word"

## Selecting by sentiment

## # A tibble: 1 x 5
##   season episode negative positive sentiment
##    <int>   <int>    <dbl>    <dbl>     <dbl>
## 1      9      18       36       79        43

It is Season 9, Ep 18, “The One with the Lottery”:

When everyone except Ross pools together to buy tickets for the lottery, Phoebe gets a prediction from her psychic that they will win. Ross, despite his skepticism, gets in on the action.

So, if we were to attempt an educated guess, this is probably about winning…?

friends %>%
    filter(season == 9) %>%
    filter(episode == 18) %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup() %>%
  group_by(sentiment) %>%
  top_n(4) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment - The Happiest Episode",
       x = NULL) +
  coord_flip()

## Joining, by = "word"

## Selecting by n

Indeed it is! Lots of positive words, barely any negative contributions.

There’s also a very happy episode in Season 4:

friends %>%
  inner_join(get_sentiments("bing")) %>%
  count(season, episode, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
    filter(season == 4) %>%
    top_n(1)

## Joining, by = "word"

## Selecting by sentiment

## # A tibble: 1 x 5
##   season episode negative positive sentiment
##    <int>   <int>    <dbl>    <dbl>     <dbl>
## 1      4      17       33       68        35

It is Season 4, Episode 17: “The One with the Free Porn.”

Well.

friends %>%
    filter(season == 4) %>%
    filter(episode == 17) %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup() %>%
  group_by(sentiment) %>%
  top_n(3) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment - Super Happy S4E17",
       x = NULL) +
  coord_flip()

## Joining, by = "word"

## Selecting by n

“Love” and “free” contribute most to the positive sentiment. Followed by “whoa.” This did not disappoint.

Over Time

We can compute a “positivity score” by Friend on the episode level:

s.pm.updown <- friends %>%
    filter(Friend %in% friendsNames) %>%
  inner_join(get_sentiments("bing")) %>%
  count(season, episode, Friend, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

## Joining, by = "word"

# create graphing function
loopPM.seasonF <- function(df, na.rm = TRUE, ...){


  # create for loop
  for (i in seq_along(friendsNames)) { 

    # create plot for each season 
  plot <-   ggplot(subset(df, df$Friend==friendsNames[i]), aes(episode, sentiment)) +
    geom_col(show.legend = FALSE, alpha = 0.8, aes(fill = sentiment > 0)) +
        geom_smooth(method = "loess", color = "white", se = FALSE) +
    facet_wrap(~season) +
    labs(title = paste("+/- by Season - ", friendsNames[i]), x = NULL, y = "+/-") +
    theme(legend.position = "none") +
    coord_cartesian(ylim = c(-20, 10))

    print(plot)
  }
}

# run graphing function
loopPM.seasonF(s.pm.updown)

This might be too granular to derive any overall trends.

Let’s compare our Friends across the seasons:

friends %>%
    filter(Friend %in% friendsNames) %>%
  inner_join(get_sentiments("bing")) %>%
  count(season, Friend, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative) %>%

ggplot(aes(season, sentiment)) +
  geom_col(show.legend = FALSE, alpha = 0.8, aes(fill = sentiment > 0)) +
    geom_smooth(method = "loess", color = "white", se = FALSE) +
  facet_wrap(~Friend, scales = "free_x") +
    labs(title = paste("+/- by Season & Friend"), x = NULL, y = "+/-") +
    scale_x_continuous(breaks = seasons)

## Joining, by = "word"

Our Friends predominantly have a negative average sentiment by season, with notable exceptions being:

  • Rachel’s Season 7 and 10 - she’s the only Friend heavily trending positive towards the end of the show;
  • Ross’s Season 4; and
  • Joey’s Season 7.

  • Chandler is by far the most negative, although he becomes less negative over time, trending in the same direction as Phoebe and Rachel.

  • Monica and Ross move in the opposite direction - Monica especially becomes A LOT more negative.

  • Joey is the most stable of all Friends in regards to his +/-.

The fact that one of the funniest shows in the history of humankind (if I may say so myself) generally has an average sentiment by Friend and season suggests that, as alluded to previously, a word-level sentiment analysis really may not be the right tool for this. You can still use it to hone your data viz skills though. :)

What contributed to the outlier seasons?

friends %>%
    filter(Friend == "Rachel") %>%
    filter(season == 3) %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, season, sort = TRUE) %>%
  ungroup() %>%
  group_by(sentiment) %>%
    top_n(5) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_grid(sentiment ~ season, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()

## Joining, by = "word"

## Selecting by n

“Break” and “blah”, plus some other words, contribute to Rachel’s negative sentiment in Season 3. Which is where Ross and Rachel were on a break, so…

friends %>%
    filter(Friend == "Rachel") %>%
    filter(season == 7) %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, season, sort = TRUE) %>%
  ungroup() %>%
  group_by(sentiment) %>%
    top_n(5) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_grid(sentiment ~ season, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()

## Joining, by = "word"

## Selecting by n

friends %>%
    filter(Friend == "Rachel") %>%
    filter(season == 10) %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, season, sort = TRUE) %>%
  ungroup() %>%
  group_by(sentiment) %>%
    top_n(5) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_grid(sentiment ~ season, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()

## Joining, by = "word"
## Selecting by n

Rachel’s season 7 and 10 were positive due to “love”, “wow”, “fine”, “happy”, and “fun.”

What made Ross’s Season 4 so stellar?

friends %>%
    filter(Friend == "Ross") %>%
    filter(season == 4) %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, season, sort = TRUE) %>%
  ungroup() %>%
  group_by(sentiment) %>%
    top_n(5) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_grid(sentiment ~ season, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()

## Joining, by = "word"

## Selecting by n

“Fine”, “love”, and “fun.” Which is interesting, as he doesn’t actually use “fine” in a positive context when he goes “I’m fiiiiiineeeeee”, but be that as it may.

And Joey’s season 7?

friends %>%
    filter(Friend == "Joey") %>%
    filter(season == 7) %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, season, sort = TRUE) %>%
  ungroup() %>%
  group_by(sentiment) %>%
    top_n(5) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_grid(sentiment ~ season, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()

## Joining, by = "word"

## Selecting by n

“Wow” and “love.”

What makes Monica’s sentiment drop so much towards the end of the show?

friends %>%
    filter(Friend == "Monica") %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, season, sort = TRUE) %>%
  ungroup() %>%
  group_by(sentiment) %>%
    top_n(25) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_grid(sentiment ~ season, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()

## Joining, by = "word"

## Selecting by n

She starts using “sick”, “hard”, and “bad” quite a lot starting in Season 6.

Contributions by Season

Which words contributed most to the plus/minus values in the different seasons?

s.pm <- friends %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, season, sort = TRUE) %>%
  group_by(sentiment, season) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n))

## Joining, by = "word"

## Selecting by n

s.pm$sentiment <- as.factor(s.pm$sentiment)

# create graphing function
loopPM.season <- function(df, na.rm = TRUE, ...){


  # create for loop to produce ggplot2 graphs 
  for (i in seq_along(seasons)) { 

    # create plot for each season 

  plot <-   ggplot(subset(df, df$season==seasons[i]), aes(reorder(word, n), n, fill = sentiment)) +
    geom_col(show.legend = FALSE) +
  facet_wrap(~ sentiment, scales = "free_y") +
  labs(y = paste("Contribution to sentiments - Season", seasons[i]), x = NULL) +
     coord_flip()

    print(plot)
  }
}

# run graphing function
loopPM.season(s.pm)

Sentiment Categories

Now let’s look at the NRC sentiment categories in detail - the ones for which we developed a function earlier that would retrieve the top n words for us in each season.

Let’s get a very general overview of how these categories do over time:

friends %>%
  inner_join(get_sentiments("nrc")) %>%
  count(season, episode, sentiment) %>%
ggplot(aes(episode, n, color = sentiment, fill = sentiment)) +
  geom_point(show.legend = FALSE, alpha = 0.3) +
    geom_smooth() +
  facet_grid(sentiment~season, scales = "free_x")

## Joining, by = "word"

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

This isn’t too informative; it is very small and crammed. Let’s loop through the ten seasons to get a better look.

sentCatSeason <- friends %>%
  inner_join(get_sentiments("nrc")) %>%
  count(season, episode, sentiment)

## Joining, by = "word"

# create graphing function
loopSentCats.season <- function(df, na.rm = TRUE, ...){

  # create for loop  
  for (i in seq_along(seasons)) { 

    # create plot for each season 
  plot <-   ggplot(subset(df, df$season==seasons[i]), aes(episode, n, color = sentiment, fill = sentiment)) +
    geom_point(show.legend = FALSE, alpha = 0.3) +
        geom_smooth(method = "loess", se = FALSE) +
    labs(title = paste("NRC Sentiment Categories across Episodes - Season", seasons[i]))

    print(plot)
  }
}

# run graphing function
loopSentCats.season(sentCatSeason)

Even though the bottom areas of these plots are fairly crowded, we can see that the sentiment each season is mostly positive, with either negative (…) or anticipation following closely behind. More often than not, positivity rises towards the end of a season (see eespecially Seasons 2, 5, and 6.)

Instead of looping through seasons, let’s loop through the different sentiment categories:

# create graphing function
loopSentCats.sent <- function(df, na.rm = TRUE, ...){

    # defining character vector to loop along:
    sentimentCats <- c("anger", "anticipation", "disgust", "fear", "joy", "negative", "positive", "sadness", "surprise", "trust")

  # create for loop
  for (i in seq_along(sentimentCats)) { 

   #create plot for each sentiment 
  plot <-   ggplot(subset(df, df$sentiment==sentimentCats[i]), aes(season, n)) +
    geom_violin(show.legend = FALSE, alpha = 0.6, aes(group = season, fill = factor(season)), color = "white") +
    geom_smooth(color = "white") +

    labs(title = paste("The", sentimentCats[i], "Sentiment across Seasons")) +
    scale_x_continuous(breaks = seasons) +
    coord_cartesian(ylim = c(10,150)) +
    theme(legend.position = "none")

    print(plot)
  }
}

# run graphing function
loopSentCats.sent(sentCatSeason)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

As we plotted all sentiments on the same y-axis scale, we can visually determine that the positive, anticipation, joy, and trust sentiments dominate the show. They all rise towards the end of the show as well. These are also the categories with the largest variance, as we can see from the height of the violin plots, in the show. Compare, for example, the tall, skinny violins to the short, stubby ones for disgust, surprise, or anger.

Again, we see that the negative sentiment is also well represented, which is confusing (at best) given that Friends is a highly comedic show.

How do our Friends compare in terms of the sentiment categories?

friends %>%
  inner_join(get_sentiments("nrc")) %>%
    filter(Friend %in% friendsNames) %>%
  count(Friend, sentiment) %>%
ggplot(aes(Friend, n, color = Friend, fill = Friend)) +
  geom_col(show.legend = TRUE, alpha = 0.8) +
    facet_wrap(~sentiment) +
    labs(y = "% of Words", x = NULL) +
    theme(axis.text.x = element_blank(), axis.ticks = element_blank(), legend.position = "none")

## Joining, by = "word"

Note that these are absolute counts - Rachel utters the most positive, trusting, and anticipatory words - but that may be simply because she is a main character.

Let’s look at the categories as percentages of all words uttered by a character:

friends %>%
  inner_join(get_sentiments("nrc")) %>%
    filter(Friend %in% friendsNames) %>%
  count(Friend, sentiment) %>%
    group_by(Friend) %>%
    mutate(totalWords = sum(n), sentPerc = (n/totalWords)*100) %>%
ggplot(aes(Friend, sentPerc, color = Friend, fill = Friend)) +
  geom_col(show.legend = TRUE, alpha = 0.8) +
    labs(y = "% of Words", x = NULL) +
    facet_wrap(~sentiment) +
    theme(axis.text.x = element_blank(), axis.ticks = element_blank(), legend.position = "none")

## Joining, by = "word"

This is much flatter; there aren’t really any outliers.

Monica and Rachel appear to be the most anticipatory (?). Joey is the most surprised - and the most angry and negative. (Again, sentiment analysis here cannot make sense of irony or sarcasm, and usage of a word in context.)

Overall, our Friends are very positive and high in anticipation (but also negative… see above).

How do those category scores develop over the seasons for each of them?

Subsetting data:

friendSent <- friends %>%
  inner_join(get_sentiments("nrc")) %>%
  count(Friend, season, sentiment) %>%
    group_by(Friend) %>%
    mutate(totalWords = sum(n), sentPerc = (n/totalWords)*100)

## Joining, by = "word"

# create graphing function
friendsDevel <- function(df, na.rm = TRUE, ...){

  # create for loop  
  for (i in seq_along(friendsNames)) { 

  # create plot for each season 
  plot <-   ggplot(subset(df, df$Friend==friendsNames[i]), aes(season, sentPerc, color = sentiment, fill = sentiment)) +
      geom_point(show.legend = FALSE, alpha = 0.4) +
    geom_smooth(method = "loess") +
        facet_wrap(~sentiment) +
      labs(title = friendsNames[i], y = "% of Words", x = "Season") +
    theme(legend.position = "none") +
    scale_x_continuous(breaks = seasons)

    print(plot)
  }
}

# run graphing function
friendsDevel(friendSent)

Interestingly, flipping between Ross and Chandler, they seem to be the “opposite” of each other - look at the positive and anticipatory sentiments, or trust. When Chandler is positive, Ross isn’t, and vice versa. This could suggest that the writers may have aimed for an equilibrium, so that episodes would be well-rounded. (Keep in mind that this whole sentiment analysis is sort of wonky and this is a somewhat wild extrapolation to make.)

Stay tuned for Season 1 Episode 4, The One with the TF-IDF!

Avatar
Isabell Hubert Lyall
PhD Candidate in Experimental Psycholinguistics | Vice-Chair ETS Advisory Board

PhD Candidate researching the influence of extra-linguistic information on language comprehension; affiliated with the Centre for Comparative Psycholinguistics at the University of Alberta. Also the Vice-Chair of the Edmonton Transit Service Advisory Board (ETSAB), which advises City Council in transit-related matters.

Related