Textmining Friends: S1 E5 - The One with Valence, Arousal, and Dominance [Work in Progress]

This Friends textmining effort in R was my Saturday project during a range of very snowy Saturdays we had here in Edmonton in September. It makes heavy use of the tidyverse, Text Mining in R by Julia Silge and David Robinson (which is highly recommended), and piping from the magrittr package (which makes things so much nicer.) If you haven’t read the previous episodes, they are:

  1. S1 E1, The One with all the Import and Cleanup,
  2. S1 E2, The One with the Most Frequent Words,
  3. S1 E3, The One with the Sentiment Analysis, and
  4. S1 E4, The One with the TF-IDF’s.

The Valence, Arousal, and Dominance Values that we’ll use below were taken from Warriner, A.B., Kuperman, V., & Brysbaert, M. (2013). Norms of valence, arousal, and dominance for 13,915 English lemmas. Behavior Research Methods, 45, 1191-1207.

You can find a tutorial by Rich Majerus on how to loop with ggplot2 here.

NOTE: This ``episode” is by far the most experimental (and not only because things kind of fall apart at the end.) This piece represents the first time I’ve ever worked with VAD scores, and it is still a work in progress.

Also keep in mind that there may be faster, smarter, or nicer ways to achieve a certain thing. Still, maybe you’ll find something interesting for your own projects - or just some funny tidbit about your favourite show. In this third “episode,” we’ll look at sentiment analyses wtih different lexicons, and - lo and behold - it may not be working too well on a show that makes heavy use of irony and sarcasm. But! It’s still fun.

Isabell Hubert 2018 www.isabellhubert.com


Prep

We’ll load the following libraries:

library(dplyr)
library(tidyr)
library(readr)
library(stringr)
library(tidytext)
library(magrittr)
library(ggplot2)

And define some useful character vectors:

friendsNames <- c("Monica", "Rachel", "Chandler", "Joey", "Ross", "Phoebe")
friendsExtended <- c("Monica", "Rachel", "Chandler", "Joey", "Ross", "Phoebe", "Janice", "Gunther")
seasons = c(1:10)

Import

Then we import the data we’ll need:

friends <- readRDS("FriendsMining-df.rds")
head(friends)

##     Friend season episode lineNum  word
## 1     Joey      1       1       3 gotta
## 2     Joey      1       1       3 wrong
## 3 Chandler      1       1       4  joey
## 4 Chandler      1       1       4  nice
## 5 Chandler      1       1       4  hump
## 6 Chandler      1       1       4  hump

str(friends)

## 'data.frame':    151445 obs. of  5 variables:
##  $ Friend : chr  "Joey" "Joey" "Chandler" "Chandler" ...
##  $ season : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ episode: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ lineNum: int  3 3 4 4 4 4 4 5 5 5 ...
##  $ word   : chr  "gotta" "wrong" "joey" "nice" ...

tokens <- readRDS("FriendsMining-tokens.rds")
head(tokens)

##   Friend season episode lineNum    word
## 1 Monica      1       1       2 there's
## 2 Monica      1       1       2 nothing
## 3 Monica      1       1       2      to
## 4 Monica      1       1       2    tell
## 5 Monica      1       1       2    he's
## 6 Monica      1       1       2    just

str(tokens)

## 'data.frame':    634211 obs. of  5 variables:
##  $ Friend : chr  "Monica" "Monica" "Monica" "Monica" ...
##  $ season : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ episode: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ lineNum: int  2 2 2 2 2 2 2 2 2 2 ...
##  $ word   : chr  "there's" "nothing" "to" "tell" ...

vad.imp <- read_csv("Ratings_Warriner_et_al.csv")

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Word = col_character()
## )

## See spec(...) for full column specifications.

head(vad.imp)

## # A tibble: 6 x 65
##      X1 Word  V.Mean.Sum V.SD.Sum V.Rat.Sum A.Mean.Sum A.SD.Sum A.Rat.Sum
##   <dbl> <chr>      <dbl>    <dbl>     <dbl>      <dbl>    <dbl>     <dbl>
## 1     1 aard…       6.26     2.21        19       2.41     1.4         22
## 2     2 abal…       5.3      1.59        20       2.65     1.9         20
## 3     3 aban…       2.84     1.54        19       3.73     2.43        22
## 4     4 aban…       2.63     1.74        19       4.95     2.64        21
## 5     5 abbey       5.85     1.69        20       2.2      1.7         20
## 6     6 abdo…       5.43     1.75        21       3.68     2.23        22
## # … with 57 more variables: D.Mean.Sum <dbl>, D.SD.Sum <dbl>,
## #   D.Rat.Sum <dbl>, V.Mean.M <dbl>, V.SD.M <dbl>, V.Rat.M <dbl>,
## #   V.Mean.F <dbl>, V.SD.F <dbl>, V.Rat.F <dbl>, A.Mean.M <dbl>,
## #   A.SD.M <dbl>, A.Rat.M <dbl>, A.Mean.F <dbl>, A.SD.F <dbl>,
## #   A.Rat.F <dbl>, D.Mean.M <dbl>, D.SD.M <dbl>, D.Rat.M <dbl>,
## #   D.Mean.F <dbl>, D.SD.F <dbl>, D.Rat.F <dbl>, V.Mean.Y <dbl>,
## #   V.SD.Y <dbl>, V.Rat.Y <dbl>, V.Mean.O <dbl>, V.SD.O <dbl>,
## #   V.Rat.O <dbl>, A.Mean.Y <dbl>, A.SD.Y <dbl>, A.Rat.Y <dbl>,
## #   A.Mean.O <dbl>, A.SD.O <dbl>, A.Rat.O <dbl>, D.Mean.Y <dbl>,
## #   D.SD.Y <dbl>, D.Rat.Y <dbl>, D.Mean.O <dbl>, D.SD.O <dbl>,
## #   D.Rat.O <dbl>, V.Mean.L <dbl>, V.SD.L <dbl>, V.Rat.L <dbl>,
## #   V.Mean.H <dbl>, V.SD.H <dbl>, V.Rat.H <dbl>, A.Mean.L <dbl>,
## #   A.SD.L <dbl>, A.Rat.L <dbl>, A.Mean.H <dbl>, A.SD.H <dbl>,
## #   A.Rat.H <dbl>, D.Mean.L <dbl>, D.SD.L <dbl>, D.Rat.L <dbl>,
## #   D.Mean.H <dbl>, D.SD.H <dbl>, D.Rat.H <dbl>

str(vad.imp)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 13915 obs. of  65 variables:
##  $ X1        : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Word      : chr  "aardvark" "abalone" "abandon" "abandonment" ...
##  $ V.Mean.Sum: num  6.26 5.3 2.84 2.63 5.85 5.43 4.48 2.42 2.05 5.52 ...
##  $ V.SD.Sum  : num  2.21 1.59 1.54 1.74 1.69 1.75 1.59 1.61 1.31 1.75 ...
##  $ V.Rat.Sum : num  19 20 19 19 20 21 23 19 19 21 ...
##  $ A.Mean.Sum: num  2.41 2.65 3.73 4.95 2.2 3.68 3.5 5.9 5.33 3.26 ...
##  $ A.SD.Sum  : num  1.4 1.9 2.43 2.64 1.7 2.23 1.82 2.57 2.2 2.22 ...
##  $ A.Rat.Sum : num  22 20 22 21 20 22 22 20 21 23 ...
##  $ D.Mean.Sum: num  4.27 4.95 3.32 2.64 5 5.15 5.32 2.75 3.02 5.33 ...
##  $ D.SD.Sum  : num  1.75 1.79 2.5 1.81 2.02 1.94 2.11 2.13 2.42 2.83 ...
##  $ D.Rat.Sum : num  15 22 22 28 25 27 19 24 48 18 ...
##  $ V.Mean.M  : num  6.18 5.71 2.45 2.17 4.75 5.3 4.36 2.38 2.67 5 ...
##  $ V.SD.M    : num  1.66 1.38 1.29 1.6 1.71 1.64 2.25 1.71 1.97 1.87 ...
##  $ V.Rat.M   : num  11 7 11 6 4 10 11 13 6 9 ...
##  $ V.Mean.F  : num  6 5.08 3.57 2.85 6.12 5.55 4.58 2.5 1.77 5.92 ...
##  $ V.SD.F    : num  2.94 1.71 1.81 1.82 1.63 1.92 0.67 1.52 0.83 1.62 ...
##  $ V.Rat.F   : num  7 13 7 13 16 11 12 6 13 12 ...
##  $ A.Mean.M  : num  3 2.56 4.5 4.3 4 3 2.6 5 5.25 4.45 ...
##  $ A.SD.M    : num  1.41 1.94 2.83 2.5 1.83 2 2.07 2.74 2.43 2.3 ...
##  $ A.Rat.M   : num  8 9 8 10 4 7 5 5 8 11 ...
##  $ A.Mean.F  : num  2.07 2.73 3.29 5.55 1.75 4 3.88 6.2 5.38 2.17 ...
##  $ A.SD.F    : num  1.33 1.95 2.16 2.73 1.39 2.33 1.71 2.54 2.14 1.53 ...
##  $ A.Rat.F   : num  14 11 14 11 16 15 16 15 13 12 ...
##  $ D.Mean.M  : num  4 5.33 4 3.09 5.56 4.76 5.83 3.08 3.47 5.6 ...
##  $ D.SD.M    : num  1.58 2.73 3.1 2.17 2.24 2.17 2.04 1.66 2.4 2.55 ...
##  $ D.Rat.M   : num  5 6 6 11 9 17 6 13 17 10 ...
##  $ D.Mean.F  : num  4.4 4.81 3.06 2.35 4.69 5.8 5.08 2.36 2.77 5 ...
##  $ D.SD.F    : num  1.9 1.38 2.29 1.54 1.89 1.32 2.18 2.62 2.43 3.3 ...
##  $ D.Rat.F   : num  10 16 16 17 16 10 13 11 31 8 ...
##  $ V.Mean.Y  : num  6.12 5 2.75 3.27 5.88 5.64 4.8 3.33 2.5 4.29 ...
##  $ V.SD.Y    : num  2.03 1.49 1.28 1.95 1.55 2.16 1.66 1.87 1.51 2.06 ...
##  $ V.Rat.Y   : num  8 10 8 11 8 11 15 9 10 7 ...
##  $ V.Mean.O  : num  6.36 5.78 2.91 1.75 5.83 5.2 3.88 1.6 1.56 6.23 ...
##  $ V.SD.O    : num  2.42 1.72 1.76 0.89 1.85 1.23 1.36 0.7 0.88 1.24 ...
##  $ V.Rat.O   : num  11 9 11 8 12 10 8 10 9 13 ...
##  $ A.Mean.Y  : num  2.56 2.75 3.78 4.55 2.44 3.4 3.75 6 5.78 4.09 ...
##  $ A.SD.Y    : num  1.74 1.91 2.28 2.66 1.67 3.36 1.91 2.65 1.3 2.3 ...
##  $ A.Rat.Y   : num  9 8 9 11 9 5 12 7 9 11 ...
##  $ A.Mean.O  : num  2.31 2.58 3.69 5.4 2 3.76 3.33 5.85 5 2.5 ...
##  $ A.SD.O    : num  1.18 1.98 2.63 2.67 1.79 1.92 1.8 2.64 2.7 1.93 ...
##  $ A.Rat.O   : num  13 12 13 10 11 17 9 13 12 12 ...
##  $ D.Mean.Y  : num  3.62 5.13 3.7 2.78 4.93 5.07 5 3 2.58 5.29 ...
##  $ D.SD.Y    : num  2 1.77 2.58 1.83 1.98 2.2 1 2.45 1.96 3.3 ...
##  $ D.Rat.Y   : num  8 15 10 18 14 14 9 11 26 7 ...
##  $ D.Mean.O  : num  5 4.57 3 2.4 5.09 5.23 5.6 2.54 3.55 5.36 ...
##  $ D.SD.O    : num  1.15 1.9 2.49 1.84 2.17 1.69 2.8 1.9 2.82 2.66 ...
##  $ D.Rat.O   : num  7 7 12 10 11 13 10 13 22 11 ...
##  $ V.Mean.L  : num  6.5 4.73 2.92 2.92 5.36 5.69 4.92 2.22 1.75 5.38 ...
##  $ V.SD.L    : num  2.5 1.56 1.88 1.61 1.21 2.02 1.51 1.92 0.89 2.1 ...
##  $ V.Rat.L   : num  12 11 12 13 11 13 12 9 8 13 ...
##  $ V.Mean.H  : num  5.86 6 2.71 2 6.44 5 4 2.6 2.27 5.75 ...
##  $ V.SD.H    : num  1.68 1.41 0.76 2 2.07 1.2 1.61 1.35 1.56 1.04 ...
##  $ V.Rat.H   : num  7 9 7 6 9 8 11 10 11 8 ...
##  $ A.Mean.L  : num  2.27 2.83 3.64 4.79 1.78 3.1 3.17 5.4 5.7 2.82 ...
##  $ A.SD.L    : num  1.56 1.95 2.8 2.72 1.39 2.73 1.99 2.67 1.49 2.04 ...
##  $ A.Rat.L   : num  11 12 11 14 9 10 12 10 10 11 ...
##  $ A.Mean.H  : num  2.55 2.38 3.82 5.29 2.55 4.17 3.9 6.4 5 3.67 ...
##  $ A.SD.H    : num  1.29 1.92 2.14 2.63 1.92 1.7 1.6 2.5 2.72 2.39 ...
##  $ A.Rat.H   : num  11 8 11 7 11 12 10 10 11 12 ...
##  $ D.Mean.L  : num  4.12 5.55 2.77 2.31 4.83 5.93 6.5 2.89 3.03 3.44 ...
##  $ D.SD.L    : num  1.64 2.21 2.09 1.45 2.18 1.9 1.52 2.52 2.39 2.24 ...
##  $ D.Rat.L   : num  8 11 13 16 18 14 6 9 30 9 ...
##  $ D.Mean.H  : num  4.43 4.36 4.11 3.08 5.43 4.31 4.77 2.67 3 7.22 ...
##  $ D.SD.H    : num  1.99 1.03 2.93 2.19 1.62 1.65 2.17 1.95 2.54 1.99 ...
##  $ D.Rat.H   : num  7 11 9 12 7 13 13 15 18 9 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   X1 = col_double(),
##   ..   Word = col_character(),
##   ..   V.Mean.Sum = col_double(),
##   ..   V.SD.Sum = col_double(),
##   ..   V.Rat.Sum = col_double(),
##   ..   A.Mean.Sum = col_double(),
##   ..   A.SD.Sum = col_double(),
##   ..   A.Rat.Sum = col_double(),
##   ..   D.Mean.Sum = col_double(),
##   ..   D.SD.Sum = col_double(),
##   ..   D.Rat.Sum = col_double(),
##   ..   V.Mean.M = col_double(),
##   ..   V.SD.M = col_double(),
##   ..   V.Rat.M = col_double(),
##   ..   V.Mean.F = col_double(),
##   ..   V.SD.F = col_double(),
##   ..   V.Rat.F = col_double(),
##   ..   A.Mean.M = col_double(),
##   ..   A.SD.M = col_double(),
##   ..   A.Rat.M = col_double(),
##   ..   A.Mean.F = col_double(),
##   ..   A.SD.F = col_double(),
##   ..   A.Rat.F = col_double(),
##   ..   D.Mean.M = col_double(),
##   ..   D.SD.M = col_double(),
##   ..   D.Rat.M = col_double(),
##   ..   D.Mean.F = col_double(),
##   ..   D.SD.F = col_double(),
##   ..   D.Rat.F = col_double(),
##   ..   V.Mean.Y = col_double(),
##   ..   V.SD.Y = col_double(),
##   ..   V.Rat.Y = col_double(),
##   ..   V.Mean.O = col_double(),
##   ..   V.SD.O = col_double(),
##   ..   V.Rat.O = col_double(),
##   ..   A.Mean.Y = col_double(),
##   ..   A.SD.Y = col_double(),
##   ..   A.Rat.Y = col_double(),
##   ..   A.Mean.O = col_double(),
##   ..   A.SD.O = col_double(),
##   ..   A.Rat.O = col_double(),
##   ..   D.Mean.Y = col_double(),
##   ..   D.SD.Y = col_double(),
##   ..   D.Rat.Y = col_double(),
##   ..   D.Mean.O = col_double(),
##   ..   D.SD.O = col_double(),
##   ..   D.Rat.O = col_double(),
##   ..   V.Mean.L = col_double(),
##   ..   V.SD.L = col_double(),
##   ..   V.Rat.L = col_double(),
##   ..   V.Mean.H = col_double(),
##   ..   V.SD.H = col_double(),
##   ..   V.Rat.H = col_double(),
##   ..   A.Mean.L = col_double(),
##   ..   A.SD.L = col_double(),
##   ..   A.Rat.L = col_double(),
##   ..   A.Mean.H = col_double(),
##   ..   A.SD.H = col_double(),
##   ..   A.Rat.H = col_double(),
##   ..   D.Mean.L = col_double(),
##   ..   D.SD.L = col_double(),
##   ..   D.Rat.L = col_double(),
##   ..   D.Mean.H = col_double(),
##   ..   D.SD.H = col_double(),
##   ..   D.Rat.H = col_double()
##   .. )

Cleanup

We’ll clean up the dataframe that has the dominance, arousal, and valence values a little bit by removing all columns from vad.imp except the mean scores:

vad <- vad.imp %>%
    select(Word, V.Mean.Sum, A.Mean.Sum, D.Mean.Sum) %>%
    rename(word = Word,
                 valence = V.Mean.Sum,
                 arousal = A.Mean.Sum,
                 dominance = D.Mean.Sum)
head(vad)

## # A tibble: 6 x 4
##   word        valence arousal dominance
##   <chr>         <dbl>   <dbl>     <dbl>
## 1 aardvark       6.26    2.41      4.27
## 2 abalone        5.3     2.65      4.95
## 3 abandon        2.84    3.73      3.32
## 4 abandonment    2.63    4.95      2.64
## 5 abbey          5.85    2.2       5   
## 6 abdomen        5.43    3.68      5.15

str(vad)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 13915 obs. of  4 variables:
##  $ word     : chr  "aardvark" "abalone" "abandon" "abandonment" ...
##  $ valence  : num  6.26 5.3 2.84 2.63 5.85 5.43 4.48 2.42 2.05 5.52 ...
##  $ arousal  : num  2.41 2.65 3.73 4.95 2.2 3.68 3.5 5.9 5.33 3.26 ...
##  $ dominance: num  4.27 4.95 3.32 2.64 5 5.15 5.32 2.75 3.02 5.33 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   X1 = col_double(),
##   ..   Word = col_character(),
##   ..   V.Mean.Sum = col_double(),
##   ..   V.SD.Sum = col_double(),
##   ..   V.Rat.Sum = col_double(),
##   ..   A.Mean.Sum = col_double(),
##   ..   A.SD.Sum = col_double(),
##   ..   A.Rat.Sum = col_double(),
##   ..   D.Mean.Sum = col_double(),
##   ..   D.SD.Sum = col_double(),
##   ..   D.Rat.Sum = col_double(),
##   ..   V.Mean.M = col_double(),
##   ..   V.SD.M = col_double(),
##   ..   V.Rat.M = col_double(),
##   ..   V.Mean.F = col_double(),
##   ..   V.SD.F = col_double(),
##   ..   V.Rat.F = col_double(),
##   ..   A.Mean.M = col_double(),
##   ..   A.SD.M = col_double(),
##   ..   A.Rat.M = col_double(),
##   ..   A.Mean.F = col_double(),
##   ..   A.SD.F = col_double(),
##   ..   A.Rat.F = col_double(),
##   ..   D.Mean.M = col_double(),
##   ..   D.SD.M = col_double(),
##   ..   D.Rat.M = col_double(),
##   ..   D.Mean.F = col_double(),
##   ..   D.SD.F = col_double(),
##   ..   D.Rat.F = col_double(),
##   ..   V.Mean.Y = col_double(),
##   ..   V.SD.Y = col_double(),
##   ..   V.Rat.Y = col_double(),
##   ..   V.Mean.O = col_double(),
##   ..   V.SD.O = col_double(),
##   ..   V.Rat.O = col_double(),
##   ..   A.Mean.Y = col_double(),
##   ..   A.SD.Y = col_double(),
##   ..   A.Rat.Y = col_double(),
##   ..   A.Mean.O = col_double(),
##   ..   A.SD.O = col_double(),
##   ..   A.Rat.O = col_double(),
##   ..   D.Mean.Y = col_double(),
##   ..   D.SD.Y = col_double(),
##   ..   D.Rat.Y = col_double(),
##   ..   D.Mean.O = col_double(),
##   ..   D.SD.O = col_double(),
##   ..   D.Rat.O = col_double(),
##   ..   V.Mean.L = col_double(),
##   ..   V.SD.L = col_double(),
##   ..   V.Rat.L = col_double(),
##   ..   V.Mean.H = col_double(),
##   ..   V.SD.H = col_double(),
##   ..   V.Rat.H = col_double(),
##   ..   A.Mean.L = col_double(),
##   ..   A.SD.L = col_double(),
##   ..   A.Rat.L = col_double(),
##   ..   A.Mean.H = col_double(),
##   ..   A.SD.H = col_double(),
##   ..   A.Rat.H = col_double(),
##   ..   D.Mean.L = col_double(),
##   ..   D.SD.L = col_double(),
##   ..   D.Rat.L = col_double(),
##   ..   D.Mean.H = col_double(),
##   ..   D.SD.H = col_double(),
##   ..   D.Rat.H = col_double()
##   .. )

Let’s gather() this and turn it into a long version:

vadl <- gather(vad, dimension, score, valence:dominance)
head(vadl)

## # A tibble: 6 x 3
##   word        dimension score
##   <chr>       <chr>     <dbl>
## 1 aardvark    valence    6.26
## 2 abalone     valence    5.3 
## 3 abandon     valence    2.84
## 4 abandonment valence    2.63
## 5 abbey       valence    5.85
## 6 abdomen     valence    5.43

str(vadl)

## Classes 'tbl_df', 'tbl' and 'data.frame':    41745 obs. of  3 variables:
##  $ word     : chr  "aardvark" "abalone" "abandon" "abandonment" ...
##  $ dimension: chr  "valence" "valence" "valence" "valence" ...
##  $ score    : num  6.26 5.3 2.84 2.63 5.85 5.43 4.48 2.42 2.05 5.52 ...

And then we’ll inner_join() both VAD df’s with the friends data, so: the transcripts split into each individual word, minus the stop words:

friends.vad <- friends %>%
    inner_join(vad, by = "word")
head(friends.vad)

##     Friend season episode lineNum  word valence arousal dominance
## 1     Joey      1       1       3 wrong    3.24    5.29      3.39
## 2 Chandler      1       1       4  nice    6.95    3.53      6.47
## 3 Chandler      1       1       4  hump    4.75    5.16      3.77
## 4 Chandler      1       1       4  hump    4.75    5.16      3.77
## 5   Phoebe      1       1       5  wait    4.55    3.62      5.31
## 6   Phoebe      1       1       5   eat    7.10    4.38      7.26

str(friends.vad)

## 'data.frame':    87692 obs. of  8 variables:
##  $ Friend   : chr  "Joey" "Chandler" "Chandler" "Chandler" ...
##  $ season   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ episode  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ lineNum  : int  3 4 4 4 5 5 5 8 8 8 ...
##  $ word     : chr  "wrong" "nice" "hump" "hump" ...
##  $ valence  : num  3.24 6.95 4.75 4.75 4.55 7.1 5 7.82 7.18 5.7 ...
##  $ arousal  : num  5.29 3.53 5.16 5.16 3.62 4.38 2.9 2.38 4.33 3.77 ...
##  $ dominance: num  3.39 6.47 3.77 3.77 5.31 7.26 5.48 6.94 5.11 5.4 ...

friends.vadl <- friends %>%
    inner_join(vadl, by = "word")
head(friends.vadl)

##     Friend season episode lineNum  word dimension score
## 1     Joey      1       1       3 wrong   valence  3.24
## 2     Joey      1       1       3 wrong   arousal  5.29
## 3     Joey      1       1       3 wrong dominance  3.39
## 4 Chandler      1       1       4  nice   valence  6.95
## 5 Chandler      1       1       4  nice   arousal  3.53
## 6 Chandler      1       1       4  nice dominance  6.47

str(friends.vadl)

## 'data.frame':    263076 obs. of  7 variables:
##  $ Friend   : chr  "Joey" "Joey" "Joey" "Chandler" ...
##  $ season   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ episode  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ lineNum  : int  3 3 3 4 4 4 4 4 4 4 ...
##  $ word     : chr  "wrong" "wrong" "wrong" "nice" ...
##  $ dimension: chr  "valence" "arousal" "dominance" "valence" ...
##  $ score    : num  3.24 5.29 3.39 6.95 3.53 6.47 4.75 5.16 3.77 4.75 ...

Analysis

Dimensions by Friend

First we’ll calculate some valence, arousal, and dominance sums for each Friend. Let’s look at the three different dimensions as percentages of all words uttered by each Friend:

friends.vad.agg <- friends.vadl %>%
    filter(Friend %in% friendsNames) %>%
    group_by(Friend) %>%
    mutate(wordCount  = n()) %>%
    ungroup() %>%
    group_by(Friend, dimension) %>%
    mutate(totalScore = sum(score)) %>%
    ungroup() %>%
    select(Friend, dimension, wordCount, totalScore) %>%
    distinct(Friend, dimension, totalScore, wordCount) %>%
    mutate(relativeScore = totalScore/wordCount)

head(friends.vad.agg)

## # A tibble: 6 x 5
##   Friend   dimension totalScore wordCount relativeScore
##   <chr>    <chr>          <dbl>     <int>         <dbl>
## 1 Joey     valence       70049.     36054          1.94
## 2 Joey     arousal       51079.     36054          1.42
## 3 Joey     dominance     66134.     36054          1.83
## 4 Chandler valence       73181.     37659          1.94
## 5 Chandler arousal       53843.     37659          1.43
## 6 Chandler dominance     69087.     37659          1.83

str(friends.vad.agg)

## Classes 'tbl_df', 'tbl' and 'data.frame':    18 obs. of  5 variables:
##  $ Friend       : chr  "Joey" "Joey" "Joey" "Chandler" ...
##  $ dimension    : chr  "valence" "arousal" "dominance" "valence" ...
##  $ totalScore   : num  70049 51079 66134 73181 53843 ...
##  $ wordCount    : int  36054 36054 36054 37659 37659 37659 34869 34869 34869 35649 ...
##  $ relativeScore: num  1.94 1.42 1.83 1.94 1.43 ...

Let’s visualize what each Friend’s values are along each dimension using a geom_point() plot:

ggplot(friends.vad.agg, aes(dimension, relativeScore, color = dimension, fill = dimension)) +
    geom_point(show.legend = TRUE, alpha = 0.8) +
    facet_wrap(~Friend) +
    theme(axis.text.x = element_blank(), axis.ticks = element_blank(), legend.position = "right")

This doesn’t tell us much - the dots all look like they are pretty much in the same place. Let’s try faceting by dimension instead:

ggplot(friends.vad.agg, aes(Friend, relativeScore, color = Friend, fill = Friend)) +
  geom_point(show.legend = TRUE, alpha = 0.8) +
    facet_wrap(~dimension) +
    #coord_cartesian(ylim = c(1.3, 2)) +
    theme(axis.text.x = element_blank(), axis.ticks = element_blank(), legend.position = "right")

This is a little bit better. Turns out the six Friends are all pretty equal in terms of their dominance, but there are some noticable patterns for arousal and valence:

  • Joey has by far the lowest arousal score.
  • The female Friends and Chandler have the highest.
  • Monica and Ross are the most positive, and Phoebe the most negative.

Most X Words by Friend

Let’s see which words are the most valent, dominant, and arousing per Friend:

words.by.Friend <- friends.vadl %>%
    filter(Friend %in% friendsNames) %>%
    group_by(Friend) %>%
    mutate(wordCount  = n()) %>%
    ungroup() %>%
    distinct(word, Friend, dimension, score)
head(words.by.Friend)

## # A tibble: 6 x 4
##   word  Friend   dimension score
##   <chr> <chr>    <chr>     <dbl>
## 1 wrong Joey     valence    3.24
## 2 wrong Joey     arousal    5.29
## 3 wrong Joey     dominance  3.39
## 4 nice  Chandler valence    6.95
## 5 nice  Chandler arousal    3.53
## 6 nice  Chandler dominance  6.47

str(words.by.Friend)

## Classes 'tbl_df', 'tbl' and 'data.frame':    44415 obs. of  4 variables:
##  $ word     : chr  "wrong" "wrong" "wrong" "nice" ...
##  $ Friend   : chr  "Joey" "Joey" "Joey" "Chandler" ...
##  $ dimension: chr  "valence" "arousal" "dominance" "valence" ...
##  $ score    : num  3.24 5.29 3.39 6.95 3.53 6.47 4.75 5.16 3.77 4.55 ...

words.by.Friend

## # A tibble: 44,415 x 4
##    word  Friend   dimension score
##    <chr> <chr>    <chr>     <dbl>
##  1 wrong Joey     valence    3.24
##  2 wrong Joey     arousal    5.29
##  3 wrong Joey     dominance  3.39
##  4 nice  Chandler valence    6.95
##  5 nice  Chandler arousal    3.53
##  6 nice  Chandler dominance  6.47
##  7 hump  Chandler valence    4.75
##  8 hump  Chandler arousal    5.16
##  9 hump  Chandler dominance  3.77
## 10 wait  Phoebe   valence    4.55
## # … with 44,405 more rows

Which words contributed most to the values, across all Friends?

words.by.Friend %>%
    filter(Friend == "Joey") %>%
    group_by(dimension) %>%
    top_n(5, score) %>%

ggplot(aes(reorder(word, score), score, fill = dimension)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ dimension, scales = "free") +
  labs(y = paste("Contribution to Dimensions"), x = NULL) +
    coord_flip()

Let’s loop this across all Friends, to visualize the most arousing/dominant/positive words for each Friend:

# create graphing function
loop.vad <- function(df, na.rm = TRUE, ...){


  # create for loop to produce ggplot2 graphs 
  for (i in seq_along(friendsNames)) { 

            use <- df %>%
                filter(Friend == friendsNames[i]) %>%
                group_by(dimension) %>%
                top_n(10, score)


            plot <- ggplot(use, aes(reorder(word, score), score, fill = dimension)) +
              geom_col(show.legend = FALSE) +
              facet_wrap(~ dimension, scales = "free") +
              labs(title = paste(friendsNames[i], "'s Most Arousing/Dominant/Valent Words"), x = NULL, y = NULL) +
                coord_flip()

    print(plot)
  }
}

# run graphing function
loop.vad(words.by.Friend)

There aren’t all that many differences between the Friends, mostly because this visualization is not at all concerned with how often a Friend uttered a word, just whether they uttered it at any point (even if just once). It would make a lot more sense if we considered the amount of times a word was used - for example, we could multiply the amount of times a word was produced by a Friend by its VAD score. Let’s try that:

friends.vadl$word <- as.factor(friends.vadl$word)

words2 <- friends.vadl %>%
    filter(Friend %in% friendsNames) %>%
    group_by(Friend) %>%
    mutate(totalWords  = n()) %>%
    ungroup() %>%
    group_by(Friend, dimension, season) %>%
    mutate(scoreSum = sum(score)) %>%
    ungroup() %>%
    group_by(word) %>%
    mutate(wordCount = n()) %>%
    ungroup() %>%
    select(Friend, word, dimension, wordCount, totalWords, scoreSum) %>%
    mutate(totalScore = (wordCount + scoreSum)/totalWords) %>%
    ungroup() %>%
    distinct() %>%
    filter(!word %in% c("god", "wait", "time", "love", "people", "fine"))

summary(words2)

##     Friend               word         dimension           wordCount     
##  Length:106608      bad    :   180   Length:106608      Min.   :   3.0  
##  Class :character   call   :   180   Class :character   1st Qu.:  24.0  
##  Mode  :character   day    :   180   Mode  :character   Median :  81.0  
##                     feel   :   180                      Mean   : 186.9  
##                     girl   :   180                      3rd Qu.: 234.0  
##                     guess  :   180                      Max.   :1566.0  
##                     (Other):105528                                      
##    totalWords       scoreSum      totalScore    
##  Min.   :34869   Min.   :3628   Min.   :0.1041  
##  1st Qu.:35649   1st Qu.:5563   1st Qu.:0.1546  
##  Median :37659   Median :6493   Median :0.1820  
##  Mean   :36902   Mean   :6485   Mean   :0.1808  
##  3rd Qu.:38232   3rd Qu.:7269   3rd Qu.:0.2029  
##  Max.   :38766   Max.   :9320   Max.   :0.2934  
## 

words2

## # A tibble: 106,608 x 7
##    Friend   word  dimension wordCount totalWords scoreSum totalScore
##    <chr>    <fct> <chr>         <int>      <int>    <dbl>      <dbl>
##  1 Joey     wrong valence         804      36054    5251.      0.168
##  2 Joey     wrong arousal         804      36054    3815.      0.128
##  3 Joey     wrong dominance       804      36054    4938.      0.159
##  4 Chandler nice  valence        1314      37659    7300.      0.229
##  5 Chandler nice  arousal        1314      37659    5399.      0.178
##  6 Chandler nice  dominance      1314      37659    6970.      0.220
##  7 Chandler hump  valence           9      37659    7300.      0.194
##  8 Chandler hump  arousal           9      37659    5399.      0.144
##  9 Chandler hump  dominance         9      37659    6970.      0.185
## 10 Phoebe   eat   valence         531      34869    4880.      0.155
## # … with 106,598 more rows

Note that this is a largely arbitrary measure that is used just for visualization purposes.

This is Where Things Get Wonky

Let’s loop this across all friends:

# create graphing function
loop.vad2 <- function(df, na.rm = TRUE, ...){


  # create for loop to produce ggplot2 graphs 
  for (i in seq_along(friendsNames)) { 

            use <- df %>%
                filter(Friend == friendsNames[i]) %>%
                group_by(dimension) %>%
                top_n(20, totalScore) %>%
                arrange(totalScore)


            plot <- ggplot(use, aes(reorder(word, totalScore), totalScore, fill = dimension)) +
              geom_col(show.legend = FALSE) +
              facet_wrap(~ dimension, scales = "free_y") +
              labs(title = paste(friendsNames[i], "'s Most Arousing/Dominant/Valent Words"), x = NULL, y = NULL) +
                coord_flip()

    print(plot)
  }
}

# run graphing function
loop.vad2(words2)

This produces a plot, but you will notice that the ordering of words on the y-axis is off. This is related to a known bug (feature?) in ggplot2; I have not looked at this in detail, but it looks like there may be workaround available here and/or here.

Avatar
Isabell Hubert Lyall
PhD Candidate in Experimental Psycholinguistics | Vice-Chair ETS Advisory Board

PhD Candidate researching the influence of extra-linguistic information on language comprehension; affiliated with the Centre for Comparative Psycholinguistics at the University of Alberta. Also the Vice-Chair of the Edmonton Transit Service Advisory Board (ETSAB), which advises City Council in transit-related matters.

Related