Textmining Friends: S1 E1 - The One with all the Import and Cleanup

This Friends textmining effort in R was my Saturday project during a range of very snowy Saturdays we had here in Edmonton in September. It makes heavy use of the tidyverse, Text Mining in R by Julia Silge and David Robinson (which is highly recommended), and piping from the magrittr package (which makes things so much nicer.)

Disclaimer – I do not claim to be an expert in textmining. There may be faster, smarter, or nicer ways to achieve a certain thing. Still, maybe you’ll find something interesting for your own projects - or just some funny tidbit about your favourite show. In this first “episode,” we’ll import the scripts and clean them up so that they lend themselves nicely to future use. Enjoy!

Isabell Hubert 2018

website :: twitter


Prep

We’ll load the following libraries:

library(dplyr)
library(tidyr)
library(readr)
library(stringr)
library(tidytext)
library(magrittr)

Import

I got all of my scripts from https://fangj.github.io/friends/, and saved them in a subdirectory called transcripts. We first list all txt files in the directory:

textFiles <- list.files(path = "transcripts", full.names = TRUE, pattern = "*txt")
head(textFiles)

## [1] "transcripts/0101.txt" "transcripts/0102.txt" "transcripts/0103.txt"
## [4] "transcripts/0104.txt" "transcripts/0105.txt" "transcripts/0106.txt"

Before importing the scripts, make sure that the same apostrophe is used throughout. I have found that some transcripts use , when others use '. The stop word list we will use uses ', so any words that contain will not be caught. You can either replace them before import, or use a regex within R.

Now we import the content of all txt files as nested list, and extract lines from them:

fullText <- sapply(textFiles, readLines)
lines <- do.call(rbind, lapply(fullText, data.frame, stringsAsFactors = FALSE))
names(lines)[1] <- "line"

This is what lines looks like:

str(lines)

## 'data.frame':    69531 obs. of  1 variable:
##  $ line: chr  "[Scene: Central Perk, Chandler, Joey, Phoebe, and Monica are there.]" "Monica: There's nothing to tell! He's just some guy I work with!" "Joey: C'mon, you're going out with the guy! There's gotta be something wrong with him!" "Chandler: All right Joey, be nice.  So does he have a hump? A hump and a hairpiece?" ...

head(lines)

##                                                                                                          line
## transcripts/0101.txt.1                   [Scene: Central Perk, Chandler, Joey, Phoebe, and Monica are there.]
## transcripts/0101.txt.2                       Monica: There's nothing to tell! He's just some guy I work with!
## transcripts/0101.txt.3 Joey: C'mon, you're going out with the guy! There's gotta be something wrong with him!
## transcripts/0101.txt.4    Chandler: All right Joey, be nice.  So does he have a hump? A hump and a hairpiece?
## transcripts/0101.txt.5                                                       Phoebe: Wait, does he eat chalk?
## transcripts/0101.txt.6                                                             (They all stare, bemused.)

We’ll extract the row names, as they contain lots of information we’ll need later:

lines$rownames <- rownames(lines)
head(lines)

##                                                                                                          line
## transcripts/0101.txt.1                   [Scene: Central Perk, Chandler, Joey, Phoebe, and Monica are there.]
## transcripts/0101.txt.2                       Monica: There's nothing to tell! He's just some guy I work with!
## transcripts/0101.txt.3 Joey: C'mon, you're going out with the guy! There's gotta be something wrong with him!
## transcripts/0101.txt.4    Chandler: All right Joey, be nice.  So does he have a hump? A hump and a hairpiece?
## transcripts/0101.txt.5                                                       Phoebe: Wait, does he eat chalk?
## transcripts/0101.txt.6                                                             (They all stare, bemused.)
##                                      rownames
## transcripts/0101.txt.1 transcripts/0101.txt.1
## transcripts/0101.txt.2 transcripts/0101.txt.2
## transcripts/0101.txt.3 transcripts/0101.txt.3
## transcripts/0101.txt.4 transcripts/0101.txt.4
## transcripts/0101.txt.5 transcripts/0101.txt.5
## transcripts/0101.txt.6 transcripts/0101.txt.6

Then we’ll separate the rownames column at /, into folder name and whatever else is left over:

lines1 <- lines %>%
    separate(rownames, into = c('folder', 'rownames'), remove = TRUE, extra = "merge", sep = "/")
head(lines1)

##                                                                                                          line
## transcripts/0101.txt.1                   [Scene: Central Perk, Chandler, Joey, Phoebe, and Monica are there.]
## transcripts/0101.txt.2                       Monica: There's nothing to tell! He's just some guy I work with!
## transcripts/0101.txt.3 Joey: C'mon, you're going out with the guy! There's gotta be something wrong with him!
## transcripts/0101.txt.4    Chandler: All right Joey, be nice.  So does he have a hump? A hump and a hairpiece?
## transcripts/0101.txt.5                                                       Phoebe: Wait, does he eat chalk?
## transcripts/0101.txt.6                                                             (They all stare, bemused.)
##                             folder   rownames
## transcripts/0101.txt.1 transcripts 0101.txt.1
## transcripts/0101.txt.2 transcripts 0101.txt.2
## transcripts/0101.txt.3 transcripts 0101.txt.3
## transcripts/0101.txt.4 transcripts 0101.txt.4
## transcripts/0101.txt.5 transcripts 0101.txt.5
## transcripts/0101.txt.6 transcripts 0101.txt.6

Then we’ll extract the season number from first two characters of rownames, and episode number from characters three and four:

lines1$season <- str_sub(lines1$rownames, 0, 2) 
lines1$episode <- str_sub(lines1$rownames, 3, 4)
head(lines1)

##                                                                                                          line
## transcripts/0101.txt.1                   [Scene: Central Perk, Chandler, Joey, Phoebe, and Monica are there.]
## transcripts/0101.txt.2                       Monica: There's nothing to tell! He's just some guy I work with!
## transcripts/0101.txt.3 Joey: C'mon, you're going out with the guy! There's gotta be something wrong with him!
## transcripts/0101.txt.4    Chandler: All right Joey, be nice.  So does he have a hump? A hump and a hairpiece?
## transcripts/0101.txt.5                                                       Phoebe: Wait, does he eat chalk?
## transcripts/0101.txt.6                                                             (They all stare, bemused.)
##                             folder   rownames season episode
## transcripts/0101.txt.1 transcripts 0101.txt.1     01      01
## transcripts/0101.txt.2 transcripts 0101.txt.2     01      01
## transcripts/0101.txt.3 transcripts 0101.txt.3     01      01
## transcripts/0101.txt.4 transcripts 0101.txt.4     01      01
## transcripts/0101.txt.5 transcripts 0101.txt.5     01      01
## transcripts/0101.txt.6 transcripts 0101.txt.6     01      01

Now we’ll extract the final number (after the period) from the original row name to be the line number:

split <- strsplit(lines1$rownames, "[.]")
split1 <- lapply(split, `[[`, 3)
lineNum <- as.data.frame(unlist(split1))
names(lineNum)[1] <- "lineNum"
head(lineNum)

##   lineNum
## 1       1
## 2       2
## 3       3
## 4       4
## 5       5
## 6       6

We’ll bind this to lines1:

lines2 <- bind_cols(lines1, lineNum)
head(lines2)

##                                                                                     line
## 1                   [Scene: Central Perk, Chandler, Joey, Phoebe, and Monica are there.]
## 2                       Monica: There's nothing to tell! He's just some guy I work with!
## 3 Joey: C'mon, you're going out with the guy! There's gotta be something wrong with him!
## 4    Chandler: All right Joey, be nice.  So does he have a hump? A hump and a hairpiece?
## 5                                                       Phoebe: Wait, does he eat chalk?
## 6                                                             (They all stare, bemused.)
##        folder   rownames season episode lineNum
## 1 transcripts 0101.txt.1     01      01       1
## 2 transcripts 0101.txt.2     01      01       2
## 3 transcripts 0101.txt.3     01      01       3
## 4 transcripts 0101.txt.4     01      01       4
## 5 transcripts 0101.txt.5     01      01       5
## 6 transcripts 0101.txt.6     01      01       6

Clean-Up

Lines

We notice there are lines in line that start with opening brackets. What are those? Let’s start with square brackets:

They appear to be scene intros and time lapse indicators. Since nobody speaks in any of them, those lines can all be removed. We’ll do that later, once we know what the situation is with regular opening brackets.

Speaking of: Of those lines starting with (, some are lines that are just in parentheses, basically just like square brackets; those can go completely. However, sometimes there is only an initial phrase in parentheses, one that sets the scene, for example, and then a character speaks after. An example - line 56 in S1 E04:

Chandler: (looking) Oh, this is not that bad.

We cannot remove the entire line, as otherwise we’d get rid of a lot of utterances. So we should remove the bits in parentheses, and leave the other stuff intact:

lines2b <- lines2
lines2b$line <- gsub( " *\\(.*?\\) *", "", lines2$line)

Great! (Except that this will only remove the first instance of brackets and their contents per line; we will learn later that some script transcribers use quite a number of parentheses per line, sometimes nested, to describe the situation - and there’s some typos, where brackets are opened, but never closed, etc. However, this will be caught and fixed when we do a general inspection of our work.)

Now we should remove those observations where line is empty:

lines2b <- lines2b %>%
    filter(line != "")

And all lines starting with [:

lines3 <- lines2b %>%
    filter(!str_detect(tolower(line), pattern = "^[\\[]"))
head(lines3)

##                                                                                                                    line
## 1                                                      Monica: There's nothing to tell! He's just some guy I work with!
## 2                                Joey: C'mon, you're going out with the guy! There's gotta be something wrong with him!
## 3                                   Chandler: All right Joey, be nice.  So does he have a hump? A hump and a hairpiece?
## 4                                                                                      Phoebe: Wait, does he eat chalk?
## 5                               Phoebe: Just, 'cause, I don't want her to go through what I went through with Carl- oh!
## 6 Monica: Okay, everybody relax. This is not even a date. It's just two people going out to dinner and- not having sex.
##        folder   rownames season episode lineNum
## 1 transcripts 0101.txt.2     01      01       2
## 2 transcripts 0101.txt.3     01      01       3
## 3 transcripts 0101.txt.4     01      01       4
## 4 transcripts 0101.txt.5     01      01       5
## 5 transcripts 0101.txt.7     01      01       7
## 6 transcripts 0101.txt.8     01      01       8

Now that we cleaned up our lines pretty well, we can extract the speaker and the actual spoken line from line:

lines4 <- lines3 %>% 
  separate(line, into = c('speaker', 'line'), remove = TRUE, extra = "merge", sep = ":") %>%
    select(-c(rownames, folder))

head(lines4)

##    speaker
## 1   Monica
## 2     Joey
## 3 Chandler
## 4   Phoebe
## 5   Phoebe
## 6   Monica
##                                                                                                             line
## 1                                                       There's nothing to tell! He's just some guy I work with!
## 2                               C'mon, you're going out with the guy! There's gotta be something wrong with him!
## 3                                      All right Joey, be nice.  So does he have a hump? A hump and a hairpiece?
## 4                                                                                       Wait, does he eat chalk?
## 5                                Just, 'cause, I don't want her to go through what I went through with Carl- oh!
## 6  Okay, everybody relax. This is not even a date. It's just two people going out to dinner and- not having sex.
##   season episode lineNum
## 1     01      01       2
## 2     01      01       3
## 3     01      01       4
## 4     01      01       5
## 5     01      01       7
## 6     01      01       8

This: (1) Introduced NA’s, which we will investigate below; and (2) probably has speakers that aren’t actually speakers.

Let’s look at what the most frequent speakers are:

lines4 %>% 
    group_by(speaker) %>%
    count() %>%
    arrange(desc(n))

We notice that some transcripts had speaker names in all-caps, so we will turn all speaker names into all lowercase to group them:

lines4$speaker <- tolower(lines4$speaker)

There’s also short versions (four letters max) for Friends other than Joey and Ross, like “mnca” for “Monica” - we’ll fix that:

lines4$speaker[lines4$speaker == "mnca"] <- "monica"
lines4$speaker[lines4$speaker == "rach"] <- "rachel"
lines4$speaker[lines4$speaker == "chan"] <- "chandler"
lines4$speaker[lines4$speaker == "phoe"] <- "phoebe"

Now we’ll look at the NAs:

head(lines4[!complete.cases(lines4),])

##              speaker line season episode lineNum
## 109 commercial break <NA>     01      01     122
## 168 commercial break <NA>     01      01     190
## 285  closing credits <NA>     01      01     319
## 304              end <NA>     01      01     339
## 315  opening credits <NA>     01      02      12
## 415 commercial break <NA>     01      02     125

We see that there is no line for things like commercial breaks, closing credits, etc., which makes sense.

We’ll remove all observations where the “speaker” is not an actual speaker:

lines4 <- lines4 %>%
    filter(!speaker %in% c("commercial break", "end", "closing credits", "opening credits", "opening titles", "commercial", "closing titles", "credits", "commerical break", "ending credits", "time lapse", "timelapse", "the end", "fade out"))
head(lines4[!complete.cases(lines4),])

## [1] speaker line    season  episode lineNum
## <0 rows> (or 0-length row.names)

There were 410 rows that were not caught by the above procedure, where line was empty mostly because the actual line has ended up (incorrectly) as speaker. These were mostly songs by Phoebe and Joey; bracketing errors; and typos in “commercial break,” for example. This needed manual investigation and fixing directly in the source txts, which is not displayed here. I used Atom, where you can search/replace across an entire project - or you can simply investigate the rows in the output and navigate to the season, episode, and line to fix things manually.


Tokenize

Now we break the lines apart into separate words, all while retaining the other columns of information - a process that is called tokenization, or “unnesting:”

tokens <- lines4 %>%
    unnest_tokens(word, line)
head(tokens)

##     speaker season episode lineNum    word
## 1    monica     01      01       2 there's
## 1.1  monica     01      01       2 nothing
## 1.2  monica     01      01       2      to
## 1.3  monica     01      01       2    tell
## 1.4  monica     01      01       2    he's
## 1.5  monica     01      01       2    just

Final Bits

Some column types aren’t exactly what we’d like them to be:

str(tokens)

## 'data.frame':    634211 obs. of  5 variables:
##  $ speaker: chr  "monica" "monica" "monica" "monica" ...
##  $ season : chr  "01" "01" "01" "01" ...
##  $ episode: chr  "01" "01" "01" "01" ...
##  $ lineNum: Factor w/ 605 levels "1","10","100",..: 112 112 112 112 112 112 112 112 112 112 ...
##  $ word   : chr  "there's" "nothing" "to" "tell" ...

We’ll change them, and also fix the name of the speaker column:

tokens$speaker <- as.factor(tokens$speaker)
tokens$season <- as.integer(tokens$season)
tokens$episode <- as.integer(tokens$episode)
tokens$lineNum <- as.integer(as.character(tokens$lineNum))
names(tokens)[1] <- "Friend"
head(tokens)

##     Friend season episode lineNum    word
## 1   monica      1       1       2 there's
## 1.1 monica      1       1       2 nothing
## 1.2 monica      1       1       2      to
## 1.3 monica      1       1       2    tell
## 1.4 monica      1       1       2    he's
## 1.5 monica      1       1       2    just

We’ll also capitalize the first letter of each name:

tokens <- tokens %>%
    mutate(Friend = sub("(.)", "\\U\\1", Friend, perl = TRUE))

To be safe, we’ll check if there are any words starting with an opening bracket left over:

tokens %>%
    filter(str_detect(word, pattern = "^\\("))

## [1] Friend  season  episode lineNum word   
## <0 rows> (or 0-length row.names)

There aren’t any!

We’ll clean up the workspace a little, and save our final df:

rm(list = ls(pattern = "^fu"))
rm(list = ls(pattern = "^li"))
rm(list = ls(pattern = "^sp"))
rm(list = ls(pattern = "^te"))
saveRDS(tokens, file = "FriendsMining-tokens.rds")
write.csv(tokens, file = "FriendsMining-tokens.csv")

Next up: S1 E2, The One with the Most Frequent Words.

Avatar
Isabell Hubert Lyall
PhD Candidate in Experimental Psycholinguistics | Vice-Chair ETS Advisory Board

PhD Candidate researching the influence of extra-linguistic information on language comprehension; affiliated with the Centre for Comparative Psycholinguistics at the University of Alberta. Also the Vice-Chair of the Edmonton Transit Service Advisory Board (ETSAB), which advises City Council in transit-related matters.

Related