layout: true <div class="my-footer"><span>juliasilge.github.io/why-r-webinar/</span></div> --- class: inverse, left, middle background-image: url(figs/patrick-fore-0gkw_9fy0eQ-unsplash.jpg) background-size: cover # Understanding # Word # Embeddings ### Julia Silge | 4 June 2020 --- class: inverse, left, bottom background-image: url(figs/patrick-fore-0gkw_9fy0eQ-unsplash.jpg) background-size: cover # Find me at... <a href="http://twitter.com/juliasilge" style="color: white;"><i class="fa fa-twitter fa-fw"></i> @juliasilge</a><br> <a href="http://github.com/juliasilge" style="color: white;"><i class="fa fa-github fa-fw"></i> @juliasilge</a><br> <a href="https://juliasilge.com" style="color: white;"><i class="fa fa-link fa-fw"></i> juliasilge.com</a><br> <a href="https://tidytextmining.com" style="color: white;"><i class="fa fa-book fa-fw"></i> tidytextmining.com</a><br> <a href="mailto:julia.silge@gmail.com" style="color: white;"><i class="fa fa-paper-plane fa-fw"></i> julia.silge@gmail.com</a> --- class: inverse, center, middle # 📑 TEXT AS DATA 📊 --- # Text as data Let's look at complaints submitted to the [United States Consumer Financial Protection Bureau (CFPB)](https://www.consumerfinance.gov/data-research/consumer-complaints/). ```r library(tidyverse) complaints <- read_csv("complaints.csv.gz") names(complaints) ``` ``` ## [1] "complaint_id" "date_received" ## [3] "product" "issue" ## [5] "company" "state" ## [7] "consumer_complaint_narrative" ``` --- # Text as data ```r complaints %>% sample_n(10) %>% pull(consumer_complaint_narrative) ``` ``` ## [1] "Love, Beal, and Nixon attempts to collect from XXXX individuals with federally protected, exempt funds." ## [2] "I have been charged an \" credit life insurance fee '' of various amounts. The website does not even allow me to look at statements further than XX/XX/2018. It says access blocked which is absurd since it is my account. \n\nStarting from XXXX XXXX, I was charged {$26.00} XXXX XXXX - {$39.00} XXXX XXXX - {$33.00} XXXX XXXX - {$34.00} XXXX XXXX. - {$34.00} XXXX. XXXX - {$33.00}" ## [3] "For the last 2 plus years, I have not received a copy of my monthly billing statement nor have I been afforded the proper application of my monies I paid to this account. There is a huge discrepancy with my auto loan payments and I have also asked for the total amounts of interest paid since loan inception which began in XX/XX/2016 to current and the presidents office in Texas as well as the auto loan office in California failed to send me accurate account of my loan as last year it stated I paid over XXXX dollars and now my online account states I have only paid slightly over XXXX. They will not release my total payments towards principal vs interest for the years 2016 to current on their letterhead although I was informed it was mailed. I need this document as well for tax purposes. Additionally they are not reporting my account balance monthly to the credit bureau, I have spoke with Capital one on this as well, and no one seems to know how my payments are being applied, it is a simple interest loan and thats about all i get. if you look at my ledger, if i pay early they take a huge chunk of my payment and apply to interest, but if i pay post my due date which is the XXXX of the month, they apply a good amount towards principal and less towards interest. Doesnt add up and no one is talking. I advised if i do not receive my billing statements for the last 2 plus years, as i have called and msg them constantly and nothing. i also have been blocked from msg them inhouse as well as looking at my statements online and nothing received by mail. they have blocked me out of my account and play XXXX as nothing is going on with my account, with me having a prior background in banking and financials i know this is a bunch of malarkey. I advised if i do not receive a statement, no more monies will be paid to this account as i only purchased the vehicle for XXXX and i was to have a gap policy of XXXX added to the account but that is even wrong. I am done. I want all of my documents and fyi this started from the beginning in XX/XX/2016 where they refused to send the dealership and myself a copy of my loan terms and payment slips or information on how i was to pay after having the car 1 month. i have only missed a few payments the 2nd year of having car due to furlough but i caught up and paid in advance in lump sums so nothing is adding up and need your assistance as i am no longer talking to this lender it is pointless. Lastly, they refused to send me a copy of my title as well last year so that I could register my car this is how dirty these ppl operate and just like i paid off my cc, i know my car is paid for, how am i paying XXXX more than what i bought the car for its a XXXX car, payments XXXX have had already 3.5 years, not behind in payments, but you still saying i owe XXXX smh.the math even with the interest doesnt add up" ## [4] "Sounds like a boiler room operation, call all three of our cell phone # 's never leave message, only can hear lots of talking in background on messages. They repeat calls all day and night." ## [5] "Please help me to understand what made you close my case, Ditech has not corrected anything, in fact, it has gotten much worse. I sent by email responding to your closed case email, copies of the loan modification and 2 years of payment history showing no missed or late payments. I can send them again if you need. I respectfully request an investigation and recompense. If you are not able to help, kindly direct me to a Consumer Protection Agency that will look into this. Ditech continues to misappropriate my funds in unlawful ways. I have attempted to reason with them so many times, I have lost count. I am XXXX and this continues to be a tremendous hardship for me. Please, someone needs to look into this. Thank you for your time." ## [6] "I am filing this complaint with CFPB because in the last 60 days I have been unable to receive any response whatsoever to my online and written complaint mailed to the HSBC Executive Office on XX/XX/XXXX at HSBC Bank USA, N.A., Executive Office, XXXX XXXX XXXX, XXXX, NY XXXX concerning the following : In XX/XX/XXXX, I did a Balance Transfer of XXXX ( including transfer fee ) with a 0 % APR until XX/XX/XXXX. In XX/XX/XXXX, I received my first interest charge on a balance of {$790.00}. The interest computations seem correct ; however, on my statement it reads : \" Interest Charge on Cash Advances ''. By the way, when I look at my transaction history online at the HSBC website, it reads \" Finance Charge Cash Advance '', instead of \" Interest Charge on Cash Advances ''. Either way this is incorrect when listed as a Cash Advance. \nI noticed this error on XX/XX/XXXX when making my timely online payment. \nOn that same date, XX/XX/XXXX, I contacted HSBC via their chat line. I was told that the Balance Transfer showing as a Cash Advance was a computer error on the part of HSBC. I opened up complaint number XXXX and sent a written complaint to the HSBC Executive Office address listed above. \nIn my written complaint, I advised HSBC that I was entitled to an accurate statement reflecting MY transactions. Attached to my written complaint was the chat line conversation that had been emailed to me as documentation. This chat documentation also included the Executive Office address where I was directed to mail my complaint. \nAlso included in my complaint letter were portions of HSBC 's own credit card agreement showing that Purchases, Cash Advances, and Balance Transfers are distinctly different transactions. Additionally, according to HSBC 's own credit card agreement, I am entitled to a response to a written complaint within 30 days. So 30 days later when I again went to make my timely online payment on or about XX/XX/XXXX, I called HSBC to follow up on my complaint. I was advised that my complaint would be resubmitted. \nIt is now XX/XX/XXXX and I am once again making a timely online payment and, to-date, I have received no response whatsoever from HSBC. As a note, I always make my payments online, although the address to mail payments is : HSBC Bank USA, N.A. at XXXX XXXX XXXX, XXXX XXXX IL XXXX. \nAdditional note : In my original letter in the RE section, I indicated the complaint number as XXXX. The correct complaint number XXXX is in the chat documentation attached to my complaint. In my XX/XX/XXXX phone call to HSBC, I was advised of this error and added a notation to my copy of the complaint letter in the RE section. I did not send a new letter to HSBC because they had more than enough information to address and handle my complaint. I have attached a copy of my letter. The original letter was printed out, signed and mailed to HSBC to XX/XX/XXXX." ## [7] "I received a copy of my Equifax credit report and noticed that I have these inquiries. I have previously asked/disputed ( 2 ) times for these hard inquiries to be removed. I have not received any documentation providing any information regarding these inquiries and since this is this the case, I hereby request you method of verification for these inquires on my credit report ( both hard & soft inquiries ). You are required to have documentation providing my consent and permissible purpose. If you lack these documents, I ask you to delete these immediately. This is not a dispute ; this is a request for your method of verification.\n\n- Must I remind you that : According to the Fair Credit Reporting Act, Section 604, a creditor shouldnt have access to a consumer 's credit information unless the individual himself gives written permission, or unless credit access is court ordered or requested by a state or local government agency in relation to child support.\n\nIncorrect Personal Information Previously disputed. \n- I am requesting that you delete these inaccurate inquiries from my Credit Report immediately and forward an updated copy of my Credit Report after you have corrected this information. \n- By law, in reference to my rights under the Fair Credit Reporting Act, I deserve and expect a timely response from your Credit Bureau to my dispute. \nUnrecognized/ Authorized inquiries previously disputed. I now ask for your method of verification for these. Either furnish proper verification or delete these immediately. \n- I am requesting that you delete this inaccurate inquiry from my Credit Report immediately and forward an updated copy of my Credit Report after you have corrected this information. \n- By law, in reference to my rights under the Fair Credit Reporting Act, I deserve and expect a timely response from your Credit Bureau to my dispute. \nLastly, be sure to include the name and address of the Company or individuals who were directly contacted in regards to this matter. \n\nFormer Address 1 : XXXX XXXX XXXX XXXX XXXX XXXX XXXX, IN XXXX First Reported XX/XX/XXXX Last Reported XX/XX/XXXX Former Address 2 : XXXX XXXX XXXX XXXX XXXX, IN XXXX First Reported XX/XX/XXXX Last Reported XX/XX/XXXX This letter was sent Certified mail / Return receipt requested & Notarized. Litigation will be pending based on your actions per this letter. \nCertified Mail # XXXX XXXX XXXX XXXX XXXX XXXX Tracking # XXXX XXXX XXXX XXXX XXXX XXXX" ## [8] "I have been a victim of Identity Theft. The following accounts are fraudulent : To protect myself from further harm, I have filed a consumer protection and an Identity Theft report. ( See attached reports. ) I did not authorize these fraudulent accounts. I have also provided you with copies of my social, current and past address and license." ## [9] "This has been on going on several month. \n\ni am NOT able to obtain Credit Report from Transunion i called them several times to correct the issue but i just get run around and nothing gets resolved. \n\nthis is very frustrating and affecting my personal life. \n\nplease help!" ## [10] "I recently applied for a furnisher account long story short I was denied. I dont understand why I would be because I typically approved for everything I apply for. So I checked my credit report so that maybe I would get a better understanding. After doing so I notice some items on my credit report are not mine. Im unsure of how this happened." ``` --- # Text as data What is a typical way to represent this text data for modeling? ```r library(tidytext) library(SnowballC) complaints %>% unnest_tokens(word, consumer_complaint_narrative) %>% anti_join(get_stopwords()) %>% mutate(stem = wordStem(word)) %>% count(complaint_id, stem) %>% bind_tf_idf(stem, complaint_id, n) %>% cast_dfm(complaint_id, stem, tf_idf) ``` ``` ## Document-feature matrix of: 67,212 documents, 34,261 features (99.8% sparse). ## features ## docs account auto bank call charg chase dai date ## 3113204 NA 0.007991838 0.09791573 0.04672194 0.015342414 0.02489768 0.05162948 0.05887824 0.024517773 ## 3113208 0 0.001205381 0 0.02114071 0.006942125 0.01877614 0 0 0.003697929 ## 3113805 0 0.003086795 0 0 0 0 0 0 0 ## 3113811 NA 0.013890576 0 0 0 0 0 0 0 ## 3113816 0 0 0 0 0 0 0 0 0 ## 3113817 0 0 0 0.06435286 0 0.06858606 0 0 0 ## features ## docs dollar ## 3113204 0.046466365 ## 3113208 0.007008357 ## 3113805 0 ## 3113811 0 ## 3113816 0 ## 3113817 0 ## [ reached max_ndoc ... 67,206 more documents, reached max_nfeat ... 34,251 more features ] ``` --- class: inverse, left, bottom background-image: url(figs/patrick-fore-0gkw_9fy0eQ-unsplash.jpg) background-size: cover # This representation... - .large[is incredibly sparse] - .large[of high dimensionality] - .large[with a huge number of features] --- class: inverse, center, middle # 📄 WORD EMBEDDINGS 📔 --- class: right, middle <h1 class="fa fa-quote-left fa-fw"></h1> <h1> You shall know a word by the company it keeps. </h1> <h1 class="fa fa-quote-right fa-fw"></h1> .large[John Rupert Firth] --- class: inverse # Modern word embeddings -- - word2vec -- - GloVe -- - fastText -- - language models with transformers like ULMFiT and ELMo --- class: inverse, left, top background-image: url(figs/patrick-fore-0gkw_9fy0eQ-unsplash.jpg) background-size: cover # We can determine word embeddings using... - .large[word counts] - .large[matrix factorization] .footnote[ <a href="https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/" style="color: white">Moody, Chris. "Stop using word2vec." MultiThreaded blog (2017).</a> ] --- # Counting words First, we tokenize and transform this dataset to a [tidy data structure](https://www.tidytextmining.com/). ```r tidy_complaints <- complaints %>% select(complaint_id, consumer_complaint_narrative) %>% unnest_tokens(word, consumer_complaint_narrative) %>% group_by(word) %>% filter(n() >= 50) %>% ungroup() tidy_complaints ``` --- # Counting words First, we tokenize and transform this dataset to a [tidy data structure](https://www.tidytextmining.com/). ``` ## # A tibble: 12,700,715 x 2 ## complaint_id word ## <dbl> <chr> ## 1 3147294 on ## 2 3147294 xx ## 3 3147294 xx ## 4 3147294 2019 ## 5 3147294 i ## 6 3147294 have ## 7 3147294 previous ## 8 3147294 sent ## 9 3147294 in ## 10 3147294 disputes ## # … with 12,700,705 more rows ``` --- # Counting words Next, we can create nested dataframes. ```r nested_words <- tidy_complaints %>% nest(words = c(word)) nested_words ``` ``` ## # A tibble: 67,186 x 2 ## complaint_id words ## <dbl> <list> ## 1 3147294 <tibble [169 × 1]> ## 2 3189876 <tibble [111 × 1]> ## 3 3475559 <tibble [134 × 1]> ## 4 3134951 <tibble [510 × 1]> ## 5 3356633 <tibble [692 × 1]> ## 6 3347791 <tibble [86 × 1]> ## 7 3270110 <tibble [101 × 1]> ## 8 3323749 <tibble [119 × 1]> ## 9 3481028 <tibble [572 × 1]> ## 10 3239639 <tibble [33 × 1]> ## # … with 67,176 more rows ``` --- # Time to SLIDE ⚡ Let's identify **windows** in order to calculate the skipgram probabilities. --- # Time to SLIDE ⚡ ```r slide_windows <- function(tbl, window_size) { skipgrams <- slider::slide( tbl, ~.x, .after = window_size - 1, .step = 1, .complete = TRUE ) safe_mutate <- safely(mutate) out <- map2(skipgrams, 1:length(skipgrams), ~ safe_mutate(.x, window_id = .y)) out %>% transpose() %>% pluck("result") %>% compact() %>% bind_rows() } ``` --- class: inverse # Window size? 🤔 -- - Determines semantic meaning the embeddings capture -- - Smaller window size (3-4) focuses on how the word is used and learns what other words are functionally similar -- - Larger window size (~10) captures the domain or topic of each word --- class: inverse # Point-wise mutual information -- - How often do words occur on their own? -- - How often words occur together with other words? -- - PMI is a measure of association to compute this -- - PMI is logarithm of the probability of finding two words together, normalized for the probability of finding each of the words alone --- # Calculate PMI We use PMI to measure which words occur together more often than expected based on how often they occurred on their own. ```r library(widyr) library(furrr) plan(multiprocess) ## for parallel processing tidy_pmi <- nested_words %>% mutate(words = future_map(words, slide_windows, 4)) %>% unnest(words) %>% unite(window_id, complaint_id, window_id) %>% pairwise_pmi(word, window_id) ``` --- # Calculate PMI When PMI is high, two words are associated with each other, likely to occur together. ``` ## # A tibble: 3,415,366 x 3 ## item1 item2 pmi ## <chr> <chr> <dbl> ## 1 xx on 1.46 ## 2 2019 on 0.976 ## 3 i on -1.23 ## 4 have on -1.04 ## 5 previous on -0.445 ## 6 sent on -0.909 ## 7 in on -1.49 ## 8 disputes on -0.0725 ## 9 to on -1.28 ## 10 transunion on 0.109 ## # … with 3,415,356 more rows ``` --- # Time for word vectors! 🎉 We determine word vectors using singular value decomposition. ```r tidy_word_vectors <- tidy_pmi %>% widely_svd( item1, item2, pmi, nv = 100, maxit = 1000 ) ``` --- # Time for word vectors! 🎉 We determine word vectors using singular value decomposition. ```r tidy_word_vectors ``` ``` ## # A tibble: 585,500 x 3 ## item1 dimension value ## <chr> <int> <dbl> ## 1 xx 1 -0.0577 ## 2 2019 1 -0.0608 ## 3 i 1 -0.0131 ## 4 have 1 -0.0225 ## 5 previous 1 -0.0252 ## 6 sent 1 -0.0504 ## 7 in 1 -0.0170 ## 8 disputes 1 -0.0192 ## 9 to 1 -0.0103 ## # … with 585,491 more rows ``` --- class: inverse, left, bottom ## Each word can be represented as a numeric vector in this - .large[new,] - .large[dense,] - .large[100-dimensional] ## feature space. --- # Explore CFPB word embeddings Which words are close to each other in this new feature space of word embeddings? ```r nearest_neighbors <- function(df, token) { df %>% widely(~ . %*% (.[token, ]), sort = TRUE, maximum_size = NULL)(item1, dimension, value) %>% select(-item2) } ``` --- # Explore CFPB word embeddings ```r tidy_word_vectors %>% nearest_neighbors("error") ``` ``` ## # A tibble: 5,855 x 2 ## item1 value ## <chr> <dbl> ## 1 error 0.0408 ## 2 issue 0.0274 ## 3 problem 0.0248 ## 4 errors 0.0242 ## 5 issues 0.0220 ## 6 system 0.0220 ## 7 mistake 0.0217 ## 8 correct 0.0160 ## 9 incorrect 0.0159 ## 10 fraud 0.0144 ## # … with 5,845 more rows ``` --- # Explore CFPB word embeddings ```r tidy_word_vectors %>% nearest_neighbors("month") ``` ``` ## # A tibble: 5,855 x 2 ## item1 value ## <chr> <dbl> ## 1 month 0.0612 ## 2 months 0.0374 ## 3 year 0.0335 ## 4 payment 0.0332 ## 5 payments 0.0318 ## 6 years 0.0311 ## 7 amount 0.0280 ## 8 pay 0.0275 ## 9 days 0.0270 ## 10 monthly 0.0261 ## # … with 5,845 more rows ``` --- # Explore CFPB word embeddings ```r tidy_word_vectors %>% nearest_neighbors("fee") ``` ``` ## # A tibble: 5,855 x 2 ## item1 value ## <chr> <dbl> ## 1 fee 0.0881 ## 2 fees 0.0612 ## 3 charge 0.0479 ## 4 charged 0.0467 ## 5 interest 0.0435 ## 6 late 0.0340 ## 7 charges 0.0333 ## 8 overdraft 0.0332 ## 9 charging 0.0287 ## 10 per 0.0279 ## # … with 5,845 more rows ``` --- # Explore CFPB word embeddings ```r tidy_word_vectors %>% filter(dimension <= 8) %>% group_by(dimension) %>% top_n(12, abs(value)) %>% ungroup %>% ggplot(aes(value, item1, fill = as.factor(dimension))) + geom_col(show.legend = FALSE) + facet_wrap(~dimension, scales = "free_y", ncol = 4) ``` --- class: center <img src="index_files/figure-html/unnamed-chunk-18-1.png" width="95%" /> --- class: center <img src="index_files/figure-html/unnamed-chunk-19-1.png" width="95%" /> --- # Embeddings in modeling The classic and simplest approach is to treat each document as a collection of words and summarize the word embeddings into **document embeddings**. ```r word_matrix <- tidy_complaints %>% count(complaint_id, word) %>% cast_sparse(complaint_id, word, n) embedding_matrix <- tidy_word_vectors %>% cast_sparse(item1, dimension, value) doc_matrix <- word_matrix %*% embedding_matrix dim(doc_matrix) ``` ``` ## [1] 67186 100 ``` --- class: inverse, center, middle ### 😠WHAT IF YOUR DATASET IS TOO SMALL? 😩 --- # Try pre-trained word embeddings ```r library(textdata) glove6b <- embedding_glove6b(dimensions = 100) ``` --- # Try pre-trained word embeddings ```r glove6b ``` ``` ## # A tibble: 400,000 x 101 ## token d1 d2 d3 d4 d5 d6 d7 d8 d9 ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 "the" -0.0382 -0.245 0.728 -0.400 0.0832 0.0440 -0.391 0.334 -0.575 ## 2 "," -0.108 0.111 0.598 -0.544 0.674 0.107 0.0389 0.355 0.0635 ## 3 "." -0.340 0.209 0.463 -0.648 -0.384 0.0380 0.171 0.160 0.466 ## 4 "of" -0.153 -0.243 0.898 0.170 0.535 0.488 -0.588 -0.180 -1.36 ## 5 "to" -0.190 0.0500 0.191 -0.0492 -0.0897 0.210 -0.550 0.0984 -0.201 ## 6 "and" -0.0720 0.231 0.0237 -0.506 0.339 0.196 -0.329 0.184 -0.181 ## 7 "in" 0.0857 -0.222 0.166 0.134 0.382 0.354 0.0129 0.225 -0.438 ## 8 "a" -0.271 0.0440 -0.0203 -0.174 0.644 0.712 0.355 0.471 -0.296 ## 9 "\"" -0.305 -0.236 0.176 -0.729 -0.283 -0.256 0.266 0.0253 -0.0748 ## 10 "'s" 0.589 -0.202 0.735 -0.683 -0.197 -0.180 -0.392 0.342 -0.606 ## # … with 399,990 more rows, and 91 more variables: d10 <dbl>, d11 <dbl>, ## # d12 <dbl>, d13 <dbl>, d14 <dbl>, d15 <dbl>, d16 <dbl>, d17 <dbl>, ## # d18 <dbl>, d19 <dbl>, … ``` --- # Try pre-trained word embeddings To compare to the word embeddings we created ourselves, let's transform these to a tidy data structure. ```r tidy_glove <- glove6b %>% pivot_longer(contains("d"), names_to = "dimension") %>% rename(item1 = token) ``` --- # Try pre-trained word embeddings To compare to the word embeddings we created ourselves, let's transform these to a tidy data structure. ``` ## # A tibble: 40,000,000 x 3 ## item1 dimension value ## <chr> <chr> <dbl> ## 1 the d1 -0.0382 ## 2 the d2 -0.245 ## 3 the d3 0.728 ## 4 the d4 -0.400 ## 5 the d5 0.0832 ## 6 the d6 0.0440 ## 7 the d7 -0.391 ## 8 the d8 0.334 ## 9 the d9 -0.575 ## 10 the d10 0.0875 ## # … with 39,999,990 more rows ``` --- # Explore GloVe word embeddings ```r tidy_glove %>% nearest_neighbors("error") ``` ``` ## # A tibble: 400,000 x 2 ## item1 value ## <chr> <dbl> ## 1 error 34.6 ## 2 errors 28.1 ## 3 data 19.8 ## 4 inning 19.4 ## 5 game 19.3 ## 6 percentage 19.3 ## 7 probability 19.2 ## 8 unforced 19.1 ## 9 fault 19.1 ## 10 point 19.0 ## # … with 399,990 more rows ``` --- # Explore GloVe word embeddings ```r tidy_glove %>% nearest_neighbors("month") ``` ``` ## # A tibble: 400,000 x 2 ## item1 value ## <chr> <dbl> ## 1 month 32.4 ## 2 year 31.2 ## 3 last 30.6 ## 4 week 30.5 ## 5 wednesday 29.6 ## 6 tuesday 29.5 ## 7 monday 29.3 ## 8 thursday 29.1 ## 9 percent 28.9 ## 10 friday 28.9 ## # … with 399,990 more rows ``` --- # Explore GloVe word embeddings ```r tidy_glove %>% nearest_neighbors("fee") ``` ``` ## # A tibble: 400,000 x 2 ## item1 value ## <chr> <dbl> ## 1 fee 39.8 ## 2 fees 30.7 ## 3 pay 26.6 ## 4 $ 26.4 ## 5 salary 25.9 ## 6 payment 25.9 ## 7 £ 25.4 ## 8 tax 24.9 ## 9 payments 23.8 ## 10 subscription 23.1 ## # … with 399,990 more rows ``` --- class: inverse, left, bottom background-image: url(figs/patrick-fore-0gkw_9fy0eQ-unsplash.jpg) background-size: cover ## Pre-trained word embeddings... - encode rich semantic relationships - can be less than ideal for specific tasks --- class: inverse # How can we integrate pre-trained word embeddings in modeling? -- - Again, we can create simple document embeddings by summarizing -- - The GloVe embeddings do not contain all the tokens in the CPFB complaints, and vice versa --- ```r word_matrix <- tidy_complaints %>% inner_join(tidy_glove %>% distinct(item1) %>% rename(word = item1)) %>% count(complaint_id, word) %>% cast_sparse(complaint_id, word, n) glove_matrix <- tidy_glove %>% inner_join(tidy_complaints %>% distinct(word) %>% rename(item1 = word)) %>% cast_sparse(item1, dimension, value) doc_matrix <- word_matrix %*% glove_matrix dim(doc_matrix) ``` ``` ## [1] 67182 100 ``` --- class: inverse, left, bottom background-image: url(figs/patrick-fore-0gkw_9fy0eQ-unsplash.jpg) background-size: cover # Fairness and # Word # Embeddings --- # Fairness and word embeddings -- - Embeddings are trained or learned from a large corpus of text data -- - Human prejudice or bias in the corpus becomes imprinted into the embeddings --- class: inverse # Fairness and word embeddings -- - African American first names are associated with more unpleasant feelings than European American first names -- - Women's first names are more associated with family and men's first names are more associated with career -- - Terms associated with women are more associated with the arts and terms associated with men are more associated with science .footnote[ <a href="https://arxiv.org/abs/1608.07187" style="color: white">Caliskan, Bryson, and Narayanan. "Semantics Derived Automatically from Language Corpora Contain Human-Like Biases." Science 356.6334 (2017): 183–186.</a> ] --- <img src="figs/turkish.png" style="display: block; margin: auto;" /> --- class: inverse, middle, center ## Bias is so ingrained in word embeddings that they can be used to quantify change in social attitudes over time .footnote[ <a href="https://www.pnas.org/content/115/16/E3635" style="color: white">Garg, Nikhil, et al. "Word embeddings quantify 100 years of gender and ethnic stereotypes." Proceedings of the National Academy of Sciences 115.16 (2018): E3635-E3644.</a> ] --- # Biased training data -- - Embeddings are trained or learned from a large corpus of text data -- - For example, consider the case of Wikipedia -- - Wikipedia both reflects social/historical biases **and** generates bias .footnote[ [Wagner, Claudia, et al. "Women through the glass ceiling: gender asymmetries in Wikipedia." EPJ Data Science 5.1 (2016): 5.](https://link.springer.com/article/10.1140/epjds/s13688-016-0066-4) ] --- # Biased embeddings in models Consider a straightforward sentiment analysis model trained to predict how positive text is. **Compare:** .pull-left[ "Let's go get Italian food!" 😊 ] .pull-right[ "Let's go get Mexican food!" 😕 ] .footnote[ [Speer, Robyn. "How to make a racist AI without really trying." ConceptNet blog (2017).](http://blog.conceptnet.io/posts/2017/how-to-make-a-racist-ai-without-really-trying/) ] --- class: inverse, left, top background-image: url(figs/patrick-fore-0gkw_9fy0eQ-unsplash.jpg) background-size: cover # Consider some options -- - .large[Find your own embeddings] -- - .large[Consider not using embeddings] -- - .large[Can embeddings be debiased?] --- class: inverse # Can embeddings be debiased? -- - Embeddings can be reprojected to mitigate a specific bias (such as gender bias) using specific sets of words -- - Training data can be augmented with counterfactuals -- - Other researchers suggest that fairness corrections occur at a decision -- - Evidence indicates that debiasing still allows stereotypes to seep back in .footnote[ <a href="https://arxiv.org/abs/1903.03862" style="color: white">Gonen, Hila, and Yoav Goldberg. "Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them." arXiv preprint arXiv:1903.03862 (2019).</a> ] --- class: inverse, left, bottom background-image: url(figs/patrick-fore-0gkw_9fy0eQ-unsplash.jpg) background-size: cover ## Word embeddings in the # REAL WORLD --- class: inverse, left background-image: url(figs/patrick-fore-0gkw_9fy0eQ-unsplash.jpg) background-size: cover # Thanks! <a href="http://twitter.com/juliasilge" style="color: white;"><i class="fa fa-twitter fa-fw"></i> @juliasilge</a><br> <a href="http://github.com/juliasilge" style="color: white;"><i class="fa fa-github fa-fw"></i> @juliasilge</a><br> <a href="https://juliasilge.com" style="color: white;"><i class="fa fa-link fa-fw"></i> juliasilge.com</a><br> <a href="https://tidytextmining.com" style="color: white;"><i class="fa fa-book fa-fw"></i> tidytextmining.com</a><br> <a href="mailto:julia.silge@gmail.com" style="color: white;"><i class="fa fa-paper-plane fa-fw"></i> julia.silge@gmail.com</a> Slides created with <a href="http://remarkjs.com/" style="color: #5A6B8C;"><b>remark.js</b></a> and <a href="https://github.com/yihui/xaringan" style="color: #5A6B8C;"><b>xaringan</b></a> Photo by <a href="https://unsplash.com/@patrickian4?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText" style="color: #5A6B8C;"><b>Patrick Fore</b></a> on <a href="https://unsplash.com/s/photos/letters?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText" style="color: #5A6B8C;"><b>Unsplash</b></a>