Modeling Intent in R and/or Python

Learning or experimenting with Tidytext has been on my radar for at least a few years. Only recently did i have a need to pick it up. As with most learnings, they lead you down a path of more knowledge (read: rabbit holes) than you foresaw. This post is a hat-tip to the resources i used, knitting them together in a sample use case with an extension using parallel processing for the R implementation.

First mention must go to Manuel Amunategui for his post on Intent Modeling. Manuel shares a link to his python code in the post. I experienced some joy running his code after making a few changes. You can find my working notebook here. I used Windows and VSCode for the python implementation. The general logic Manuel uses for modeling intent is to:

  1. Tag the transcripts to identify the most common verbs and nouns
  2. Cluster the observations based on the presence of these key verbs and key nouns
  3. Conduct n-gram analysis on the clusters to identify patterns of key word sequences
  4. Use the keyword sequences to identify the relevant transcripts worth reviewing more closely to understand intent.

It is such a simple yet powerful approach. And it is made possible thanks to the spaCy package in python for parts-of-speech tagging and of course the nltk package for n-grams analysis. For the R implementation i used spacyr and Tidytext.

The Data

Manuel uses the Consumer Complaints Database from The Bureau of Consumer Financial Protection. This is available at data.gov which looks like a great source of open data for experimenting with models. There are over 2.4 million chat transcripts. For the purposes of our analysis we look at just 200,000 of them, the number Manuel uses in his original notebook.

What about R?

So Manuel’s python code (after a few minor adjustments for me and my environment) runs swimmingly which is great and hat-tip to Manuel. However, when i set out on my intent modeling journey initially using the approach Manuel describes, i did so using R. For parts-of-speech tagging i used the spacyr package, and to speed up the tagging i wrapped it in a foreach “loop” using all my cores. However, in addition to identifying key verbs and key nouns, i used noun phrases as well. What are noun phrases? From the spaCy docs:

Noun chunks are “base noun phrases” – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, “the lavish green grass” or “the world’s largest tech fund”.

https://spacy.io/usage/linguistic-features

The reason for using noun phrases was to increase the information available from tokenizing the transcripts and computing n-gram analyses. For example the noun phrase “a legitimate claim” becomes “a_legitimate_claim” and therefore a single token when joined with an underscore. A tri-gram analysis where one or more of the tokens are noun phrase combinations should theoretically hold more information than not. The tradeoff of course is a lower count of the individual words making up the noun phrase. Whether or not this tradeoff makes sense will depend on the use case. Let’s look at some results:

We see the top 20 verbs, nouns and noun phrases.

Top 20 verbs
Top 20 nouns
Top 20 noun phrases

Note the masked personal data. That is merely for Data Loss Protection (DLP) purposes. It would be appropriate to remove those tokens but for this toy example I felt it beneficial to point out the nuance.

As Manuel points out the threshold for determining which verbs, nouns and in our case the noun phrases to include as keywords will depend on your data. We use a threshold of 10k. There are 115 verbs, 140 nouns and 74 noun phrases which exceed this threshold in the 200k chat transcripts in our sample. We combine these keywords and one-hot encode the presence of the keyword in each of the 200k samples. Below is a snippet of the resulting matrix:

one-hot encoded keyword matrix

At this point we can employ a `kmeans` clustering algorithm to cluster the observations based on the presence of the keywords. We follow the method used by Manuel and specify 50 centers. After replacing the noun phrases in the original text with their underscored equivalents, we can perform some n-gram analysis to understand the frequency of particular word sequences.

Of course, we perform the n-gram analysis based on the clusters. Looking at the number of observations per cluster it appears the 15th cluster has the most observations. We will use that cluster for illustrative purposes.

observations per cluster

We see the most frequent 4-gram from cluster 15 has to do with people complaining about identity theft. It should also be clear how tokenizing noun phrases gives us additional information…instead of a 4-gram of only 4 words, we see 5 words. Thanks to spaCy and noun chunks.

Top 4-grams in cluster 46

At this point, its possible to tie back the n-gram to the actual complaints…and review for a better sense of the issues related to identity theft.

Complaints including the term “a victim of identity theft”

Well, thats it. I hope you enjoyed reading. All resources are below, including links to my code. Cheers!

Resources

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s